This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Short Text Matching Model Enhanced with Knowledge via Contrastive Learning

Ruiqiang Liu
Alibaba
[email protected]
&Qiqiang Zhong11footnotemark: 1
Xiaohongshu
[email protected]
&Mengmeng Cui
East China University of Science and Technology
[email protected]
&Hanjie Mai
Yunnan University
[email protected]
&Qiang Zhang
Alibaba
[email protected]
&Shaohua Xu
Alibaba
[email protected]
&Xiangzheng Liu
Alibaba
[email protected]
&Yanlong Du
Alibaba
[email protected]
These authors contributed to the work equllly and should be regarded as co-first authors.
Abstract

In recent years, short Text Matching tasks have been widely applied in the fields of advertising search and recommendation. The difficulty lies in the lack of semantic information and word ambiguity caused by the short length of the text. Previous works have introduced complement sentences or knowledge bases to provide additional feature information. However, these methods have not fully interacted between the original sentence and the complement sentence, and have not considered the noise issue that may arise from the introduction of external knowledge bases. Therefore, this paper proposes a short Text Matching model that combines contrastive learning and external knowledge. The model uses a generative model to generate corresponding complement sentences and uses the contrastive learning method to guide the model to obtain more semantically meaningful encoding of the original sentence. In addition, to avoid noise, we use keywords as the main semantics of the original sentence to retrieve corresponding knowledge words in the knowledge base, and construct a knowledge graph. The graph encoding model is used to integrate the knowledge base information into the model. Our designed model achieves state-of-the-art performance on two publicly available Chinese Text Matching datasets, demonstrating the effectiveness of our model.

1 Introduction

Short Text Matching (STM) is a task of determining semantic similarity or relevance between two short texts, such as sentence pairs, phrases or keywords. This is a crucial task in many natural language processing applications, including information retrieval, question answering, and document classification.

Search advertising is a form of advertising displayed on search engine results pages. When a user searches for a keyword in a search engine, if the keyword matches the keywords purchased by the advertiser, the advertisements will be triggered to display. Advertisements matching methods include exact match, phrase match, and broad match. Exact match and phrase match are traditional string matches, making it difficult to match text that may have slight variations in wording but have the same meaning. Broad match is usually used when there is little overlap between the text a user is searching for and the keywords purchased by the advertiser, but the meaning expressed is the same. Therefore, compared to exact and phrase matches, broad match is easier to trigger more ads due to less strict matching conditions.

With the rapid development of deep learning in recent years, vector retrieval for Text Matching has achieved significant success in multiple fields. Vector retrieval refers to the method of using mathematical representations of data to perform similarity searches in large databases. This method can be applied to various types of data, including text, images, and audio.In vector retrieval, the text searched by the user and the keywords purchased by the advertiser are first represented as high-dimensional vectors. Then, a similarity measure is used to compare these vectors to find the most relevant results for a given query. Short-text vector retrieval is the most common form of broad matching, and data statistics have shown that over 85 perenct of search content and advertiser purchased keywords in our search engine are less than 25 characters long. Therefore, short-text vector retrieval matching is more important.

However, there are many challenges in short text vector retrieval matching currently. Because short texts that users search for often lack context, there can be multiple interpretations, and limited vocabulary means that many words can have multiple meanings. As shown in the left part of Figure 1, when a user searches for "Changan," it can refer to both a car brand and a city, making it difficult to match accurately. Even though the query and the returned entity from the search results are the same, the meaning expressed in the sentence is completely different and contrary to the user’s search intent. Therefore, many of the keywords purchased by keywords advertiser for matching are ineffective.

Refer to caption
Figure 1: The left image showcases bad cases in the UC Browser, while the right image demonstrates the process of generating a complement sentence.

issue of entity ambiguity in short texts, previous research has often relied on incorporating externalt knowledge to supplement the semantic information of the text. For example, in Chen et al. [1], external knowledge was collected by gathtering text data from user queries and clicks in search engines, and then used to enhance the semantic information of the text. In the works of Lyu et al. [2] and Ma et al. [3], a multi-dimensional attention mechanism was proposed to interactively fuse each word in the word grid diagram with external words retrieved from HowNet[4], in order to eliminate word ambiguity in the sentence.

While these methods have been proven effective through experiments, they also have certain limitations. Firstly, Chen et al. [1] did not pay sufficient attention to the relationship between the original and supplemented sentences, and lacked effective interaction between the two. In fact, the original and supplemented sentences form a semantic matching sentence pair, so the supplemented sentence can guide the model to generate encoding vectors that better align with the semantics of the original sentence. In the works of Lyu et al. [2] and Ma et al. [3], all words in the sentence were retrieved from the knowledge base, but some words are not main components of the sentence, and using them to retrieve knowledge words may introduce some noise.

Therefore, from these perspectives, we propose a short Text Matching model(KSTM) that enhances knowledge through contrastive learning. This model utilizes the SimBERT111https://github.com/ZhuiyiTechnology/simbert text generation model to generate complement text for two input sentences. To enable SimBERT to generate text that fits the context of the original sentences, we train it using a large amount of UC and Quark browser search data from within our company. In our experiments, we use the generated complement text as positive samples, as shown on the right side of Figure 1, and other samples in the same batch as negative samples. Additionally, we extract keywords from the original sentences and use the HowNet knowledge base Dong et al. [4] to obtain the top k knowledge words that are similar to the keywords. These words are used to construct a knowledge graph, where the nodes are keywords and knowledge words, and the edges represent the similarity weights between the keywords and similar words. The model’s discriminative ability is enhanced through graph learning. Currently, our model has achieved state-of-the-art performance on two publicly available Chinese datasets.

In conclusion, we made the following contributions:

  • We use a generative model to generate complement text for contrastive learning, in order to better guide the model to encode the original text and incorporate more contextual information.

  • We extract keywords from the original text and construct a graph by querying related words from a knowledge base, in order to eliminate entity ambiguity issues.

  • Based on the above two points, we have built a short-Text Matching model, which achieved state-of-the-art results on two publicly available Chinese datasets, demonstrating the effectiveness of the model.

2 Related work

Short Text Matching is a widely used technology in the field of Natural Language Processing(NLP), and in recent years, a series of neural network-based Text Matching models have emerged, which can be roughly categorized as follows:

(a) By utilizing contextual features. Hu et al. [5] proposed a context-aware interaction network (context-aware interaction network) for question matching. The interaction network in the paper is a multi-layer interaction network with a multi-head attention mechanism similar to the transformer. Each layer of the interaction network includes a context-aware cross-attention mechanism and a gate fusion layer. The former is responsible for learning the contextual interaction information when aligning the two sentences, while the latter implements flexible selection of useful alignment interaction information. Previous sentence matching models directly perform attention interaction after embedding two sentences. Hu et al. [6] proposed an enhanced sentence alignment network with a simple gate feature to flexibly integrate original words and contextual features to improve the attention interaction part across sentences. In short Text Matching, there is little further recognition of discriminative features and feature denoising to enhance relevance learning. Li et al. [7] designed ADDAX to clearly distinguish distinguishing features and filter out irrelevant features in a context-aware manner.

(b) Using token features. Chen et al. [8] proposed a neural graph matching method (GMN) for Chinese short Text Matching. The traditional approach of segmenting each sentence into a word sequence is changed, and all possible word segmentation paths are retained to form a word lattice graph, and node representations are updated based on graph matching attention mechanism. In addition, current interactive matching methods for sentence modeling are relatively simple and ignore the importance of the relative distance between words. Deng et al. [9] established a high-performance distance-aware self-attention and multi-level matching model (DSSTM) for sentence semantic matching tasks. By considering the importance of tokens at different distances, better sentence semantics can be obtained.

(c) Using additional features. Short Text Matching focuses on learning semantic features based on character and word-level features, to some extent ignoring the special features of Chinese, such as Pinyin, radicals, and keywords. Zhao et al. [10] proposed a multi-granularity interaction model based on Pinyin and radicals (MIPR) for Chinese semantic matching. Keywords represent factual information that needs to be strictly matched, such as actions, entities, and events, while intents convey abstract concepts and ideas that can be interpreted as various expressions. A simple yet effective training strategy was proposed in Zou et al. [11], which involves breaking down keywords and intents and performing semantic matching on the text in a divide-and-conquer approach. Lu et al. [12] presented a multi-keyword matching and sentence matching method that utilizes keyword pairs from two sentences to represent their semantic relationship, thereby avoiding noise redundancy and interference that may arise from using the entire sentence.

(d) Utilizing external information. Chen et al. [1] obtained an external knowledge collection for each short text by using a search engine. The corresponding clicked title was used as a context sentence set to enhance the model’s understanding of short texts. Lyu et al. [2] proposed using a multi-dimensional attention mechanism to enable interaction and fusion between each word in the word lattice graph and the external words retrieved from HowNet. This approach aims to eliminate ambiguity among the words in the sentence. Based on the research in Ma et al. [3], OTE model used SoftLexicon to provide more detailed information at different levels. They used LaserTagger[13] model and EDA[14] to improve their data.

3 Methodology

Refer to caption
Figure 2: Framework of KSTM

The overview of our model (KSTM) is shown in Figure 2. It is comprised of three major components:1) Text Encoding Module, 2) Graph Encoding Module, 3) Aggeration Layer, 4)Binary Classifier Layer.

In the following section, we provide a definition of the STM task and introduce each component of the model.

3.1 Task Definition

Given two original sentences s1s_{1} and s2s_{2}, the objective of STM is to determine whether they have the same semantic meaning. It can be considered as a binary classification task. To capture more comprehensive semantic information of s1s_{1} and s2s_{2}, we generated complement sentences s1s^{{}^{\prime}}_{1} and s2s^{{}^{\prime}}_{2} specifically for them. The details of the generation process can be found in Section 3.6.

3.2 Text Encoding Module

Given the remarkable performance of the pre-trained model BERT[15] in a range of natural language processing tasks, we utilize it as the encoder to separately encode the original and complementary sentences, generating high-dimensional vectors. Specifically, we first insert the [CLS][CLS] token to the beginning of each sentence, and then encode them using BERT. Finally, by extracting the representation vector of the [CLS][CLS] token, we obtain the sentence representations h11×dh_{1}\in\mathbb{R}^{1\times d} and h21×dh_{2}\in\mathbb{R}^{1\times d} for the original sentences and h11×dh^{{}^{\prime}}_{1}\in\mathbb{R}^{1\times d} and h21×dh^{{}^{\prime}}_{2}\in\mathbb{R}^{1\times d} for the complementary sentences. dd is the embedded dimension. It is worth noting that reduce the parameters count and prevent over-fitting, each sentence shares the same encoder.

Finally, we combine the semantic representations of the complementary sentence with those of the original sentence to obtain an enhanced sentence representation he1h_{e1} and he2h_{e2}. This process improves the model’s ability to discern the semantic similarity between two sentences.

he1=[h1;h1]h_{e1}=[h_{1};h^{\prime}_{1}] (1)
he2=[h2;h2]h_{e2}=[h_{2};h^{\prime}_{2}] (2)

3.3 Graph Encoding Module

The keywords of a sentence contain crucial information that the sentence intends to convey. By analyzing the relationships between keywords, the model can better infer the semantic relationship between two sentences. Therefore, to extract and represent the relationships between keywords, In this module, we utilize the keywords from the original sentence to retrieve relevant knowledge words from external knowledge sources[4]. Subsequently, we construct a graph with these keywords and knowledge words as nodes, and the edges of the graph represent the relationships between words. Finally, we utilize the GCN[16] to extract the relationship features between words and obtain a graph representation vector.

3.3.1 Construction Knowledge Graphs

To construct the relationship graph, we first extract the keywords k1k_{1} and k2k_{2} from the two original sentences using the TextRank algorithm[17]. Specifically, k1k_{1} represents the keyword of the original sentence s1s_{1}, while k2k_{2} represents the keyword of the original sentence s2s_{2}. Next, we retrieve the knowledge words W={w1,w2,..,wn}W=\{w_{1},w_{2},..,w_{n}\} related to these keywords and their corresponding relevance scores from external knowledge sources. Finally, we consider the keywords and knowledge words as the nodes of the relationship graph, where the knowledge words related to the keywords are connected by edges with weights based on their relevance scores. The relationship graph is defined as G=(V,E,A)G=(V,E,A), where V={k1,k2,w1,wn}V=\{k_{1},k_{2},w_{1},...w_{n}\} is a set of graph nodes, EE is the set of edges, and AA is the adjacency matrix of GG.

3.3.2 Graph Encoder

The Graph Convolutional Network (GCN)[16], which is a type of multi-layer neural network, can directly process data with graph structures and introduce node representation vectors based on the neighborhood features of each node. When there is only one layer in the network, GCN can only capture information from the direct neighbors of each node, and thus it requires stacking multiple layers to integrate more node information to complete the encoding of the entire graph. After constructing the relation graph, we initialize each node’s encoding as HH and input it into a sentence encoder. Then, we feed the adjacency matrix AA and the node encoding HH into GCN for further processing.

H(i+1)=Relu(A~H(i)W(i))H^{(i+1)}=Relu(\widetilde{A}H^{(i)}W^{(i)}) (3)
A~=D12AD12\widetilde{A}=D^{-\frac{1}{2}}AD^{-\frac{1}{2}} (4)

where H(i)(n+2)×dH^{(i)}\in\mathbb{R}^{(n+2)\times d} is the representation vector of all nodes in ithith GCN layer, H(0)=HH^{(}0)=H. A~\widetilde{A} is the normalized symmetric adjacency matrix and W(i)W^{(}i) is a weight matrix for ithith GCN layer. DD is degree matrix of AA, Dii=jAijD_{ii}=\sum_{j}A_{ij}.

Finally, we merge all the node representations outputted by the last layer of GCN to obtain the graph representation vector.

hgraph=i=1nHilasth_{graph}=\sum^{n}_{i=1}H^{last}_{i} (5)

where HlastH^{last} is the representation of all nodes output by the last layer of GCN..

3.4 Aggregration Layer and Binary Classifier

During the prediction process, our model combines the representations of sentences and graphs to predict the similarity scores between the original sentences.

hfinal=[he1;he2;|he1he2|;hgraph]h_{final}=[h_{e1};h_{e2};|h_{e1}-h_{e2}|;h_{graph}] (6)
p=Sigmode(wThfinal+b)p=Sigmode(w^{T}h_{final}+b) (7)

3.5 Contrastive Learning

To facilitate the capturing of shared features and the differentiation of non-shared features between original and complement text, we propose a semi-supervised contrastive learning strategy[18] for integrating the enhanced layers. We consider the encoded vectors of original and corresponding complement sentences as positive examples, since they are semantically very close. Moreover, to obtain more negative examples, we regard other complement sentences in the same batch as negative examples for the original sentence. By optimizing the contrastive learning, the model will reduce the distance between the encoded vectors of original and corresponding complement sentences while enlarging the distance between the encoded vectors of original and other unrelated sentences. As a result, the final model can better differentiate the shared semantic parts between original and complement sentences.

To be more specific, firstly, we obtain the encoded vectors of the original sentences and the complement sentences for conducting contrastive learning, respectively.

h^1=FNN(h1)\hat{h}_{1}=FNN(h_{1}) (8)
h^2=FNN(h2)\hat{h}_{2}=FNN(h_{2}) (9)
h^1=FNN(h1)\hat{h}^{{}^{\prime}}_{1}=FNN(h^{\prime}_{1}) (10)
h^2=FNN(h2)\hat{h}^{{}^{\prime}}_{2}=FNN(h^{\prime}_{2}) (11)

The training objective is to minimize the semantic distance between the original sentence and the complement sentence, while maximizing the distance between unrelated sentences. We utilize InfoNCE loss[19] as the training loss and cosine similarity to measure the semantic distance between sentences.

contrast1=i=1Nlogesim(h^i,1,h^i,1)/τj=1BatchSizeesim(h^i,1,hj,1)/τ\mathcal{L}_{\text{contrast1}}=-\sum_{i=1}^{N}\log\frac{e^{\operatorname{sim}\left(\hat{h}_{i,1},\hat{h}_{i,1}^{\prime}\right)/\tau}}{\sum_{j=1}^{BatchSize}e^{\operatorname{sim}\left(\hat{h}_{i,1},h_{j,1}^{{}^{\prime}}\right)/\tau}} (12)
contrast2=i=1Nlogesim(h^i,2,h^i,2)/τj=1BatchSizeesim(h^i,2,hj,2)/τ\mathcal{L}_{\text{contrast2}}=-\sum_{i=1}^{N}\log\frac{e^{\operatorname{sim}\left(\hat{h}_{i,2},\hat{h}_{i,2}^{\prime}\right)/\tau}}{\sum_{j=1}^{BatchSize}e^{\operatorname{sim}\left(\hat{h}_{i,2},h_{j,2}^{{}^{\prime}}\right)/\tau}} (13)

where NN is the total number of samples. hj,kh_{j,k}^{{}^{\prime}} represents the kthkth original sentence of the jthjth sample in the current batch. τ\tau is the temperature hyperparameter, which is used to control the model’s differentiation of negative samples.

3.6 Loss Function

We apply the BCE loss to optimize the sentence matching task during the training process.

binary=i=1Nyilog(pi)+(1yi)log(1pi)\mathcal{L}_{\text{binary}}=-\sum_{i=1}^{N}y_{i}log(p_{i})+(1-y_{i})log(1-p_{i}) (14)

The final total training loss function:

=αbinary+βcontrast1+γcontrast2\mathcal{L}=\alpha\mathcal{L}_{\text{binary}}+\beta\mathcal{L}_{\text{contrast1}}+\gamma\mathcal{L}_{\text{contrast2}} (15)

α\alpha, β\beta, and γ\gamma are hyperparameters set to 0.8, 0.1, 0.1 respectively.

3.7 Generating Complementary Sentences

To generate high-quality complement sentences, we constructed a dataset for training the SimBERT model. We can generate complementary sentences for existing data through this model. The model improves the attention mask mechanism in Transformer[20], resulting in efficient performance when generating similar sentences.

Specifically, during the data collection stage, we collected user query statements and corresponding advertisements returned by search engines (UC and Quark). For advertisements with low click-through rates, we considered them irrelevant to the user’s search intent and query statement. Therefore, we filtered out ad texts below the threshold KK, and paired ad texts above the threshold with their corresponding queries to form sentence pairs. During the training phase, we obtained a total of 30,000k sentence pairs for model training. Finally, in the testing phase, the trained SimBERT model was frozen and used to generate complement sentences for the original sentences in the public dataset.

4 Experiments

4.1 Experimental Setings

We applied Bert-base-chinese as our encoder and AdamW as the optimizer during the training phase. In addition, we implemented a hierarchical learning rate strategy, with a learning rate of 1e-4 for the pre-training model and 2e-5 for the other parts. The model was trained for 30 epochs on a GPU server with a 3080 GPU, using a batch size of 16. Early stopping was employed.

4.2 Dataset

To validate the effectiveness of the proposed KSTM model, we conducted experiments on two publicly available Chinese datasets, i.e.,BQ[21] and LCQMC[22].

Bank Question(BQ): This is a large-scale question matching dataset related to the banking and finance domain, which has been widely used in sentence matching tasks. It contains a total of 120k sentence pairs, with the training, validation, and test sets containing 100k, 10k, and 10k sentence pairs, respectively.

Large-scale Chinese Question Matching Corpus (LCQMC): This is a semantic matching dataset of questions collected from the Baidu Knowledge website. The dataset focuses on matching intentions rather than meanings. It contains a total of 260k sentence pairs, with the training, validation, and test sets containing 239k, 8.4k, and 12.5k sentence pairs, respectively.

4.3 Metrics

Compared to previous research, we evaluated our approach on two publicly available datasets using the same evaluation metrics as previous studies, i.e., ACCACC and F1F1. The calculation formula of ACCACC and F1F1 are as follows,

ACC=TP+TNTP+TN+FP+FNACC=\frac{TP+TN}{TP+TN+FP+FN} (16)
P=TPTP+FPP=\frac{TP}{TP+FP} (17)
R=TPTP+FNR=\frac{TP}{TP+FN} (18)
F1=2PRP+RF1=\frac{2*P*R}{P+R} (19)

where TPTP and FPFP respectively represent the number of cases correctly predicted as positive and the number of cases incorrectly predicted as positive. TNTN and FNFN respectively represent the number of cases correctly predicted as negative and the number of cases incorrectly predicted as negative.

4.4 Baselines

To demonstrate the effectiveness of our approach, we compared KSTM model with the following strong baseline models, which use BERT as encoder.

Glyce[23]: A glyph-based model for Chinese NLP tasks.

GMN[8]: A sentence matching framework capable of dealing with multi-granular input information.

LET[2]: A short Text Matching model using external knowledge to eliminate sentence ambiguity.

DSSTM[9]: Combining the relative distance information of words, a distance-aware self-attention and multi-level matching model is proposed to solve the Chinese short Text Matching task.

OTE[3]: A short Text Matching model based on external knowledge, combined with SoftLexico model and hybrid data augmentation to obtain multi-granularity features.

CBM[1]: This method proposes to use external knowledge to enhance the semantic information of the original short text.

4.5 Experiment Result

We compared our KSTM model with other short Text Matching models that use pre-trained Bert models as encoders on two datasets, using F1F1 and accuracyaccuracy as evaluation metrics. As shown in Table 1, the KSTM model outperforms all baseline models on both datasets. On the BQ dataset, the KSTM model achieves 1.46% higher accuracy and 1% higher F1F1 than the present SOTA model CBM. On the LCQMC dataset, the KSTM model outperforms the present best model DSSTM by 0.1% in accuracy and 1.1% in F1F1. This demonstrates the effectiveness of our proposed KSTM model.

Table 1: Performance comparison results of KSTM and baseline models on BQ and LCQMC datasets.
Model BQ LCQMC
ACC F1 ACC F1
Glyce 85.80 85.50 88.70 88.80
LET 85.30 84.98 88.38 88.85
GMN 85.60 85.50 87.30 88.00
DSSTM 85.40 - 88.90 -
OTE 85.26 84.77 86.68 88.29
CBM 86.16 87.44 88.80 89.10
KSTM(our) 87.62 88.44 89.00 90.20

4.6 Ablation Experiment

Table 2: Ablation experiment results of the KSTM model.
Model BQ LCQMC
ACC F1 ACC F1
w/o Contrastive Learning 86.10 87.50 88.10 88.80
w/o Knowledge Graphs 85.30 85.88 85.38 86.85
w/o Complementary Sentence 85.68 85.50 84.30 86.10
KSTM(our) 87.62 88.44 89.00 90.20

In order to demonstrate the effectiveness of the components of KSTM, we conducted three ablation experiments on the BQ and LCQMC datasets, as shown in Table 2. The first experiment aimed to verify the effectiveness of contrastive learning. We removed contrastive learning between the original sentence and its complementary sentence and used only binary cross-entropy as the overall loss function. The second experiment aimed to demonstrate the improvement in model performance by introducing external knowledge into the model. We did not use external knowledge graphs or graph encoding information, only enhanced sentence encoding for sentence matching. The third experiment aimed to demonstrate the effectiveness of the complementary sentences we constructed. Following the unsupervised contrastive learning approach proposed by SimCES[24], we replaced the complementary sentences with the original sentences, we inputted the original sentences into the model to obtain the encodings, which were not identical due to the presence of dropout. Two identical original sentences were used as positive examples for contrastive learning.

The experimental results demonstrate that removing any part of the model will have a negative impact on its performance. Specifically, when we remove contrastive learning, the model’s accuracy and F1 score on the BQ dataset decrease by 1.52% and 0.94%, respectively. On the FCQMC dataset, the accuracy and F1 score decrease by 0.9% and 1.4%, respectively. When we remove the knowledge graph, the model’s accuracy and F1 score on the BQ dataset decrease by 2.32% and 2.56%, respectively. On the FCQMC dataset, the accuracy and F1 score decrease by 3.62% and 3.35%, respectively. When we remove the use of complementary sentences and use unsupervised contrastive learning instead, the model’s accuracy and F1 score on the BQ dataset decrease by 1.94% and 2.94%, respectively. On the FCQMC dataset, the accuracy and F1 score decrease by 4.7% and 4.1%, respectively.

Through comparing three ablation strategies, we found that compared to external knowledge, removing contrastive learning has a smaller impact on the model. We believe this is because the role of contrastive learning is to guide the model to generate more semantically meaningful encoding. It can guide the model to use semantic vector representations to represent text with similar semantics using vectors that are close in space, and use vectors that are far apart in space for text with different semantics. For short-text datasets, pre-trained models already have strong encoding capabilities to understand and encode their semantic information, so the effect of using contrastive learning to improve model performance is lower. In addition, the use of complementary sentences and knowledge graphs has a significant positive effect on model performance. This is because they both introduce external knowledge, which largely solves the problem of the model lacking sufficient information to judge whether two sentences match in short-text scenarios. Comparing the results of the two datasets, removing knowledge graphs or complementary sentences results in a greater performance decrease on the LCQMC dataset than on the BQ dataset. We believe this is because the sentence matching in the LCQMC dataset is more focused on sentence intent rather than meaning, which requires the model to have more information to understand the sentence and perform intent inference. Therefore, the method of using external knowledge has a more significant improvement effect on models trained on this dataset.

5 Conclusion

In this paper, we propose a model(KSTM) that combines external knowledge and contrastive learning to address the problem of insufficient semantic information in short Text Matching tasks. We use two methods to introduce external knowledge, one of which is to generate complement sentences using the SimBert model, and the other is to construct a graph using keywords and knowledge words. Additionally, we use contrastive learning to improve the model’s ability to encode sentence representations. We validate the KSTM model on two Chinese short Text Matching tasks, and the results show that the KSTM model outperforms the previous baseline model. In future work, we plan to introduce more powerful models such as ChatGPT to generate more accurate complement sentences. Additionally, in constructing the knowledge graph, we will design a refined filtering mechanism to filter out knowledge words retrieved from the knowledge base that are unrelated to the keywords, thereby reducing the influence of noise.

References

  • [1] Mao Yan Chen, Haiyun Jiang, and Yujiu Yang. Context enhanced short text matching using clickthrough data. arXiv preprint arXiv:2203.01849, 2022.
  • [2] Boer Lyu, Lu Chen, Su Zhu, and Kai Yu. Let: Linguistic knowledge enhanced graph transformer for chinese short text matching. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13498–13506, 2021.
  • [3] Haoyang Ma, Zhaoyun Ding, Zeyu Li, and Hongyu Guo. Ote: An optimized chinese short text matching algorithm based on external knowledge. In Knowledge Science, Engineering and Management: 15th International Conference, KSEM 2022, Singapore, August 6–8, 2022, Proceedings, Part I, pages 15–30. Springer, 2022.
  • [4] Zhendong Dong and Qiang Dong. Hownet-a hybrid language and knowledge resource. In International conference on natural language processing and knowledge engineering, 2003. Proceedings. 2003, pages 820–824. IEEE, 2003.
  • [5] Zhe Hu, Zuohui Fu, Yu Yin, and Gerard de Melo. Context-aware interaction network for question matching. arXiv preprint arXiv:2104.08451, 2021.
  • [6] Zhe Hu, Zuohui Fu, Cheng Peng, and Weiwei Wang. Enhanced sentence alignment network for efficient short text matching. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 34–40, 2020.
  • [7] Yan Li, Chenliang Li, and Junjun Guo. Adaptive feature discrimination and denoising for asymmetric text matching. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1146–1156, 2022.
  • [8] Lu Chen, Yanbin Zhao, Boer Lyu, Lesheng Jin, Zhi Chen, Su Zhu, and Kai Yu. Neural graph matching networks for chinese short text matching. In Proceedings of the 58th annual meeting of the Association for Computational Linguistics, pages 6152–6158, 2020.
  • [9] Yao Deng, Xianfeng Li, Mengyan Zhang, Xin Lu, and Xia Sun. Enhanced distance-aware self-attention and multi-level match for sentence semantic matching. Neurocomputing, 501:174–187, 2022.
  • [10] Pengyu Zhao, Wenpeng Lu, Shoujin Wang, Xueping Peng, Ping Jian, Hao Wu, and Weiyu Zhang. Multi-granularity interaction model based on pinyins and radicals for chinese semantic matching. World Wide Web, 25(4):1703–1723, 2022.
  • [11] Yicheng Zou, Hongwei Liu, Tao Gui, Junzhe Wang, Qi Zhang, Meng Tang, Haixiang Li, and Daniel Wang. Divide and conquer: Text semantic matching with disentangled keywords and intents. arXiv preprint arXiv:2203.02898, 2022.
  • [12] Xin Lu, Yao Deng, Ting Sun, Yi Gao, Jun Feng, Xia Sun, and Richard Sutcliffe. Mkpm: Multi keyword-pair matching for natural language sentences. Applied Intelligence, 52(2):1878–1892, 2022.
  • [13] Eric Malmi, Sebastian Krause, Sascha Rothe, Daniil Mirylenka, and Aliaksei Severyn. Encode, tag, realize: High-precision text editing. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5054–5065, 2019.
  • [14] Jason Wei and Kai Zou. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6382–6388, 2019.
  • [15] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  • [16] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2016.
  • [17] Rada Mihalcea and Paul Tarau. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 404–411, 2004.
  • [18] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.
  • [19] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • [20] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [21] Jing Chen, Qingcai Chen, Xin Liu, Haijun Yang, Daohe Lu, and Buzhou Tang. The bq corpus: A large-scale domain-specific chinese corpus for sentence semantic equivalence identification. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 4946–4951, 2018.
  • [22] Xin Liu, Qingcai Chen, Chong Deng, Huajun Zeng, Jing Chen, Dongfang Li, and Buzhou Tang. Lcqmc: A large-scale chinese question matching corpus. In Proceedings of the 27th international conference on computational linguistics, pages 1952–1962, 2018.
  • [23] Yuxian Meng, Wei Wu, Fei Wang, Xiaoya Li, Ping Nie, Fan Yin, Muyu Li, Qinghong Han, Xiaofei Sun, and Jiwei Li. Glyce: Glyph-vectors for chinese character representations. Advances in Neural Information Processing Systems, 32, 2019.
  • [24] Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. In 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, pages 6894–6910. Association for Computational Linguistics (ACL), 2021.