This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: University of Information Technology,Ho Chi Minh City, Vietnam 22institutetext: Vietnam National University, Ho Chi Minh City, Vietnam 33institutetext: National Institute of Informatics, Tokyo, Japan
33email: [email protected], [email protected], [email protected]

Multi-stage Information Retrieval for Vietnamese Legal Texts

Nhat-Minh Pham 1122    Ha-Thanh Nguyen 33    Trong-Hop Do 1122
Abstract

This study deals with the problem of information retrieval (IR) for Vietnamese legal texts. Despite being well researched in many languages, information retrieval has still not received much attention from the Vietnamese research community. This is especially true for the case of legal documents, which are hard to process. This study proposes a new approach for information retrieval for Vietnamese legal documents using sentence-transformer. Besides, various experiments are conducted to make comparisons between different transformer models, ranking scores, syllable-level, and word-level training. The experiment results show that the proposed model outperforms models used in current research on information retrieval for Vietnamese documents.

Keywords:
Sentence-transformer information retrieval Vietnamese legal texts retrieval systems.

1 Introduction

Legal information retrieval is a specialized task in natural language processing (NLP) that involve retrieving relevant legal documents 111In this paper, ”text” and ”document” are used interchangeably given a query. Compared with traditional text retrieval, legal text retrieval is more difficult since legal documents are often long and complicated. On the other hand, questions are often highly complex and need a person with good expertise to give the correct answer.

Legal information retrieval for English documents has received a lot of attentions. COLIEE 222https://sites.ualberta.ca/~rabelo/COLIEE2022/ is an annual competition about legal information extraction/entailment or JURIX 333http://jurix.nl/conferences/ is an annual conference on legal AI. However, there are very few studies on Vietnamese legal information retrieval. Recent studies of Vietnamese legal text retrieval have been mainly on CNN architecture combined with attention mechanism [9, 14], using fine-tuned Multilingual transformer model like XLM-RoBERTa [14]. In recent years, monolingual transformer models like PhoBERT [13], ViBERT [21] have beat the state-of-the-art (SOTA) in many NLP tasks in Vietnamese like Part-of-Speech(POS tagging), Named-Entity Recognition(NER), and Dependency Parsing(DP). Sentence-BERT(SBERT) [15] and models evolved from it [20] have recently reached significant results on several tasks like semantic searching or information retrieval. In this paper, we focus on exploring fine-tuned sentence-transformer monolingual models for Vietnamese legal text retrieval.

The contribution of this paper is two-fold. First, we propose a novel pipeline of multi-stage information retrieval based on sentence-transformer for Vietnamese legal texts. Experiments were conducted and the results show that the proposed pipeline significantly outperforms existing models. Second, empirical analyses of the effects of multiple factors including language models(LM), word segmentation methods, and ranking scores on the performance of information retrieval are performed. The results show that SPhoBERT-large language model, word-based segmentation, and system with combine ranking scores yield the best result in information retrieval for Vietnamese legal documents.

2 Background and Related Work

In this paper, we focus on answering legal queries at the article level. Given a legal query, our goal is to retrieve all the relevant articles, that can be used as the fundamental to answer the query.

Many approaches, especially for ad-hoc text retrieval have been proposed, from past decades to recent years.

2.0.1 Non-neural approaches

These methods, decide relevance based on the frequency and occurrence of the words in the query and the documents. Non-neural approaches do not perform well in the semantic searching problem due to lexical mismatching in the answers. However, they are still useful for current SOTA models. BM25 [16] and tf-idf [18] are well-known among all while BM25+[17] is currently the most effective in the approach. To overcome the lexical mismatching challenge, dense embedding is used to represent queries and documents. The main idea of this method was proposed with the LSI approach [3].

2.0.2 Attention mechanism approaches

Attention mechanism makes the model able to focus more on the main keywords or informative sentences in the original text. Kien and Nguyen et al. signed a simple attentive convolution neural network for Vietnamese legal text retrieval [9]. Nguyen also used this method in his dissertation[14]. Their retrieval system can capture both local and global contexts to construct to build representation vectors.

2.0.3 Transformer cross-encoder approaches

BERT [4] brings breakthroughs in NLP. The cross-encoder approach is the first widely approach of BERT in information retrieval. Both query and document will be passed simultaneously to the BERT network. Birch [1] - the system combines lexical matching and the cross-encoder approach for better performance. Several other studies [11, 8, 10] applied the cross-encoder approach and reached significant results. BERT-PLI [19] is a retrieval system based on these approaches. However, these approaches require lots of time for training and costly computation resource.

2.0.4 Transformer bi-encoder approaches

Cross-encoder approaches are costly and time-consuming for training. Motivated by this, Reimers and Gurevych presented SBERT which uses SiameseBERT-network to represent semantically meaningful sentence embeddings. SBERT produce vector embedding for each sentence independently. When we want to compare two sentences, we just need to calculate the cosine similarity of the two existing vectors. The authors of SBERT train the bi-encoder from BERT and compare two input sentences using cosine similarity. For question answering(QA) or IR tasks, many studies developed from or applied SBERT [23, 2] and get high performance.According to recent research by Gao and Callan [5] standard LMs’ internal attention structure is not ready to use for dense encoders. The authors proposed a new transformer architecture, Condenser as an improvement for the bi-encoder approach. For applying bi-encoder approach for Vietnamese, to the best of our knowledge, there is only one paper [6] on bi-encoder approach and the solution 444https://github.com/CuongNN218/zalo_ltr_2021 for an annual competition of artificial intelligence.

3 Sentence-transformers based Multi-stage information retrieval for Vietnamese legal text

The general idea of our approach is to use both lexical matching and semantic searching to improve the performance of our system. For lexical matching, we used BM25+ in package rank_bm25 library 555https://pypi.org/project/rank-bm25/. For semantic searching, we trained sentence-transformer models by contrastive learning. For each query in training dataset, the label of the relevant article to query is 1 while the negative sample is 0. To get the negative samples, we took the top-k highest ranking score in the previous training round. Then the samples for each query was k+(number of positive articles)as positive(pos) and negative(neg) pairs. We used BM25+ and then trained the sentence-transformer model for three rounds. Our pipeline for training are shown in Figure 1

Refer to caption
Figure 1: Our proposed pipeline for training

4 Experiments

4.1 Datasets

The dataset used in our paper is from the original dataset in the paper published by Kien et al. [9]. We clean the data to reduce the noise in the legal corpus. The removed laws and articles do not appear in QA dataset while QA dataset remains the original so that data cleaning does not affect the evaluation results. Finally, we obtained the legal corpus containing 8,436 documents with 114,177 articles. In legal corpus, we concatenated title and text in each article as a ”very long” sentence.

Table 1: Distribution of articles length by syllables
Length Amount Proportion
¡ 100 32950 28.86%
101 - 256 43923 38.47%
257 - 512 23799 20.84%
513+ 9483 11.83%
Table 2: Distribution of articles length by words
Length Amount Proportion
¡ 100 43763 38.33%
101 - 256 42800 37.48%
257 - 512 18636 16.33%
513+ 6111 7.86%

Not like English, Vietnamese syllables and word tokens are different. For example, 4-syllable written text ”Bài báo khoa học” (scientific paper) form 2 words ”Bài_báo paper khoa_học scientific”. We used VNCoreNLP [22] for word segmentation. We divided the corpus into 4 types of lengths: ¡ 100, 101-256,257-512, and 513+. 256 is the maximum number of tokens supported by PhoBERT while 512 is the maximum number of tokens supported by ViBERT. Table 2 and table 2 show percentage of articles based on length.

For QA dataset, we found that only 1,709 articles (about 1.5% of the whole articles) in the legal corpus appear and about 95% of questions in both training and testing datasets have one to three relevant articles.

4.2 Experimental procedure

We use a single NVIDIA Tesla P100 GPU via Google Colaboratory to train all the models. Figure 2 presents an overview of the experimental procedure in this paper.

Refer to caption
Figure 2: Experiment procedure

4.2.1 Monolingual pre-trained models

  • PhoBERT pre-training approach is based on RoBERTa [12] which is using Dynamic Masking to create masked tokens. Because two PhoBERT versions are trained on large-scale Vietnamese dataset, they perform great results on several NLP tasks in Vietnamese like NER, POS tagging, and DP.

  • ViBERT leverages checkpoint from mBERT [4] and continues training on 10 GB of Vietnamese news data. ViBERT pre-training approach based on BERT which has model architecture is a multilayer bidirectional Transformer encoder. ViBERT now is supporting sequence lengths up to 512 tokens, this is an advantage compared to only 256 tokens supported by PhoBERT.

4.2.2 Word-level and syllable-level approaches

In a field that contains many semantic challenges such as law, we want to evaluate how words and syllables of Vietnamese affect the results of models. We trained transformer models on both word-level and syllable-level independently in every below step.

4.2.3 Fine-tuning monolingual language models

We used our legal corpus to fine-tune masked language modeling of monolingual language models. We set gradient_accumulation_steps = 4, train_batch_size = 8, eval_batch_size = 8 and epochs = 20.

4.2.4 Fine-tuning Condenser

After training mask language modeling, we used both versions of PhoBERT to train Condenser. We used source code 666https://github.com/luyug/Condenser for creating data in Condenser’s form and fine-tuning Condenser from PhoBERT.

4.2.5 Lexical matching

For lexical matching, we used BM25+. VNCoreNLP was used for word segmentation. Stopwords were also removed to make the model focus on important words. Our goal after this step is to get the list of the top highest lexical similarity articles for each query.

4.2.6 Training sentence-transformer

We trained 3 rounds for sentence-transformer. The choosing method we mentioned above, for short, we will call the number of negative sample each round is k. In the first round, we picked 35 negative samples from BM25+. These negative samples almost are high lexical similarity to queries but low semantic similarity. In the second round, we picked 20 negative samples from the first round. These negative samples have better semantic similarity to queries. In the third round, we pick 15 negative samples from the second round. These negative samples are really close semantic to queries. We set batch_size=8, epochs = 4, learning_rate = 1e-5. For loss function, we used contrastive loss [7]

4.2.7 Evaluation metrics

The first measure we use to evaluate the performance of the system is (macro)recall@20, where 20 is the number of the top selected articles. The second evaluation metric we use is F2. The F2-measure is shown below:

Precisioni-th=the number of correctly retrieved articles of query i-ththe number of retrieved articles of query i-thPrecision\raisebox{-3.01385pt}{\footnotesize{i-th}}=\frac{\text{the number of correctly retrieved articles of query i-th}}{\text{the number of retrieved articles of query i-th}}
Recalli-th=the number of correctly retrieved articles of query i-ththe number of relevant articles of query i-thRecall\raisebox{-3.01385pt}{\footnotesize{i-th}}=\frac{\text{the number of correctly retrieved articles of query i-th}}{\text{the number of relevant articles of query i-th}}
F2i-th=5*Precisioni-th*Recalli-th4*Precisioni-th+Recalli-thF2\raisebox{-3.01385pt}{\footnotesize{i-th}}=\frac{\text{5*Precision\raisebox{-3.01385pt}{\footnotesize{i-th}}*Recall\raisebox{-3.01385pt}{\footnotesize{i-th}}}}{\text{4*Precision\raisebox{-3.01385pt}{\footnotesize{i-th}}+Recall\raisebox{-3.01385pt}{\footnotesize{i-th}}}}
F2=average of (F2i-th)F2=\text{average of (F2\raisebox{-3.01385pt}{\footnotesize{i-th}})}

4.3 Results of BM25+ and sentence-transformer

The models we selected can only compute scores between query-rule pairs. The performance of the model on F2 will decrease if we choose too many articles. In order to get good results on the F2 measure, we have to set a suitable threshold. Let the highest score between each query and all articles as highest_score. The articles would be chosen if their scores in range [highest_score-threshold, highest_score]. For short, we rewrite sentence-transformer Condenser trained from PhoBERT-base and PhoBERT-large respectively SConPBB and SConPBL in the below table.

Table 3: Results of BM25+ and sentence-transformer
Model Recall@20 F2
BM25+ 0.557 0.221
SPhoBERT-base(syl) 0.944 0.700
SConPBB(syl) 0.953 0.697
SPhoBERT-large(syl) 0.940 0.715
SConPBL(syl) 0.953 0.719
SViBERT(syl) 0.942 0.679
SPhoBERT-base(word) 0.954 0.721
SConPBB(word) 0.957 0.704
SPhoBERT-large(word) 0.954 0.727
SConPBL(word) 0.954 0.724
SViBERT(word) 0.921 0.669

The result in Table 3 shows that: sentence-transformer models outperform BM25+. However, we want to build a retrieval system with both lexical matching and semantic searching approaches. We consider the combining score of these two approaches. More than 95% of questions have one to three relevant articles. We visualized the average score of the top 3 highest scores of BM25+.

Refer to caption
Figure 3: Average score of top 3 highest scores of BM25+

4.4 Retrieval system using BM25+ and sentence-transformer

Sentence-transformer uses the cosine similarity(cos_sim) of the two vectors query and article to rank.Cos_sim is always in range [-1,1]. According to Figure 3 average score of BM25+ is much larger than cos_sim while semantic approach makes a big gap. Combining score for ranking in previous work like (1- α\alpha)*lexical_score + α\alpha * semantic_score will not work or take much time to fine the α\alpha. We proposed two combining score: sqrt(BM25_score)*cos_sim and BM25_score*cos_sim to rank our retrieval systems. Similar to single model, we used suitable threshold for retrieval system to get best results on F2. For short, we named the retrieval system by a name of sentence-transformer model within it.

Table 4: Results of retrieval systems
sqrt(BM25_score)*cos_sim BM25_score*cos_sim
System Recall@20 F2 Recall@20 F2
SPhoBERT-base(syl) 0.958 0.715 0.960 0.703
SConPBB(syl) 0.967 0.712 0.965 0.692
SPhoBERT-large(syl) 0.951 0.726 0.955 0.706
SConPBL(syl) 0.953 0.729 0.956 0.710
SViBERT(syl) 0.951 0.695 0.963 0.684
SPhoBERT-base(word) 0.963 0.725 0.964 0.706
SConPBB(word) 0.962 0.719 0.964 0.700
SPhoBERT-large(word) 0.967 0.741 0.970 0.721
SConPBL(word) 0.960 0.738 0.967 0.718
SViBERT(word) 0.948 0.683 0.950 0.661

Based on the results from Tables 3, 4, we had three observations:

  • If LMs were pre-trained on word-level data like PhoBERT and CondenserPhoBERT, trained sentence-transformer models on word-level data bring better results.

  • If LM were pre-trained on syllable-level data like ViBERT, trained sentence-transformer model on syllable-level data bring better results.

  • The system with both lexical matching and semantic searching performances better than sentence-transformer model within it.

  • Sqrt(BM25_score)*cos_sim is the best ranking score for the system on F2 while BM25_score*cos_sim is the best on recall@20.

4.5 Comparison with previous Vietnamese legal text retrieval systems

Our best performing system on F2 is SPhoBERT-large(word) system with ranking score: sqrt(BM25_score)*cos_sim. On Recall@20, our best performing system is SPhoBERT-large(word) system with ranking score: BM25_score*cos_sim. For short, we call our best retrieval system on F2 is our best system 1 and our best retrieval system on Recall@20 is our best system 2.

Table 5: Comparison with previous Vietnamese legal text retrieval systems on F2
System F2
Our best system 1 0.741
Attentive CNN [14] 0.4774
XLM-RoBERTa [14] 0.2006
Table 6: Comparison with previous Vietnamese legal text retrieval systems on Recall@20
System Recall@20
Our best system 2 0.970
ElasticSearch + Attentive CNN [9] 0.825
Birch(256 first words) [9] 0.763
Birch(title) [9] 0.783

Table 6 and Table 6 illustrate the results of our best systems compared with the previous systems [9, 14]. It is proved that our system yields the best performance on both metrics.

4.6 Analysis by the number of relevant articles related to queries

We used best retrieval system from each sentence-transformer model to analysis effect of the number of relevant articles to queries in both metrics.

Refer to caption
Figure 4: Effect of the number of relevant articles to queries on F2
Refer to caption
Figure 5: Effect of the number of relevant articles to queries on recall@20

Our observations based on the results from Figures 4, 5:

  • The more relevant articles relate to the query, the lower F2 of the systems there. Part of the reason for this is because we set the threshold pretty low. So the range [highest_score-threshold, highest_score] would not be large enough to cover many the relevant articles.

  • Queries with (3,5,7) relevant articles make the systems with recall@20 significantly lower than queries with (2,4,6) relevant articles. Most of the recall@20 results of systems are pulled down because of queries with (2,4,6) relevant articles.

  • SViBERT system achieves much better results than other systems on the recall@20 measure in queries with (5,6) relevant articles.

  • SPhoBERT-large achieved the best results because it outperformed other systems in queries with (3,5,7) relevant articles.

4.7 Data-driven error analysis

There are two main errors we found during the experiment. First, wrong crawling relevant articles for query in the dataset. After a thorough discussion, we all agreed that the answer to this question was wrong. The example in Table 7 shows that relevant article was wrongly collected. The question is about person who is obliged to execute the judgment returns the non-implementation papers while relevant article is about principles of auction of land use rights. We did not find any correlation between the question and the answer.

Table 7: Example of wrong crawling relevant articles
Q/A/AC Content
Query(Q) Người có nghĩa vụ phải thi hành án trả giấy tờ không thực hiện thì cơ quan thi hành án dân sự cưỡng chế như thế nào ? (If the person who is obliged to execute the judgment returns the non-implementation papers, how does the civil judgment enforcement agency coerce ?)
Answer(A) Article 117 in law id:45/2013/QH13
Article content (AC) Nguyên tắc đấu giá quyền sử dụng đất (Principles of auction of land use rights)
1 . Đấu giá quyền sử dụng đất được thực hiện công khai , liên tục , khách quan , trung thực , bình đẳng , bảo vệ quyền và lợi ích hợp pháp của các bên tham gia. (Auction of land use rights is conducted publicly, continuously, objectively, honestly, equally, protecting the legitimate rights and interests of the participating parties)
2. Việc đấu giá quyền sử dụng đất phải đúng trình tự , thủ tục theo quy định của pháp luật về đất đai và pháp luật về đấu giá tài sản(The auction of land use rights must comply with the order and procedures prescribed by the law on land and the law on asset auction.)

Second, very hard semantic queries and relevant articles. These queries often miss-matching lexical with their relevant articles. Both of them are also long and contain difficult specialized words. To be able to understand the semantic of these queries and articles will still be a big challenge for retrieval systems.

Table 8: Example of very hard semantic query and relevant articles
Q/A/AC Content
Query(Q) Có thể kê biên tài sản riêng của giám đốc để đảm bảo trả nợ cho công ty không ? (Can the director’s personal assets be distraint to ensure the repayment of the company’s debt ?)
Answer(A) Article 74 in law id:91/2015/QH13
Article content (AC) Pháp nhân
1. Một tổ chức được công nhận là pháp nhân khi có đủ các điều kiện sau đây (An organization is recognized as a ”pháp nhân” when the following conditions are satisfied) a ) Được thành lập theo quy định của Bộ Luật này , luật khác có liên quan; (a) Established under the provisions of this law and other relevant laws;) b ) Có cơ cấu tổ chức theo quy định tại Điều 83 của bộ luật này;(b) Having an organizational structure as prescribed in Article 83 of this law;) c) Có tài sản độc lập với cá nhân , pháp nhân khác và tự chịu trách nhiệm bằng tài sản của mình ; (c) Having assets independent of other individuals or legal entities and taking responsibility for their own property;) c) Có tài sản độc lập với cá nhân , pháp nhân khác và tự chịu trách nhiệm bằng tài sản của mình ; (c) Having assets independent of other individuals or legal entities and taking responsibility for their own property;) d ) Nhân danh mình tham gia quan hệ pháp luật một cách độc lập. (d) Independently participate in legal relations in their own name.)
2 . Mọi cá nhân , pháp nhân đều có quyền thành lập pháp nhân , trừ trường hợp luật có quy định khác . (2 . Every individual and juridical person has the right to establish a juridical person, unless otherwise provided for by law.)

In Table 8, ”pháp nhân” is a specialized word in Vietnamese legal. To the best of our knowledge, the word has the closest meaning to it in English is corporation. We found the answer for the query in clause c) of this article. This article and query are lexical mismatching, so retrieval system will not work as well as sentence-transformer model. We used SPhoBERT-large which is the best semantic searching model to evalute cos_sim between question and clause c) only. The result is 0.351. This shows that our model is still limited in the face of questions and answers with high semantic difficulty.

5 Conclusion and future work

In this paper, we proposed multi-stage information retrieval based on sentence-transformers for Vietnamese legal texts. We also compare the best performance model to the Attentive CNN model (previous SOTA in this QA dataset). The results indicate that our models outperform previous SOTA in both evaluation metrics.

In future work, we will try other techniques for long articles like summary, and keyword extraction. We will also improve the models to learn the to learn the relationship between queries and rules of very high semantic complexity.

References

  • [1] Akkalyoncu Yilmaz, Z., Wang, S., Yang, W., Zhang, H., Lin, J.: Applying BERT to document retrieval with birch. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations. pp. 19–24. Association for Computational Linguistics, Hong Kong, China (Nov 2019). https://doi.org/10.18653/v1/D19-3004, https://aclanthology.org/D19-3004
  • [2] Condor, A., Litster, M., Pardos, Z.: Automatic short answer grading with sbert on out-of-sample questions. International Educational Data Mining Society (2021)
  • [3] Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990)
  • [4] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  • [5] Gao, L., Callan, J.: Condenser: a pre-training architecture for dense retrieval. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 981–993. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic (Nov 2021). https://doi.org/10.18653/v1/2021.emnlp-main.75, https://aclanthology.org/2021.emnlp-main.75
  • [6] Ha, T.T., Nguyen, V.N., Nguyen, K.H., Nguyen, K.A., Than, Q.K.: Utilizing sbert for finding similar questions in community question answering. In: 2021 13th International Conference on Knowledge and Systems Engineering (KSE). pp. 1–6. IEEE (2021)
  • [7] Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). vol. 2, pp. 1735–1742. IEEE (2006)
  • [8] Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 6769–6781. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.emnlp-main.550, https://aclanthology.org/2020.emnlp-main.550
  • [9] Kien, P.M., Nguyen, H.T., Bach, N.X., Tran, V., Nguyen, M.L., Phuong, T.M.: Answering legal questions by learning neural attentive text representation. In: Proceedings of the 28th International Conference on Computational Linguistics. pp. 988–998. International Committee on Computational Linguistics, Barcelona, Spain (Online) (Dec 2020). https://doi.org/10.18653/v1/2020.coling-main.86, https://aclanthology.org/2020.coling-main.86
  • [10] Laskar, M.T.R., Huang, J.X., Hoque, E.: Contextualized embeddings based transformer encoder for sentence similarity modeling in answer selection task. In: Proceedings of the 12th Language Resources and Evaluation Conference. pp. 5505–5514. European Language Resources Association, Marseille, France (May 2020), https://aclanthology.org/2020.lrec-1.676
  • [11] Lee, K., Chang, M.W., Toutanova, K.: Latent retrieval for weakly supervised open domain question answering. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 6086–6096. Association for Computational Linguistics, Florence, Italy (Jul 2019). https://doi.org/10.18653/v1/P19-1612, https://aclanthology.org/P19-1612
  • [12] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. ArXiv abs/1907.11692 (2019)
  • [13] Nguyen, D.Q., Tuan Nguyen, A.: PhoBERT: Pre-trained language models for Vietnamese. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 1037–1042. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.findings-emnlp.92, https://aclanthology.org/2020.findings-emnlp.92
  • [14] Nguyen, H.T.: Toward improving attentive neural networks in legal text processing. arXiv preprint arXiv:2203.08244 (2022)
  • [15] Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019)
  • [16] Robertson, S., Zaragoza, H., Taylor, M.: Simple bm25 extension to multiple weighted fields. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management. p. 42–49. CIKM ’04, Association for Computing Machinery, New York, NY, USA (2004). https://doi.org/10.1145/1031171.1031181, https://doi.org/10.1145/1031171.1031181
  • [17] Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: SIGIR’94. pp. 232–241. Springer (1994)
  • [18] Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523 (1988). https://doi.org/https://doi.org/10.1016/0306-4573(88)90021-0, https://www.sciencedirect.com/science/article/pii/0306457388900210
  • [19] Shao, Y., Mao, J., Liu, Y., Ma, W., Satoh, K., Zhang, M., Ma, S.: Bert-pli: Modeling paragraph-level interactions for legal case retrieval. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20. pp. 3501–3507 (2020)
  • [20] Thakur, N., Reimers, N., Daxenberger, J., Gurevych, I.: Augmented SBERT: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 296–310. Association for Computational Linguistics, Online (Jun 2021). https://doi.org/10.18653/v1/2021.naacl-main.28, https://aclanthology.org/2021.naacl-main.28
  • [21] Tran, T.O., Le Hong, P., et al.: Improving sequence tagging for vietnamese text using transformer-based neural models. In: Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation. pp. 13–20 (2020)
  • [22] Vu, T., Nguyen, D.Q., Nguyen, D.Q., Dras, M., Johnson, M.: VnCoreNLP: A Vietnamese natural language processing toolkit. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. pp. 56–60. Association for Computational Linguistics, New Orleans, Louisiana (Jun 2018). https://doi.org/10.18653/v1/N18-5012, https://aclanthology.org/N18-5012
  • [23] Wang, B., Kuo, C.C.J.: SBERT-WK: A sentence embedding method by dissecting bert-based word models. arXiv preprint arXiv:2002.06652 (2020)