Ruri: Japanese General Text Embeddings
Abstract
We report the development of Ruri, a series of Japanese general text embedding models. While the development of general-purpose text embedding models in English and multilingual contexts has been active in recent years, model development in Japanese remains insufficient. The primary reasons for this are the lack of datasets and the absence of necessary expertise. In this report, we provide a detailed account of the development process of Ruri. Specifically, we discuss the training of embedding models using synthesized datasets generated by LLMs, the construction of the reranker for dataset filtering and knowledge distillation, and the performance evaluation of the resulting general-purpose text embedding models.
Ruri: Japanese General Text Embeddings
Hayato Tsukagoshi Ryohei Sasano Graduate School of Informatics, Nagoya University [email protected], [email protected]
1 Introduction
Text embeddings are widely used for tasks such as retrieval-augmented generation (RAG) and similar document retrieval Reimers and Gurevych (2019); Gao et al. (2021); Wang et al. (2022). In recent years, the development of general-purpose text embedding models trained on diverse datasets has become increasingly common Wang et al. (2022); Li et al. (2024b, 2023); Xiao et al. (2024); Günther et al. (2024). However, these efforts have mainly focused on English and multilingual models, where the proportion of Japanese vocabulary and training datasets is relatively small. Building embedding models using large-scale Japanese datasets may enable the creation of higher-performing models.
In this report, we present a general-purpose text embedding model specialized for Japanese, which was developed through contrastive pre-training, the construction of synthetic training datasets using LLMs, and fine-tuning on high-quality datasets. Our contributions are summarized as follows:
-
1.
We collected datasets for building Japanese embedding models and released them under a permissive license.
-
2.
To address the lack of Japanese retrieval datasets, we constructed a synthetic dataset using LLMs. A performance comparison with and without the synthetic dataset in benchmark tests showed a difference of over 1 point, confirming its utility in training Japanese embedding models.
-
3.
We created a large-scale dataset for contrastive pre-training in Japanese, demonstrating its effectiveness by outperforming existing multilingual models, even when using contrastive pre-training alone.
-
4.
We developed a Japanese reranker, achieving the highest performance among existing Japanese rerankers.
-
5.
We built the Japanese embedding model Ruri, which significantly outperformed existing models in text embedding benchmarks.
Our models and datasets are publicly available111\seqsplithttps://huggingface.co/collections/cl-nagoya/ruri-japanese-general-text-embeddings-66cf1f3ee0c8028b89d85b5e.
2 Contrastive Pre-training
Recent research on text embeddings has seen a growing interest in a two-stage learning approach Wang et al. (2022); Li et al. (2023); Xiao et al. (2024). This approach consists of two steps: contrastive pre-training and fine-tuning. First, contrastive learning is performed using a large-scale, weakly-supervised dataset. The dataset used in the first stage typically consists of text pairs extracted from sources like Wikipedia and web corpora. Although this dataset is noisy and may contain false positives/negatives, large-scale training with substantial batch sizes has been shown to improve the quality of the embeddings. After contrastive pre-training, the model is fine-tuned using a manually labeled dataset. While the model from the first stage already yields reasonably effective embeddings, its performance can be further enhanced through fine-tuning with high-quality, human-labeled data.
Building on these methods, this report aims to develop a robust base model that can adapt to various domains through contrastive pre-training for Japanese. However, unlike in English or multilingual models, there are several challenges in developing a general text embedding model for Japanese. One of the most significant challenges is the limited availability of training datasets. To address this, in addition to collecting and preprocessing existing datasets, we synthesized additional training data using Large Language Models (LLMs) for contrastive pre-training. In this section, we provide a detailed explanation of our contrastive pre-training process.
2.1 Existing Datasets
First, we collected and preprocessed available open datasets suitable for training text embedding models. Specifically, we standardized the format and applied common preprocessing steps to 7 datasets: Japanese Wikipedia222https://huggingface.co/datasets/hpprc/jawiki, WikiBooks333https://huggingface.co/datasets/hpprc/jawiki-books, Wiktionary444https://huggingface.co/datasets/hpprc/jawiki-wiktionary, Japanese split of MQA555https://huggingface.co/datasets/clips/mqa, Japanese split of CC News666https://huggingface.co/datasets/intfloat/multilingual_cc_news, Japanese Research Corpus (JRC)777https://huggingface.co/datasets/kunishou/J-ResearchCorpus,888 Japanese Research Corpus is a high-quality dataset comprising Japanese academic papers. While it includes papers from various academic societies, it also contains papers from the journal “Natural Language Processing,” which is included in the evaluation benchmark, JMTEB Li et al. (2024a). To prevent potential leakage, we excluded the data from “Natural Language Processing” from our training dataset. , Wiki Atomic Edits Faruqui et al. (2018). We applied NFKC normalization and removed invisible characters. Each dataset consists of pairs of “queries” typically composed of article titles, and “passages” composed of article bodies or longer texts.
Source | Anchor | Positive | Negative | Dataset size |
Wikipedia (1) | title + section title | 1-paragraph | random 1-paragraph | 19,361,464 |
Wikipedia (3) | title + section title | 3-paragraphs | random 3-paragraphs | 10,010,462 |
Wikipedia (long) | title / abst. | abst. / article body | random abst. / article body | 7,889,486 |
Wiktionary | title | article body | random article body | 697,405 |
WikiBooks | title + section title | 1-paragraph | random 1-paragraph | 314,207 |
MQA | title | article body | BM25 mined article body | 25,165,824 |
CC News (long) | title | article body | BM25 mined article body | 6,248,336 |
CC News (short) | random sentence | sentence in the same article | sentence in other articles | 2,795,632 |
AutoWikiQA (MX) | question | passage | BM25 mined passage | 11,563,562 |
AutoWikiQA (Nemo) | question | passage | BM25 mined passage | 495,062 |
JRC | title + section title | section body | BM25 mined section body | 131,072 |
Wiki Atomic Edits | sentence | edited sentence | random sentence | 3,679,939 |
AutoWikiNLI | premise | hypothesis (entailment) | hypothesis (contradiction) | 203,147 |
JSNLI | premise | hypothesis (entailment) | hypothesis (contradiction) | 180,146 |
Total | 88,735,744 | |||
Model | #Params. | GPUs | Base LM |
---|---|---|---|
(cl-nagoya/ruri-pt-small) | 68M | A60004 | line-corporation/line-distilbert-base-japanese |
(cl-nagoya/ruri-pt-base) | 111M | A1004 | tohoku-nlp/bert-base-japanese-v3 |
(cl-nagoya/ruri-pt-large) | 337M | A1004 | tohoku-nlp/bert-large-japanese-v2 |
2.2 Synthesized Datasets
The use of synthetic datasets in training text embeddings is very promising and is actively explored Zhang et al. (2023); Wang et al. (2024); Sato et al. (2024); Lee et al. (2024b). This is particularly true for languages like Japanese, where there are few available datasets for training embedding models, and licensing is crucial.
Therefore, we created synthetic datasets using LLMs for two types of datasets commonly used in embedding model training: QA datasets and natural language inference (NLI) datasets, and used them for model training. While the synthetic datasets are considered relatively high quality, they may contain noise or bias. Thus, we included them in the pretraining datasets.
- AutoWikiQA99footnotemark: 9
-
is a dataset consisting of queries and answers generated from random Wikipedia passages. For generation, we mainly used Swallow-MX111111https://huggingface.co/tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1, which is a continually pre-trained model from Mixtral-8x7B121212https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 on the cleaned large Japanese corpus Okazaki et al. (2024) as well as Swallow Fujii et al. (2024). We also used Nemotron-4 340B Nvidia et al. (2024). The source passages to generate queries and answers were constructed by concatenating three paragraphs from random Wikipedia articles to ensure a sufficient length of the text. Since the passages in AutoWikiQA are created by combining multiple paragraphs from Wikipedia, a single passage usually consists of multiple sentences rather than just one. The resulting dataset consists of over 250 million query–passage pairs.
- AutoWikiNLI1313footnotemark: 13
-
is a synthesized natural language inference (NLI) dataset generated using Nemotron-4 340B. We sampled random sentences from Wikipedia as premises and generated both entailment and contradiction sentences. By generating both entailment and contradiction sentences from a single premise, we create triplet datasets that include harder negatives, which are crucial for contrastive learning due to the lexical similarity between entailment and contradiction sentences. Initially, we observed that when the LLM generated entailment followed by contradiction sentences, the contradiction sentences were often simple negations. To improve this, we reversed the generation order, creating contradiction sentences first, followed by entailment sentences. Additionally, some generated hypothesis sentences were of low quality. To address this, we used the reward model of Nemotron-4 340B151515https://build.nvidia.com/nvidia/nemotron-4-340b-reward to score the generated outputs and removed the bottom 20% of examples based on low helpfulness scores.
2.3 Pre-training Dataset
Table 1 shows the datasets used for contrastive pre-training. Our training strategy aims to achieve high-performance models by leveraging a diverse range of datasets, including both noisy and high-quality, manually curated sources. This approach allows the model to learn robust representations from varied data before fine-tuning on more specific, high-quality datasets. Wikipedia was utilized in multiple ways to create text pairs for training. We employed various methods to extract and pair texts from Wikipedia, maximizing the value of this rich information source for our pre-training process. Also, we incorporated JSNLI 吉越 et al. (2020), a dataset derived from machine translation of English Bowman et al. (2015).
We implemented hard negative mining during the pre-training phase, a technique shown to enhance model performance as reported in Wang et al. (2022). Although Wang et al. (2022) suggests that hard negative mining becomes impractical for datasets approaching 200 million samples, our relatively smaller dataset size allowed us to effectively employ this technique. We utilized BM25 for generating hard negatives, which required preprocessing the entire document corpus to create searchable indexes. To optimize this process, we created separate indexes for each dataset, thereby reducing the time and computational costs associated with indexing and searching.
2.4 Training Details
An overview of contrastive pre-trained model is shown in Table 2. In the contrastive pre-training, we built three models of different sizes: small, base, and large. The small model was based on line-corporation/line-distilbert-base-japanese, the base model on tohoku-nlp/bert-base-japanese-v3, and the large model on tohoku-nlp/bert-large-japanese-v2, with contrastive pre-training applied as fine-tuning. All of these models are Japanese BERT Devlin et al. (2019) models. We used the improved contrastive loss, as proposed by Li et al. (2023), for training. This loss function not only calculates the similarity between query and passage in contrastive learning using in-batch negatives, but also considers the similarities between query–query, passage–query, and passage–passage. It aims to lower the similarity scores for non-positive examples. In this regard, it is similar to the loss function proposed by Zhang et al. (2021a), which also emphasizes the use of in-batch negatives as much as possible. For training the small model, we used four NVIDIA A6000 GPUs, while for the base and large models, we used four NVIDIA A100 (80GB) GPUs.
Prefix
Recent embedding models commonly use prefixes in addition to the text being embedded Wang et al. (2022); Li et al. (2024b, 2023); Xiao et al. (2024); Wang et al. (2024). This approach is known to be particularly effective for tasks requiring asymmetric similarity, such as retrieval tasks. Therefore, we also utilized prefixes during model training. Specifically, we added the prefix “クエリ: ” to search queries and “文章: ” to target passages. While English or multilingual models often use “query: ” for queries and “passage: ” for passages, since our model is a Japanese model, we simply translated these prefixes into Japanese.
Batching Strategy
As with GTE Li et al. (2023) and InstructOR Su et al. (2023), only triplets from the same dataset were included in a single batch to prevent shortcut learning. This batching strategy is called task-homogeneous batching. Task-homogeneous batching has another advantage: it prevents the mixing of datasets with different sequence lengths in the same batch, reducing the need for padding tokens and thus decreasing training time. To further improve peformance, if identical sentences are mixed within the same batch, false negatives can occur, we removed duplicate sentences within each batch in advance.
Hyperparameters and Implementations
Following Li et al. (2023), which reported minimal performance improvements for batch sizes larger than 8192, we set the batch size to 8192 for the contrastive pre-training phase. As with E5, the positional embedding was fixed161616https://github.com/microsoft/unilm/issues/1120. Unlike SimCSE, but similar to SimLM Wang et al. (2023)171717https://github.com/microsoft/unilm/blob/9c0f1ff7ca53431fe47d2637dfe253643d94185b/simlm/src/config.py#L54 and E5, we did not use a pooler layer. We performed data augmentation by shuffling the sentence order of positive example documents. For splitting sentences, we used Konoha181818https://github.com/himkt/konoha. For other hyperparameters, refer to the left side of Table 14.
3 Building Reranker
A reranker is a model typically used in retrieval tasks, where the query and document are concatenated and input into the model to output a relevance score. Unlike dual-encoder, which independently embeds the text and measures similarity in vector space, the reranker—also known as a cross-encoder—captures the interaction between the query and the document. This allows it to measure relevance more accurately than a dual-encoder Wang et al. (2023). Recent embedding models have found it effective to incorporate knowledge distillation from cross-encoders in addition to contrastive learning Wang et al. (2023, 2022). Knowledge distillation from cross-encoders is a method where the model is trained to align the similarity score distributions between query and document produced by the cross-encoder and those produced by the dual-encoder.
In the model constructed in this report, we follow E5 and apply knowledge distillation from cross-encoders. To achieve this, we first built a reranker for Japanese. The biggest challenge in constructing a Japanese reranker is the dataset. Japanese retrieval datasets are extremely limited in size, making it difficult to prepare training datasets on the scale of those used in English or multilingual models. Therefore, we adopted a two-stage learning approach, starting with training on noisy datasets, including the synthesized dataset from Section 2, and then fine-tuning on higher-quality datasets.
Source | Dataset size |
---|---|
JSQuAD | 212,352 |
AutoWikiQA (Nemo) | 190,743 |
JaQuAD | 108,068 |
Quiz No Mori | 36,120 |
Quiz Works | 29,112 |
JQaRA | 16,260 |
MIRACL | 13,968 |
Mr. TyDi | 7,394 |
MKQA | 6,636 |
Total | 620,653 |
Source | Dataset size |
---|---|
Quiz No Mori | 18,060 |
Quiz Works | 14,556 |
JQaRA | 8,130 |
MIRACL | 6,984 |
MR. TyDi | 3,697 |
Total | 51,427 |
3.1 Datasets
We trained the reranker in two stages: the first stage used a large but noisy dataset, while the second stage employed a smaller, higher-quality dataset along with JaColBERT v2.5 Clavié (2024). The datasets used in the first stage are listed in Table 3, and those used in the second stage are shown in Table 4191919 We did not use the MMARCO Bonifacio et al. (2021) dataset for our retrieval tasks. MMARCO is a translated version of MS MARCO Bajaj et al. (2018), which is primarily intended for non-commercial research purposes. There were licensing concerns when publishing models with a commercial-use license. Additionally, the quality of the Japanese translations in MMARCO is relatively low, and in preliminary experiments, training with MMARCO negatively affected the performance of the reranker. . For reranker training, we utilized existing Japanese retrieval and QA datasets, including JSQuAD Kurihara et al. (2022), JaQuAD So et al. (2022), JQaRA Tateno (2024b), MKQA Longpre et al. (2021), Mr. TyDi Zhang et al. (2021b), and MIRACL Zhang et al. (2022). We also used a synthesized dataset generated by Nemotron-4 340B Nvidia et al. (2024). Additionally, we used high-quality QA datasets extracted from Japanese quiz websites, available under a free license202020https://huggingface.co/datasets/hpprc/quiz-works,212121https://huggingface.co/datasets/hpprc/quiz-no-mori. For each dataset, we used only the train set for those with predefined splits, and for others, we used all available data. As a result, we did not use the examples included in the test sets of the benchmarks for training.
Pseudo Positives and Hard Negative Mining
To train rerankers, each query is paired with one positive document and multiple negative documents. The model is trained to ensure that the relevance score for the query–positive document pair is higher than for any query–negative document pair. There are two key points in reranker training; the first is to use challenging examples as hard negatives, and the second is to avoid false negatives, where a negative document is actually a positive example.
To collect hard negatives for each query, we used a combination of BM25-based negative mining and nearest neighbor search based on the embeddings of the multilingual E5-large model Li et al. (2024b) (mE5-large). Specifically, we integrated the results of BM25 and mE5-large negative mining using reciprocal rank fusion (RRF) Cormack et al. (2009), and selected hard negatives from the examples excluding the top-ranked negatives. We used documents ranked between 30th and 100th by the combined BM25 and mE5-large ranking as hard negatives. To mitigate the issue of false negatives, we used the answers from QA datasets. Specifically, after hard negative mining, we removed documents containing the query’s answer as pseudo positives and used them as “mined positives” as well as JAQKET 鈴木 et al. (2020).
For hard negative mining, we used a collection of concatenated 3-paragraphs from Japanese Wikipedia, Wiktionary, and WikiBooks, for the Japanese QA datasets as candidates for negatives excluding Mr. TyDi and MIRACL. For Mr. TyDi and MIRACL, we used the predefined document collections as they were.
3.2 Training Details
We trained the reranker in two-stages. In the first stage, we built a reranker using noisy data, and in the second stage, we fine-tuned the reranker. For the first stage, we applied data augmentation by shuffling the sentence order of each positive document in the datasets, but no shuffling was done in the second stage. The sequence length during training was set to a relatively short value of 256 in the first stage and increased to 512 in the second stage. The number of hard negatives used during training was set to 63 both for the first and second stage. As the loss function, we used cross-entropy loss, where the score of the positive document is maximized relative to the scores of the 63 negative documents for each query. We used the contrastive pre-trained model constructed in Section 2 as the base model. For other hyperparameters, refer to Table 15.
Model | #Param. (w/o Emb.) | JQaRA | JaCWIR | MIRACL |
---|---|---|---|---|
hotchpotch/japanese-reranker-cross-encoder-xsmall-v1 | 107M (11M) | 61.4 | 93.8 | 90.6 |
hotchpotch/japanese-reranker-cross-encoder-small-v1 | 118M (21M) | 62.5 | 93.9 | 92.2 |
hotchpotch/japanese-reranker-cross-encoder-base-v1 | 111M (86M) | 67.1 | 93.4 | 93.3 |
hotchpotch/japanese-reranker-cross-encoder-large-v1 | 337M (303M) | 71.0 | 93.6 | 91.5 |
hotchpotch/japanese-bge-reranker-v2-m3-v1 | 568M (303M) | 69.2 | 93.7 | 94.7 |
BAAI/bge-reranker-v2-m3 | 568M (303M) | 67.3 | 93.4 | 94.9 |
(cl-nagoya/ruri-reranker-small) | 68M (43M) | 64.5 | 92.6 | 92.3 |
(cl-nagoya/ruri-reranker-base) | 111M (86M) | 74.3 | 93.5 | 95.6 |
(cl-nagoya/ruri-reranker-large) | 337M (303M) | 77.1 | 94.1 | 96.1 |
3.3 Evaluation
Settings
The reranker is used for dataset filtering and knowledge distillation for embedding models, so its performance is expected to impact the performance of resulting embedding models. Therefore, we first evaluated the reranker for Japanese. In reranker evaluation, the goal is to determine how well the model ranks the relevant documents at the top when given a query and a set of documents. We used JQaRA Tateno (2024b), JaCWIR Tateno (2024a), and the test set of MIRACL Zhang et al. (2022) for evaluation. JQaRA is a dataset designed to evaluate the retrieval of useful data for answering questions, which is important for retrieval-augmented generation (RAG). JaCWIR is a diverse retrieval evaluation dataset based on web articles. For the evaluation of JQaRA and JaCWIR, we used the official evaluation code222222https://github.com/hotchpotch/JQaRA,232323https://github.com/hotchpotch/JaCWIR, and for MIRACL, we used an implementation designed for reranking with mined negatives242424https://github.com/oshizo/JapaneseEmbeddingEval. The evaluation metrics were top-10 nDCG (nDCG@10) for JQaRA, top-10 mean average precision (MAP@10) for JaCWIR, and top-30 recall (Recall@30) for MIRACL.
Results
Table 5 shows the evaluation results of major multilingual and Japanese rerankers, as well as our reranker (Ruri-Reranker). Ruri-reranker demonstrated consistently strong performance across the board, with the base model achieving performance comparable to or exceeding existing Japanese and multilingual rerankers, and the large model significantly outperforming them. Notably, our model performed particularly well on JQaRA, which can be attributed to the use of diverse QA datasets during training.
Model | Stage | JQaRA | JaCWIR | MIRACL |
---|---|---|---|---|
1 only | 63.9 | 92.5 | 91.2 | |
2 only | 60.3 | 89.9 | 89.3 | |
1 2 | 64.5 | 92.6 | 92.3 | |
1 only | 72.9 | 92.4 | 94.2 | |
2 only | 68.0 | 92.9 | 93.7 | |
1 2 | 74.3 | 93.5 | 95.6 | |
1 only | 75.8 | 93.4 | 95.4 | |
2 only | 70.5 | 90.8 | 93.2 | |
1 2 | 77.1 | 94.1 | 96.1 | |
Model | Phase | JQaRA | JaCWIR | MIRACL |
---|---|---|---|---|
stage1 | 63.7 | 89.4 | 90.4 | |
stage2 | 64.3 | 91.4 | 91.6 | |
stage1 | 63.9 | 92.5 | 91.2 | |
stage2 | 64.5 | 92.6 | 92.3 | |
stage1 | 71.8 | 89.3 | 93.9 | |
stage2 | 73.1 | 91.6 | 95.1 | |
stage1 | 72.9 | 92.4 | 94.2 | |
stage2 | 74.3 | 93.5 | 95.6 | |
stage1 | 76.1 | 92.2 | 95.2 | |
stage2 | 77.3 | 93.5 | 96.0 | |
stage1 | 75.8 | 93.4 | 95.4 | |
stage2 | 77.1 | 94.1 | 96.1 | |
Ablation Study
We conducted several ablation studies on the design of our reranker. The two main points we investigated were: 1) whether the two-stage training of the reranker is effective, and 2) whether using a contrastive pre-trained model as the base model for the reranker is beneficial.
First, for the two-stage reranker training, we experimented with three configurations: 1) training only the first stage, 2) training only the second stage, and 3) training from the first stage to the second stage (i.e. Ruri-Reranker). Table 6 shows the results. From the table, it is clear that two-stage training consistently yielded the best performance across all model sizes. While training only in the first stage achieved reasonable performance, adding fine-tuning with higher-quality datasets in the second stage further improved the results, demonstrating the effectiveness of the two-stage approach. This observation is consistent with Clavié (2024).
Next, we investigated whether using a contrastive pre-trained model (Ruri-PT) as the base model for the reranker is beneficial. Specifically, we compared the performance of a contrastive pre-trained model and a non-pre-trained model, both fine-tuned as rerankers in the same manner. Table 7 shows the results. Observing the models and stages, it is evident that the Ruri-PT generally outperformed the non-pre-trained model. While the training objective of contrastive pre-training differs from that of reranking, this result suggests that contrastive pre-training helps the model learn how to focus on important information in the text, leading to improved performance.
4 Supervised Fine-tuning
Following previous research, we constructed the final embedding model by fine-tuning a contrastive pre-trained model, trained on a weakly supervised dataset, using high-quality datasets. This section describes the datasets used for training, details of the training process, and evaluation.
4.1 Dataset
For fine-tuning the contrastive pre-trained model, we collected high-quality datasets. The datasets are shown in Table 8. Our datasets can be categorized into two types: retrieval/QA datasets and natural language inference (NLI) datasets. The retrieval/QA datasets were the same as those used in the second stage of training, described in Section 3. For the NLI datasets, we used NU-SNLI252525https://huggingface.co/datasets/cl-nagoya/nu-snli and NU-MNLI262626https://huggingface.co/datasets/cl-nagoya/nu-mnli, which were translated from SNLI Bowman et al. (2015) and MNLI Williams et al. (2018) using a model fine-tuned on machine translation tasks with Swallow-MX. While NU-SNLI and NU-MNLI were not manually created, their quality is sufficiently high. We also used JaNLI Yanaka and Mineshima (2021) to facilitate the model in capturing more subtle differences in meaning.
To perform knowledge distillation from the cross-encoder, we used , constructed in Section 3, to score the relevance of the retrieval datasets. Additionally, to mitigate the negative effect from noisy examples in the retrieval datasets, we removed examples where the relevance score between the query and the positive document was less than 0.8. As data augmentation, we also added examples where the sentence order of positive documents was shuffled, along with the original, unshuffled examples. For the NLI datasets, we did not apply knowledge distillation, so no relevance scoring by the reranker was performed. As in Section 2, we used task-homogeneous batching to ensure that only examples from the same dataset were included in each batch.
Source | Distill. | Dataset size |
---|---|---|
Quiz No Mori | ✓ | 31,232 |
Quiz Works | ✓ | 26,624 |
JQaRA | ✓ | 13,824 |
MIRACL | ✓ | 12,800 |
Mr. TyDi | ✓ | 7,168 |
NU-SNLI | 109,568 | |
NU-MNLI | 77,824 | |
JaNLI | 13,824 | |
Total | 292,864 | |
Model | #Params. | Dim. | #Layer | Pooling | Context Len. | Vocab Size | JMTEB Avg. |
(cl-nagoya/ruri-small) | 68M | 768 | 6 | Mean | 512 | 32,768 | 71.53 |
(cl-nagoya/ruri-base) | 111M | 768 | 12 | Mean | 512 | 32,768 | 71.91 |
(cl-nagoya/ruri-large) | 337M | 1024 | 24 | Mean | 512 | 32,768 | 73.31 |
Model | #Param. | Retrieval | STS | Class. | Reranking | Clustering | Pair. | Avg. |
cl-nagoya/sup-simcse-ja-base | 111M | 49.64 | 82.05 | 73.47 | 91.83 | 51.79 | 62.57 | 63.36 |
cl-nagoya/sup-simcse-ja-large | 337M | 37.62 | 83.18 | 73.73 | 91.48 | 50.56 | 62.51 | 58.88 |
cl-nagoya/unsup-simcse-ja-base | 111M | 40.23 | 78.72 | 73.07 | 91.16 | 44.77 | 62.44 | 58.39 |
cl-nagoya/unsup-simcse-ja-large | 337M | 40.53 | 80.56 | 74.66 | 90.95 | 48.41 | 62.49 | 59.58 |
pkshatech/GLuCoSE-base-ja | 133M | 59.02 | 78.71 | 76.82 | 91.90 | 49.78 | 66.39 | 67.29 |
sentence-transformers/LaBSE | 472M | 40.12 | 76.56 | 72.66 | 91.63 | 44.88 | 62.33 | 58.01 |
intfloat/multilingual-e5-small | 118M | 67.27 | 80.07 | 67.62 | 93.03 | 46.91 | 62.19 | 67.71 |
intfloat/multilingual-e5-base | 278M | 68.21 | 79.84 | 69.30 | 92.85 | 48.26 | 62.26 | 68.61 |
intfloat/multilingual-e5-large | 560M | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 | 70.90 |
OpenAI/text-embedding-ada-002 | - | 64.38 | 79.02 | 69.75 | 93.04 | 48.30 | 62.40 | 67.21 |
OpenAI/text-embedding-3-small | - | 66.39 | 79.46 | 73.06 | 92.92 | 51.06 | 62.27 | 69.18 |
OpenAI/text-embedding-3-large | - | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 | 74.05 |
(cl-nagoya/ruri-small) | 68M | 69.41 | 82.79 | 76.22 | 93.00 | 51.19 | 62.11 | 71.53 |
(cl-nagoya/ruri-base) | 111M | 69.82 | 82.87 | 75.58 | 92.91 | 54.16 | 62.38 | 71.91 |
(cl-nagoya/ruri-large) | 337M | 73.02 | 83.13 | 77.43 | 92.99 | 51.82 | 62.29 | 73.31 |
Model | JaGovFAQs | JAQKET | Mr. TyDi | NLP Journal | Avg. | ||
Abst.–Intro. | Title–Abst. | Title–Intro. | |||||
pkshatech/GLuCoSE-base-ja | 63.88 | 39.82 | 30.28 | 78.26 | 82.06 | 59.82 | 59.02 |
intfloat/multilingual-e5-small | 64.11 | 49.97 | 36.05 | 85.21 | 95.26 | 72.99 | 67.27 |
intfloat/multilingual-e5-base | 65.34 | 50.67 | 38.38 | 87.10 | 94.73 | 73.05 | 68.21 |
intfloat/multilingual-e5-large | 70.30 | 58.78 | 43.63 | 86.00 | 94.70 | 72.48 | 70.98 |
OpenAI/text-embedding-ada-002 | 61.02 | 42.56 | 14.51 | 94.99 | 91.23 | 81.98 | 64.38 |
OpenAI/text-embedding-3-small | 64.02 | 33.94 | 20.03 | 98.47 | 91.70 | 90.17 | 66.39 |
OpenAI/text-embedding-3-large | 72.41 | 48.21 | 34.88 | 99.33 | 96.55 | 95.47 | 74.48 |
(cl-nagoya/ruri-small) | 73.65 | 48.44 | 33.43 | 87.69 | 97.17 | 76.09 | 69.41 |
(cl-nagoya/ruri-base) | 74.56 | 50.12 | 35.45 | 86.89 | 96.57 | 75.31 | 69.82 |
(cl-nagoya/ruri-large) | 76.68 | 61.74 | 38.03 | 87.12 | 96.58 | 77.97 | 73.02 |
4.2 Training Details
We built a high-performance embedding model by fine-tuning the contrastive pre-trained model constructed in Section 2 using high-quality datasets. An overview of each model is shown in Table 9. Following Clavié (2024), we decoupled the loss for knowledge distillation and contrastive learning. Specifically, for retrieval/QA dataset examples, we computed the loss using knowledge distillation, and for NLI examples, we computed the loss using contrastive learning.
During knowledge distillation, inspired by Clavié (2024), we applied min-max normalization to both the student scores, calculated via cosine similarity of embeddings, and the teacher scores, calculated by the cross-encoder. For the NLI dataset, we used the improved contrastive loss, as described in Section 2. The maximum sequence length was set to 512, the batch size to 512, and the number of hard negatives to 15. For other hyperparameters, refer to the right side of Table 14.
4.3 Evaluation
We evaluated our Japanese general text embedding model, Ruri, on the Japanese text embedding benchmark.
Settings
For evaluation, we used JMTEB Li et al. (2024a), the Japanese version of the massive text embedding benchmark (MTEB) Muennighoff et al. (2023). JMTEB includes 16 evaluation datasets covering various tasks such as classification, retrieval, and clustering. We used the official implementation for the evaluation.
Results
Table 10 shows the results. The results indicate that our model consistently outperforms existing multilingual embedding models such as mE5 and Japanese embedding models on average. Notably, our base-sized model achieved higher average performance than the mE5-large. Even when compared to proprietary embedding models, our model demonstrates comparable performance.
Model | Retrieval | STS | Class. | Reranking | Clustering | Pair. | Avg. |
---|---|---|---|---|---|---|---|
71.48 | 82.06 | 76.12 | 92.75 | 53.41 | 62.27 | 72.46 | |
w/o retrieval | 68.08 | 82.32 | 76.42 | 92.66 | 51.98 | 62.29 | 71.11 |
Model | Retrieval | STS | Class. | Reranking | Clustering | Pair. | Avg. |
---|---|---|---|---|---|---|---|
67.39 | 81.41 | 75.41 | 92.98 | 51.13 | 62.44 | 70.41 | |
w/o pre-training | 56.62 | 82.45 | 77.30 | 92.01 | 47.77 | 62.42 | 66.49 |
69.41 | 82.79 | 76.22 | 93.00 | 51.19 | 62.11 | 71.53 | |
68.18 | 81.81 | 74.56 | 92.82 | 53.35 | 62.33 | 70.80 | |
w/o pre-training | 52.99 | 81.95 | 76.19 | 91.60 | 51.85 | 62.20 | 65.25 |
69.82 | 82.87 | 75.58 | 92.91 | 54.16 | 62.38 | 71.91 | |
71.48 | 82.06 | 76.12 | 92.75 | 53.41 | 62.27 | 72.46 | |
w/o pre-training | 57.84 | 83.66 | 76.50 | 91.51 | 49.56 | 62.35 | 67.09 |
73.02 | 83.13 | 77.43 | 92.99 | 51.82 | 62.29 | 73.31 | |
To further analyze the trends of each model, we focused on retrieval tasks, where performance differences tend to be more pronounced. Table 11 shows the performance of each model on these tasks. JaGovFAQs, JAQKET, and Mr. TyDi are standard QA/retrieval tasks, while the three NLP Journal-related tasks involve retrieving abstracts or introductions based on paper titles, or retrieving introductions based on abstracts. The results show that our model performs well on JaGovFAQs, a FAQ retrieval task, and JAQKET, a QA task272727 We used JQaRA to train the reranker and fine-tuned models, and JQaRA shares some data with JAQKET. However, to avoid using the JAQKET test set and prevent data leakage, we only used a portion of the JQaRA data (dev, unused split) for training. , with our model outperforming proprietary embeddings on average. On the other hand, proprietary embedding models performed exceptionally well on the NLP Journal-related tasks. Although the training data and methods for proprietary embeddings are not disclosed, these NLP Journal tasks can be viewed as topic similarity search tasks in LaTeX documents, suggesting that these models may have been largely trained on LaTeX documents.
Ablation Study
The key differences between our embedding model and existing Japanese models are: 1) the use of a synthesized dataset for model training, and 2) the application of contrastive pre-training. Therefore, we conducted an ablation study on these aspects.
First, Table 12 shows the performance of the large model on JMTEB when contrastive pre-training is performed with or without the synthesized retrieval dataset. The results indicate a significant performance improvement in retrieval tasks when the synthesized retrieval dataset is used. On the other hand, when the synthesized retrieval dataset is excluded, there is a slight improvement in STS and classification tasks, suggesting that the introduction of synthesized datasets may not be beneficial for all tasks. However, since retrieval tasks are generally more challenging compared to other tasks, improving performance in these tasks by using synthesized datasets is valuable.
Next, to verify the effectiveness of contrastive pre-training under the assumption of supervised fine-tuning, we compared the performance of models after supervised fine-tuning, using both contrastive pre-trained models and non-pre-trained models as the base models. Table 13 shows the results. We can clearly observe that the presence or absence of contrastive pre-training has a significant impact on post-supervised fine-tuning performance. The improvement in retrieval task performance is particularly notable, indicating the importance of contrastive pre-training for retrieval tasks.
5 Conclusion and Future Work
In this report, we described the process of building a general-purpose Japanese text embedding model, Ruri. Our contributions can be summarized into the following five points:
-
1.
We collected datasets for building a Japanese embedding model and made them public with a permissive license.
-
2.
To address the shortage of Japanese retrieval datasets, we constructed a synthesized dataset using LLMs and verified its effectiveness.
-
3.
We constructed a large-scale dataset for contrastive pre-training in Japanese and demonstrated its utility.
-
4.
We developed a reranker in Japanese, achieving the highest performance among existing Japanese and multilingual rerankers.
-
5.
We built a Japanese embedding model, Ruri, which significantly outperformed existing models on the Japanese text embedding benchmark.
While our model already demonstrates high performance, there are still many challenges left in the realm of Japanese text embedding. Below, we outline some of the challenges and considerations that we were unable to address in this report.
Prefix
Using more diverse prefixes, beyond the simple “クエリ: ” and “文章: ”, could potentially improve overall performance. Indeed, recent models employing instructions Wang et al. (2024); BehnamGhader et al. (2024); Su et al. (2023); Lee et al. (2024a) or more detailed prefixes Nussbaum et al. (2024)282828https://huggingface.co/nomic-ai/nomic-embed-text-v1.5 have been developed.
Knowledge Distillation from Cross-Encoder
Knowledge distillation from cross-encoders has been introduced in models like SimLM Wang et al. (2023) and E5 Wang et al. (2022), but recent models based on LLMs BehnamGhader et al. (2024); Lee et al. (2024a) do not seem to incorporate knowledge distillation from cross-encoders. Also, during the development of Ruri, we observed that introducing knowledge distillation made training slightly unstable. Whether this technique is truly necessary remains an open question for further investigation.
Dataset
The current dataset is still insufficient, especially in terms of web corpora compared to GLuCoSE292929https://huggingface.co/pkshatech/GLuCoSE-base-ja and other multilingual embedding models. To build a general-purpose text embedding model usable in various domains, a more diverse and higher-quality dataset may be crucial.
Base LM and Pre-training
There may be room for improvement in the performance of the Japanese BERT used as the base model. It has been reported that even with the same training methods and datasets, the quality of text embeddings can vary greatly depending on the base model Tsukagoshi et al. (2023). Developing base models specifically pre-trained for embedding, such as SimLM Wang et al. (2023), RetroMAE Xiao et al. (2022), and RetroMAE-2 Liu et al. (2023), could potentially lead to even higher-performing embedding models. Also, it should be noted that Ruri does not use code datasets, so it cannot be applied for code search. To perform code search in a Japanese-specific model, it may be necessary to develop a bilingual model with both Japanese and English vocabularies, given that program code is primarily written in English.
Context Length
The context length is short. While recent large language models can process sequences as long as 32k tokens or longer, most embedding models can only handle around 512 tokens. There are long-context embedding models, such as Jina BERT Günther et al. (2024), which use ALiBi Press et al. (2022), and it is important to develop robust text embedding models for longer sequences. In particular, Japanese BERT does not incorporate architectural advancements used in recent LLMs, such as RoPE Su et al. (2024) and SwiGLU Shazeer (2020). By backporting these developments from LLM research, it may be possible to create models with longer context lengths and higher performance.
Evaluation
The bias in both training and evaluation datasets is also a concern. There are very few evaluation datasets based on web corpora for assessing Japanese text embeddings. Although solving this issue is complicated by licensing and copyright constraints, it is a challenge that must be addressed to evaluate the model’s broad applicability.
Research on Japanese text embeddings is still in its early stages. We hope this report contributes to further progress in this field.
References
- Bajaj et al. (2018) Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268.
- BehnamGhader et al. (2024) Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024. LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders. In First Conference on Language Modeling (COLM).
- Bonifacio et al. (2021) Luiz Bonifacio, Vitor Jeronymo, Hugo Queiroz Abonizio, Israel Campiotti, Marzieh Fadaee, Roberto Lotufo, and Rodrigo Nogueira. 2021. mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset. arXiv:2108.13897.
- Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 632–642.
- Clavié (2024) Benjamin Clavié. 2024. JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources. arXiv:2407.20750.
- Cormack et al. (2009) Gordon V. Cormack, Charles L. A. Clarke, and Stefan Büttcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR).
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 4171–4186.
- Faruqui et al. (2018) Manaal Faruqui, Ellie Pavlick, Ian Tenney, and Dipanjan Das. 2018. WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), pages 305–315.
- Fujii et al. (2024) Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, and Naoaki Okazaki. 2024. Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities. In Proceedings of the First Conference on Language Modeling (COLM), COLM.
- Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6894–6910.
- Günther et al. (2024) Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, and Han Xiao. 2024. Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents. arXiv:2310.19923.
- Kurihara et al. (2022) Kentaro Kurihara, Daisuke Kawahara, and Tomohide Shibata. 2022. JGLUE: Japanese General Language Understanding Evaluation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC), pages 2957–2966.
- Lee et al. (2024a) Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2024a. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. arXiv:2405.17428.
- Lee et al. (2024b) Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha Reddy Jonnalagadda, Ming-Wei Chang, and Iftekhar Naim. 2024b. Gecko: Versatile Text Embeddings Distilled from Large Language Models. arXiv:2403.20327.
- Li et al. (2024a) Shengzhe Li, Masaya Ohagi, and Ryokan Ri. 2024a. JMTEB: Japanese Massive Text Embedding Benchmark. https://huggingface.co/datasets/sbintuitions/JMTEB%7D%7D. [Accessed 31-08-2024].
- Li et al. (2023) Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards General Text Embeddings with Multi-stage Contrastive Learning. arXiv:2308.03281.
- Li et al. (2024b) Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2024b. Multilingual E5 Text Embeddings: A Technical Report. arXiv:2402.05672.
- Liu et al. (2023) Zheng Liu, Shitao Xiao, Yingxia Shao, and Zhao Cao. 2023. RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 2635–2648.
- Longpre et al. (2021) Shayne Longpre, Yi Lu, and Joachim Daiber. 2021. MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering. Transactions of the Association for Computational Linguistics (TACL), pages 1389–1406.
- Muennighoff et al. (2023) Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 2014–2037.
- Nussbaum et al. (2024) Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. 2024. Nomic Embed: Training a Reproducible Long Context Text Embedder. arXiv:2402.01613.
- Nvidia et al. (2024) Nvidia, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek, Robert Hero, Jining Huang, Vibhu Jawa, Joseph Jennings, Aastha Jhunjhunwala, John Kamalu, Sadaf Khan, Oleksii Kuchaiev, Patrick LeGresley, Hui Li, Jiwei Liu, Zihan Liu, Eileen Long, Ameya Sunil Mahabaleshwarkar, Somshubra Majumdar, James Maki, Miguel Martinez, Maer Rodrigues de Melo, Ivan Moshkov, Deepak Narayanan, Sean Narenthiran, Jesus Navarro, Phong Nguyen, Osvald Nitski, Vahid Noroozi, Guruprasad Nutheti, Christopher Parisien, Jupinder Parmar, Mostofa Patwary, Krzysztof Pawelec, Wei Ping, Shrimai Prabhumoye, Rajarshi Roy, Trisha Saar, Vasanth Rao Naik Sabavat, Sanjeev Satheesh, Jane Polak Scowcroft, Jason Sewall, Pavel Shamis, Gerald Shen, Mohammad Shoeybi, Dave Sizer, Misha Smelyanskiy, Felipe Soares, Makesh Narsimhan Sreedhar, Dan Su, Sandeep Subramanian, Shengyang Sun, Shubham Toshniwal, Hao Wang, Zhilin Wang, Jiaxuan You, Jiaqi Zeng, Jimmy Zhang, Jing Zhang, Vivienne Zhang, Yian Zhang, and Chen Zhu. 2024. Nemotron-4 340B Technical Report. arXiv:2406.11704.
- Okazaki et al. (2024) Naoaki Okazaki, Kakeru Hattori, Hirai Shota, Hiroki Iida, Masanari Ohi, Kazuki Fujii, Taishi Nakamura an Mengsay Loem, Rio Yokota, and Sakae Mizuki. 2024. Building a Large Japanese Web Corpus for Large Language Models. In Proceedings of the First Conference on Language Modeling (COLM), COLM.
- Press et al. (2022) Ofir Press, Noah A. Smith, and Mike Lewis. 2022. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. arXiv:2108.12409.
- Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
- Sato et al. (2024) Soma Sato, Hayato Tsukagoshi, Ryohei Sasano, and Koichi Takeda. 2024. Improving Sentence Embeddings with Automatic Generation of Training Data Using Few-shot Examples. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics Volume 4: Student Research Workshop (ACL SRW), pages 519–530.
- Shazeer (2020) Noam Shazeer. 2020. GLU Variants Improve Transformer. arXiv:2002.05202.
- So et al. (2022) ByungHoon So, Kyuhong Byun, Kyungwon Kang, and Seongjin Cho. 2022. JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension. arXiv:2202.01764.
- Su et al. (2023) Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. 2023. One Embedder, Any Task: Instruction-Finetuned Text Embeddings. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1102–1121.
- Su et al. (2024) Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing, 568:127063.
- Tateno (2024a) Yuichi Tateno. 2024a. JaCWIR: Japanese Casual Web IR - 日本語情報検索評価のための小規模でカジュアルなWebタイトルと概要のデータセット.
- Tateno (2024b) Yuichi Tateno. 2024b. JQaRA: Japanese Question Answering with Retrieval Augmentation - 検索拡張(RAG)評価のための日本語Q&Aデータセット.
- Tsukagoshi et al. (2023) Hayato Tsukagoshi, Ryohei Sasano, and Koichi Takeda. 2023. Japanese SimCSE Technical Report. arXiv:2310.19349.
- Wang et al. (2022) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv:2212.03533.
- Wang et al. (2023) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2023. SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 2244–2258.
- Wang et al. (2024) Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. Improving Text Embeddings with Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 11897–11916.
- Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 1112–1122.
- Xiao et al. (2022) Shitao Xiao, Zheng Liu, Yingxia Shao, and Zhao Cao. 2022. RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 538–548.
- Xiao et al. (2024) Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2024. C-Pack: Packaged Resources To Advance General Chinese Embedding. In The 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).
- Yanaka and Mineshima (2021) Hitomi Yanaka and Koji Mineshima. 2021. Assessing the Generalization Capacity of Pre-trained Language Models through Japanese Adversarial Natural Language Inference. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), pages 337–349.
- Zhang et al. (2021a) Dejiao Zhang, Shang-Wen Li, Wei Xiao, Henghui Zhu, Ramesh Nallapati, Andrew O. Arnold, and Bing Xiang. 2021a. Pairwise Supervised Contrastive Learning of Sentence Representations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5786–5798.
- Zhang et al. (2023) Junlei Zhang, Zhenzhong Lan, and Junxian He. 2023. Contrastive Learning of Sentence Embeddings from Scratch. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), pages 3916–3932.
- Zhang et al. (2021b) Xinyu Zhang, Xueguang Ma, Peng Shi, and Jimmy Lin. 2021b. Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval. In Proceedings of the 1st Workshop on Multilingual Representation Learning (MRL), pages 127–137.
- Zhang et al. (2022) Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. 2022. Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages. arXiv:2210.09984.
- 吉越 et al. (2020) 卓見 吉越, 大輔 河原, and 禎夫 黒橋. 2020. 機械翻訳を用いた自然言語推論データセットの多言語化. 第244回自然言語処理研究会(NL研).
- 鈴木 et al. (2020) 正敏 鈴木, 潤 鈴木, 耕史 松田, 京介 ⻄田, and 直也 井之上. 2020. JAQKET: クイズを題材にした日本語QAデータセットの構築. In 言語処理学会第26回年次大会.
Appendix A Hyperparameters
Table 14 shows the hyperparameters and settings used during the training of the embedding model, and Table 15 shows the hyperparameters and settings used during the training of the reranker.
Phase | Pre-training | Fine-tuning | ||||
---|---|---|---|---|---|---|
Model | ||||||
learning rate | 1 | 5 | 3 | 1 | 5 | 3 |
max length | 256 | 256 | 192 | 512 | 512 | 512 |
warmup ratio | 10% | 10% | 10% | 10% | 10% | 10% |
batch size | 8192 | 8192 | 8192 | 512 | 512 | 512 |
epochs | 1 | 1 | 1 | 1 | 1 | 1 |
0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | |
weight decay | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
hard negatives | 1 | 1 | 1 | 15 | 15 | 15 |
task-homogeneous | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
shuffle positive | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
knowledge distillation | ✓ | ✓ | ✓ | |||
Phase | Stage1 | Stage2 | ||||
---|---|---|---|---|---|---|
Model | Small | Base | Large | Small | Base | Large |
learning rate | 1 | 5 | 3 | 1 | 5 | 3 |
max length | 256 | 256 | 256 | 512 | 512 | 512 |
warmup ratio | 10% | 10% | 10% | 10% | 10% | 10% |
batch size | 512 | 512 | 512 | 64 | 64 | 64 |
epochs | 1 | 1 | 1 | 1 | 1 | 1 |
weight decay | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
hard negatives | 63 | 63 | 63 | 63 | 63 | 63 |
task-homogeneous | ||||||
shuffle positive | ✓ | ✓ | ✓ | |||