Ruri: Japanese General Text Embeddings

Hayato Tsukagoshi Ryohei Sasano
Graduate School of Informatics, Nagoya University
[email protected],
[email protected]

Abstract

We report the development of Ruri, a series of Japanese general text embedding models. While the development of general-purpose text embedding models in English and multilingual contexts has been active in recent years, model development in Japanese remains insufficient. The primary reasons for this are the lack of datasets and the absence of necessary expertise. In this report, we provide a detailed account of the development process of Ruri. Specifically, we discuss the training of embedding models using synthesized datasets generated by LLMs, the construction of the reranker for dataset filtering and knowledge distillation, and the performance evaluation of the resulting general-purpose text embedding models.

Hayato Tsukagoshi Ryohei Sasano Graduate School of Informatics, Nagoya University [email protected], [email protected]

1 Introduction

Text embeddings are widely used for tasks such as retrieval-augmented generation (RAG) and similar document retrieval Reimers and Gurevych (2019); Gao et al. (2021); Wang et al. (2022). In recent years, the development of general-purpose text embedding models trained on diverse datasets has become increasingly common Wang et al. (2022); Li et al. (2024b, 2023); Xiao et al. (2024); Günther et al. (2024). However, these efforts have mainly focused on English and multilingual models, where the proportion of Japanese vocabulary and training datasets is relatively small. Building embedding models using large-scale Japanese datasets may enable the creation of higher-performing models.

In this report, we present a general-purpose text embedding model specialized for Japanese, which was developed through contrastive pre-training, the construction of synthetic training datasets using LLMs, and fine-tuning on high-quality datasets. Our contributions are summarized as follows:

1.

We collected datasets for building Japanese embedding models and released them under a permissive license.
2.

To address the lack of Japanese retrieval datasets, we constructed a synthetic dataset using LLMs. A performance comparison with and without the synthetic dataset in benchmark tests showed a difference of over 1 point, confirming its utility in training Japanese embedding models.
3.

We created a large-scale dataset for contrastive pre-training in Japanese, demonstrating its effectiveness by outperforming existing multilingual models, even when using contrastive pre-training alone.
4.

We developed a Japanese reranker, achieving the highest performance among existing Japanese rerankers.
5.

We built the Japanese embedding model Ruri, which significantly outperformed existing models in text embedding benchmarks.

Our models and datasets are publicly available¹¹1\seqsplithttps://huggingface.co/collections/cl-nagoya/ruri-japanese-general-text-embeddings-66cf1f3ee0c8028b89d85b5e.

2 Contrastive Pre-training

Recent research on text embeddings has seen a growing interest in a two-stage learning approach Wang et al. (2022); Li et al. (2023); Xiao et al. (2024). This approach consists of two steps: contrastive pre-training and fine-tuning. First, contrastive learning is performed using a large-scale, weakly-supervised dataset. The dataset used in the first stage typically consists of text pairs extracted from sources like Wikipedia and web corpora. Although this dataset is noisy and may contain false positives/negatives, large-scale training with substantial batch sizes has been shown to improve the quality of the embeddings. After contrastive pre-training, the model is fine-tuned using a manually labeled dataset. While the model from the first stage already yields reasonably effective embeddings, its performance can be further enhanced through fine-tuning with high-quality, human-labeled data.

Building on these methods, this report aims to develop a robust base model that can adapt to various domains through contrastive pre-training for Japanese. However, unlike in English or multilingual models, there are several challenges in developing a general text embedding model for Japanese. One of the most significant challenges is the limited availability of training datasets. To address this, in addition to collecting and preprocessing existing datasets, we synthesized additional training data using Large Language Models (LLMs) for contrastive pre-training. In this section, we provide a detailed explanation of our contrastive pre-training process.

2.1 Existing Datasets

First, we collected and preprocessed available open datasets suitable for training text embedding models. Specifically, we standardized the format and applied common preprocessing steps to 7 datasets: Japanese Wikipedia²²2https://huggingface.co/datasets/hpprc/jawiki, WikiBooks³³3https://huggingface.co/datasets/hpprc/jawiki-books, Wiktionary⁴⁴4https://huggingface.co/datasets/hpprc/jawiki-wiktionary, Japanese split of MQA⁵⁵5https://huggingface.co/datasets/clips/mqa, Japanese split of CC News⁶⁶6https://huggingface.co/datasets/intfloat/multilingual_cc_news, Japanese Research Corpus (JRC)⁷⁷7https://huggingface.co/datasets/kunishou/J-ResearchCorpus^,⁸⁸8 Japanese Research Corpus is a high-quality dataset comprising Japanese academic papers. While it includes papers from various academic societies, it also contains papers from the journal “Natural Language Processing,” which is included in the evaluation benchmark, JMTEB Li et al. (2024a). To prevent potential leakage, we excluded the data from “Natural Language Processing” from our training dataset. , Wiki Atomic Edits Faruqui et al. (2018). We applied NFKC normalization and removed invisible characters. Each dataset consists of pairs of “queries” typically composed of article titles, and “passages” composed of article bodies or longer texts.


Source	Anchor	Positive	Negative	Dataset size

Wikipedia (1)	title + section title	1-paragraph	random 1-paragraph	19,361,464
Wikipedia (3)	title + section title	3-paragraphs	random 3-paragraphs	10,010,462
Wikipedia (long)	title / abst.	abst. / article body	random abst. / article body	7,889,486
Wiktionary	title	article body	random article body	697,405
WikiBooks	title + section title	1-paragraph	random 1-paragraph	314,207
MQA	title	article body	BM25 mined article body	25,165,824
CC News (long)	title	article body	BM25 mined article body	6,248,336
CC News (short)	random sentence	sentence in the same article	sentence in other articles	2,795,632
AutoWikiQA (MX)	question	passage	BM25 mined passage	11,563,562
AutoWikiQA (Nemo)	question	passage	BM25 mined passage	495,062
JRC	title + section title	section body	BM25 mined section body	131,072
Wiki Atomic Edits	sentence	edited sentence	random sentence	3,679,939
AutoWikiNLI	premise	hypothesis (entailment)	hypothesis (contradiction)	203,147
JSNLI	premise	hypothesis (entailment)	hypothesis (contradiction)	180,146
Total				88,735,744

Table 1: Datasets used for contrastive pre-training


Model	#Params.	GPUs	Base LM

$\text{Ruri-PT}_{\text{small}}$ (cl-nagoya/ruri-pt-small)	68M	A6000 $\times$ 4	line-corporation/line-distilbert-base-japanese
$\text{Ruri-PT}_{\text{base}}$ (cl-nagoya/ruri-pt-base)	111M	A100 $\times$ 4	tohoku-nlp/bert-base-japanese-v3
$\text{Ruri-PT}_{\text{large}}$ (cl-nagoya/ruri-pt-large)	337M	A100 $\times$ 4	tohoku-nlp/bert-large-japanese-v2

Table 2: Overview of the model with contrastive pre-training.

2.2 Synthesized Datasets

The use of synthetic datasets in training text embeddings is very promising and is actively explored Zhang et al. (2023); Wang et al. (2024); Sato et al. (2024); Lee et al. (2024b). This is particularly true for languages like Japanese, where there are few available datasets for training embedding models, and licensing is crucial.

Therefore, we created synthetic datasets using LLMs for two types of datasets commonly used in embedding model training: QA datasets and natural language inference (NLI) datasets, and used them for model training. While the synthetic datasets are considered relatively high quality, they may contain noise or bias. Thus, we included them in the pretraining datasets.

AutoWikiQA⁹⁹footnotemark: 9: is a dataset consisting of queries and answers generated from random Wikipedia passages. For generation, we mainly used Swallow-MX¹¹¹¹11https://huggingface.co/tokyotech-llm/Swallow-MX-8x7b-NVE-v0.1, which is a continually pre-trained model from Mixtral-8x7B¹²¹²12https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 on the cleaned large Japanese corpus Okazaki et al. (2024) as well as Swallow Fujii et al. (2024). We also used Nemotron-4 340B Nvidia et al. (2024). The source passages to generate queries and answers were constructed by concatenating three paragraphs from random Wikipedia articles to ensure a sufficient length of the text. Since the passages in AutoWikiQA are created by combining multiple paragraphs from Wikipedia, a single passage usually consists of multiple sentences rather than just one. The resulting dataset consists of over 250 million query–passage pairs.
AutoWikiNLI¹³¹³footnotemark: 13: is a synthesized natural language inference (NLI) dataset generated using Nemotron-4 340B. We sampled random sentences from Wikipedia as premises and generated both entailment and contradiction sentences. By generating both entailment and contradiction sentences from a single premise, we create triplet datasets that include harder negatives, which are crucial for contrastive learning due to the lexical similarity between entailment and contradiction sentences. Initially, we observed that when the LLM generated entailment followed by contradiction sentences, the contradiction sentences were often simple negations. To improve this, we reversed the generation order, creating contradiction sentences first, followed by entailment sentences. Additionally, some generated hypothesis sentences were of low quality. To address this, we used the reward model of Nemotron-4 340B¹⁵¹⁵15https://build.nvidia.com/nvidia/nemotron-4-340b-reward to score the generated outputs and removed the bottom 20% of examples based on low helpfulness scores.

2.3 Pre-training Dataset

Table 1 shows the datasets used for contrastive pre-training. Our training strategy aims to achieve high-performance models by leveraging a diverse range of datasets, including both noisy and high-quality, manually curated sources. This approach allows the model to learn robust representations from varied data before fine-tuning on more specific, high-quality datasets. Wikipedia was utilized in multiple ways to create text pairs for training. We employed various methods to extract and pair texts from Wikipedia, maximizing the value of this rich information source for our pre-training process. Also, we incorporated JSNLI 吉越 et al. (2020), a dataset derived from machine translation of English Bowman et al. (2015).

We implemented hard negative mining during the pre-training phase, a technique shown to enhance model performance as reported in Wang et al. (2022). Although Wang et al. (2022) suggests that hard negative mining becomes impractical for datasets approaching 200 million samples, our relatively smaller dataset size allowed us to effectively employ this technique. We utilized BM25 for generating hard negatives, which required preprocessing the entire document corpus to create searchable indexes. To optimize this process, we created separate indexes for each dataset, thereby reducing the time and computational costs associated with indexing and searching.

2.4 Training Details

An overview of contrastive pre-trained model is shown in Table 2. In the contrastive pre-training, we built three models of different sizes: small, base, and large. The small model was based on line-corporation/line-distilbert-base-japanese, the base model on tohoku-nlp/bert-base-japanese-v3, and the large model on tohoku-nlp/bert-large-japanese-v2, with contrastive pre-training applied as fine-tuning. All of these models are Japanese BERT Devlin et al. (2019) models. We used the improved contrastive loss, as proposed by Li et al. (2023), for training. This loss function not only calculates the similarity between query and passage in contrastive learning using in-batch negatives, but also considers the similarities between query–query, passage–query, and passage–passage. It aims to lower the similarity scores for non-positive examples. In this regard, it is similar to the loss function proposed by Zhang et al. (2021a), which also emphasizes the use of in-batch negatives as much as possible. For training the small model, we used four NVIDIA A6000 GPUs, while for the base and large models, we used four NVIDIA A100 (80GB) GPUs.

Prefix

Recent embedding models commonly use prefixes in addition to the text being embedded Wang et al. (2022); Li et al. (2024b, 2023); Xiao et al. (2024); Wang et al. (2024). This approach is known to be particularly effective for tasks requiring asymmetric similarity, such as retrieval tasks. Therefore, we also utilized prefixes during model training. Specifically, we added the prefix “クエリ: ” to search queries and “文章: ” to target passages. While English or multilingual models often use “query: ” for queries and “passage: ” for passages, since our model is a Japanese model, we simply translated these prefixes into Japanese.

Batching Strategy

As with GTE Li et al. (2023) and InstructOR Su et al. (2023), only triplets from the same dataset were included in a single batch to prevent shortcut learning. This batching strategy is called task-homogeneous batching. Task-homogeneous batching has another advantage: it prevents the mixing of datasets with different sequence lengths in the same batch, reducing the need for padding tokens and thus decreasing training time. To further improve peformance, if identical sentences are mixed within the same batch, false negatives can occur, we removed duplicate sentences within each batch in advance.

Hyperparameters and Implementations

Following Li et al. (2023), which reported minimal performance improvements for batch sizes larger than 8192, we set the batch size to 8192 for the contrastive pre-training phase. As with E5, the positional embedding was fixed¹⁶¹⁶16https://github.com/microsoft/unilm/issues/1120. Unlike SimCSE, but similar to SimLM Wang et al. (2023)¹⁷¹⁷17https://github.com/microsoft/unilm/blob/9c0f1ff7ca53431fe47d2637dfe253643d94185b/simlm/src/config.py#L54 and E5, we did not use a pooler layer. We performed data augmentation by shuffling the sentence order of positive example documents. For splitting sentences, we used Konoha¹⁸¹⁸18https://github.com/himkt/konoha. For other hyperparameters, refer to the left side of Table 14.

3 Building Reranker

A reranker is a model typically used in retrieval tasks, where the query and document are concatenated and input into the model to output a relevance score. Unlike dual-encoder, which independently embeds the text and measures similarity in vector space, the reranker—also known as a cross-encoder—captures the interaction between the query and the document. This allows it to measure relevance more accurately than a dual-encoder Wang et al. (2023). Recent embedding models have found it effective to incorporate knowledge distillation from cross-encoders in addition to contrastive learning Wang et al. (2023, 2022). Knowledge distillation from cross-encoders is a method where the model is trained to align the similarity score distributions between query and document produced by the cross-encoder and those produced by the dual-encoder.

In the model constructed in this report, we follow E5 and apply knowledge distillation from cross-encoders. To achieve this, we first built a reranker for Japanese. The biggest challenge in constructing a Japanese reranker is the dataset. Japanese retrieval datasets are extremely limited in size, making it difficult to prepare training datasets on the scale of those used in English or multilingual models. Therefore, we adopted a two-stage learning approach, starting with training on noisy datasets, including the synthesized dataset from Section 2, and then fine-tuning on higher-quality datasets.


Source	Dataset size

JSQuAD	212,352
AutoWikiQA (Nemo)	190,743
JaQuAD	108,068
Quiz No Mori	36,120
Quiz Works	29,112
JQaRA	16,260
MIRACL	13,968
Mr. TyDi	7,394
MKQA	6,636
Total	620,653

Table 3: Datasets used for training the reranker in the first stage.


Source	Dataset size

Quiz No Mori	18,060
Quiz Works	14,556
JQaRA	8,130
MIRACL	6,984
MR. TyDi	3,697
Total	51,427

Table 4: Datasets used for training the reranker in the second stage.

3.1 Datasets

We trained the reranker in two stages: the first stage used a large but noisy dataset, while the second stage employed a smaller, higher-quality dataset along with JaColBERT v2.5 Clavié (2024). The datasets used in the first stage are listed in Table 3, and those used in the second stage are shown in Table 4¹⁹¹⁹19 We did not use the MMARCO Bonifacio et al. (2021) dataset for our retrieval tasks. MMARCO is a translated version of MS MARCO Bajaj et al. (2018), which is primarily intended for non-commercial research purposes. There were licensing concerns when publishing models with a commercial-use license. Additionally, the quality of the Japanese translations in MMARCO is relatively low, and in preliminary experiments, training with MMARCO negatively affected the performance of the reranker. . For reranker training, we utilized existing Japanese retrieval and QA datasets, including JSQuAD Kurihara et al. (2022), JaQuAD So et al. (2022), JQaRA Tateno (2024b), MKQA Longpre et al. (2021), Mr. TyDi Zhang et al. (2021b), and MIRACL Zhang et al. (2022). We also used a synthesized dataset generated by Nemotron-4 340B Nvidia et al. (2024). Additionally, we used high-quality QA datasets extracted from Japanese quiz websites, available under a free license²⁰²⁰20https://huggingface.co/datasets/hpprc/quiz-works^,²¹²¹21https://huggingface.co/datasets/hpprc/quiz-no-mori. For each dataset, we used only the train set for those with predefined splits, and for others, we used all available data. As a result, we did not use the examples included in the test sets of the benchmarks for training.

Pseudo Positives and Hard Negative Mining

To train rerankers, each query is paired with one positive document and multiple negative documents. The model is trained to ensure that the relevance score for the query–positive document pair is higher than for any query–negative document pair. There are two key points in reranker training; the first is to use challenging examples as hard negatives, and the second is to avoid false negatives, where a negative document is actually a positive example.

To collect hard negatives for each query, we used a combination of BM25-based negative mining and nearest neighbor search based on the embeddings of the multilingual E5-large model Li et al. (2024b) (mE5-large). Specifically, we integrated the results of BM25 and mE5-large negative mining using reciprocal rank fusion (RRF) Cormack et al. (2009), and selected hard negatives from the examples excluding the top-ranked negatives. We used documents ranked between 30th and 100th by the combined BM25 and mE5-large ranking as hard negatives. To mitigate the issue of false negatives, we used the answers from QA datasets. Specifically, after hard negative mining, we removed documents containing the query’s answer as pseudo positives and used them as “mined positives” as well as JAQKET 鈴木 et al. (2020).

For hard negative mining, we used a collection of concatenated 3-paragraphs from Japanese Wikipedia, Wiktionary, and WikiBooks, for the Japanese QA datasets as candidates for negatives excluding Mr. TyDi and MIRACL. For Mr. TyDi and MIRACL, we used the predefined document collections as they were.

3.2 Training Details

We trained the reranker in two-stages. In the first stage, we built a reranker using noisy data, and in the second stage, we fine-tuned the reranker. For the first stage, we applied data augmentation by shuffling the sentence order of each positive document in the datasets, but no shuffling was done in the second stage. The sequence length during training was set to a relatively short value of 256 in the first stage and increased to 512 in the second stage. The number of hard negatives used during training was set to 63 both for the first and second stage. As the loss function, we used cross-entropy loss, where the score of the positive document is maximized relative to the scores of the 63 negative documents for each query. We used the contrastive pre-trained model constructed in Section 2 as the base model. For other hyperparameters, refer to Table 15.


Model	#Param. (w/o Emb.)	JQaRA	JaCWIR	MIRACL

hotchpotch/japanese-reranker-cross-encoder-xsmall-v1	107M (11M)	61.4	93.8	90.6
hotchpotch/japanese-reranker-cross-encoder-small-v1	118M (21M)	62.5	93.9	92.2
hotchpotch/japanese-reranker-cross-encoder-base-v1	111M (86M)	67.1	93.4	93.3
hotchpotch/japanese-reranker-cross-encoder-large-v1	337M (303M)	71.0	93.6	91.5
hotchpotch/japanese-bge-reranker-v2-m3-v1	568M (303M)	69.2	93.7	94.7
BAAI/bge-reranker-v2-m3	568M (303M)	67.3	93.4	94.9
$\text{Ruri-Reranker}_{\text{small}}$ (cl-nagoya/ruri-reranker-small)	68M (43M)	64.5	92.6	92.3
$\text{Ruri-Reranker}_{\text{base}}$ (cl-nagoya/ruri-reranker-base)	111M (86M)	74.3	93.5	95.6
$\text{Ruri-Reranker}_{\text{large}}$ (cl-nagoya/ruri-reranker-large)	337M (303M)	77.1	94.1	96.1

Table 5: Performance comparison of rerankers. JQaRA is evaluated using nDCG@10, JaCWIR with MAP@10, and MIRACL with Recall@30. “#Param. (w/o Emb.)” indicates the number of parameters, both with and without token embeddings.

3.3 Evaluation

Settings

The reranker is used for dataset filtering and knowledge distillation for embedding models, so its performance is expected to impact the performance of resulting embedding models. Therefore, we first evaluated the reranker for Japanese. In reranker evaluation, the goal is to determine how well the model ranks the relevant documents at the top when given a query and a set of documents. We used JQaRA Tateno (2024b), JaCWIR Tateno (2024a), and the test set of MIRACL Zhang et al. (2022) for evaluation. JQaRA is a dataset designed to evaluate the retrieval of useful data for answering questions, which is important for retrieval-augmented generation (RAG). JaCWIR is a diverse retrieval evaluation dataset based on web articles. For the evaluation of JQaRA and JaCWIR, we used the official evaluation code²²²²22https://github.com/hotchpotch/JQaRA^,²³²³23https://github.com/hotchpotch/JaCWIR, and for MIRACL, we used an implementation designed for reranking with mined negatives²⁴²⁴24https://github.com/oshizo/JapaneseEmbeddingEval. The evaluation metrics were top-10 nDCG (nDCG@10) for JQaRA, top-10 mean average precision (MAP@10) for JaCWIR, and top-30 recall (Recall@30) for MIRACL.

Results

Table 5 shows the evaluation results of major multilingual and Japanese rerankers, as well as our reranker (Ruri-Reranker). Ruri-reranker demonstrated consistently strong performance across the board, with the base model achieving performance comparable to or exceeding existing Japanese and multilingual rerankers, and the large model significantly outperforming them. Notably, our model performed particularly well on JQaRA, which can be attributed to the use of diverse QA datasets during training.


Model	Stage	JQaRA	JaCWIR	MIRACL

$\text{Ruri-PT}_{\text{small}}$	1 only	63.9	92.5	91.2
$\text{Ruri-PT}_{\text{small}}$	2 only	60.3	89.9	89.3
$\text{Ruri-PT}_{\text{small}}$	1 $\rightarrow$ 2	64.5	92.6	92.3
$\text{Ruri-PT}_{\text{base}}$	1 only	72.9	92.4	94.2
$\text{Ruri-PT}_{\text{base}}$	2 only	68.0	92.9	93.7
$\text{Ruri-PT}_{\text{base}}$	1 $\rightarrow$ 2	74.3	93.5	95.6
$\text{Ruri-PT}_{\text{large}}$	1 only	75.8	93.4	95.4
$\text{Ruri-PT}_{\text{large}}$	2 only	70.5	90.8	93.2
$\text{Ruri-PT}_{\text{large}}$	1 $\rightarrow$ 2	77.1	94.1	96.1

Table 6: The results of the ablation study for two-stage training. The evaluation metrics are the same as those in Table 5.


Model	Phase	JQaRA	JaCWIR	MIRACL

$\text{BERT}_{\text{small}}$	stage1	63.7	89.4	90.4
$\text{BERT}_{\text{small}}$	stage2	64.3	91.4	91.6
$\text{Ruri-PT}_{\text{small}}$	stage1	63.9	92.5	91.2
$\text{Ruri-PT}_{\text{small}}$	stage2	64.5	92.6	92.3
$\text{BERT}_{\text{base}}$	stage1	71.8	89.3	93.9
$\text{BERT}_{\text{base}}$	stage2	73.1	91.6	95.1
$\text{Ruri-PT}_{\text{base}}$	stage1	72.9	92.4	94.2
$\text{Ruri-PT}_{\text{base}}$	stage2	74.3	93.5	95.6
$\text{BERT}_{\text{large}}$	stage1	76.1	92.2	95.2
$\text{BERT}_{\text{large}}$	stage2	77.3	93.5	96.0
$\text{Ruri-PT}_{\text{large}}$	stage1	75.8	93.4	95.4
$\text{Ruri-PT}_{\text{large}}$	stage2	77.1	94.1	96.1

Table 7: The results of the ablation study for the benefit of contrastive pre-training. The evaluation metrics are the same as those in Table 5.

Ablation Study

We conducted several ablation studies on the design of our reranker. The two main points we investigated were: 1) whether the two-stage training of the reranker is effective, and 2) whether using a contrastive pre-trained model as the base model for the reranker is beneficial.

First, for the two-stage reranker training, we experimented with three configurations: 1) training only the first stage, 2) training only the second stage, and 3) training from the first stage to the second stage (i.e. Ruri-Reranker). Table 6 shows the results. From the table, it is clear that two-stage training consistently yielded the best performance across all model sizes. While training only in the first stage achieved reasonable performance, adding fine-tuning with higher-quality datasets in the second stage further improved the results, demonstrating the effectiveness of the two-stage approach. This observation is consistent with Clavié (2024).

Next, we investigated whether using a contrastive pre-trained model (Ruri-PT) as the base model for the reranker is beneficial. Specifically, we compared the performance of a contrastive pre-trained model and a non-pre-trained model, both fine-tuned as rerankers in the same manner. Table 7 shows the results. Observing the models and stages, it is evident that the Ruri-PT generally outperformed the non-pre-trained model. While the training objective of contrastive pre-training differs from that of reranking, this result suggests that contrastive pre-training helps the model learn how to focus on important information in the text, leading to improved performance.

4 Supervised Fine-tuning

Following previous research, we constructed the final embedding model by fine-tuning a contrastive pre-trained model, trained on a weakly supervised dataset, using high-quality datasets. This section describes the datasets used for training, details of the training process, and evaluation.

4.1 Dataset

For fine-tuning the contrastive pre-trained model, we collected high-quality datasets. The datasets are shown in Table 8. Our datasets can be categorized into two types: retrieval/QA datasets and natural language inference (NLI) datasets. The retrieval/QA datasets were the same as those used in the second stage of training, described in Section 3. For the NLI datasets, we used NU-SNLI²⁵²⁵25https://huggingface.co/datasets/cl-nagoya/nu-snli and NU-MNLI²⁶²⁶26https://huggingface.co/datasets/cl-nagoya/nu-mnli, which were translated from SNLI Bowman et al. (2015) and MNLI Williams et al. (2018) using a model fine-tuned on machine translation tasks with Swallow-MX. While NU-SNLI and NU-MNLI were not manually created, their quality is sufficiently high. We also used JaNLI Yanaka and Mineshima (2021) to facilitate the model in capturing more subtle differences in meaning.

To perform knowledge distillation from the cross-encoder, we used $\text{Ruri-Reranker}_{\text{large}}$ , constructed in Section 3, to score the relevance of the retrieval datasets. Additionally, to mitigate the negative effect from noisy examples in the retrieval datasets, we removed examples where the relevance score between the query and the positive document was less than 0.8. As data augmentation, we also added examples where the sentence order of positive documents was shuffled, along with the original, unshuffled examples. For the NLI datasets, we did not apply knowledge distillation, so no relevance scoring by the reranker was performed. As in Section 2, we used task-homogeneous batching to ensure that only examples from the same dataset were included in each batch.


Source	Distill.	Dataset size

Quiz No Mori	✓	31,232
Quiz Works	✓	26,624
JQaRA	✓	13,824
MIRACL	✓	12,800
Mr. TyDi	✓	7,168
NU-SNLI		109,568
NU-MNLI		77,824
JaNLI		13,824
Total		292,864

Table 8: Datasets used for supervised fine-tuning.


Model	#Params.	Dim.	#Layer	Pooling	Context Len.	Vocab Size	JMTEB Avg.

$\text{Ruri}_{\text{small}}$ (cl-nagoya/ruri-small)	68M	768	6	Mean	512	32,768	71.53
$\text{Ruri}_{\text{base}}$ (cl-nagoya/ruri-base)	111M	768	12	Mean	512	32,768	71.91
$\text{Ruri}_{\text{large}}$ (cl-nagoya/ruri-large)	337M	1024	24	Mean	512	32,768	73.31

Table 9: Overview of the model with supervised fine-tuning.


Model	#Param.	Retrieval	STS	Class.	Reranking	Clustering	Pair.	Avg.

cl-nagoya/sup-simcse-ja-base	111M	49.64	82.05	73.47	91.83	51.79	62.57	63.36
cl-nagoya/sup-simcse-ja-large	337M	37.62	83.18	73.73	91.48	50.56	62.51	58.88
cl-nagoya/unsup-simcse-ja-base	111M	40.23	78.72	73.07	91.16	44.77	62.44	58.39
cl-nagoya/unsup-simcse-ja-large	337M	40.53	80.56	74.66	90.95	48.41	62.49	59.58
pkshatech/GLuCoSE-base-ja	133M	59.02	78.71	76.82	91.90	49.78	66.39	67.29
sentence-transformers/LaBSE	472M	40.12	76.56	72.66	91.63	44.88	62.33	58.01
intfloat/multilingual-e5-small	118M	67.27	80.07	67.62	93.03	46.91	62.19	67.71
intfloat/multilingual-e5-base	278M	68.21	79.84	69.30	92.85	48.26	62.26	68.61
intfloat/multilingual-e5-large	560M	70.98	79.70	72.89	92.96	51.24	62.15	70.90
OpenAI/text-embedding-ada-002	-	64.38	79.02	69.75	93.04	48.30	62.40	67.21
OpenAI/text-embedding-3-small	-	66.39	79.46	73.06	92.92	51.06	62.27	69.18
OpenAI/text-embedding-3-large	-	74.48	82.52	77.58	93.58	53.32	62.35	74.05
$\text{Ruri}_{\text{small}}$ (cl-nagoya/ruri-small)	68M	69.41	82.79	76.22	93.00	51.19	62.11	71.53
$\text{Ruri}_{\text{base}}$ (cl-nagoya/ruri-base)	111M	69.82	82.87	75.58	92.91	54.16	62.38	71.91
$\text{Ruri}_{\text{large}}$ (cl-nagoya/ruri-large)	337M	73.02	83.13	77.43	92.99	51.82	62.29	73.31

Table 10: Evaluation results on JMTEB. “#Param.” indicates the number of model parameters, “Retrieval” shows the average performance on 6 retrieval datasets, “STS” on 2 Semantic Textual Similarity datasets, “Classification” on 4 classification datasets, “Reranking” on 1 reranking dataset, “Clustering” on 2 clustering datasets, “Pair.” on 1 pair classification dataset, and “Avg.” is the micro average across all 16 datasets.


Model	JaGovFAQs	JAQKET	Mr. TyDi	NLP Journal			Avg.
Abst.–Intro.	JaGovFAQs	JAQKET	Mr. TyDi	Title–Abst.	Title–Intro.		Avg.

pkshatech/GLuCoSE-base-ja	63.88	39.82	30.28	78.26	82.06	59.82	59.02
intfloat/multilingual-e5-small	64.11	49.97	36.05	85.21	95.26	72.99	67.27
intfloat/multilingual-e5-base	65.34	50.67	38.38	87.10	94.73	73.05	68.21
intfloat/multilingual-e5-large	70.30	58.78	43.63	86.00	94.70	72.48	70.98
OpenAI/text-embedding-ada-002	61.02	42.56	14.51	94.99	91.23	81.98	64.38
OpenAI/text-embedding-3-small	64.02	33.94	20.03	98.47	91.70	90.17	66.39
OpenAI/text-embedding-3-large	72.41	48.21	34.88	99.33	96.55	95.47	74.48
$\text{Ruri}_{\text{small}}$ (cl-nagoya/ruri-small)	73.65	48.44	33.43	87.69	97.17	76.09	69.41
$\text{Ruri}_{\text{base}}$ (cl-nagoya/ruri-base)	74.56	50.12	35.45	86.89	96.57	75.31	69.82
$\text{Ruri}_{\text{large}}$ (cl-nagoya/ruri-large)	76.68	61.74	38.03	87.12	96.58	77.97	73.02

Table 11: Evaluation results on the retrieval tasks. We used nDCG@10 as an evaluation metric for all tasks.

4.2 Training Details

We built a high-performance embedding model by fine-tuning the contrastive pre-trained model constructed in Section 2 using high-quality datasets. An overview of each model is shown in Table 9. Following Clavié (2024), we decoupled the loss for knowledge distillation and contrastive learning. Specifically, for retrieval/QA dataset examples, we computed the loss using knowledge distillation, and for NLI examples, we computed the loss using contrastive learning.

During knowledge distillation, inspired by Clavié (2024), we applied min-max normalization to both the student scores, calculated via cosine similarity of embeddings, and the teacher scores, calculated by the cross-encoder. For the NLI dataset, we used the improved contrastive loss, as described in Section 2. The maximum sequence length was set to 512, the batch size to 512, and the number of hard negatives to 15. For other hyperparameters, refer to the right side of Table 14.

4.3 Evaluation

We evaluated our Japanese general text embedding model, Ruri, on the Japanese text embedding benchmark.

Settings

For evaluation, we used JMTEB Li et al. (2024a), the Japanese version of the massive text embedding benchmark (MTEB) Muennighoff et al. (2023). JMTEB includes 16 evaluation datasets covering various tasks such as classification, retrieval, and clustering. We used the official implementation for the evaluation.

Results

Table 10 shows the results. The results indicate that our model consistently outperforms existing multilingual embedding models such as mE5 and Japanese embedding models on average. Notably, our base-sized model achieved higher average performance than the mE5-large. Even when compared to proprietary embedding models, our model demonstrates comparable performance.


Model	Retrieval	STS	Class.	Reranking	Clustering	Pair.	Avg.

$\text{Ruri-PT}_{\text{large}}$	71.48	82.06	76.12	92.75	53.41	62.27	72.46
$\text{Ruri-PT}_{\text{large}}$ w/o retrieval	68.08	82.32	76.42	92.66	51.98	62.29	71.11

Table 12: Performance of pre-trained models on JMTEB with and without using synthesized retrieval datasets.


Model	Retrieval	STS	Class.	Reranking	Clustering	Pair.	Avg.

$\text{Ruri-PT}_{\text{small}}$	67.39	81.41	75.41	92.98	51.13	62.44	70.41
$\text{Ruri}_{\text{small}}$ w/o pre-training	56.62	82.45	77.30	92.01	47.77	62.42	66.49
$\text{Ruri}_{\text{small}}$	69.41	82.79	76.22	93.00	51.19	62.11	71.53
$\text{Ruri-PT}_{\text{base}}$	68.18	81.81	74.56	92.82	53.35	62.33	70.80
$\text{Ruri}_{\text{base}}$ w/o pre-training	52.99	81.95	76.19	91.60	51.85	62.20	65.25
$\text{Ruri}_{\text{base}}$	69.82	82.87	75.58	92.91	54.16	62.38	71.91
$\text{Ruri-PT}_{\text{large}}$	71.48	82.06	76.12	92.75	53.41	62.27	72.46
$\text{Ruri}_{\text{large}}$ w/o pre-training	57.84	83.66	76.50	91.51	49.56	62.35	67.09
$\text{Ruri}_{\text{large}}$	73.02	83.13	77.43	92.99	51.82	62.29	73.31

Table 13: Evaluation results of the model with and without contrastive pre-training. “w/o pre-training” represents the performance of the model that underwent only supervised fine-tuning without contrastive pre-training.

To further analyze the trends of each model, we focused on retrieval tasks, where performance differences tend to be more pronounced. Table 11 shows the performance of each model on these tasks. JaGovFAQs, JAQKET, and Mr. TyDi are standard QA/retrieval tasks, while the three NLP Journal-related tasks involve retrieving abstracts or introductions based on paper titles, or retrieving introductions based on abstracts. The results show that our model performs well on JaGovFAQs, a FAQ retrieval task, and JAQKET, a QA task²⁷²⁷27 We used JQaRA to train the reranker and fine-tuned models, and JQaRA shares some data with JAQKET. However, to avoid using the JAQKET test set and prevent data leakage, we only used a portion of the JQaRA data (dev, unused split) for training. , with our model outperforming proprietary embeddings on average. On the other hand, proprietary embedding models performed exceptionally well on the NLP Journal-related tasks. Although the training data and methods for proprietary embeddings are not disclosed, these NLP Journal tasks can be viewed as topic similarity search tasks in LaTeX documents, suggesting that these models may have been largely trained on LaTeX documents.

Ablation Study

The key differences between our embedding model and existing Japanese models are: 1) the use of a synthesized dataset for model training, and 2) the application of contrastive pre-training. Therefore, we conducted an ablation study on these aspects.

First, Table 12 shows the performance of the large model on JMTEB when contrastive pre-training is performed with or without the synthesized retrieval dataset. The results indicate a significant performance improvement in retrieval tasks when the synthesized retrieval dataset is used. On the other hand, when the synthesized retrieval dataset is excluded, there is a slight improvement in STS and classification tasks, suggesting that the introduction of synthesized datasets may not be beneficial for all tasks. However, since retrieval tasks are generally more challenging compared to other tasks, improving performance in these tasks by using synthesized datasets is valuable.

Next, to verify the effectiveness of contrastive pre-training under the assumption of supervised fine-tuning, we compared the performance of models after supervised fine-tuning, using both contrastive pre-trained models and non-pre-trained models as the base models. Table 13 shows the results. We can clearly observe that the presence or absence of contrastive pre-training has a significant impact on post-supervised fine-tuning performance. The improvement in retrieval task performance is particularly notable, indicating the importance of contrastive pre-training for retrieval tasks.

5 Conclusion and Future Work

In this report, we described the process of building a general-purpose Japanese text embedding model, Ruri. Our contributions can be summarized into the following five points:

1.

We collected datasets for building a Japanese embedding model and made them public with a permissive license.
2.

To address the shortage of Japanese retrieval datasets, we constructed a synthesized dataset using LLMs and verified its effectiveness.
3.

We constructed a large-scale dataset for contrastive pre-training in Japanese and demonstrated its utility.
4.

We developed a reranker in Japanese, achieving the highest performance among existing Japanese and multilingual rerankers.
5.

We built a Japanese embedding model, Ruri, which significantly outperformed existing models on the Japanese text embedding benchmark.

While our model already demonstrates high performance, there are still many challenges left in the realm of Japanese text embedding. Below, we outline some of the challenges and considerations that we were unable to address in this report.

Prefix

Using more diverse prefixes, beyond the simple “クエリ: ” and “文章: ”, could potentially improve overall performance. Indeed, recent models employing instructions Wang et al. (2024); BehnamGhader et al. (2024); Su et al. (2023); Lee et al. (2024a) or more detailed prefixes Nussbaum et al. (2024)²⁸²⁸28https://huggingface.co/nomic-ai/nomic-embed-text-v1.5 have been developed.

Knowledge Distillation from Cross-Encoder

Knowledge distillation from cross-encoders has been introduced in models like SimLM Wang et al. (2023) and E5 Wang et al. (2022), but recent models based on LLMs BehnamGhader et al. (2024); Lee et al. (2024a) do not seem to incorporate knowledge distillation from cross-encoders. Also, during the development of Ruri, we observed that introducing knowledge distillation made training slightly unstable. Whether this technique is truly necessary remains an open question for further investigation.

Dataset

The current dataset is still insufficient, especially in terms of web corpora compared to GLuCoSE²⁹²⁹29https://huggingface.co/pkshatech/GLuCoSE-base-ja and other multilingual embedding models. To build a general-purpose text embedding model usable in various domains, a more diverse and higher-quality dataset may be crucial.

Base LM and Pre-training

There may be room for improvement in the performance of the Japanese BERT used as the base model. It has been reported that even with the same training methods and datasets, the quality of text embeddings can vary greatly depending on the base model Tsukagoshi et al. (2023). Developing base models specifically pre-trained for embedding, such as SimLM Wang et al. (2023), RetroMAE Xiao et al. (2022), and RetroMAE-2 Liu et al. (2023), could potentially lead to even higher-performing embedding models. Also, it should be noted that Ruri does not use code datasets, so it cannot be applied for code search. To perform code search in a Japanese-specific model, it may be necessary to develop a bilingual model with both Japanese and English vocabularies, given that program code is primarily written in English.

Context Length

The context length is short. While recent large language models can process sequences as long as 32k tokens or longer, most embedding models can only handle around 512 tokens. There are long-context embedding models, such as Jina BERT Günther et al. (2024), which use ALiBi Press et al. (2022), and it is important to develop robust text embedding models for longer sequences. In particular, Japanese BERT does not incorporate architectural advancements used in recent LLMs, such as RoPE Su et al. (2024) and SwiGLU Shazeer (2020). By backporting these developments from LLM research, it may be possible to create models with longer context lengths and higher performance.

Evaluation

The bias in both training and evaluation datasets is also a concern. There are very few evaluation datasets based on web corpora for assessing Japanese text embeddings. Although solving this issue is complicated by licensing and copyright constraints, it is a challenge that must be addressed to evaluate the model’s broad applicability.

Research on Japanese text embeddings is still in its early stages. We hope this report contributes to further progress in this field.

References

Bajaj et al. (2018) Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268.
BehnamGhader et al. (2024) Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024. LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders. In First Conference on Language Modeling (COLM).
Bonifacio et al. (2021) Luiz Bonifacio, Vitor Jeronymo, Hugo Queiroz Abonizio, Israel Campiotti, Marzieh Fadaee, Roberto Lotufo, and Rodrigo Nogueira. 2021. mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset. arXiv:2108.13897.
Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 632–642.
Clavié (2024) Benjamin Clavié. 2024. JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources. arXiv:2407.20750.
Cormack et al. (2009) Gordon V. Cormack, Charles L. A. Clarke, and Stefan Büttcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR).
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 4171–4186.
Faruqui et al. (2018) Manaal Faruqui, Ellie Pavlick, Ian Tenney, and Dipanjan Das. 2018. WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), pages 305–315.
Fujii et al. (2024) Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, and Naoaki Okazaki. 2024. Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities. In Proceedings of the First Conference on Language Modeling (COLM), COLM.
Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6894–6910.
Günther et al. (2024) Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, and Han Xiao. 2024. Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents. arXiv:2310.19923.
Kurihara et al. (2022) Kentaro Kurihara, Daisuke Kawahara, and Tomohide Shibata. 2022. JGLUE: Japanese General Language Understanding Evaluation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC), pages 2957–2966.
Lee et al. (2024a) Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2024a. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models. arXiv:2405.17428.
Lee et al. (2024b) Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha Reddy Jonnalagadda, Ming-Wei Chang, and Iftekhar Naim. 2024b. Gecko: Versatile Text Embeddings Distilled from Large Language Models. arXiv:2403.20327.
Li et al. (2024a) Shengzhe Li, Masaya Ohagi, and Ryokan Ri. 2024a. JMTEB: Japanese Massive Text Embedding Benchmark. https://huggingface.co/datasets/sbintuitions/JMTEB%7D%7D. [Accessed 31-08-2024].
Li et al. (2023) Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards General Text Embeddings with Multi-stage Contrastive Learning. arXiv:2308.03281.
Li et al. (2024b) Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2024b. Multilingual E5 Text Embeddings: A Technical Report. arXiv:2402.05672.
Liu et al. (2023) Zheng Liu, Shitao Xiao, Yingxia Shao, and Zhao Cao. 2023. RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 2635–2648.
Longpre et al. (2021) Shayne Longpre, Yi Lu, and Joachim Daiber. 2021. MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering. Transactions of the Association for Computational Linguistics (TACL), pages 1389–1406.
Muennighoff et al. (2023) Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 2014–2037.
Nussbaum et al. (2024) Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. 2024. Nomic Embed: Training a Reproducible Long Context Text Embedder. arXiv:2402.01613.
Nvidia et al. (2024) Nvidia, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek, Robert Hero, Jining Huang, Vibhu Jawa, Joseph Jennings, Aastha Jhunjhunwala, John Kamalu, Sadaf Khan, Oleksii Kuchaiev, Patrick LeGresley, Hui Li, Jiwei Liu, Zihan Liu, Eileen Long, Ameya Sunil Mahabaleshwarkar, Somshubra Majumdar, James Maki, Miguel Martinez, Maer Rodrigues de Melo, Ivan Moshkov, Deepak Narayanan, Sean Narenthiran, Jesus Navarro, Phong Nguyen, Osvald Nitski, Vahid Noroozi, Guruprasad Nutheti, Christopher Parisien, Jupinder Parmar, Mostofa Patwary, Krzysztof Pawelec, Wei Ping, Shrimai Prabhumoye, Rajarshi Roy, Trisha Saar, Vasanth Rao Naik Sabavat, Sanjeev Satheesh, Jane Polak Scowcroft, Jason Sewall, Pavel Shamis, Gerald Shen, Mohammad Shoeybi, Dave Sizer, Misha Smelyanskiy, Felipe Soares, Makesh Narsimhan Sreedhar, Dan Su, Sandeep Subramanian, Shengyang Sun, Shubham Toshniwal, Hao Wang, Zhilin Wang, Jiaxuan You, Jiaqi Zeng, Jimmy Zhang, Jing Zhang, Vivienne Zhang, Yian Zhang, and Chen Zhu. 2024. Nemotron-4 340B Technical Report. arXiv:2406.11704.
Okazaki et al. (2024) Naoaki Okazaki, Kakeru Hattori, Hirai Shota, Hiroki Iida, Masanari Ohi, Kazuki Fujii, Taishi Nakamura an Mengsay Loem, Rio Yokota, and Sakae Mizuki. 2024. Building a Large Japanese Web Corpus for Large Language Models. In Proceedings of the First Conference on Language Modeling (COLM), COLM.
Press et al. (2022) Ofir Press, Noah A. Smith, and Mike Lewis. 2022. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. arXiv:2108.12409.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.
Sato et al. (2024) Soma Sato, Hayato Tsukagoshi, Ryohei Sasano, and Koichi Takeda. 2024. Improving Sentence Embeddings with Automatic Generation of Training Data Using Few-shot Examples. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics Volume 4: Student Research Workshop (ACL SRW), pages 519–530.
Shazeer (2020) Noam Shazeer. 2020. GLU Variants Improve Transformer. arXiv:2002.05202.
So et al. (2022) ByungHoon So, Kyuhong Byun, Kyungwon Kang, and Seongjin Cho. 2022. JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension. arXiv:2202.01764.
Su et al. (2023) Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. 2023. One Embedder, Any Task: Instruction-Finetuned Text Embeddings. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1102–1121.
Su et al. (2024) Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing, 568:127063.
Tateno (2024a) Yuichi Tateno. 2024a. JaCWIR: Japanese Casual Web IR - 日本語情報検索評価のための小規模でカジュアルなWebタイトルと概要のデータセット.
Tateno (2024b) Yuichi Tateno. 2024b. JQaRA: Japanese Question Answering with Retrieval Augmentation - 検索拡張(RAG)評価のための日本語Q&Aデータセット.
Tsukagoshi et al. (2023) Hayato Tsukagoshi, Ryohei Sasano, and Koichi Takeda. 2023. Japanese SimCSE Technical Report. arXiv:2310.19349.
Wang et al. (2022) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv:2212.03533.
Wang et al. (2023) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2023. SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 2244–2258.
Wang et al. (2024) Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. Improving Text Embeddings with Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 11897–11916.
Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 1112–1122.
Xiao et al. (2022) Shitao Xiao, Zheng Liu, Yingxia Shao, and Zhao Cao. 2022. RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 538–548.
Xiao et al. (2024) Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2024. C-Pack: Packaged Resources To Advance General Chinese Embedding. In The 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).
Yanaka and Mineshima (2021) Hitomi Yanaka and Koji Mineshima. 2021. Assessing the Generalization Capacity of Pre-trained Language Models through Japanese Adversarial Natural Language Inference. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), pages 337–349.
Zhang et al. (2021a) Dejiao Zhang, Shang-Wen Li, Wei Xiao, Henghui Zhu, Ramesh Nallapati, Andrew O. Arnold, and Bing Xiang. 2021a. Pairwise Supervised Contrastive Learning of Sentence Representations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5786–5798.
Zhang et al. (2023) Junlei Zhang, Zhenzhong Lan, and Junxian He. 2023. Contrastive Learning of Sentence Embeddings from Scratch. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), pages 3916–3932.
Zhang et al. (2021b) Xinyu Zhang, Xueguang Ma, Peng Shi, and Jimmy Lin. 2021b. Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval. In Proceedings of the 1st Workshop on Multilingual Representation Learning (MRL), pages 127–137.
Zhang et al. (2022) Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. 2022. Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages. arXiv:2210.09984.
吉越 et al. (2020) 卓見吉越, 大輔河原, and 禎夫黒橋. 2020. 機械翻訳を用いた自然言語推論データセットの多言語化. 第２４４回自然言語処理研究会（ＮＬ研）.
鈴木 et al. (2020) 正敏鈴木, 潤鈴木, 耕史松田, 京介⻄田, and 直也井之上. 2020. JAQKET: クイズを題材にした日本語QAデータセットの構築. In 言語処理学会第26回年次大会.

Appendix A Hyperparameters

Table 14 shows the hyperparameters and settings used during the training of the embedding model, and Table 15 shows the hyperparameters and settings used during the training of the reranker.


Phase	Pre-training			Fine-tuning
Model	$\text{Ruri-PT}_{\text{small}}$	$\text{Ruri-PT}_{\text{base}}$	$\text{Ruri-PT}_{\text{large}}$	$\text{Ruri}_{\text{small}}$	$\text{Ruri}_{\text{base}}$	$\text{Ruri}_{\text{large}}$

learning rate	1 $\times 10^{-4}$	5 $\times 10^{-5}$	3 $\times 10^{-5}$	1 $\times 10^{-5}$	5 $\times 10^{-6}$	3 $\times 10^{-6}$
max length	256	256	192	512	512	512
warmup ratio	10%	10%	10%	10%	10%	10%
batch size	8192	8192	8192	512	512	512
epochs	1	1	1	1	1	1
$\tau$	0.01	0.01	0.01	0.01	0.01	0.01
weight decay	0.01	0.01	0.01	0.01	0.01	0.01
hard negatives	1	1	1	15	15	15
task-homogeneous	✓	✓	✓	✓	✓	✓
shuffle positive	✓	✓	✓	✓	✓	✓
knowledge distillation				✓	✓	✓

Table 14: Hyperparameters for contrastive pre-training and supervised fine-tuning.


Phase	Stage1			Stage2
Model	Small	Base	Large	Small	Base	Large

learning rate	1 $\times 10^{-4}$	5 $\times 10^{-5}$	3 $\times 10^{-5}$	1 $\times 10^{-5}$	5 $\times 10^{-6}$	3 $\times 10^{-6}$
max length	256	256	256	512	512	512
warmup ratio	10%	10%	10%	10%	10%	10%
batch size	512	512	512	64	64	64
epochs	1	1	1	1	1	1
weight decay	0.01	0.01	0.01	0.01	0.01	0.01
hard negatives	63	63	63	63	63	63
task-homogeneous
shuffle positive	✓	✓	✓

Table 15: Hyperparameters for rerankers.