MiLMo:Minority Multilingual Pre-trained Language Model
^†^†thanks: Supported by National Nature Science Foundation (No. 61972436).

Junjie Deng^1,2,&, Hanru Shi^1,2,&, Xinhe Yu^1,2, Wugedele Bao³, Yuan Sun ^1,2,∗, Xiaobing Zhao^1,2,∗ *Corresponding author: Yuan Sun, Xiaobing Zhao
^&These authors contributed equally to this work and should be considered co-first authors ¹ Minzu University of China, China ² National Language Resource Monitoring & Research Center Minority Languages Branch ³ Hohhot Minzu College [email protected], [email protected], [email protected], [email protected]

Abstract

Pre-trained language models are trained on large-scale unsupervised data, and they can fine-turn the model only on small-scale labeled datasets, and achieve good results. Multilingual pre-trained language models can be trained on multiple languages, and the model can understand multiple languages at the same time. At present, the search on pre-trained models mainly focuses on rich resources, while there is relatively little research on low-resource languages such as minority languages, and the public multilingual pre-trained language model can not work well for minority languages. Therefore, this paper constructs a multilingual pre-trained model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. To solve the problem of scarcity of datasets on minority languages and verify the effectiveness of the MiLMo model, this paper constructs a minority multilingual text classification dataset named MiTC, and trains a word2vec model for each language. By comparing the word2vec model and the pre-trained model in the text classification task, this paper provides an optimal scheme for the downstream task research of minority languages. The final experimental results show that the performance of the pre-trained model is better than that of the word2vec model, and it has achieved the best results in minority multilingual text classification. The multilingual pre-trained model MiLMo, multilingual word2vec model and multilingual text classification dataset MiTC are published on http://milmo.cmli-nlp.com/.

Index Terms:

Multilingual, Pre-trained language model, datasets, Word2vec

I Introduction

With the development of deep learning, various neural networks are widely used in the downstream tasks of natural language processing and have achieved good performance. These downstream tasks usually rely on a large-scale of labeled training datasets, but labeled datasets often require significant human and material resources. The emergence of pre-trained language model [1, 2, 3, 4, 5] has solved this problem well. The pre-trained model is trained on large-scale unsupervised data to obtain a general model. In the downstream tasks, the model can achieve good performance only by fine-tuning the small-scale labeled data, which is crucial for the research on low-resource languages.

BERT [1] is the most influential model among various pre-trained language models, which has achieved the best results in a variety of downstream tasks. However, there are still some problems in BERT, and a large number of BERT variant pre-trained language models [3, 13, 14] have emerged to solve these problems, but these researches are mainly focused on large-scale resources such as English, and there are still few researches on low resources. Moreover, these pre-trained models can only be trained in a single language, and there is no shared knowledge between different language models. To solve these problems, multilingual pre-trained language models [15, 16, 17, 34] come into being, which can process multiple languages at the same time. The existing multilingual pre-trained model is trained on an unlabeled multilingual corpus, which can project multiple languages into the same semantic space, and has the ability of cross-lingual transfer. It can conduct zero-shot learning.

At present, the pre-trained language model has been well developed in large-scale languages. However, for minority languages, since it is difficult to obtain corpus resources and there are relatively few related studies, various publicly available multilingual pre-trained models do not work well on minority languages, which seriously affects the construction of minority language informatization. Moreover, the existing multilingual pre-trained models only include a few minority languages. Although cross-lingual transfer can be applied to minority languages, the effect is not ideal. For example, the F1 of mBERT [15] on Tibetan News Classification Corpus (TNCC) [24] is 5.5 $\%$ [32], and the F1 of XLM-R-base on TNCC is 21.1 $\%$ [32]. To further promote the development of natural language processing tasks in minority languages, this paper has fully collected and sorted out relevant documents from the Internet, relevant books, the National People’s Congress (NPC) and the Chinese Political Consultative Conference the National People’s Congress (CPPCC), and government work reports, and trains a multilingual pre-trained model named MiLMo on these data. The main contributions of this paper are as follows:

•

This paper constructs a pre-trained model MiLMo containing five minority languages, including Mongolian, Tibetan, Uygur, Kazakh and Korean, to provide support for various downstream tasks of minority languages.
•

This paper trains a word2vec representation for five languages, including Mongolian, Tibetan, Uygur, Kazakh and Korean. Comparing the word2vec representation and the pre-trained model in the downstream task of text classification, this paper provides the best scheme for the research of downstream task of minority languages. The experimental results show that MiLMo model outperforms the word2vec representation.
•

To solve the problem of scarcity of minority language datasets, this paper constructs a classification dataset MiTC containing five languages, including Mongolian, Tibetan, Uyghur, Kazakh and Korean, and publishes the word2vec representation, multilingual pre-trained model MiLMo and multilingual classification dataset MiTC on http://milmo.cmli-nlp.com/.

II Related Work

Word representation can convert words in natural language into vectors which is of great significance to natural language tasks. Early word vectors can capture the semantics of words, but they are context independent and cannot solve the problem of polysemy [6, 7, 8]. To solve the above problems, relevant researchers study context-sensitive word representation. ELMo [9] is the first model to apply context-sensitive word representation successfully. It uses language model to learn the word vector representations in a large-scale unsupervised database, and then the word representations of the network layers corresponding to the words are extracted from the pre-trained network as new features to be added to the downstream task, and the word vector representations can be obtained using the contextual information of the data. However, this model splices two unidirection LSTMs, and the ability of feature extraction and fusion is weak.

In 2017, Google proposes the Transformer block [10] in the machine translation task, which is a new encoder-decoder architecture that uses the attention mechanism to encode each location and can parallelize processing. Transformer solves the problem that CNN needs a lot of storage resources to remember the whole sequence and the inability to parallelize processing. On this basis, OpenAI proposes the GPT, which uses the Transformer block with 12 layers. GPT is first pre-trained on an unlabeled dataset to obtain a language model, and then deal with various downstream tasks by fine-tuning. GPT-1 [4] can achieve good results after fine-tuned, but does not work well on the tasks without fine-tuning. To train a word vector model with stronger generalization ability, GPT-2 [11] uses more network parameters and larger datasets. GPT-2 considers supervised task as the sub-task of the language model. It verifies that word vector models trained with massive data and parameters can be directly transferred to other tasks without fine-tuning on labeled data. To further improve the effect of unsupervised learning, GPT-3 [12] further increases training data and parameters, and achieves better results in a variety of downstream tasks.

However, GPT is a unidirection language model. BERT proposes to use MLM to pre-train to generate deep bidirection language representation. After BERT, various BERT variants have emerged, such as ALBERT [13], SpanBERT [14] and RoBERta [15]. These models have achieved better results. However, these studies are mainly monolingual and focus on large-scale languages, such as English. There is still little research on low resources. To solve this problem, multilingual pre-trained models [15, 16, 17, 18] begin to emerge. Facebook AI Research proposes the XLM[16], which uses Byte Pair Encoding (BPE) to preprocess training data, and expands the shared vocabulary between different languages by dividing text into sub-words. They propose three pre-training tasks, including CLM, MLM, and TLM, which have achieved good results in cross-lingual classification tasks. After that, Facebook AI Research put forward XLM-R. This model increases the number of languages in the training dataset on the basis of XLM and RoBERTa, and upsamples low-resource languages in the process of vocabulary construction and training to generate a larger shared vocabulary and improve the ability of the model. Liu et al. propose mBERT [15], they select largest 104 languages in Wikipedia as the training data, they also train the model in MLM and NSP tasks. They use the same model and weight to process all target languages, and make the model have cross-lingual transfer capabilities by using the shared parameters. The above cross-lingual models have achieved great success in large-scale languages, such as English. However, due to the scarcity of minority language corpus and the complexity of grammar rules, the above multilingual models can not deal with them well. Therefore, the research on various downstream tasks in minority languages is still at an early stage, which seriously hinders the development of minority natural language processing. To solve the above problems, The HIT·iFLYEK Language Cognitive Computing Lab releases the minority language pre-trained model CINO [32], which provides the ability to understand Tibetan, Mongolian, Uyghur, Kazakh, Korean, Zhuang, Cantonese, Chinese and dialects. To further promote the development of ethnic minority language pre-trained model, this paper trains a multilingual pre-trained model for ethnic minority languages. The experimental results show that our model can effectively promote the research on the downstream tasks of ethnic minority languages.

Languages	The Amount of Data
Mongolian	788MB
Tibetan	1.5GB
Uyghur	397MB
Kazakh	620MB
Korean	994MB

Table I: The amount of data for each minority language

[Uncaptioned image] — TABLE II: The results of word segmentation in each minority language

Refer to caption — Figure 1: Distribution of text types in the MiTC

III Model Details

III-A Word2vec’s Data Preprocessing

This paper obtains the training data of five minority languages, including Mongolian, Tibetan, Uyghur, Kazakh and Korean from the network, relevant books, the NPC and CPPCC sessions, government work reports and other relevant documents. And we delete the non-text information such as pictures, links and symbols, and discard the articles with text length less than 20 to clean the data. The final data information is shown in Table I.

Before training the model, we need to segment the data. The above five minority languages are all alphabetic writings. Mongolian is spelled with words from top to bottom and left to right. Mongolian words are separated by spaces [20], so this paper directly uses spaces for word separation. The smallest unit of a Tibetan word is a syllable, and a syllable contains one or up to seven characters. The syllables contain rich semantic information. Therefore, this paper segments Tibetan sentences at syllable level and word level respectively [33]. The morphological structure of Uyghur words is complex. There are 32 letters in modern Uyghur, and each word is spelled by letters. The end of the word realizes its grammatical function by pasting different affixes. Therefore, the same word root can evolve into different word morphologies without great differences in word meanings. Uyghur is written from right to left, and words are separated by spaces. Each Uyghur word can be used as a feature item [22]. Therefore, this paper uses spaces to segment Uyghur words. In Kazakh, words are also separated by spaces [21]. In Korean, the space cannot be used as a direct word segmentation mark. Morpheme is the smallest linguistic unit with semantics, so it is necessary to segment sentences into morphemes. In this paper, the Korean processing toolkit KoNLPy [19] proposed by Park et al. is used to segment the Korean corpus, and the morphemes obtained are used as input features. The word segmentation results of the above five languages are shown in Table II. This paper uses skip-gram model to train word representation with 300 dimension.

III-B Pre-trained Model MiLMo

III-B1 MiLMo’s Design

XLM is a cross-lingual pre-trained model proposed by Facebook AI Research. This model can be trained in multiple languages, allowing the model to learn more cross-lingual information, so that the information learned from other languages can be applied to low-resource languages. XLM proposes three pre-training tasks, including Causal Language Modeling (CLM), Masked Language Modeling (MLM) and Translation Language Modeling (TLM). CLM is a Transformer language model, which can predict the probability of the next word of a given sentence $P(w_{t}|w_{1},...,w_{t-1},\Theta)$ . MLM is a masking task in which the model masks the tokens in the input sentence with a certain probability and then predicts the masked tokens. TLM is an extension of MLM, which splices parallel corpora from different languages as the input of the model, then masks some of these tokens according to a certain probability, and then predicts these tokens. CLM and MLM only need unsupervised training on monolingual datasets, and TLM needs supervised learning using parallel corpus. This paper uses MLM model. The input of the model is a sentence with 256 tokens. During training, part of the tokens are masked with a probability of 15 $\%$ . For each masked token, 80 $\%$ uses [mask] instead, 10 $\%$ randomly selects a token from the vocabulary, and 10 $\%$ remains unchanged. The model parameters are shown in Table III. The MiLMo model is trained using the 12 layer Transformer blocks.

Parameter	Value
emb_dim	2048
n_layers	12
n_heads	8
dropout	0.1
n_langs	5
max_len	256
vocab_size	70,000

Table III: The training parameters of the MiLMo

III-B2 Shared sub-word vocabulary

BPE [23] is a data compression algorithm. By iteratively merging character pairs with high frequency, variable length sub-words can be generated in a fixed size vocabulary. The process of building a vocabulary is as follows:

•

Divideing the words in a sentence into individual characters and uses all the characters to build an initial vocabulary.
•

Counting the frequencies of adjacent sub-word pairs within words in the corpus.
•

Selecting the sub-word pairs with the highest frequency, merge them into new sub-words, and add new sub-words to the sub-word vocabulary.
•

Deleting the sub-words that no longer exist in the corpus from the vocabulary.

Language	MiTC	WCM
Mongolian	1,747	2,973
Tibetan	4,926	1,110
Uyghur	1,304	300
Kazakh	35,826	6,258
Korean	38,859	6,558
Total	82,662	17,199

Table IV: Language scale statistics in WCM and MiTC

Model	Mongolian	Tibetan_syll	Tibetan_word	Uyghur	Kazakh	Korean
TextCNN	35.12%	39.60%	32.85%	34.59%	27.52%	53.17%
TextRNN	28.77%	26.26%	31.48%	42.19%	17.95%	43.22%
TextRNN_Att	22.17%	24.04%	17.20%	28.93%	20.21%	38.88%
TextRCNN	27.81%	28.92%	26.37%	38.23%	24.40%	43.41%
FastText	17.32%	11.23%	16.02%	26.34%	11.09%	19.42%
DPCNN	49.15%	34.79%	34.51%	32.15%	30.13%	52.85%
Transformer	24.13%	26.53%	18.67%	34.90%	10.88%	33.63%

Table V: Text classification results of word2vec on MiTC

Model	Mongolian	Tibetan	Uyghur	Kazakh	Korean
word2vec_best	49.15%	34.51%	38.23%	30.13%	53.17%
MiLMo-base	81.48%	76.44%	74.13%	71.34%	85.98%

Table VI: Comparison of text classification results between word2vec and MiLMo on MiTC

From the process of building the vocabulary, BPE is generally applicable to character formal languages involving prefixes and suffixes. The five languages in this paper are all alphabetic writings with stickiness phenomenon. Grammar functions can be realized by pasting different suffixes in the front, middle and back of the root words. Therefore, BPE can be used to preprocess the five minority languages and build a shared vocabulary, so that it can effectively improve the efficiency of word segmentation and simplify the vocabulary. Before training the model, this paper splits the training corpus into training set, validation set, and test set in with ratio 8:1:1, and then the BPE is used to preprocess the training corpus.

To cover all the corpus to the greatest extent, this paper combines all the training sets of five languages, and uses BPE to build the vocabulary. The final vocabulary has 70,000 sub-words and covers 99.95 $\%$ of the training sets. Then we use the trained BPE model and the constructed vocabulary to preprocess the training data of five minority languages.

IV Experiments

IV-A Dataset

Due to the difficulty in acquiring minority languages and the complexity of grammar rules, at present, the only publicly available dataset on minority languages is Wiki-Chinese-Minority (WCM) [32] , a minority language classification task dataset from Harbin Institute of Technology. This dataset is based on the Wikipedia corpus of minority languages and its classification system labels, including Mongolian, Tibetan, Uyghur, Cantonese, Korean, Kazakh and Chinese. It covers ten categories of art, geography, history, nature, natural science, characters, technology, education, economy and health. The total number of samples of five languages in WCM is 17,199, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. While training the pre-trained model of minority languages, this paper constructs a multilingual text classification dataset MiTC for various languages, which contains 82,662 samples. The data of each language in WCM and MiTC is shown in Table IV.

From Table IV, we can see that the MiTC dataset is rich in all five languages, and the number of samples in Tibetan, Uyghur, Kazakh and Korean is much larger than the number of samples in WCM. After analyzing WCM, we find that the number of samples in the same language is not balanced. For example, the Tibetan contains eight categories, of which the number of ”education” categories accounts for a small proportion. The Uyghur contains six categories. The total sample contains 300 pieces of data, but the ”geography” category has 256 samples. The unbalanced distribution of samples leads to low accuracy of the model, and the performance of the model cannot be evaluated. To better evaluate the performance of the model, this paper performs a data balancing process on the MiTC dataset by keeping the data volume of each category relatively balanced under the same language. The distribution of the categorical dataset of each language in the MiTC dataset is shown in Figure I.

IV-B Classification based on Word2vec

This paper trains word2vec representation for five minority languages. Word2vec solves the problems of ”dimension disaster” and document vector sparsity. It transforms each word into a low-dimensional real vector based on the contextual semantic information of the document. The more similar the meaning of words is, the more similar they are in the word vector space. This paper uses skip-gram to train 300-dimensional word representation, and uses the word representation for text classification tasks. We used TextCNN, TextRNN, TextRNN Attention, TextRCNN, FastText, DPCNN and Transformer models to conduct classification experiments on MiTC. TextRNN Attention is a bidirectional LSTM network based on the Attention mechanism. The F1 of classification results is shown in Table V.

Model	Mongolian	Tibetan_syll	Tibetan_word	Uyghur	Kazakh	Korean
TextCNN	55.52%	30.63%	43.01%	69.44%	42.08%	30.44%
TextRNN	55.99%	17.65%	37.61%	69.44%	39.02%	18.01%
TextRNN_Att	31.85%	25.62%	26.37%	69.44%	25.18%	7.41%
TextRCNN	55.53%	19.07%	46.98%	69.44%	32.35%	17.53%
FastText	31.85%	17.34%	27.23%	69.44%	10.56%	16.77%
DPCNN	56.69%	30.01%	60.98%	67.92%	50.81%	23.69%
Transformer	31.85%	25.24%	45.35%	69.44%	15.46%	11.70%
CINO-base-v2	74.44%	-	75.04%	69.44%	72.82%	73.08%
MiLMo-base	91.62%	-	88.15%	92.81%	82.05%	73.34%

Table VII: Text classification results on WCM

From Table V, we can see that DPCNN has the best effect of 49.15 $\%$ on Mongolian dataset. On Tibetan syllable level data, TextCNN has the best classification F1 of 39.60 $\%$ . On the Tibetan word level data, the effect of DPCNN reached the best 34.51 $\%$ . On Uyghur data, TextRNN has the best effect of 42.19 $\%$ . In Kazakh, the DPCNN model has the best effect of 30.13 $\%$ . TextCNN has the best effect of 53.17 $\%$ in Korean. Overall, the DPCNN model achieves the best effect on various datasets, while Transformer performs poorly. The main reason is that the complex network structure of Transformer need more training data, while the number of classification datasets constructed in this paper is relatively small.

IV-C Classification based on MiLMo

In this paper, we use the trained XLM model for the downstream experiment of text classification. We first use MiLMo model to encode classification text, and obtain the representation vector $E={e_{1},e_{2},...,e_{256}}$ , containing multilingual common information and text information. Then use the ”linear + softmax” structure in the text classification layer for text classification, and the vector representation obtained in the coding layer is fed to the linear to obtain the vector $s=Linear(E)$ , and we use the softmax function to calculate the probability of text categories, $p=softmax(s)$ . After that cross entropy is used to calculate the loss function value of text classification $loss=ylogp$ , where $y$ is the real result and $p$ is the prediction result. This paper constructs the MiLMo model, and uses it to conduct experiments on multilingual classification datasets. The datasets are preprocessed by BPE. Then MiLMo model is used to encode the processed the data, and the ”linear + softmax” structure is used to classify it. The experimental results are shown in Table VI, where word2vec_Best is the most effective classification result of word2vec model in five minority languages.

From Table VI, we can see that the text classification based on the pre-trained model has the highest F1 of 85.98 $\%$ in Korean and 71.34 $\%$ in Kazakh. The text classification effect in five languages is higher than that of word2vec.

To further explore the effectiveness of the multilingual pre-trained model proposed in this paper this paper extracts Mongolian, Tibetan, Uyghur, Kazakh and Korean from the WCM for text classification experiments, and uses CINO-base-v2, word2vec and MiLMo model for comparison. CINO-base-v2 model is a multilingual pre-trained model for ethnic minorities released by HIT·iFLYEK Language Cognitive Computing Lab. The final experimental results are shown in Table VII and Figure II.

From Table VII and Figure II, we can see that our model MiLMo model has achieved the best results in five datasets. The classification effect of CINO-base-v2 in Tibetan, Korean, Mongolian and Kazakh is higher than that of word2vec, but lower than that of MiLMo model. In Uyghur, due to the unbalance distribution of WCM, the article category is mainly focused on the ”geography”. The classification F1 of CINO-base-v2 and word2vec is the same as 69.44 $\%$ , while the classification F1 of MiLMo model is 92.81 $\%$ , which shows that our model can still achieve good results on small-scale datasets. At present, the MiLMo model architecture trained in this paper only includes 12 layers of Transformer block. In future research, we will further release the MiLMo-large model based on 24 layers of Transformer block, which will further improve the effect on downstream tasks.

V Conclusion

The multilingual pre-trained model provides support for various languages with rich resources. However, due to the rich morphology of minority languages, different grammatical rules, and difficult data acquisition, various natural language processing tasks for minority languages are still at the initial stage. To solve the above problems, this paper takes Mongolian, Tibetan, Uyghur, Kazakh, and Korean as examples. We obtain relevant data from relevant books, relevant documents of the NPC and CPPCC sessions and government work reports, and construct a multilingual dataset after data cleaning, and construct a multilingual pre-trained model MiLMo for ethnic minorities. To verify the effectiveness of the pre-trained model, this paper trains word2vec on five minority languages, and uses the word2vec word representation and the pre-trained model for text classification experiments. The experimental results show that the classification effect of the pre-trained model on five languages is better than word2vec.

References

[1] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), 2019, pp.4171-4186.
[2] Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551.
[3] Liu Y, Ott M, Goyal N, et al. Roberta: A robustly optimized bert pretraining approach[J]. arXiv preprint arXiv:1907.11692, 2019.
[4] Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[J]. 2018.
[5] Yang Z, Dai Z, Yang Y, et al. Xlnet: Generalized autoregressive pretraining for language understanding[J]. Advances in neural information processing systems, 2019, 32.
[6] Mikolov T, Grave E, Bojanowski P, et al. 2018. Advances in Pre-Training Distributed Word Representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
[7] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in neural information processing systems, 2013, 26.
[8] Pennington J, Socher R, Manning C D. Glove: Global vectors for word representation[C]//Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014: 1532-1543.
[9] Matthew E. Peters, Mark Neumann, Mohit Iyyer, et al. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237.
[10] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
[11] Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[J]. OpenAI blog, 2019, 1(8): 9.
[12] Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in neural information processing systems, 2020, 33: 1877-1901.
[13] Lan Z, Chen M, Goodman S, et al. Albert: A lite bert for self-supervised learning of language representations[J]. In International Conference on Learning Representations, 2019.
[14] Joshi M, Chen D, Liu Y, et al. Spanbert: Improving pre-training by representing and predicting spans[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 64-77.
[15] Pires T, Schlinger E, Garrette D. How Multilingual is Multilingual BERT?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4996–5001.
[16] Conneau A, Lample G. Cross-lingual language model pretraining[J]. Advances in neural information processing systems, 2019, 32.
[17] Conneau A, Khandelwal K, Goyal N, et al. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.
[18] Ouyang X, Wang S, Pang C, et al. 2021. ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 27–38.
[19] Park E L, Cho S. KoNLPy: Korean natural language processing in Python[C]//Annual Conference on Human and Language Technology. Human and Language Technology, 2014: 133-136.
[20] Jian-dong Z, Guang-lai G A O, Fei-long B A O. Research on History-based Mongolian Automatic POS Tagging[J]. Journal of Chinese Information Processing, 2013, 27(5).
[21] Alimjan A, Jumahun H, Sun T, et al. An approach based on SV-NN for Kazakh language text classification. Journal of Northeast Normal University(Natural Science Edition). 2018, pp.58-65.
[22] Alimjan A, Turgun I, Hasan O, et al. Machine learning based Uyghur language text categorization[J]. Computer Engineering and Applications, 2012, 48(5): 110-112.
[23] Sennrich R, Haddow B, Birch A. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725.
[24] Qun N, Li X, Qiu X, et al. End-to-end neural text classification for tibetan[C]//Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data: 16th China National Conference, CCL 2017, and 5th International Symposium, NLP-NABD 2017, Nanjing, China, October 13-15, 2017, Proceedings 5. Springer International Publishing, 2017: 472-480.
[25] Park J, Kim M, Oh Y, et al. An empirical study of topic classification for Korean newspaper headlines[J]. Hum. Lang. Technol, 2021: 287-292.
[26] Chen Y. Convolutional neural network for sentence classification[D]. University of Waterloo, 2015.
[27] Liu P, Qiu X, Huang X. 2016. Recurrent neural network for text classification with multi-task learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI’16). AAAI Press, 2873–2879.
[28] Zhou P, Shi W, Tian J, et al. Attention-based bidirectional long short-term memory networks for relation classification[C]//Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers). 2016: 207-212.
[29] Lai S, Xu L, Liu K, et al. Recurrent convolutional neural networks for text classification[C]//Proceedings of the AAAI conference on artificial intelligence. 2015, 29(1).
[30] Joulin A, Grave E, Bojanowski P, et al. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431.
[31] Johnson R, Zhang T. Deep pyramid convolutional neural networks for text categorization[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017: 562-570.
[32] Yang Z, Xu Z, Cui Y, et al. 2022. CINO: A Chinese Minority Pre-trained Language Model. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3937–3949.
[33] Long C, Liu H, Nuo M, et al. 2015. Tibetan POS Tagging Based on Syllable Tagging. In Journal of Chinese Information Processing. pp.211-216.
[34] Xue L, Constant N, Roberts A, et al. 2021. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498.

Languages	Segmentation
Mongolian	(High-quality development of education in the city.)
Tibetan_syll	(The ranks of Tibetan party members have also changed from small to large, from weak to strong.)
Tibetan_word	(The ranks of Tibetan party members have also changed from small to large, from weak to strong.)
Uyghur	(Open up new ways for villagers to increase income and become rich.)
Kazakh	(Now they don’t see him as part of the family anymore.)
Korean	(Many parks in Harbin are in full bloom.)

MiLMo:Minority Multilingual Pre-trained Language Model ††thanks: Supported by National Nature Science Foundation (No. 61972436).