Transformer-based Korean Pretrained Language Models: A Survey on Three Years of Progress
Abstract
With the advent of Transformer, which was used in translation models in 2017, attention-based architectures began to attract attention. Furthermore, after the emergence of BERT, which strengthened the NLU-specific encoder part, which is a part of the Transformer, and the GPT architecture, which strengthened the NLG-specific decoder part, various methodologies, data, and models for learning the Pretrained Language Model began to appear. Furthermore, in the past three years, various Pretrained Language Models specialized for Korean have appeared. In this paper, we intend to numerically and qualitatively compare and analyze various Korean PLMs released to the public.
Index Terms:
Computational Linguistics, Natural Language Processing, Machine Learning, AI1 Introduction
The hot keyword in the field of Natural Language Processing and, furthermore, Machine Learning for the last three years, was the Transformer[1]-based BERT[2] or GPT[3] model using the attention algorithm. Originally, the Transformer was a model focused on the NMT model that appeared because of the gradient bottleneck problem that occurred during train RNN[4] for implementing the Neural Machine Translation model, but BERT, which performs only the role corresponding to NLU in the Transformer, and GPT, which performs only the role corresponding to NLG. With the advent of the two models, various models, algorithms, and data pre-processing methods appeared. In addition to this, as a pretrained-model sharing platform called ”Transformers”[5] created by huggingface111https://huggingface.co/ appeared, the NLP/AI field achieved unprecedented growth in both academia and industry. Most recently, large-scale models such as GPT3[6] that scale-up the parameters and data of GPT by hundreds of times or more, or models that expanded (ViT[7]) or mixed (DALL-E[8]) modality was appeared, it seemed to be getting a little closer to AGI. On the other hand, based on this huggingface platform and the Transformer series model, active research and development of models specialized in the Korean domain were also conducted in many areas of companies, schools, and individuals. Accordingly, we would like to conduct a comprehensive survey by combining the research results of individual researchers/developers and Korean companies such as Naver222https://www.navercorp.com, Kakao333https://www.kakaocorp.com, and SKT444https://www.sktelecom.com do. In this paper, the contribution we would like to claim is as follows.
-
•
Introduction and summary of the type of Korean models that have been released so far
-
•
Introduction and arrangement of Korean benchmark datasets that have been released so far
-
•
Comprehensive score analysis of published models.

2 Related Works
2.1 Neural Machine Translation
The most popular framework for NMT is the encoder-decoder model[9, 10, 11, 12, 1]. Adopting attention module greatly improved the performance of encoder-decoder model by using context vector instead of fixed length vector[11, 12]. By exploiting multiple attentive heads, the Transformer model has become the de-facto standard model in NMT[1, 13, 14].
2.2 Pretraining with Unsupervised Feature-based Approaches
In recently, there are several approaches for pretraining methods of the main stream, using feature-based approaches. OpenAI GPT[3] uses decoder of Transformer architecture with next token prediction (Auto Regressive) object. On the other side of GPT, BERT[2] uses submodule of Transformer architecture(encoder), with Masked Language Modeling(MLM) and Next Sentence Prediction(NSP) object for pretraining. RoBERTa[15] is similar to BERT architecture except it trained without NSP object and static masking during the pretraining process. ELECTRA[16] uses MLM object with an adversarial objective used in GAN[17] architecture for pretraining, using only a discriminator in finetuning different from GAN. BART[18] uses both an encoder and decoder (ie. full architecture of transformer) architecture with several permutation and deletion objectives. In the finetuning step, simply plugging in the task-specific inputs and outputs into each PLM, we introduce and finetune all of the parameters end-to-end. However, recent researches of Large-Scale PLM like GPT-3[6] show that any finetuning steps are not needed as the size of PLM and data are large enough to remember all of tasks and information from training data. However, as there are very few number of large-scale Korean PLMs exist, our survey does not include these types of PLMs.

2.3 Korean NLP Benchmarks
Various fine-tuning data and test sets that can measure the performance of Korean natural language tasks have been released. NSMC555https://github.com/e9t/nsmc dataset is a Sentiment Analysis dataset labeled on Naver movie review comment data. Naver and Changwon University unveiled NaverNER666https://github.com/naver/nlp-challenge, the Korean NER data, at a competition held together in 2018. Kakao Brain released the KorNLI and KorSTS[19] datasets for measuring the NLU performance of Korean in 2020. Also, in 2019, LG CNS released KorQuAD777https://korquad.github.io/, a SQuAD dataset for Korean to measure the performance of Korean question and answer tasks. In 2020, the BEEP![20] dataset for the Korean Hate Speech Classification task was released. Most recently, KLUE[21], Korean version of GLUE[22] benchmark was released. However, we are not reporting this results as lots of models are not reported for this benchmark yet.
3 Korean PLM Architectures
Language models after 2018 can be classified into three major types according to the pretraining method (Fig. 2). (1) The first group of models is Encoder-Centric Models, which focus on “Understanding” of language (NLU) by using objective functions such as predicting the corresponding MASK (MLM) after creating and inserting the MASK in the input sentence. These models are later fintuned for tasks such as classification or feature extraction. As a representative model, BERT series PLMs are applicable. (2) The second case is using the objective function to predict the next token of each input token. Since these models are optimized for inference corresponding to Auto-Regressive, they are mainly used for Downstream-Task (Chat-bot, Lyric Generation, etc) learning corresponding to Language Generation (NLG). This mainly applies to GPT-based PLMs. (3) The third case is a model that utilizes the entire architecture of Transformer, which has recently been introduced in many ways. Models such as T5[23], BART[18], and MASS[24] are representative. The model trained in this way shows some significant performance improvement not only in NLU and NLG, but also in tasks where the effect of PLM is hard to see, such as NMT. In this section, we introduce the tokenizers and parameters of the Korean pretrained models that have been released so far based on the three categories we classified above.
3.1 Encoder-Centric Models

Encoder centric models are focused on extracting the features of language. Some tasks like classification, clustering, tagging are able to use this type of models as PLM.
3.1.1 KoBERT
KoBERT888https://github.com/SKTBrain/KoBERT is the first Korean pretrained model shared on huggingface released by SKT-Brain. It is mostly the same as BERT’s configuration, but the tokenizer uses SentencePiece 999https://github.com/google/sentencepiece, not the Word Piece Toeknizer used in BERT. For the data used for pretraining, 5 million sentences and 54 million words were used in the Korean wiki.
3.1.2 HanBERT
HanBERT101010https://github.com/tbai2019/HanBert-54k-N is a BERT model trained using about 150GB (General Domain: 70GB, Patent Domain: 75GB) and 700 million sentences of Korean corpus. The tokenizer used a private tokenizer called Moran Tokenizer, and the vocab size was 54000.
3.1.3 KoELECTRA
KoELECTRA111111https://github.com/monologg/KoELECTRA is an ELECTRA-based language model trained from ‘Modu Corpus’121212https://corpus.korean.go.kr/ released by the National Institute of Korean Language (NIKL), Korean Wikipedia, NamuWiki131313Large-scale Korean open domain encyclopedia., and various news data.
3.1.4 KcBERT
3.1.5 SoongsilBERT (KcBERT2)
SoongsilBERT141414https://github.com/jason9693/Soongsil-BERT is a language model pretrained by using community data of Soongsil University and Modu Corpus in addition to the news comments data used in KcBERT. Most of the settings are identical, except it is trained based on the RoBERTa model and uses Byte-level BPE Tokenizer. Moreover, SoongsilBERT is more fitting to community terminology. In other words, it does not perform well in non-community domains.
3.1.6 KcELECTRA
KcELECTRA151515https://github.com/Beomi/KcELECTRA is a model trained by collecting additional data (mainly comments) to the data used for KcBERT. In NSMC Task, the model is currently recording State-of-the-Arts.
3.1.7 DistilKoBERT
DistilKoBERT161616https://github.com/monologg/DistilKoBERT is a lightweighted version of KoBERT distillation based on huggingface’s DistilBert[27]model. The Teacher model and tokenizer used are the same as KoBERT.
3.1.8 KoBigBird
KoBigBird[28] is released for long-range understanding of Korean language. This model covered with more than 8 times longer than the usual (512 tokens) BERT models.
3.2 Decoder-Centric Models

Decoder centric models are focused on a generation of languages. Some tasks like dialog (someone called chatbot), Lyric-generation, or other types of ”generate” language are able to use this type of models as PLM. Like fig. 5, objective functions of Decoder-centric models are too simple, just predicting the next token of all sequences is done. Unfortunately, in Korean, a very few number of models that trained on this type are released as many PLM focused on NLU, not NLG.
3.2.1 SKT-AI KoGPT2
KoGPT2171717https://github.com/SKT-AI/KoGPT2 is a language model that GPT2[29]-based PLM for the first Korean Natural Language Generation released by SKT-AI. Korean Wikipedia, Modu Corpus, and the Blue House National Petition181818https://github.com/akngs/petitions and private data like news were used for model training. Char BPE Tokenizer is used for tokenization, in addition to the custom (unused) tokens that used to train the downstream task.
3.2.2 Large-Scale PLM
As mentioned above, in abstraction, we will introduce about large-scale LMs for Korean, but we will not deal with these models because of computational limitations.
-
•
HyperCLOVA[30]: HyperCLOVA is first version of Korean Large-Scale PLM. Parameter size is up to 82B, but the models(i.e parameters) are not published now.
-
•
SKT KoGPT-trinity: SKT KoGPT-trinity (We’ll call this model as SKGPT) is the first public version of Large-scaled Korean PLM. The size of parameters is 1.2B and trained with Ko-DATA dataset which is inner refined corpus of SKT for training the model.
-
•
KakaoBrain KoGPT[31]191919https://github.com/kakaobrain/kogpt: Kakao Brain’s KoGPT(for preventing confused with KoGPT2 released by SKT, We’ll call this model as KakaoGPT) is the largest(size of model) public version of Korean PLM. Parameter size is 6B.
SKGPT and KakaoGPT announced with their down-stream task results as finetuning, unlike HyperCLOVA registered as prompt-tuning version.
3.3 Seq2Seq-Centric Models

Seq2Seq[10] centric models use seq2seq transformer architecture for both NLU and NLG. Lots of pretraining method are available as so many seq2seq tasks exist. Unfortunately, in Korean, a few numbers of PLM trained with this method are opened.
3.3.1 KoBART
KoBART202020https://github.com/SKT-AI/KoBART is one of Seq2Seq versions of PLM, based on BART model, which have training objects: text infilling (for NLU) and Auto-Regressive(for NLG). It was trained on more than 40GB corpus.
4 Experiment
Models | NSMC* | BEEP!(Dev)2 | Naver NER3 | Size(MB) |
---|---|---|---|---|
KoELECTRA(Small) | 89.36 | 63.07 | 85.4 | 54 |
KoELECTRA(Base) | 90.63 | 67.61 | 88.11 | 431 |
DistilKoBERT | 88.6 | 60.72 | 84.65 | 108 |
KoBERT | 89.59 | 66.21 | 87.92 | 351 |
SoongsilBERT(Small) | 90.7 | 66 | 84 | 213 |
SoongsilBERT(Base) | 91.2 | 69 | 85.2 | 370 |
KcBERT(Base) | 89.62 | 68.78 | 84.34 | 417 |
KcBERT(Large) | 90.68 | 69.91 | 85.53 | 1200 |
KoBigBird(Base) | 91.18 | - | - | 436 |
KoBART | 90.24 | - | - | 473 |
KoGPT2 | 91.13 | - | - | 490 |
HanBERT | 90.06 | 68.32 | 87.70 | 614 |
XLM-Roberta-Base | 89.03 | 64.06 | 86.65 | 1030 |
KcELECTRA-base | 91.71 | 74.05 | 86.90 | 475 |
-
1
measured by accuracy.
-
2, 3
measured by F1 score.
We exploit an aggregate Down-Stream task benchmark results of several pretrained models we discussed above. Using the benchmark datasets introduced in related works, We report results with 2 aspects. (1) Tasks deal with only a single sentence task (TABLE I), and (2) Tasks deal with multiple sentences or have some interactions with multiple agents (TABLE II).
4.1 Single Sentence Tasks
Korean Benchmarks with a single sentence are mainly focused on classification or tagging task. NSMC is Korean sentiment classification benchmark which have binary classes, labeled with NAVER Corp’s Shopping review comments. and BEEP! is Korean hate-speech classification benchmark labeled with ”Hate”, ”Offensive” and ”None” classes. Naver NER is namely Named Entity Recognition benchmark of Korean, which opend by NAVER Corp.
4.1.1 NSMC Result
NSMC is one of the benchmark datasets classifying whether sentiment is positive or negative. All sentences are come from the commercial review sentence of NAVER. The size of this dataset is 150k sentences for training and 50k sentences for testing. KcELECTRA has recorded State-of-the-Art (SOTA) in this task with 91.71 accuracy.
4.1.2 BEEP! Result
BEEP! is a human-annotated corpus where the intensity of hate speech is tagged with the labels of ‘hate’, ‘offensive’, and ‘none’, built upon celebrity news comments on a Korean online news platform. KcELECTRA achieved highest score of the models with 69.91 F1 Scores. One of the interesting things is that DistilKoBERT, lightweight version of KoBERT, the score is degraded more than 5 points unlike NSMC scores of the two models are nearly the same.
4.1.3 Naver NER Result
Naver NER dataset is a data published by processing Korean Wikipedia into text form. Total number of training sets is 90,000 examples. KoELECTRA (Base) model achieved State-of-the-Art in this task. One of the interesting things is KcBERT and SoongsilBERT, unlike in NSMC or BEEP! results, these models are not performed well, even worse than the general multilingual model (XLM)[32] that is not specialized in Korean.
Models | KorNLI1 | KorSTS2 | Question Pair3 | KorQuaD (Dev)4 | Size (MB) |
---|---|---|---|---|---|
KoELECTRA (Small) | 78.6 | 80.79 | 94.85 | 82.11 / 91.13 | 54 |
KoELECTRA (Base) | 82.24 | 85.53 | 95.25 | 84.83 / 93.45 | 431 |
DistilKoBERT | 72 | 72.59 | 92.48 | 54.40 / 77.97 | 108 |
KoBERT | 79.62 | 81.59 | 94.85 | 51.75 / 79.15 | 351 |
SoongsilBERT (Small) | 76 | 74.2 | 92 | - | 213 |
SoongsilBERT (Base) | 78.3 | 76 | 94 | - | 370 |
KcBERT (Base) | 74.85 | 75.57 | 93.93 | 60.25 / 84.39 | 417 |
KcBERT (Large) | 76.99 | 77.49 | 94.06 | 62.16 / 86.64 | 1200 |
KoBigBird (Base) | - | - | - | 87.08 / 94.71 | 436 |
KoBART | - | 81.66 | 94.34 | - | 473 |
KoGPT2 | - | 78.4 | - | - | 490 |
HanBERT | 80.32 | 82.73 | 94.72 | 78.74 / 92.02 | 614 |
XLM-Roberta-Base | 80.23 | 78.45 | 93.8 | 64.70 / 88.94 | 1030 |
KcELECTRA-base | 81.65 | 82.65 | 95.78 | 70.60 / 90.11 | 475 |
-
1, 3
measured by accuracy.
-
2
measured by spearman correlation.
-
4
measured by (1) EM score and (2) F1 score.
4.2 Multiple Sentence and Agent Tasks
The result of this task showed different patterns than before. KoELECTRA and KoBigBird showed best results, whereas KcBERT and SoongsilBERT showed better before. In this task, unlike before, the texts of the datasets are much longer as there are some interaction between sentences (NLI, STS) or agents (QA). KorNLI and KorSTS are NLI dataset for Korean, released by Kakao Brain. and Question Pair (Korean) dataset is a pharaphrase detection benchmark that finds the similarity between two question sentences for Korean. Unfortunately, we can not access this dataset anymore as this repo is removed now. Finally, KorQuAD dataset is a Korean version of SQuAD (QA) dataset. Although the latest version of this dataset is 2.0, We used 1.0 as lots of models reported in this version.
4.2.1 KorNLI Result
NLI task is the task classifying the relationship between two sentences as ”entailment”, ”contradiction” and ”neutral”. KorNLI dataset has 942,854 examples (pair) for training, 2,490 exmamples for evaluation, and 5,010 examples for testing. KoELECTRA scored State-of-the-Arts in this task. However, most of Korean PLM scored lower points than XLM, not Korean centered model.
4.2.2 KorSTS Result
STS task is identical to NLI, except for scoring metric. This dataset score similarity between two sentences from 1 (not similar) to 5 (identical). In KorSTS, it has 5,749 examples for training, and 1,500 examples for evaluation, and 1,379 examples for testing. Like 4.2.1, KoELECTRA recorded the best score in this task.
4.2.3 Question Pair Result
Question Pair dataset has 6,888 examples of train sets and 688 examples of test sets. KcELECTRA model has recorded the best in this task. However, Question Pair dataset is currently unavailable because the repository of this task and dataset is vanished.
4.2.4 KorQuAD Result
The total data of KorQuAD are divided into 10,645 paragraphs and 66,181 Q&A pairs for 1,560 Wikipedia articles, 60,407 Q&A pairs for the training set, and 5,774 Q&A pairs for Dev set. In this task, KoBigBird scored the highest score (87.08 EM score / 94.71 F1-score), On the other hand, KcBERT and KoBERT are not performed well even scored lower than XLM. It seems the sentence length of corpus for pretraining is too short to understand long-term sequence.
5 Conclusion
In this survey, we discussed several Korean pretrained language models and benchmarks and compared with these models. Of course, there are many more publicly available Korean language models other than the model we introduced, but we could not include all of them for reasons such as the length of this paper or the reason that the benchmark results were not reported in various ways. In future works, we expect that the latest Korean benchmarks such as KLUE and various surveys will appear to promote the development of Korean NLP and furthermore, Computational Linguistics.
References
- [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008, 2017.
- [2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.
- [3] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training.”
- [4] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling.”
- [5] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Huggingface’s transformers: State-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019.
- [6] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020.
- [7] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2020.
- [8] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” CoRR, vol. abs/2102.12092, 2021. [Online]. Available: https://arxiv.org/abs/2102.12092
- [9] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734.
- [10] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
- [11] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
- [12] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1412–1421.
- [13] M. Ott, S. Edunov, D. Grangier, and M. Auli, “Scaling neural machine translation,” in Proceedings of the Third Conference on Machine Translation: Research Papers, 2018, pp. 1–9.
- [14] D. So, Q. Le, and C. Liang, “The evolved transformer,” in International Conference on Machine Learning, 2019, pp. 5877–5886.
- [15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
- [16] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “ELECTRA: Pre-training text encoders as discriminators rather than generators,” in ICLR, 2020. [Online]. Available: https://openreview.net/pdf?id=r1xMH1BtvB
- [17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
- [18] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.”
- [19] J. Ham, Y. J. Choe, K. Park, I. Choi, and H. Soh, “Kornli and korsts: New benchmark datasets for korean natural language understanding,” CoRR, vol. abs/2004.03289, 2020. [Online]. Available: https://arxiv.org/abs/2004.03289
- [20] J. Moon, W. I. Cho, and J. Lee, “BEEP! Korean corpus of online news comments for toxic speech detection,” in Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media. Online: Association for Computational Linguistics, Jul. 2020, pp. 25–31. [Online]. Available: https://www.aclweb.org/anthology/2020.socialnlp-1.4
- [21] S. Park, J. Moon, S. Kim, W. I. Cho, J. Han, J. Park, C. Song, J. Kim, Y. Song, T. Oh, J. Lee, J. Oh, S. Lyu, Y. Jeong, I. Lee, S. Seo, D. Lee, H. Kim, M. Lee, S. Jang, S. Do, S. Kim, K. Lim, J. Lee, K. Park, J. Shin, S. Kim, L. Park, A. Oh, J.-W. Ha, and K. Cho, “Klue: Korean language understanding evaluation,” 2021.
- [22] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” 2019.
- [23] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” 2020.
- [24] K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu, “Mass: Masked sequence to sequence pre-training for language generation,” in ICML, 2019.
- [25] J. Lee, “Kcbert: Korean comments bert,” in Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology, 2020, pp. 437–440.
- [26] M. Schuster and K. Nakajima, “Japanese and korean voice search,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012, pp. 5149–5152.
- [27] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.
- [28] J. Park and D. Kim, “Kobigbird: Pretrained bigbird model for korean,” Nov. 2021. [Online]. Available: https://doi.org/10.5281/zenodo.5654154
- [29] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.
- [30] B. Kim, H. Kim, S.-W. Lee, G. Lee, D. Kwak, J. D. Hyeon, S. Park, S. Kim, S. Kim, D. Seo et al., “What changes can large-scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 3405–3424.
- [31] I. Kim, G. Han, J. Ham, and W. Baek, “Kogpt: Kakaobrain korean(hangul) generative pre-trained transformer,” https://github.com/kakaobrain/kogpt, 2021.
- [32] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” CoRR, vol. abs/1911.02116, 2019. [Online]. Available: http://arxiv.org/abs/1911.02116