In-game Toxic Language Detection: Shared Task and Attention Residuals
Abstract
In-game toxic language becomes the hot potato in the gaming industry and community. There have been several online game toxicity analysis frameworks and models proposed. However, it is still challenging to detect toxicity due to the nature of in-game chat, which has extremely short length. In this paper, we describe how the in-game toxic language shared task has been established using the real-world in-game chat data. In addition, we propose and introduce the model/framework for toxic language token tagging (slot filling) from the in-game chat. The code is publicly available on GitHub111https://github.com/Yuanzhe-Jia/In-Game-Toxic-Detection.
Introduction
Toxic behaviour has become a severe problem in recent online games and the gaming industry. The nature of the toxic utterances in the in-game chat is very different from the other online domain, such as social media or online news. It has a much shorter length because game players tend to type in-game chat during playing. The longer utterances occur only in pre- or post-game discussions. With this in-game chat nature in mind, understanding the slot(word token)-level (Weld et al. 2021b) is crucial to detect the in-game toxic language. In this paper, we describe the established slot(token)-based in-game toxic language detection shared task, and present the best model and its novel components. The result shows the novelty of the best model by comparing the baselines from the CONDA (Weld et al. 2021a).
Shared Task and Dataset
We set up a shared task competition222https://www.kaggle.com/competitions/2022-comp5046-a2 for performing a sequence token labelling task to identify the type of semantics conveyed by each slot in-game chat utterances provided by CONDA (Weld et al. 2021a). CONDA consists of 44,869 utterances from chat logs of 1,921 Dota2 matches, with 26,921/8,974/8,974 utterances for training/validation/test sets respectively. CONDA provides 6 distinct slot labels: T (Toxicity), C (Character), D (Dota-specific), S (Game Slang), P (Pronoun) and O (Other) to provide a deeper understanding of game context. A total of 312 student teams participated in our shared task. Each team was allowed 20 submissions per day. In total, we received 3,646 submissions across 4 weeks. Most participants pre-processed the provided data simply by tokenizing the utterances into slots by spaces, as CONDA provides the cleaned utterances to provide slot labels. There are also teams applying data augmentation to increase the number of instances they can have for training. For the input embedding, combinations of syntactic, semantics and domain-related embeddings were included in the participants’ trials. For the syntactic embedding, most teams chose to encode POS tags and/or dependency parsing results based on models provided by SpaCy. For the semantic embedding, participants would either directly use pre-trained GloVE/FastText embedding vectors, or train their own FastText/Word2Vec based on the provided game chat corpus. Some teams tried to find a Dota-related corpus or toxicity detection dataset collected from social media to train the word embedding and inject more domain-related information into the model. A variety of sequence labelling model architectures are explored by the participants, and we will introduce the best-performing team’s method and results.
Methodology
Our model achieved the best performance from the shared task. It includes Bi-LSTM cells, attention residuals, label forcing and CRF. Since the model uses global information extracted from the attention mechanism as residuals to supplement bi-directional features, it is named Bi-directional Representations with Attention Residuals (BRAR).

Bi-LSTM cells perform feature extraction on the input sequence where denotes the number of tokens. is the corresponding slot labels with unique values. The attention layer aims to understand the global information of the input utterance. Attention is calculated in the following equation where is the last hidden state of Bi-LSTM cells, is a trainable weight matrix and is the output of Bi-LSTM:
(1) |
As the global information is not always helpful for understanding each input token, it is treated as the residual of the token-level representation to form the feature , while is a trainable parameter to scaling and is a trainable weight matrix for dimension transformation. When the global information benefits the token-level interpretation, it will speed up the convergence; otherwise, it will not reduce the prediction performance.
(2) |
The feature is enhanced at the label forcing layer to form the emission score, which will be passed to the CRF layer to predict the tag of each token. For example, “gg” stands for “good game”. If all words “gg” in the training set are classified as “S” (Game Slang), the word in the test set will have a high probability of being classified as “S”. In this case, if the probability information of the correspondence between words and labels in the training set can be learned, it will be of great help in predicting the results of the test set. Therefore, the proposed model creatively uses a novel technique: label forcing (LF). Specifically, the label distribution probability of each token in the training corpus is calculated and normalized by the following equation, where is the frequency of that the token is annotated as the label . The overall architecture can be found in the Figure 1.
(3) |
Experiment
The proposed model adopts FastText-50 as the input embeddings. The Bi-LSTM layer contains a hidden size of 5 in each direction, and the layer number is set to 1. The SGD optimizer has been applied with the learning rate of 0.1 and the weight decay of 1e-4. In addition, the training epoch is set to 2 with a batch size of 1 (same as the CONDA). As for evaluation metrics, the overall micro F1 and that for each slot label have been reported on the test set, but the O tag is excluded when calculating the overall F1-Score. The 5 baselines are selected based on CONDA paper. From Table 1 we can conclude that BRAR outperforms other baselines on most F1, especially in predicting T, S, and D labels.
Model | F1 | F1(T) | F1(P) | F1(S) | F1(D) | F1(C) | F1(O) |
---|---|---|---|---|---|---|---|
RNN-NLU (2016) | 97.0 | 93.1 | 98.1 | 93.0 | 71.8 | 99.1 | 98.7 |
Slot-gated (2018) | 99.1 | 97.8 | 99.2 | 98.2 | 95.2 | 99.7 | 99.4 |
Inter-BiLSTM (2018) | 86.5 | 87.1 | 88.9 | 86.9 | 78.8 | 94.2 | 92.4 |
Capsule NN (2019) | 99.1 | 97.5 | 99.1 | 98.2 | 94.9 | 99.7 | 99.4 |
Joint BERT (2019) | 98.9 | 97.2 | 99.2 | 97.9 | 91.4 | 99.8 | 99.3 |
BRAR (our model) | 99.9 | 98.6 | 99.4 | 99.4 | 98.1 | 99.0 | 99.5 |
We evaluated the model with different number of Bi-LSTM stacks/layers. The results shown in Table 2 indicate that more layers lead to the lower F1 scores, though there is no significant difference among the three different number of layers. Therefore, we propose to use only 1 Bi-LSTM layer as it is computationally more efficient with a smaller number of parameters.
Number of Stacks | F1 | F1(T) | F1(P) | F1(S) | F1(D) | F1(C) | F1(O) |
---|---|---|---|---|---|---|---|
1 layer | 99.9 | 98.6 | 99.4 | 99.4 | 98.1 | 99.0 | 99.5 |
2 layers | 99.6 | 98.1 | 99.2 | 99.3 | 96.4 | 98.1 | 99.5 |
3 layers | 98.6 | 96.5 | 99.1 | 98.3 | 81.8 | 94.4 | 99.4 |
Conclusion
In this paper, we established a shared task for slot (word token)-based in-game toxic language detection, as well as the winner (the best model) in the shared task, which integrates bi-directional representations, attention residuals, the label forcing technique and CRF. Experiments indicate that the proposed best model is more effective in capturing global information between the semantic components in the slot filling task than existing models.
Acknowledgments
This work was supported by FortifyEdge. We would like to specially thank Mr Hyunsuk (David) Chung for his help and guidance for this research work.
References
- Bing and Ian (2016) Bing, L.; and Ian, L. 2016. Attention-based recurrent neural network models for joint intent detection and slot filling. In Interspeech 2016, 685–689.
- Chenwei et al. (2019) Chenwei, Z.; Yaliang, L.; Nan, D.; Wei, F.; and S, P., Yu. 2019. Joint slot filling and intent detection via capsule neural networks. In Proceedings of the 57th Annual Meeting of the ACL, 5259– 5267.
- Chih-Wen et al. (2018) Chih-Wen, G.; Guang, G.; Yun-Kai, H.; Chih-Li, H.; Tsung-Chieh, C.; Keng-Wei, H.; and Yun-Nung, C. 2018. Slot-gated modeling for joint slot filling and intent prediction. In NAACL-HLT 2018, 753– 757.
- Weld et al. (2021a) Weld, H.; Huang, G.; Lee, J.; Zhang, T.; Wang, K.; Guo, X.; Long, S.; Poon, J.; and Han, C. 2021a. CONDA: a CONtextual Dual-Annotated dataset for in-game toxicity understanding and detection. In Findings of the Association for Computational Linguistics: ACL 2021, 2406–2416.
- Weld et al. (2021b) Weld, H.; Huang, X.; Long, S.; Poon, J.; and Han, S. C. 2021b. A survey of joint intent detection and slot filling models in natural language understanding. ACM Computing Surveys (CSUR).
- Yu, Yilin, and Hongxia (2018) Yu, W.; Yilin, S.; and Hongxia, J. 2018. A bi-model based rnn semantic frame parsing model for intent detection and slot filling. In NAACL-HLT 2018.