Towards Patronizing and Condescending Language in Chinese Videos: A Multimodal Dataset and Detector

Hongbo Wang Dalian University of Technology, China Junyu Lu Dalian University of Technology, China Yan Han University of Tsukuba, Japan Kai Ma Dalian University of Technology, China Liang Yang Dalian University of Technology, China Hongfei Lin Dalian University of Technology, China

Abstract

Patronizing and Condescending Language (PCL) is a form of discriminatory toxic speech targeting vulnerable groups, threatening both online and offline safety. While toxic speech research has mainly focused on overt toxicity, such as hate speech, microaggressions in the form of PCL remain underexplored. Additionally, dominant groups’ discriminatory facial expressions and attitudes toward vulnerable communities can be more impactful than verbal cues, yet these frame features are often overlooked. In this paper, we introduce the PCLMM dataset, the first Chinese multimodal dataset for PCL, consisting of 715 annotated videos from Bilibili, with high-quality PCL facial frame spans. We also propose the MultiPCL detector, featuring a facial expression detection module for PCL recognition, demonstrating the effectiveness of modality complementarity in this challenging task. Our work makes an important contribution to advancing microaggression detection within the domain of toxic speech.

Index Terms:

Patronizing and Condescending Language, Toxic Speech, Multimodal, Video, Facial Expression.

I Introduction

The rapid development of social media has exceeded expectations. Since around 2010, self-media has gradually gained prominence in disseminating ideas on mainstream platforms, such as the English platforms YouTube, TikTok [1], and the Chinese platform Bilibili [2]. While these video platforms have created significant economic benefits and influence, they have also accelerated the spread of harmful content. Despite the robust regulations enforced by mainstream online platforms to reduce the risk of dangerous content, these measures primarily slow the dissemination of videos with clearly aggressive content, such as hate speech [3], but overlook microaggressions targeting vulnerable communities, known as Patronizing and Condescending Language (PCL) [4].

PCL is a form of discriminatory toxic speech targeting vulnerable groups, such as individuals with disabilities, children, and the elderly, reflecting a superior attitude towards these communities [4]. Although the construction of PCL corpora has advanced [4, 5, 6] and the researchers also established specialized evaluation tracks, utilizing various improved deep learning networks to advance related research [7, 8, 9], current PCL research remains text-based. Unlike traditional toxic speech, such as hate speech, PCL lacks explicit offensive words, making it more subtle and implicit [10]. The characteristics of PCL suggest that integrating multimodal approaches, especially discriminatory facial expressions, will contribute to breakthroughs in this field. Although multimodal frameworks have begun to emerge in hate detection [11, 12], this direction remains unexplored for PCL. Moreover, current research is limited to English and lacks attention to vulnerable groups in other language communities.

In this paper, we introduce a multimodal dataset and corresponding detector designed to enhance the automated detection of microaggressions on video platforms, aiming to protect vulnerable communities. We introduce PCLMM, the first multimodal dataset for detecting PCL in videos, comprising 715 annotated videos, over 21 hours of content from Bilibili, one of China’s largest video platforms. PCLMM includes a wide range of vulnerable communities in China and is publicly available to support further research.¹¹1The dataset and code for this project have been open-sourced at https://github.com/dut-laowang/PCLMM. We also propose the MultiPCL Detector, which integrates facial expression features with video, text, and audio to enhance the detection of discriminatory language. Our research focuses on the Chinese context due to the prevalence of its vulnerable groups, and our findings are also relevant to English-speaking contexts. Our contributions are summarized as follows: (1) We develop and release PCLMM, the first multimodal PCL dataset, including 715 Bilibili videos (21+ hours) annotated as patronizing (PCL) or non-patronizing (non-PCL), with annotated PCL facial frame spans. (2) We introduce the MultiPCL detector, which integrates facial expressions, video, text, and audio, demonstrating significant improvements in detection accuracy. (3) Our sentiment and toxicity analysis indicates that PCL possesses a certain level of ambiguity, and our detector can effectively identify these marginal features.

Refer to caption — Figure 1: The presentation of our multimodal PCL framework: (a) Data Collection. Refining annotation guidelines and gathering data from Bilibili. (b) PCLMM dataset. A high-quality annotated dataset with PCL frame spans. (c) MultiPCL detector. Utilizing a cross-attention mechanism to extract and integrate features from facial expressions, video, text, and audio modalities.

II PCLMM Dataset

II-A Overview

In this section, we outline the construction of the PCLMM dataset. We developed a comprehensive semantic definition of PCL in Chinese to create annotation guidelines. Using six key vulnerable community categories from the Chinese internet, we compiled a keyword list and collected videos via targeted searches. The dataset was manually annotated by three annotators, followed by sentiment and toxicity analysis.

II-B Definition Development

PCL typically targets vulnerable groups, but this definition often doesn’t align with the Chinese context, where the concept of vulnerable groups differs from that in English-speaking communities. For example, PCL toward immigrants is rare in China due to policy reasons. Building on [4, 6], we proposed a comprehensive definition of PCL tailored to the Chinese context, serving as our annotation guide.

Chinese PCL refers to discriminatory, falsely sympathetic, and hypocritical remarks directed at six vulnerable groups within the Chinese community: disabled individuals, women, the elderly, children, single-parent families, and low-income groups. A key feature of PCL is the speaker’s condescending attitude, making statements that do not improve the group’s situation. PCL expressions are often accompanied by contemptuous and discriminatory facial expressions. To minimize subjective discrepancies, we specified the following cases to be annotated as non-PCL:

•

Vulnerable individuals describing their own experiences of unfair treatment.
•

Objective news reports on discriminatory incidents.
•

Public service announcements containing discriminatory content but lacking discriminatory intent.

II-C Data Collection

Based on II-B, we identified six major vulnerable communities on the Chinese internet. We expanded each community into a list of 10 commonly used keywords and designed a lexicon of offensive and discriminatory terms as query keys. These were matched with the keyword list to generate the final search set (e.g., adding the query discrimination to the keyword elderly care). Our search list included 1800 keyword-value pairs, retrieving 2654 preliminary videos. We retained videos ranging from 30 seconds to 5 minutes and filtered out damaged and irrelevant videos, and we also used masking techniques to obscure all possible watermarks that might disclose user privacy. Finally, we got 715 high-quality annotable samples.

TABLE I: Statistics of PCLMM. PCL Frame Spans refer to the statistics of the patronizing spans within PCL videos, and

\mu

represents the average value.

	Non-PCL	PCL	PCL Frame Spans	Total
Total num	519	196	330	715
Total len (hrs)	15.1	6.5	2.3	21.6
Total frame (M)	1.6	0.7	0.2	2.3
$\mu$ Video len (min)	1.7	1.9	0.4	1.8
$\mu$ Text len (char)	455	536	158	477

II-D Data Annotation

Two trained Ph.D. students annotated the videos, with a third as the reviewer (two males, one female, aged 25-30, all in computer science and focused on toxicity detection). Annotators were compensated based on the number of annotations completed. Videos were labeled as PCL or non-PCL following II-B. An initial set of 30 videos (20 non-PCL, 10 PCL) was used to reach consensus on discrepancies. To minimize harm, annotators were limited to 20 videos per day and reported their psychological state. The CVAT tool [13] further recorded facial expression spans for PCL videos, while non-PCL videos had no patronizing spans. Fleiss’ Kappa [14] measured inter-annotator agreement (IAA = 0.72), with the third annotator resolving discrepancies. Finally, we obtained 196 PCL and 519 non-PCL videos.

II-E Data Statistics

The PCLMM dataset contains 715 videos, totaling 21 hours of content, with an average video length of 1.80 mins and a frame rate of 30 FPS, comprising 2.3M frames. Approximately 27.4% of the videos were labeled as patronizing, aligning with the distribution of PCL data on internet platforms. Detailed dataset statistics are shown in Table I.

II-F Data Analysis

II-F1 Sentiment Analysis

We used the advanced open-source model DeepFace [15] to analyze facial expressions in the PCLMM dataset. We sampled 20 videos per community from both PCL and non-PCL subsets, totaling 240 samples. For PCL, 10 facial expressions were selected from annotated PCL frame spans; for non-PCL, 10 expressions were from general frames. As shown in Figure 2, non-PCL expressions were predominantly positive or neutral, while PCL expressions conveyed more negative emotions such as anger, sadness, and disgust. Some PCL expressions were misclassified as ’happy,’ despite indicating superiority and contempt, highlighting the limitations of basic positive-negative classification in detecting PCL.

II-F2 Toxicity Analysis

We scored our transcribed texts using the Perspective API [16], as shown in Figure 3. PCL samples have higher toxicity scores across all community categories compared to non-PCL samples (0.37 vs. 0.24). However, PCL toxicity is lower than traditional hate speech (usually above 0.7), highlighting its implicit nature and the challenge in detection.

III Method

III-A Problem Statement

Given a dataset of video samples $V$ , the task is to classify videos targeting vulnerable groups as either PCL ( $y=1$ ) or non-PCL ( $y=0$ ). Each video $V$ is represented by a sequence of frames $F=\{f_{1},f_{2},\dots,f_{n}\}$ and a subset of facial expression frames $F_{v}=\{f_{1v},f_{2v},\dots,f_{nv}\}$ . If a frame $f_{n}$ lacks facial expressions, $f_{nv}$ is filled with a zero vector. The audio sequence is $A=\{a_{1},a_{2},\dots,a_{l}\}$ , and the transcribed text sequence is $T=\{w_{1},w_{2},\dots,w_{m}\}$ . The goal is to develop an attention-based multimodal classifier $X:X(F;F_{v};A;T)\rightarrow y$ , where $y\in\{0,1\}$ .

III-B Video Encoding

We used the Vision Transformer (ViT) [17] to extract features from videos. Given a sequence of frames $F=\{f_{1},f_{2},\dots,f_{n}\}$ , ViT extracted feature vectors for each frame $f_{i}$ . The feature vector $\mathbf{z}_{i}$ is computed as:

\mathbf{z}_{i}=\text{ViT}(f_{i}),\quad\mathbf{z}_{i}\in\mathbb{R}^{d_{v}},\quad i=1,2,\dots,n

(1)

Here, $\mathbf{z}_{i}\in\mathbb{R}^{d_{v}}$ is the $d_{v}$ -dimensional feature vector encoded by ViT for each frame $f_{i}$ .

III-C Facial Expression Encoding

To capture patronizing facial expressions in the video, we first used MTCNN (Multi-task Cascaded Convolutional Networks) [18] for face detection. Next, FER-VT (Facial Expression Recognition using Vision Transformers) [19] encoded facial features using grid-wise attention and visual transformers to capture long-range dependencies. For each video frame $f_{i}$ , if MTCNN detected a face, FER-VT extracted the facial feature vector $\mathbf{z}_{i}^{v}$ ; otherwise, a zero vector was assigned.

\mathbf{z}_{i}^{v}\begin{cases}\text{FER-VT}(f_{i}v),&\text{if MTCNN detects a face in }f_{i}\\ f_{i}v=0,&\text{if no face is detected}\end{cases}

(2)

III-D Audio Encoding

We used FFmpeg [20], a widely-used multimedia package, to extract high-quality audio from the videos, and then applied the Mel Frequency Cepstral Coefficient (MFCC) [21] to extract audio features. The extracted audio sequence $A$ was encoded as $\mathbf{z}^{a}$ .

III-E Text Encoding

We used Whisper [22], a speech recognition model by OpenAI, to transcribe audio into text. For text encoding, we utilized RoBERTa-Chinese [23] and a fine-tuned RoBERTa trained on the CCPC dataset [6] for patronizing language detection (We call it BERT-PCL). These models extract the CLS token from each transcript, producing a feature vector $\mathbf{z}^{t}$ .

III-F Cross-Modality Fusion

In our model, we used a unified Cross-Modality Multi-Head Attention (MHCA) mechanism to fuse information across different modalities. The general form of MHCA is:

\text{MHCA}(\mathbf{Q}_{i},\mathbf{K}_{j},\mathbf{V}_{j})=\text{Softmax}\left(\frac{\mathbf{Q}_{i}\mathbf{K}_{j}^{\top}}{\sqrt{d_{k}}}\right)\mathbf{V}_{j}

(3)

Here, $\mathbf{Q}_{i}$ is the query from modality $i$ , and $\mathbf{K}_{j}$ and $\mathbf{V}_{j}$ are the key and value from modality $j$ . By varying $i$ and $j$ , the interaction between different modality pairs is expressed as:

\mathbf{A}_{i,j}=\text{MHCA}(\mathbf{Q}_{i},\mathbf{K}_{j},\mathbf{V}_{j}),\quad i,j\in\{\mathbf{z},\mathbf{z}^{v},\mathbf{z}^{a},\mathbf{z}^{t}\}

(4)

The resulting attention features are then aggregated into a unified multimodal representation:

\mathbf{Z}=\sum_{i,j}\mathbf{A}_{i,j}

(5)

III-G Loss Function

We employed the BCEWithLogitsLoss as our loss function, which is suitable for binary classification tasks. The loss is computed as:

\text{Loss}=-\frac{1}{N}\sum_{i=1}^{N}\left[y_{i}\log\sigma(x_{i})+(1-y_{i})\log(1-\sigma(x_{i}))\right]

(6)

IV Experiment

TABLE II: Model performance on the classification task of PCL videos.

\mathbf{X}_{p}

represents metrics for PCL samples.

\mathbf{F1}_{m}

denotes the macro-averaged F1 score. Abbreviations: MC (MFCC), RC (RoBERTa-Chinese), BP (BERT-PCL), FT (FER-VT), VM (VideoMAE), VT (ViT).

M	Model	$\textbf{P}_{p}$	$\textbf{R}_{p}$	$\textbf{F1}_{p}$	$\textbf{F1}_{m}$	Acc
A	MC	35.81	56.89	45.21	54.28	64.14
T	RC	54.84	50.00	52.31	69.14	78.32
	BP	58.06	52.94	55.38	71.13	79.72
	GPT4	65.52	55.88	60.32	74.55	82.52
F	FT	65.52	47.50	55.07	70.46	78.47
V	VM	61.76	52.50	56.76	70.90	77.78
V	VT	65.62	52.50	58.33	72.22	79.17
A+F	MC+FT	39.13	45.00	41.86	58.55	65.28
A+T	MC+BP	58.82	50.00	54.05	69.08	76.39
T+F	BP+FT	62.89	55.00	58.67	72.06	78.47
A+V	MC+VT	58.00	72.50	64.44	74.14	77.78
V+F	VT+FT	62.79	67.50	65.06	75.46	79.86
V+T	VT+BP	63.04	72.50	67.44	76.79	80.56
A+T+F	MC+BP+FT	61.90	65.00	63.41	74.43	79.17
V+T+F	VT+BP+FT	64.44	72.50	68.24	77.47	81.25
V+T+A	VT+BP+MC	65.91	72.50	69.05	78.15	81.94
V+A+F	VT+MC+FT	67.44	72.50	69.88	78.84	82.64
V+A+T+F	MultiPCL	68.09	80.00	73.56	81.06	84.03

IV-A Experimental Settings

Our experiments were conducted using two NVIDIA A800-80G GPUs with 5-fold cross-validation to ensure robust training. We trained for 20 epochs, averaged the top five performances, and used a batch size of 10 with a learning rate of 1e-4. All code was implemented in PyTorch. Evaluation metrics included Precision, Recall, F1-score, and Accuracy, standard in toxicity detection. Notably, the ViT architecture provides an efficient solution, making it ideal for multimodal models. Its performance in short video analysis matches that of VideoMAE [24], which is why we chose ViT as the baseline instead of VideoMAE for modality fusion.

IV-B Experimental Result

We employed a strategy of progressively integrating multiple modalities, beginning with single modalities. The experimental results are presented in Table II. (1) In single-modality scenarios, the text modality yielded the highest detection performance, while using audio alone resulted in poor outcomes, underscoring the ongoing importance of text in toxicity detection. (2) In multi-modality scenarios, incorporating the video modality often leads to superior results. For dual-modality setups, combinations that include video achieved an average F1 score of 75.46, whereas those without video only reached 66.56. This trend is also evident in tri-modal configurations, highlighting the critical supportive role of video in feature understanding. Moreover, the facial expression modality only demonstrates optimal performance when combined with the video modality. (3) Our proposed MultiPCL, which integrates four modalities, significantly outperforms all baselines, with performance improvements of 6.51%, 4.27%, and 2.22% over the best single, dual, and tri-modal setups, respectively. This confirms the effectiveness of our detector.

We further conducted ablation experiments (Table III) on the MultiPCL detector to demonstrate the role of MHCA. Our experiments showed that replacing MHCA with a standard fully connected layer resulted in nearly a 4% decrease in F1 score, highlighting the critical role of MHCA in capturing the relationships between different modalities.

TABLE III: Ablation Study to show the effectiveness of MHCA.

Model	$\textbf{P}_{p}$	$\textbf{R}_{p}$	$\textbf{F1}_{p}$	$\textbf{F1}_{m}$	Acc
MultiPCL	68.09	80.00	73.56	81.06	84.03
-MHCA	62.50	75.00	68.18	77.09	80.56

V Conclusion

Patronizing and Condescending Language (PCL) is a form of discriminatory speech targeting vulnerable groups and is widespread online, demanding more comprehensive data resources and detection schemes. In this paper, we present PCLMM, the first multimodal PCL video dataset with 715 annotated videos totaling over 21 hours. We also propose the MultiPCL detector, integrating video and discriminatory facial expression features for multimodal detection, achieving state-of-the-art performance. Future work will explore PCL’s impact on microaggressions such as sarcasm and stereotypes, and evaluate existing multimodal large language models, particularly those incorporating audio, using our dataset and detector as benchmarks for microaggression detection.

References

[1] Nic Newman, Richard Fletcher, Anne Schulz, Simge Andi, Craig T Robertson, and Rasmus Kleis Nielsen. Reuters institute digital news report 2021. Reuters Institute for the study of Journalism, 2021.
[2] Hsin-Pey Peng. Exploring symbolic effect of new media: The impact of bilibili on gen z’s cohort identity and aesthetic choices in fashion. In International Conference on Fashion communication: between tradition and future digital developments, pages 176–187. Springer, 2023.
[3] Binny Mathew, Punyajoy Saha, Seid Muhie Yimam, Chris Biemann, Pawan Goyal, and Animesh Mukherjee. Hatexplain: A benchmark dataset for explainable hate speech detection. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 14867–14875, 2021.
[4] Carla Pérez-Almendros, Luis Espinosa-Anke, and Steven Schockaert. Don’t patronize me! an annotated dataset with patronizing and condescending language towards vulnerable communities. arXiv preprint arXiv:2011.08320, 2020.
[5] Zijian Wang and Christopher Potts. Talkdown: A corpus for condescension detection in context. arXiv preprint arXiv:1909.11272, 2019.
[6] Hongbo Wang, Mingda Li, Junyu Lu, Liang Yang, Hebin Xia, and Hongfei Lin. Ccpc: A hierarchical chinese corpus for patronizing and condescending language detection. In CCF International Conference on Natural Language Processing and Chinese Computing, pages 640–652. Springer, 2023.
[7] Carla Pérez-Almendros, Luis Espinosa Anke, and Steven Schockaert. Semeval-2022 task 4: Patronizing and condescending language detection. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 298–307, 2022.
[8] Yong Deng, Chenxiao Dou, Liangyu Chen, Deqiang Miao, Xianghui Sun, Baochang Ma, and Xiangang Li. Beike nlp at semeval-2022 task 4: prompt-based paragraph classification for patronizing and condescending language detection. arXiv preprint arXiv:2208.01312, 2022.
[9] Junyu Lu, Hao Zhang, Tongyue Zhang, Hongbo Wang, Haohao Zhu, Bo Xu, and Hongfei Lin. Guts at semeval-2022 task 4: Adversarial training and balancing methods for patronizing and condescending language detection. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 432–437, 2022.
[10] Sik Hung Ng. Language-based discrimination: Blatant and subtle forms. Journal of Language and Social Psychology, 26(2):106–122, 2007.
[11] Mithun Das, Rohit Raj, Punyajoy Saha, Binny Mathew, Manish Gupta, and Animesh Mukherjee. Hatemm: A multi-modal dataset for hate video classification. In Proceedings of the International AAAI Conference on Web and Social Media, volume 17, pages 1014–1023, 2023.
[12] Krishanu Maity, Poornash Sangeetha, Sriparna Saha, and Pushpak Bhattacharyya. Toxvidlm: A multimodal framework for toxicity detection in code-mixed videos. In Findings of the Association for Computational Linguistics ACL 2024, pages 11130–11142, Bangkok, Thailand and virtual meeting, 2024. Association for Computational Linguistics.
[13] Vladimir Manovich, Alexey Davidchack, and Dmitry Tsishkov. Cvat: An open source tool for annotating images and videos. https://github.com/opencv/cvat, 2019.
[14] Joseph L Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378, 1971.
[15] Sefik Serengil and Alper Ozpinar. A benchmark of facial recognition pipelines and co-usability performances of modules. Journal of Information Technologies, 17(2):95–107, 2024.
[16] Jigsaw and Google. Perspective api. https://perspectiveapi.com/, 2017. Accessed: 2024-08-29.
[17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
[18] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. In IEEE Signal Processing Letters, volume 23, pages 1499–1503. IEEE, 2016.
[19] Qionghao Huang, Changqin Huang, Xizhe Wang, and Fan Jiang. Facial expression recognition with grid-wise attention and visual transformer. Information Sciences, 580:35–54, 2021.
[20] FFmpeg Developers. Ffmpeg. https://ffmpeg.org/, 2024. Accessed: 2024-08-30.
[21] et al. Xu. Hmm-based audio feature extraction using the mel-frequency cepstral coefficients (mfcc). In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 912–915. IEEE, 2004.
[22] Alec Radford et al. Whisper: Openai’s speech recognition model. https://github.com/openai/whisper, 2023. Accessed: 2024-08-30.
[23] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Haoran Chen, and Yizhong Wang. Revisiting pre-trained models for chinese natural language processing. arXiv preprint arXiv:2004.13922, 2020.
[24] Zhan Tong, Yibing Song, Jue Wang, Limin Wang, and Yu Qiao. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in Neural Information Processing Systems, 35:24206–24219, 2022.