This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Disentangled and Robust Representation Learning
for Bragging Classification in Social Media

Abstract

Researching bragging behavior on social media arouses interest of computational (socio) linguists. However, existing bragging classification datasets suffer from a serious data imbalance issue. Because labeling a data-balance dataset is expensive, most methods introduce external knowledge to improve model learning. Nevertheless, such methods inevitably introduce noise and non-relevance information from external knowledge. To overcome the drawback, we propose a novel bragging classification method with disentangle-based representation augmentation and domain-aware adversarial strategy. Specifically, model learns to disentangle and reconstruct representation and generate augmented features via disentangle-based representation augmentation. Moreover, domain-aware adversarial strategy aims to constrain domain of augmented features to improve their robustness. Experimental results demonstrate that our method achieves state-of-the-art performance compared to other methods.

Index Terms—  Bragging Classification, Disentangled Feature, Adversarial Learning, Social Media

1 Introduction

Bragging classification aims to predict the bragging type for a social media text. As online communication on social media is more pervasive and essential in human life, bragging (or self-promotion) classification has become a significant area in computational (socio) linguistics [1, 2]. It has been widely applied in academia and industry, like helping linguists dive into the context and types of bragging [2], supporting social scientists to study the relation between bragging and other traits (e.g., gender, age, economic status, occupation) [1, 3], enhancing online users’ self-presentation strategies [4, 5], and many real-world NLP applications in business, economics and education [6, 7].

Although bragging has been widely studied in the context of online communication and forum, all these studies depend on manual analyses on small data sets [8, 4, 9, 10, 3]. To efficiently research bragging on social media, Jin et al. [2] collect the first large-scale dataset of bragging classification in computational linguistics, which contains six bragging types and a non-bragging type. However, the dataset suffers from a heavy data imbalance issue. For example, there are 2,838 examples in the non-bragging type, while only 58 to 166 (i.e., 1% \sim 4%) in the other bragging types. It severely affects the learning of the model on examples with these bragging types.

To alleviate the data imbalance issue, apart from employing a weighted loss function to balance sample learning from different types [11, 12], many researchers attempt to perform data augmentation by injecting models with external knowledge, such as knowledge graph [13, 14], pre-trained word embedding [15, 16], translation [17] and some related pragmatics tasks, i.e., complaint severity classification [18]. As for bragging classification, Jin et al. [2] inject language models with external knowledge from the NRC word-emotion lexicon, Linguistic Inquiry and Word Count(LIWC) and vectors clustered by Word2Vector. Despite their success, improvement of external knowledge injection relies on the relevance between bragging classification and other pragmatics tasks. However, knowledge provided by other pragmatic tasks is fixed and obtained in a model-based manner, which inevitably brings noise.

To get rid of the noise from external knowledge injection, we propose a disentangle-based feature augmentation for disentangled representation and augmented feature learning without any other external knowledge. Specifically, we first disentangle content and bragging-type information from a representation. Next, we generate a reconstructed representation by integrating disentangled information and then constrain consistency between representation and reconstructed representation. To address the data imbalance problem, we fuse disentangled information from different examples to generate augmented features for model training.

Refer to caption
Fig. 1: Overview of our model.

Moreover, we propose a domain-aware adversarial strategy to mitigate domain disorder caused by augmented features. Specifically, we present a discriminator on top of the language model, which is trained to distinguish whether the input is a representation from the encoder or an augmented feature. Meanwhile, jointing with a classification objective, the encoder is trained to fool the discriminator, which pushes the model to generate robust augmented features that are domain consistent with representations from the encoder.

In the experiments, we train and evaluate our method on bragging classification dataset [2]. The results show that our method achieves state-of-the-art performance compared to other strong competitors.

2 Method

This section starts with a base bragging classification model, followed by our proposed methods, i.e., disentangle-based feature augmentation and domain-aware adversarial strategy. Lastly, training details are elaborated.

2.1 Base Bragging Classification Model

The bragging classification task aims to automatically classify bragging type of a given text from social media. Since pre-trained language models show their excellent performance in natural language processing (NLP) tasks, the general paradigm for NLP classification tasks is to fine-tune pre-trained language models (e.g. BERT [19], RoBERTa [20]). In this work, our base bragging classification model is composed of a pre-trained transformer encoder and an MLP with softmax\mathrm{softmax}. Given a text 𝑻i{\bm{T}}_{i}, the model is to distinguish its bragging type cc, i.e,

𝒉i\displaystyle{\bm{h}}^{i} =TransEnc(𝑻i;θ(enc))\displaystyle=\operatorname*{Trans-Enc}({\bm{T}}_{i};\theta^{(enc)}) (1)
𝒑i\displaystyle{\bm{p}}^{i} =softmax(MLPclf(𝒉i;θ(clf)))\displaystyle=\mathrm{softmax}(\operatorname*{MLP_{clf}}({\bm{h}}^{i};\theta^{(clf)})) (2)

where ii denotes ii-th training example; 𝒉i{\bm{h}}^{i} denotes representation of the text 𝑻i{\bm{T}}_{i}, and 𝒑i{\bm{p}}^{i} refers to a probability distribution over all bragging types 𝑪{\bm{C}}. Lastly, we train the pre-trained transformer encoder θ(enc)\theta^{(enc)} and MLP θ(clf)\theta^{(clf)} by maximum likelihood estimation, and its loss function is written as:

(clf)=1|𝒯|𝒯log𝒑[y=c]i,\displaystyle{\mathcal{L}}^{(clf)}=-\dfrac{1}{|{\mathcal{T}}|}\sum_{{\mathcal{T}}}\log{\bm{p}}^{i}_{[y=c]}, (3)

where 𝒯{\mathcal{T}} denotes a collection of training texts; cc refers to a ground truth bragging type, and c𝒞c\in{\mathcal{C}}.

2.2 Disentangle-based Feature Augmentation

To alleviate data imbalance problem, previous works [2] propose data augmentation to enrich text embedding representation by injecting external knowledge from the NRC word-emotion lexicon, Linguistic Inquiry and Word Count(LIWC) and vectors clustered by Word2Vector. However, these methods achieve only a limited gain (i.e., -0.5% \sim 1.09% on F1) compared to the base bragging classification model. The reason is that injected external knowledge is fixed (not trainable) and is full of noise information unrelated to bragging classification, which interferes with the model training. Therefore, we introduce a disentangled-based feature augmentation method instead of external knowledge injection.

Since the text is encoded as representation, it is reasonable to assume that representations contain separable content and specific features from a feature disentanglement perspective [21]. Therefore, we separate the representation 𝒉i{\bm{h}}^{i} into two features: one is closely related to bragging types, denoted as 𝒉ti{\bm{h}}^{i}_{t}, and the other contains separable content unrelated to bragging, denoted as 𝒉ci{\bm{h}}^{i}_{c}. To disentangle these two features from representation 𝒉i{\bm{h}}^{i}, we employ two MLPs for content and bragging type features disentangle, respectively, i.e.,

𝒉ci\displaystyle{\bm{h}}^{i}_{c} =MLPcont(𝒉i;θ(cont))\displaystyle=\operatorname*{MLP_{cont}}({\bm{h}}^{i};\theta^{(cont)}) (4)
𝒉ti\displaystyle{\bm{h}}^{i}_{t} =MLPtype(𝒉i;θ(type))\displaystyle=\operatorname*{MLP_{type}}({\bm{h}}^{i};\theta^{(type)}) (5)

where 𝒉ci{\bm{h}}^{i}_{c} and 𝒉ti{\bm{h}}^{i}_{t} denote content and bragging type disentangled features, respectively.

Based on disentangled features, we can obtain new augmented features by integrating content and bragging type features from different samples, i.e,

𝒉^ji=𝒉ci+𝒉tj\displaystyle\hat{{\bm{h}}}^{ji}={\bm{h}}^{i}_{c}+{\bm{h}}^{j}_{t} (6)

where 𝒉^ji\hat{{\bm{h}}}^{ji} denotes an augmented feature; 𝒉tj{\bm{h}}^{j}_{t} refers to a bragging type feature from jj-th training example. It is remarkable that bragging type of 𝒉^ji\hat{{\bm{h}}}^{ji} is the same as bragging type feature 𝒉tj{\bm{h}}^{j}_{t} because it contains bragging type features 𝒉tj{\bm{h}}^{j}_{t}. Augmented feature 𝒉^ji\hat{{\bm{h}}}^{ji} is passed into Eq.2 to derive a probability distribution 𝒑ji{\bm{p}}^{ji}, and loss function can be defined as:

(clf+)=1|𝒯|𝒯log𝒑[y=c]ji,\displaystyle{\mathcal{L}}^{(clf^{+})}=-\dfrac{1}{|{\mathcal{T}}|}\sum_{{\mathcal{T}}}\log{\bm{p}}^{ji}_{[y=c]}, (7)

Moreover, we employ Kullback-Leibler (KL) divergence loss to guide representation disentangling and representation reconstruction. Specifically, we disentangle 𝒉^ji\hat{{\bm{h}}}^{ji} to reconstruct content feature 𝒉^ci\hat{{\bm{h}}}^{i}_{c} and bragging type feature 𝒉^tj\hat{{\bm{h}}}^{j}_{t} by Eq.4 and Eq.5. Then, we optimize the reconstruction loss by minimizing KL divergence, defined as follow:

(kl)\displaystyle{\mathcal{L}}^{(kl)} =𝔼(log𝒉yxlog𝒉^yz),\displaystyle=\mathbb{E}(\log{\bm{h}}^{x}_{y}-\log\hat{{\bm{h}}}^{z}_{y}),
where,x,z𝒯,y{c,t}\displaystyle\text{where,}~{}x,z\in{\mathcal{T}},y\in\{c,t\} (8)

2.3 Domain-aware Adversarial Strategy

Although model trained on augmented features circumvents data imbalance problem, it inevitably suffers from domain discrepancy between encoder representations and augmented features, which is verified to undermine the performance [22]. We thus design an adversarial strategy to constrain the augmented features to follow the same domain with representations. Formally, a discriminator is built upon representation and augmented features:

p(adv)\displaystyle p^{(adv)} =Sigmoid(MLPdis(𝒉;θ(dis))),\displaystyle=\operatorname*{Sigmoid}(\operatorname*{MLP_{dis}}({\bm{h}};\theta^{(dis)})),
where,𝒉{𝒉i,𝒉j,𝒉^ji,𝒉^ij}.\displaystyle\text{where,}~{}{\bm{h}}\in\{{\bm{h}}_{i},{\bm{h}}_{j},\hat{{\bm{h}}}^{ji},\hat{{\bm{h}}}^{ij}\}. (9)

p(adv)p^{(adv)} is the probability of the feature not augmented. The discriminator is trained to minimize:

(dis)=𝕀(rep)log𝒑(adv)𝕀(aug)log(1𝒑(adv))\displaystyle{\mathcal{L}}^{(dis)}=-\mathbb{I}_{(rep)}\log{\bm{p}}^{(adv)}-\mathbb{I}_{(aug)}\log(1-{\bm{p}}^{(adv)}) (10)

where 𝕀(aug)\mathbb{I}_{(aug)} denotes if the feature is an augmented feature; 𝕀(rep)\mathbb{I}_{(rep)} denotes if the feature is encoder representation. On the contrary, augmented features are learned to fool by minimizing an adversarial loss, i.e.,

(adv)=𝕀(aug)log𝒑(adv).\displaystyle{\mathcal{L}}^{(adv)}=-\mathbb{I}_{(aug)}\log{\bm{p}}^{(adv)}. (11)

2.4 Training

During training, we alternately train model and discriminator. we train our model jointly by minimizing four loss functions:

=α×(clf)+β×(clf+)+λ×(kl)+γ×(adv)\displaystyle{\mathcal{L}}\!\!=\!\!\alpha\times{\mathcal{L}}^{(clf)}\!+\!\beta\times{\mathcal{L}}^{(clf^{+})}\!+\!\lambda\times{\mathcal{L}}^{(kl)}\!+\!\gamma\times{\mathcal{L}}^{(adv)} (12)

Meantime, discriminator is trained by (dis){\mathcal{L}}^{(dis)}.

3 Experiments

3.1 Dataset and Evaluation Metrics

In experiments, we train and evaluate our proposed approach on the dataset built by [2]. This dataset is collected from tweeter by Premium Twitter Search API and further annotated by human-craft. Bragging types include “Not Bragging”, “Achievement”, “Action”, “Feeling”, “Trait”, “Possession” and “Affiliation”. We follow the official data split with 3,382/3,314 samples in training/test sets. Following [2], evaluation metrics are macro precision, recall and F1 score.

3.2 Experimental Setting

The pre-trained transformer encoder we used is BERTweet [23]. We use AdamW optimizer[24, 25] with learning rate of 3×1063\times 10^{-6} for training, and learning rate for discriminator is 3×1043\times 10^{-4}. The maximum training epoch and batch size are set to 26,500 and 35. The maximum sequence length, weight decay and gradient clipping are set to 128, 0.01 and 1.0. The dropout of model and discriminator are set to 0.2 and 0.5, separately. In Eq.12, α\alpha, β\beta, λ\lambda and γ\gamma are set to 1.5, 1.5, 1.0 and 1.5, respectively. Experiments are conducted on a NVIDIA RTX2070 GPU, and training time is around 5 hours.

3.3 Main Results

Method Macro Average
Precision Recall F1 Score
Majority Class [2] 13.26 14.29 13.76
LR-BOW [2] 18.52 20.02 18.59
BiGRU-Att [2] 18.32 26.16 19.19
BERT [19] 24.16 39.66 26.85
RoBERTa [20] 28.99 45.90 32.82
BERTweet [23] 30.82 47.25 34.86
BERTweet-NRC [2] 30.95 47.98 34.36
BERTweet-LIWC [2] 32.06 46.68 34.83
BERTweet-Clusters [2] 32.51 46.97 35.59
Ours 41.18 40.08 39.86
Table 1: Bragging multi-classification results
Method Macro Average
Precision Recall F1 Score
Ours 41.18 40.08 39.86
Ours w/o DAS 45.07 35.02 38.84 (-1.02)
Ours w/o DAS, DFA 30.82 47.25 34.86 (-3.98)
Table 2: Ablation study of our approach. “w/o DAS” denotes removing the domain-aware adversarial strategy in our model, and “w/o DAS, DFA” indicates removing both disentangle-based feature augmentation and domain-aware adversarial strategy.

Comparison results are shown in Table 1. On the table, it is clear that 1) using pre-trained language model as base model can significantly improve performance, mainly because pre-trained language models have stronger text representation capability; 2) our approach is superior to the method proposed by [2] and achieves state-of-the-art,demonstrating the proposed disentangled and robust representation learning is effective. From these two observations above, it is insightful to assert that learning better representation from data itself is more critical than external knowledge injection.

Method Macro Average
Precision Recall F1 Score
BERT [19] 64.24 65.91 64.58
RoBERTa [20] 66.53 68.43 67.34
BERTweet [23] 70.43 72.62 71.44
BERTweet-NRC [2] 72.89 70.95 71.80
BERTweet-LIWC [2] 72.65 72.21 72.42
BERTweet-Clusters [2] 71.26 72.53 71.60
Ours 78.42 69.67 73.07
Table 3: Bragging binary-classification results.
Method Macro Average
Precision Recall F1 Score
BERTweet [23] 30.82 47.25 34.86
Only 𝒉ti{\bm{h}}^{i}_{t} 43.94 31.00 33.96
Ours 41.18 40.08 39.86
Table 4: Impact of disentangled type feature in our method. “Only 𝐡ti{\bm{h}}^{i}_{t}” denotes classification using bragging type disentangled features 𝒉ti{\bm{h}}^{i}_{t} extracted by our model.
Refer to caption
Fig. 2: t-SNE visualization: ours (left) and BERTweet (right).
Refer to caption
Fig. 3: Classification loss (left) and discriminator loss (right) in training phase.

3.4 Ablation Study

As shown in Table 2, we conduct an ablation study on our method. First, we remove domain-aware adversarial strategy, and results show that performance drops. It demonstrates that constraining the domain of augmented features using adversarial strategy is beneficial for model to learn robust representation. Moreover, we verify the effectiveness of augmented features by removing disentangle-based feature augmentation. Results show that the performance drops a lot, which demonstrates that disentangled representation learning is able to mine text information deeper.

3.5 Binary Classification for Bragging

To comprehensively evaluate our method, we also evaluate our model on bragging binary-classification task (i.e., bragging and non-bragging). As shown in Table 3, we can observe that our model still achieves better performance as compared with other methods. Since data imbalance is greatly alleviated in binary-classification setting, the improvement of our method is not as significant as the multi-classification setting.

3.6 Impact of Disentangled Type Feature

To delve into the effectiveness of disentangled features, we employ MLP to classify disentangled feature 𝒉ti{\bm{h}}^{i}_{t} directly, and results are shown in Table 4. From Precision results, we can observe that Only 𝒉ti{\bm{h}}^{i}_{t} outperforms BERTweet, which shows the effectiveness of feature disentanglement. Moreover, Only 𝒉ti{\bm{h}}^{i}_{t} underperforms our method on F1 Score, which indicates that content feature disentanglement and domain consistency benefit bragging classification.

3.7 Visualization Analysis

We apply t-SNE visualization(Fig 2.) on features obtained by Eq.1 to visually demonstrate how our approach works. It is obvious that scatters of our method represent a more clustering trend than BERTweet, which indicates our method is able to learn a type-relevant representation.

3.8 Deep Dive into Adversarial Learning

To investigate the impact of domain-aware adversarial strategy, we show classification loss and discriminator loss in Figure 3. The discriminator loss quickly drops and then slowly goes up, indicating that the discriminator is functional at first and fooled later by domain-consistency features. Meanwhile, The classification loss decreases rapidly and remains very low in the following steps, showing that domain-aware adversarial strategy is well integrated within the whole process.

4 Conclusion

In this study, to address the challenge of data imbalance, we propose a novel augmentation method for learning disentangled and robust representations without other external knowledge. The method includes disentangle-based feature augmentation and domain-aware adversarial strategy. Experimental result shows that our method achieves state-of-the-art performance. Lastly, we conducted extensive analyses to verify the effectiveness of our method.

References

  • [1] Jun Wang, Kelly Yixin Cui, and Bei Yu, “Self promotion in us congressional tweets,” in NAACL, 2021.
  • [2] Mali Jin, Daniel Preotiuc-Pietro, A. Seza Doğruöz, and Nikolaos Aletras, “Automatic identification and classification of bragging in social media,” in ACL, 2022, pp. 3945–3959.
  • [3] Xi Chen, Gang Li, Yundi Hu, and Yujie Li, “How anonymity influence self-disclosure tendency on sina weibo: An empirical study,” The Anthropologist, vol. 26, pp. 217 – 226, 2016.
  • [4] Natalya N. Bazarova, Jessie G. Taft, Yoon Hyung Choi, and Dan Cosley, “Managing impressions and relationships on facebook,” Journal of Language and Social Psychology, vol. 32, pp. 121 – 141, 2013.
  • [5] Carolien Van Damme, Eliane Deschrijver, Eline Van Geert, and Vera Hoorens, “When praising yourself insults others: Self-superiority claims provoke aggression,” Personality and Social Psychology Bulletin, vol. 43, pp. 1008 – 1019, 2017.
  • [6] Emily Prinsloo, Irene Scopelliti, George Loewenstein, and Joachim Vosgerau, “Responses to bragging and self-deprecation. theory and empirical evidence,” Decision Making, 2021.
  • [7] Gregory M Kerr, Clifford Lewis, and Lois Burgess, “Bragging rights and destination marketing: A tourism bragging rights model,” Journal of Hospitality and Tourism Management, vol. 19, pp. 1–8, 2012.
  • [8] Jhih-Syuan Lin, Yen-I Lee, Yan Jin, and Bob Gilbreath, “Personality traits, motivations, and emotional consequences of social media usage,” Cyberpsychology, Behavior, and Social Networking, vol. 20, no. 10, pp. 615–623, 2017, PMID: 29039699.
  • [9] Sofia Rüdiger and Daria Dayter, “Manbragging online: Self-praise on pick-up artists’ forums,” Journal of Pragmatics, vol. 161, pp. 16–27, 2020.
  • [10] Els Tobback, “Telling the world how skilful you are: Self-praise strategies on linkedin,” Discourse & Communication, vol. 13, pp. 647 – 668, 2019.
  • [11] Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu, and Jiwei Li, “Dice loss for data-imbalanced NLP tasks,” in ACL, 2020, pp. 465–476.
  • [12] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár, “Focal loss for dense object detection,” in ICCV, 2017, pp. 2999–3007.
  • [13] Yucheng Zhou, Xiubo Geng, Tao Shen, Jian Pei, Wenqiang Zhang, and Daxin Jiang, “Modeling event-pair relations in external knowledge graphs for script reasoning,” in ACL/IJCNLP, 2021, vol. ACL/IJCNLP 2021 of Findings of ACL, pp. 4586–4596.
  • [14] Donghan Yu, Chenguang Zhu, Yiming Yang, and Michael Zeng, “Jaket: Joint pre-training of knowledge graph and language understanding,” in AAAI, 2022.
  • [15] Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, “Efficient estimation of word representations in vector space,” in ICLR, 2013.
  • [16] Jeffrey Pennington, Richard Socher, and Christopher D. Manning, “Glove: Global vectors for word representation,” in EMNLP, 2014, pp. 1532–1543.
  • [17] Yucheng Zhou, Xiubo Geng, Tao Shen, Wenqiang Zhang, and Daxin Jiang, “Improving zero-shot cross-lingual transfer for multilingual question answering over knowledge graph,” in NAACL-HLT, 2021, pp. 5822–5834.
  • [18] João Filgueiras, Luís Barbosa, Gil Rocha, Henrique Lopes Cardoso, Luís Paulo Reis, João Pedro Machado, and Ana Maria Oliveira, “Complaint analysis and classification for economic and food safety,” Proceedings of the Second Workshop on Economics and Natural Language Processing, 2019.
  • [19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT, 2019, pp. 4171–4186.
  • [20] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” ArXiv, vol. abs/1907.11692, 2019.
  • [21] Yoshua Bengio, Aaron C. Courville, and Pascal Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 1798–1828, 2013.
  • [22] Kibeom Hong, Seogkyu Jeon, Huan Yang, Jianlong Fu, and Hyeran Byun, “Domain-aware universal style transfer,” ICCV, pp. 14589–14597, 2021.
  • [23] Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen, “BERTweet: A pre-trained language model for English tweets,” in EMNLP, Online, Oct. 2020, pp. 9–14.
  • [24] Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
  • [25] Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” in ICLR, 2019.