Do Large Language Models Possess Sensitive to Sentiment?

Yang Liu, Xichou Zhu, Zhou Shen, Yi Liu, Min Li, Yujun Chen, Benzi John,
Zhenzhen Ma, Tao Hu, Zhi Li, Zhiyang Xu, Wei Luo, Junhui Wang
Machine Learning & AI Team, Privacy and Data Protection Office
ByteDance, Beijing, China
{liuyang.173, junhui.wang}@bytedance.com

Abstract

Large Language Models (LLMs) have recently displayed their extraordinary capabilities in language understanding. However, how to comprehensively assess the sentiment capabilities of LLMs continues to be a challenge. This paper investigates the ability of LLMs to detect and react to sentiment in text modal. As the integration of LLMs into diverse applications is on the rise, it becomes highly critical to comprehend their sensitivity to emotional tone, as it can influence the user experience and the efficacy of sentiment-driven tasks. We conduct a series of experiments to evaluate the performance of several prominent LLMs in identifying and responding appropriately to sentiments like positive, negative, and neutral emotions. The models’ outputs are analyzed across various sentiment benchmarks, and their responses are compared with human evaluations. Our discoveries indicate that although LLMs show a basic sensitivity to sentiment, there are substantial variations in their accuracy and consistency, emphasizing the requirement for further enhancements in their training processes to better capture subtle emotional cues. Take an example in our findings, in some cases, the models might wrongly classify a strongly positive sentiment as neutral, or fail to recognize sarcasm or irony in the text. Such misclassifications highlight the complexity of sentiment analysis and the areas where the models need to be refined. Another aspect is that different LLMs might perform differently on the same set of data, depending on their architecture and training datasets. This variance calls for a more in-depth study of the factors that contribute to the performance differences and how they can be optimized.

Yang Liu, Xichou Zhu, Zhou Shen, Yi Liu, Min Li, Yujun Chen, Benzi John, Zhenzhen Ma, Tao Hu, Zhi Li, Zhiyang Xu, Wei Luo, Junhui Wang Machine Learning & AI Team, Privacy and Data Protection Office ByteDance, Beijing, China {liuyang.173, junhui.wang}@bytedance.com

1 Introduction

Recently, large language models (LLMs) have made groundbreaking strides that have dramatically reshaped the artificial intelligence landscape (Brown et al., 2020; Chowdhery et al., 2023). These models have become a cornerstone in natural language processing (NLP), enabling advances in various tasks, from text generation (Mo et al., 2024; Li et al., 2024; Abburi et al., 2023) to question answering (Zhuang et al., 2024; Saito et al., 2024). Despite their widespread adoption, one crucial area that remains insufficiently explored is their capability to accurately perceive and respond to sentiment. Sentiment analysis—the process of identifying the emotional tone within text—is vital for applications such as customer feedback analysis, social media monitoring, and conversational agents (Liao et al., 2023; Zhang et al., 2024b). This raises an important question:

Are these advanced LLMs, trained on extensive datasets, truly capable of understanding sentiment, or are they simply replicating the sentiment patterns they have learned from their training data?

This paper aims to thoroughly evaluate the performance of large language models (LLMs) in sentiment analysis, specifically assessing their ability to detect and generate responses that correspond to the sentiment present in the input text. We explore models with varying architectures and sizes to identify both the strengths and the areas needing improvement in sentiment-related tasks. Our evaluation process follows a structured workflow, as illustrated in Fig 2. We begin by selecting a diverse set of prompts, which are then processed by various LLMs to produce soft outputs. These outputs undergo a similarity evaluation using word vector similarity techniques to assess their alignment with the intended sentiment. This systematic approach allows for a comprehensive analysis of the models’ performance, offering insights into their ability to capture nuanced sentiment. The workflow’s design not only ensures thoroughness in evaluation but also facilitates the identification of specific areas where LLMs excel or require improvement, ultimately contributing to more targeted advancements in sentiment analysis capabilities. Our research not only contributes to the ongoing discourse surrounding LLM evaluation but also highlights the necessary enhancements required to bolster the sentiment sensitivity of these models.

Our contributions are summarized as follows:

•

Incorporating the sentiment analysis, we develop and introduce the ‘Sentiment Knowledge Workflow’ for LLMs. This innovative framework is pivotal in defining and advancing the self-awareness capacities of LLMs.
•

We evaluate the sentiment sensitivity of a diverse array of LLMs across multiple public datasets. Our comprehensive analysis reveals that while these models exhibit a basic ability to detect sentiment, there are significant discrepancies in their accuracy and consistency. These findings underscore the need for further refinements in their training processes to improve their capacity to recognize and respond to subtle emotional cues more effectively.

Refer to caption — Figure 1: Illustration of the LLMs sensitivity to sentiment topic.

2 Related Works

LLMs Development.

The development of Large Language Models (LLMs) (Chen et al., 2024; Zhang et al., 2025) represents a significant milestone in the field of artificial intelligence, particularly in natural language processing (NLP) (Yuan et al., 2023; Yang et al., 2024). Originating from earlier efforts in neural networks and deep learning, LLMs have evolved rapidly, driven by advancements in computational power, algorithmic innovations, and the availability of vast amounts of textual data. These models, exemplified by architectures like GPT (Floridi and Chiriatti, 2020; Achiam et al., 2023) and BERT (Devlin et al., 2019), are trained on diverse and extensive datasets, enabling them to generate human-like text, perform complex language tasks, and even demonstrate an understanding of context and nuance. The scaling of model parameters (Zhang et al., 2024a) and data has been a crucial factor in enhancing the capabilities of LLMs, allowing them to achieve state-of-the-art performance across a wide range of applications, from translation (Lu et al., 2024) and summarization (Tang et al., 2024) to more specialized tasks like code generation (Ugare et al., 2024; Zheng et al., 2024) and creative writing (Gómez-Rodríguez and Williams, 2023). As research continues, LLMs are poised to further revolutionize how we interact with and understand language in both digital and real-world environments.

LLMs Sentiment Capability.

Large Language Models (LLMs) have increasingly been designed to understand and emulate human emotions (Zou et al., 2024), enhancing their role in more nuanced and empathetic communication (Sorin et al., 2023; Hasan et al., 2024). These models are trained on vast datasets that include emotionally rich language, enabling them to recognize and generate text that reflects various emotional tones. By interpreting subtle cues (Shukla et al., 2023) in language, such as word choice, tone, and context, LLMs can respond in ways that align with the emotional state of the user. This emotional capability is particularly valuable in applications like virtual assistants (Vu et al., 2024), mental health support (Lai et al., 2023), and customer service (Pandya and Holia, 2023), where understanding and responding to emotions is crucial for effective interaction. However, the development of these capabilities also raises important ethical considerations, as LLMs must navigate complex emotional landscapes without reinforcing biases or generating inappropriate responses. As this technology continues to advance, the emotional intelligence of LLMs (Sabour et al., 2024; Wang et al., 2023) is expected to become increasingly sophisticated, allowing for more personalized and empathetic interactions between humans and machines.

3 Background

3.1 Self-awareness.

Self-awareness (Wang et al., 2024; Yin et al., 2023; Liu et al., 2024) refers to the ability to recognize and comprehend one’s own existence, emotions, thoughts, and behaviors. It encompasses an understanding of one’s identity, abilities, strengths, and weaknesses, as well as an awareness of one’s role and influence within various social and environmental contexts. Self-awareness (Camacho et al., 2012; Hall, 2004; Ortiz and Patton, 2012) can be further categorized into several key aspects, including:

1.

Personal Identity Awareness. Knowing who you are, including your name, age, gender, occupation, interests, and hobbies.
2.

Sentiment Awareness. The ability to identify and understand your own emotional states, such as happiness, sadness, anger, and fear.
3.

Cognitive Self-awareness. Being aware of and reflecting on your own thoughts and beliefs, including how you make decisions, solve problems, and perceive the world.
4.

Social Self-awareness. Understanding your role and status in society, as well as being aware of how others perceive and expect from you.
5.

Physical Self-awareness. Recognizing your physical state and sensations, including your appearance, health, and bodily movements.

Self-awareness is a unique characteristic of humans that enables individuals to reflect on their actions, set goals, adjust behavior to adapt to changing environments, and interact effectively in complex social settings. Developing self-awareness can be achieved through self-reflection, psychological counseling, meditation, and communication with others. In this paper, we focus on the sentiment part.

3.2 Evaluation of LLMs.

We perform a thorough evaluation of the LLMs using a detailed question-answer workflow, as illustrated in Figure 2. The process begins with the creation of precise prompts, leading to the generation of initial outputs by the LLMs. These outputs are then analyzed for similarity. The workflow further includes evaluating multiple word vector similarities and categorizing responses based on emotional tones such as stunning, sentimental, positive, inspiring, uplifting, heartwarming, and hopeful. The evaluation concludes with an in-depth assessment of the overall performance and effectiveness of the LLMs.

Various metrics based on multiple-choice questions have been utilized in prominent benchmarks such as CommonsenseQA (Talmor et al., 2018), HellaSwag (Zellers et al., 2019), and MMLU (Hendrycks et al., 2020). These benchmarks have laid the groundwork for evaluating the accuracy of knowledge within language models by focusing on the correct responses to these questions. Building on the methodologies employed in these foundational studies (Pan and Zeng, 2023), our approach extends their insights by leveraging questions from our specific target tasks. These questions, which are designed to be seamlessly integrated into our evaluation framework, allow us to assess not only the accuracy of the responses but also the depth of understanding and reasoning capabilities exhibited by the models. By incorporating these metrics, we aim to provide a comprehensive evaluation that mirrors the rigor of the original benchmarks while adapting them to the nuanced requirements of our tasks.

4 Experimental Settings

4.1 Dataset.

In this study, we utilized three publicly available datasets: Sentiment140, MyPersonality, and IMDB Reviews. The specific details of each dataset are outlined below.

•

Sentiment140¹¹1https://huggingface.co/datasets/stanfordnlp/sentiment140. It is a dataset developed by Stanford University for sentiment analysis research. It consists of 1.6 million tweets collected from Twitter, each labeled as positive, negative, or neutral. What sets Sentiment140 apart is its unique approach to labeling: it uses emoticons in the tweets (such as :-) or :-( ) as sentiment indicators, which are then removed to create a more authentic representation of social media content. This dataset is particularly valuable for handling the challenges of noisy text, including abbreviations, spelling errors, and informal language typical of Twitter. Sentiment140 is widely used for training and evaluating sentiment analysis models, especially those designed to analyze social media data.
•

Mypersonality²²2https://www.psychometrics.cam.ac.uk/productsservices/mypersonality. It is a well-known dataset in the fields of psychology and data science, originally created by researchers at the University of Cambridge in 2007. It was generated through an online personality test application hosted on the Facebook platform, where users could take various personality assessments and voluntarily share their Facebook data. This data includes profile information, social network activities, and the results of psychological assessments like the "Big Five Personality Traits" (OCEAN model). Mypersonality offers a unique opportunity for researchers to study the relationship between social media behavior and personality traits. Although the dataset has been controversial due to privacy concerns and data collection methods, it remains a valuable resource for research in psychology and social network analysis.
•

IMDB Reviews³³3https://huggingface.co/datasets/stanfordnlp/imdb. It is a widely-used sentiment analysis dataset composed of movie reviews from the Internet Movie Database (IMDb). The dataset typically includes 50,000 reviews, each labeled as either positive or negative, and is used for text classification tasks, particularly sentiment analysis. Unlike shorter text datasets, IMDB Reviews features longer reviews, rich with semantic information such as opinions, emotions, and arguments. This makes it an ideal dataset for evaluating and training deep learning models that need to handle more complex semantics and contextual information. Given IMDb’s global reach, the dataset encompasses a wide range of expressions and cultural backgrounds, making it valuable for testing the generalization capabilities of sentiment analysis models.

4.2 Tasks

The focus of this paper is on the task of multi-label classification, where each input query $q$ can be linked to multiple labels simultaneously, rather than being limited to a single label. Formally, given a query $q$ , the model outputs a set of labels $y=\{y_{1},y_{2},\dots,y_{k}\}$ , with each $y_{i}\in\{0,1\}$ representing the presence (1) or absence (0) of the corresponding label. Large language models (LLMs) address this task by leveraging their advanced ability to comprehend complex textual contexts, enabling them to effectively predict multiple relevant labels by identifying and capturing intricate patterns embedded within the data.

1Input: Questions (Prompt)

\mathcal{Q}

, candidate options set

\mathcal{O}=\{o_{1},o_{2},\ldots,o_{n}\}

, pretrained embedding model

\mathcal{M}

, LLM API

\mathcal{K}

Output: Option

\hat{o}\in\mathcal{O}

with highest probability (LLM response).

1: clean prompt

\mathcal{Q}^{\prime}

from

\mathcal{Q}

2: compute embedding of candidate labels

\mathcal{E}=\{e_{i}\}_{i=1}^{n}=\{\mathcal{M}(o)\mid o\in\mathcal{O}\}

3: [Rule1] obtain one response from LLM API

\mathcal{R}=\{r\}=\mathcal{K}(\mathcal{Q}^{\prime})

4: [Rule2] obtain multiple soft response from LLM API

\mathcal{R}=\{r_{1},\ldots,r_{m}\}=\mathcal{K}(\mathcal{Q}^{\prime})

5: compute embedding

e_{r}=\mathcal{M}(\mathcal{R})

6: compute similarity score s = [

\operatorname{SIMILARITY}(e_{r},e)\mid e\in\mathcal{E}

]

7: [Optional] probs =

\operatorname{SOFTMAX}(s)

\triangleright

normalize the similarity score

8: candidate_answer = dict([(i, op) for i, op in enumerate(

\mathcal{O}

)])

9: answer = candidate_answer[np.argmax(probs)]

\triangleright

choose the highest similarity score

Algorithm 1 Evaluation Process of LLMs’ Sentiment Analysis

Table 1: Evaluation Results of LLMs on Sentiment140 dataset.

Models	Accuracy	Precision	Recall	F1-score	ROC-AUC
GPT-3.5-turbo	0.62	0.85	0.68	0.78	0.89
GPT-4	0.64	0.67	0.68	0.73	0.81
GPT-4o	0.72	0.72	0.59	0.65	0.78
llama_7b_v2	0.50	0.66	0.61	0.59	0.62
llama_8b_v3	0.72	0.87	0.71	0.76	0.87
Mistral_7b	0.71	0.88	0.69	0.76	0.88
Doubao	0.75	0.75	0.56	0.66	0.75
Doubao-pro	0.77	0.76	0.56	0.69	0.82

Table 2: Evaluation Results of LLMs on Mypersonality dataset.

Models	Accuracy	Precision	Recall	F1-score	ROC-AUC
GPT-3.5-turbo	0.52	0.55	0.80	0.61	0.77
GPT-4	0.82	0.53	0.76	0.65	0.79
GPT-4o	0.80	0.69	0.75	0.65	0.81
llama_7b_v2	0.38	0.39	0.52	0.38	0.39
llama_8b_v3	0.57	0.48	0.86	0.61	0.77
Mistral_7b	0.59	0.45	0.83	0.59	0.80
Doubao	0.72	0.66	0.79	0.64	0.72
Doubao-pro	0.75	0.70	0.75	0.66	0.75

Table 3: Evaluation Results of LLMs on IMDB dataset.

	Models	Accuracy	Precision	Recall	F1-score	ROC-AUC
rare input	GPT-3.5-turbo	0.58	0.45	0.59	0.53	0.57
	GPT-4	0.65	0.41	0.56	0.49	0.60
	GPT-4o	0.62	0.47	0.61	0.49	0.63
	llama_7b_v2	0.50	0.24	0.50	0.31	0.50
	llama_8b_v3	0.67	0.49	0.66	0.42	0.65
	Mistral_7b	0.60	0.44	0.60	0.42	0.60
	Doubao	0.71	0.55	0.70	0.54	0.70
	Doubao-pro	0.72	0.55	0.72	0.54	0.70
processed input	GPT-3.5-turbo	0.63	0.42	0.64	0.54	0.52
	GPT-4	0.67	0.43	0.58	0.45	0.59
	GPT-4o	0.60	0.50	0.61	0.52	0.68
	llama_7b_v2	0.51	0.26	0.50	0.34	0.50
	llama_8b_v3	0.53	0.27	0.50	0.38	0.50
	Mistral_7b	0.61	0.46	0.60	0.46	0.61
	Doubao	0.72	0.55	0.70	0.55	0.68
	Doubao-pro	0.74	0.55	0.71	0.57	0.70

4.3 Evaluation.

Metrics for multi-class classification are used to evaluate the performance of models that predict multiple classes. Key metrics include accuracy, precision, recall, and F1 score, each providing insights into different aspects of the model’s performance. Accuracy measures the proportion of correctly predicted samples out of the total samples. Precision and recall assess the model’s performance for each class, with precision indicating how many of the predicted positives are actually correct, and recall showing how many actual positives were correctly identified. The F1 score, the harmonic mean of precision and recall, offers a balanced evaluation of the model’s effectiveness.

4.4 Baselines.

We utilize well-established LLMs, such as LLaMA, as baseline models in our experiments. Unless specified otherwise, all baseline models are implemented using the parameters provided in their respective original APIs.

•

ChatGPT⁴⁴4https://openai.com/. It is an advanced conversational AI developed by OpenAI, designed to generate human-like text based on given prompts. It comes in various versions, including GPT-3.5-turbo, GPT-4, and GPT-4-turbo. GPT-3.5-turbo offers efficient performance and responsiveness, making it well-suited for a wide range of applications. GPT-4, a more powerful and sophisticated model, provides enhanced language understanding and generation capabilities, ideal for complex tasks. GPT-4-turbo further optimizes performance, delivering faster responses while maintaining the high quality and depth of GPT-4’s output.
•

LLaMA⁵⁵5https://huggingface.co/docs/transformers/main/model_doc/llama3 (Large Language Model Meta AI). It is a family of advanced language models developed by Meta, designed to generate and understand human-like text. LLaMA models are known for their efficiency and effectiveness in various natural language processing tasks. Within this family, Mistral⁶⁶6https://huggingface.co/mistralai/Mistral-7B-v0.1 is a notable variant that focuses on optimizing performance and resource usage, offering high-quality outputs while being more computationally efficient. Mistral represents a key innovation within the LLaMA series, combining state-of-the-art language generation with enhanced scalability and accessibility for diverse applications.
•

Doubao⁷⁷7https://huggingface.co/doubao-llm. It is an advanced language model designed to excel in a wide range of natural language processing tasks, from text generation and sentiment analysis to machine translation. Built with cutting-edge deep learning techniques and trained on extensive data, Doubao captures intricate linguistic patterns and contextual meanings, enabling it to generate human-like text across various contexts. Its robust performance and versatility make it a valuable tool for industries such as customer service, content creation, academic research, and data-driven decision-making. Doubao’s capabilities contribute significantly to the advancement of AI technologies and our understanding of language.

5 Results and Insights

5.1 Results

We present a comprehensive overview of our evaluation results across three different datasets, which are shown in Table 1, Table 2, and Table 3. These tables encompass a range of models and configurations, offering detailed insights into their performance. Additionally, we provide a demonstration analysis in Table 4, where we apply various prompt templates to the IMDB dataset. For further clarity, the specific details of the prompt templates can be found in Table 5. In the following sections, we discuss key findings and insights, addressing each observation individually for a more in-depth understanding.

5.2 Interesting Insights

Insight 1.

Some LLMs possess a unique ability to be sensitive to sentiment. (Refer to Table 1, 2 and 3)

Insight 2.

Processing prompts cannot obscure or eliminate LLMs ability to detect sentiment with neutral prompts. (Refer to Table 3 and 4)

Insight 3.

Different versions of the same LLM can exhibit varying behaviors and performance. (Refer to Table 1, 2 and 3)

Table 4: Evaluation Results of Doubao on IMDB dataset with various prompt template.

Models	PROMPT ID	P	R	F1
Doubao	PROMPT_NEUTRAL	0.55	0.70	0.55
Doubao	PROMPT_POSITIVE	0.25	0.50	0.33
Doubao	PROMPT_NEGATIVE	0.26	0.50	0.34
Doubao-pro	PROMPT_NEUTRAL	0.55	0.71	0.57
Doubao-pro	PROMPT_POSITIVE	0.24	0.50	0.34
Doubao-pro	PROMPT_NEGATIVE	0.25	0.50	0.32

6 Discussion

The evaluation results across different datasets and models shed light on the varying capabilities of LLMs, particularly when it comes to sentiment detection.

LLMs exhibit a certain sensitivity towards sentiment that appears to be a unique characteristic across multiple models, as shown in Table 1, Table 2, and Table 3. For example, the Doubao-pro model consistently performs well in sentiment tasks, demonstrating strong scores in both precision and recall across multiple datasets. This suggests that the underlying architecture of some LLMs might be more adept at capturing emotional subtleties in text, despite the varying nature of the input data. This sensitivity indicates that certain LLMs can be fine-tuned or selected specifically for sentiment-related tasks, even in a competitive landscape of multiple LLM options.

The ability of LLMs to detect sentiment is not easily obscured by prompt processing, particularly when dealing with neutral prompts, as evident from Table 4. Doubao-pro maintains a relatively high performance with neutral prompts, despite changes in input structure. This suggests that the model’s internal mechanisms for identifying sentiment are robust enough to operate even when the prompt is neutral, implying that sentiment detection in LLMs may be deeply ingrained in the model’s learned representations, rather than being highly sensitive to prompt formulations. This highlights the model’s flexibility and adaptability in different real-world scenarios where the exact phrasing of the input may vary.

The comparison between different versions of the same LLM, such as the Doubao and Doubao-pro models, reveals that even slight modifications to the architecture or training procedures can lead to notable differences in performance, as shown in Table 1, Table 2, and Table 3. Doubao-pro consistently outperforms its predecessor across multiple datasets and metrics, showing that model refinement plays a crucial role in enhancing the ability of LLMs to perform on sentiment tasks. This variability underscores the importance of continuous model development and experimentation to achieve optimal results in practical applications. These detailed insights together provide a deeper understanding of how LLMs behave under various conditions and prompt configurations, suggesting potential strategies for optimizing LLMs for sentiment analysis tasks in diverse applications.

References

Abburi et al. (2023) Harika Abburi, Michael Suesserman, Nirmala Pudota, Balaji Veeramani, Edward Bowen, and Sanmitra Bhattacharya. 2023. Generative ai text classification using ensemble llm approaches. arXiv preprint arXiv:2309.07755.
Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Camacho et al. (2012) Mar Camacho, Janaina Minelli, and Gabriela Grosseck. 2012. Self and identity: raising undergraduate students’ awareness on their digital footprints. Procedia-Social and Behavioral Sciences, 46:3176–3181.
Chen et al. (2024) Shengyuan Chen, Qinggang Zhang, Junnan Dong, Wen Hua, Qing Li, and Xiao Huang. 2024. Entity alignment with noisy annotations from large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.
Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2023. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Floridi and Chiriatti (2020) Luciano Floridi and Massimo Chiriatti. 2020. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694.
Gómez-Rodríguez and Williams (2023) Carlos Gómez-Rodríguez and Paul Williams. 2023. A confederacy of models: A comprehensive evaluation of llms on creative writing. arXiv preprint arXiv:2310.08433.
Hall (2004) Douglas T Hall. 2004. Self-awareness, identity, and leader development. In Leader development for transforming organizations, pages 173–196. Psychology Press.
Hasan et al. (2024) Md Rakibul Hasan, Md Zakir Hossain, Tom Gedeon, and Shafin Rahman. 2024. Llm-gem: Large language model-guided prediction of people’s empathy levels towards newspaper article. In Findings of the Association for Computational Linguistics: EACL 2024, pages 2215–2231.
Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
Lai et al. (2023) Tin Lai, Yukun Shi, Zicong Du, Jiajie Wu, Ken Fu, Yichao Dou, and Ziqi Wang. 2023. Psy-llm: Scaling up global mental health psychological services with ai-based large language models. arXiv preprint arXiv:2307.11991.
Li et al. (2024) Xuchen Li, Xiaokun Feng, Shiyu Hu, Meiqi Wu, Dailing Zhang, Jing Zhang, and Kaiqi Huang. 2024. Dtllm-vlt: Diverse text generation for visual language tracking based on llm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7283–7292.
Liao et al. (2023) Lizi Liao, Grace Hui Yang, and Chirag Shah. 2023. Proactive conversational agents in the post-chatgpt world. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3452–3455.
Liu et al. (2024) Zhendong Liu, Changhong Xia, Wei He, and Chongjun Wang. 2024. Trustworthiness and self-awareness in large language models: An exploration through the think-solve-verify framework. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16855–16866.
Lu et al. (2024) Yinquan Lu, Wenhao Zhu, Lei Li, Yu Qiao, and Fei Yuan. 2024. Llamax: Scaling linguistic horizons of llm by enhancing translation capabilities beyond 100 languages. arXiv preprint arXiv:2407.05975.
Mo et al. (2024) Yuhong Mo, Hao Qin, Yushan Dong, Ziyi Zhu, and Zhenglin Li. 2024. Large language model (llm) ai text generation detection based on transformer deep learning algorithm. arXiv preprint arXiv:2405.06652.
Ortiz and Patton (2012) Anna M Ortiz and Lori D Patton. 2012. Awareness of self. In Why Aren’t We There Yet?, pages 9–31. Routledge.
Pan and Zeng (2023) Keyu Pan and Yawen Zeng. 2023. Do llms possess a personality? making the mbti test an amazing evaluation for large language models. arXiv preprint arXiv:2307.16180.
Pandya and Holia (2023) Keivalya Pandya and Mehfuza Holia. 2023. Automating customer service using langchain: Building custom open-source gpt chatbot for organizations. arXiv preprint arXiv:2310.05421.
Sabour et al. (2024) Sahand Sabour, Siyang Liu, Zheyuan Zhang, June M Liu, Jinfeng Zhou, Alvionna S Sunaryo, Juanzi Li, Tatia Lee, Rada Mihalcea, and Minlie Huang. 2024. Emobench: Evaluating the emotional intelligence of large language models. arXiv preprint arXiv:2402.12071.
Saito et al. (2024) Kuniaki Saito, Kihyuk Sohn, Chen-Yu Lee, and Yoshitaka Ushiku. 2024. Unsupervised llm adaptation for question answering. arXiv preprint arXiv:2402.12170.
Shukla et al. (2023) Utsav Shukla, Manan Vyas, and Shailendra Tiwari. 2023. Raphael at araieval shared task: Understanding persuasive language and tone, an llm approach. In Proceedings of ArabicNLP 2023, pages 589–593.
Sorin et al. (2023) Vera Sorin, Danna Brin, Yiftach Barash, Eli Konen, Alexander Charney, Girish Nadkarni, and Eyal Klang. 2023. Large language models (llms) and empathy-a systematic review. medRxiv, pages 2023–08.
Talmor et al. (2018) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2018. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937.
Tang et al. (2024) Liyan Tang, Igor Shalyminov, Amy Wing-mei Wong, Jon Burnsky, Jake W Vincent, Yu’an Yang, Siffi Singh, Song Feng, Hwanjun Song, Hang Su, et al. 2024. Tofueval: Evaluating hallucinations of llms on topic-focused dialogue summarization. arXiv preprint arXiv:2402.13249.
Ugare et al. (2024) Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, and Gagandeep Singh. 2024. Improving llm code generation with grammar augmentation. arXiv preprint arXiv:2403.01632.
Vu et al. (2024) Minh Duc Vu, Han Wang, Zhuang Li, Jieshan Chen, Shengdong Zhao, Zhenchang Xing, and Chunyang Chen. 2024. Gptvoicetasker: Llm-powered virtual assistant for smartphone. arXiv preprint arXiv:2401.14268.
Wang et al. (2023) Xuena Wang, Xueting Li, Zi Yin, Yue Wu, and Jia Liu. 2023. Emotional intelligence of large language models. Journal of Pacific Rim Psychology, 17:18344909231213958.
Wang et al. (2024) Yuhao Wang, Yusheng Liao, Heyang Liu, Hongcheng Liu, Yu Wang, and Yanfeng Wang. 2024. Mm-sap: A comprehensive benchmark for assessing self-awareness of multimodal large language models in perception. arXiv preprint arXiv:2401.07529.
Yang et al. (2024) Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Shaochen Zhong, Bing Yin, and Xia Hu. 2024. Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Transactions on Knowledge Discovery from Data, 18(6):1–32.
Yin et al. (2023) Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. 2023. Do large language models know what they don’t know? arXiv preprint arXiv:2305.18153.
Yuan et al. (2023) Lifan Yuan, Yangyi Chen, Ganqu Cui, Hongcheng Gao, Fangyuan Zou, Xingyi Cheng, Heng Ji, Zhiyuan Liu, and Maosong Sun. 2023. Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and llms evaluations. Advances in Neural Information Processing Systems, 36:58478–58507.
Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
Zhang et al. (2024a) Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. 2024a. When scaling meets llm finetuning: The effect of data, model and finetuning method. arXiv preprint arXiv:2402.17193.
Zhang et al. (2025) Qinggang Zhang, Shengyuan Chen, Yuanchen Bei, Zheng Yuan, Huachi Zhou, Zijin Hong, Junnan Dong, Hao Chen, Yi Chang, and Xiao Huang. 2025. A survey of graph retrieval-augmented generation for customized large language models. arXiv preprint arXiv:2501.13958.
Zhang et al. (2024b) Zhiping Zhang, Michelle Jia, Hao-Ping Lee, Bingsheng Yao, Sauvik Das, Ada Lerner, Dakuo Wang, and Tianshi Li. 2024b. “it’s a fair game”, or is it? examining how users navigate disclosure risks and benefits when using llm-based conversational agents. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pages 1–26.
Zheng et al. (2024) Dewu Zheng, Yanlin Wang, Ensheng Shi, Ruikai Zhang, Yuchi Ma, Hongyu Zhang, and Zibin Zheng. 2024. Towards more realistic evaluation of llm-based code generation: an experimental study and beyond. arXiv preprint arXiv:2406.06918.
Zhuang et al. (2024) Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. 2024. Toolqa: A dataset for llm question answering with external tools. Advances in Neural Information Processing Systems, 36.
Zou et al. (2024) Zhao Zou, Omar Mubin, Fady Alnajjar, and Luqman Ali. 2024. A pilot study of measuring emotional response and perception of llm-generated questionnaire and human-generated questionnaires. Scientific reports, 14(1):2781.

Appendix

Appendix A Prompt Design

For each task, we apply a consistent prompt engineering template to generate the input prompt. The templates are listed below.

Table 5: Prompt Summarizing.

Dataset	Prompt ID	Prompt Content
Sentiment140	PROMPT_NEUTRAL	You are a psychologist. Please analysis the general sentiment of the tweet based on the following text. You should also provide a brief description of the tweet with some more words or sentences. {TWEET CONTENT}
Mypersonality	PROMPT_NEUTRAL	You are a educational psychologist. Please analysis the personality of the person based on the following detailed information. You should also provide a brief description of the person’s personality with some more words or sentences. {PERSON DETAILED INFORMATION}
	PROMPT_NEUTRAL	You are a film critic. Please classify the general sentiment of the film based on the following reviews as either "positive" or "negative." You should also provide a brief description of the movie with some more words or sentences, such as "positive, this movie looks wonderful!" or "negative, this movie sucks." {REVIEW CONTENT}
IMDB	PROMPT_POSITIVE	You are a film critic with more positive attitude. Please classify the general sentiment of the film based on the following reviews as either "positive" or "negative." You should also provide a brief description of the movie with some more words or sentences, such as "positive, this movie looks wonderful!" or "negative, this movie sucks." {REVIEW CONTENT}
	PROMPT_NEGATIVE	You are a film critic with more negative attitude. Please classify the general sentiment of the film based on the following reviews as either "positive" or "negative." You should also provide a brief description of the movie with some more words or sentences, such as "positive, this movie looks wonderful!" or "negative, this movie sucks." {REVIEW CONTENT}