RITFIS: Robust input testing framework for LLMs-based intelligent software

Mingxuan Xiao1, Yan Xiao2, Hai Dong3, Shunhui Ji1 and Pengcheng Zhang1 1College of Computer Science and Software Engineering, Hohai University, Nanjing, China
[email protected], [email protected], [email protected] 2School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China
[email protected] 3School of Computing Technologies, RMIT University, Melbourne, Australia, [email protected]

Abstract

The dependence of Natural Language Processing (NLP) intelligent software on Large Language Models (LLMs) is increasingly prominent, underscoring the necessity for robustness testing. Current testing methods focus solely on the robustness of LLM-based software to prompts. Given the complexity and diversity of real-world inputs, studying the robustness of LLM-based software in handling comprehensive inputs (including prompts and examples) is crucial for a thorough understanding of its performance.

To this end, this paper introduces RITFIS, a Robust Input Testing Framework for LLM-based Intelligent Software. To our knowledge, RITFIS is the first framework designed to assess the robustness of LLM-based intelligent software against natural language inputs. This framework, based on given threat models and prompts, primarily defines the testing process as a combinatorial optimization problem. Successful test cases are determined by a goal function, creating a transformation space for the original examples through perturbation means, and employing a series of search methods to filter cases that meet both the testing objectives and language constraints. RITFIS, with its modular design, offers a comprehensive method for evaluating the robustness of LLM-based intelligent software.

RITFIS adapts 17 automated testing methods, originally designed for Deep Neural Network (DNN)-based intelligent software, to the LLM-based software testing scenario. It demonstrates the effectiveness of RITFIS in evaluating LLM-based intelligent software through empirical validation. However, existing methods generally have limitations, especially when dealing with lengthy texts and structurally complex threat models. Therefore, we conducted a comprehensive analysis based on five metrics and provided insightful testing method optimization strategies, benefiting both researchers and everyday users.

Index Terms:

intelligent software testing, large language models, natural language processing

I Introduction

Intelligent software based on large language models (LLMs), such as ChatGPT¹¹1https://openai.com/chatgpt, New Being²²2https://www.bing.com, Chatsonic³³3https://writesonic.com, and Paradot⁴⁴4https://www.paradot.ai, has garnered widespread attention due to its exceptional semantic understanding and text generation capabilities. Humans can collaborate with LLMs through ‘prompt + example’ as input [1], thereby accomplishing many security-relevant natural language processing (NLP) downstream tasks, such as financial sentiment analysis [2], public opinion monitoring [3], and fraud detection [4], among others. Taking financial sentiment analysis as an example, the financial market generates a massive amount of news, reports, and social media content, making manual analysis of these texts impractical. Investors and financial analysts need to understand market sentiment to predict the performance of stocks, bonds, or other financial products. Therefore, it is necessary to collect relevant financial textual data, apply models to classify the sentiments of untagged financial texts, and use this for analysis and prediction. If influenced by adversarial texts, leading to inaccurate text classification, this can result in incorrect market sentiment judgment, misguiding investors and decision-makers into making unfavorable investment decisions, potentially leading to financial losses, reputation damage, market turbulence, and other severe consequences. Additionally, in the United States alone, the labor force for software testing is estimated to cost $48 billion annually [5]. For the development of stable-performing LLM-based intelligent software, and to ensure the quality of software responses, automated robustness testing of such software is crucial [6].

Current research focuses on the high sensitivity of LLMs to prompts [7]. Zhu et al. [1] have assessed the robustness of LLMs to adversarial prompts by dynamically creating minor perturbations in characters, words, and sentences. Liu et al. [8] have evaluated the robustness of LLMs to typographical errors in prompts using the Justice dataset. We believe that studying the overall input robustness of LLMs (prompt + example) is more important, as real-world inputs are often complex and varied [9]. Evaluating the model’s robustness to overall input is essential to comprehensively measure its performance, including understanding complex contexts, handling various input errors, and adapting to non-standard usages.

Therefore, this paper proposes RITFIS, a robust input testing framework for LLM-based intelligent software. To our knowledge, RITFIS is the first framework designed to evaluate the robustness of LLM-based intelligent software to natural language inputs. Inspired by the work of Morris et al. [10], RITFIS consists of objective functions, perturbations, constraints, and search methods. It transfers 17 automated testing methods aimed at the deep neural network (DNN)-based intelligent software to the LLM-based intelligent software testing scenario, covering character-level, word-level, sentence-level, and combination-level perturbations. Considering the strong generalization capability of large models and the fact that many companies and individual users will not or cannot fine-tune large models for specific domains due to insufficient data [11], RITFIS focuses on zero-shot situations. We conducted robustness tests on the famous open-source large model Llama [12] using five of the latest or commonly used DNN-based intelligent software testing methods on three datasets that include different language styles and industry terminologies. Experimental results show that current testing methods can reveal certain robustness flaws in LLM-based intelligent software, but there are limitations in testing capabilities.

Refer to caption — Figure 1: RITFIS workflow

II methodology

Figure 1 outlines the proposed RITFIS, which aims to accomplish such testing: given the required threat model and prompt for testing, different testing methods are executed to automatically generate test cases for the original examples in the dataset. The testing method can be defined as a combinatorial optimization problem: using a goal function to define successful test cases, establishing a perturbation space for the original examples based on the method of perturbation, and using search methods to find a test case that meets the test objectives and certain language constraints. The modular design of RITFIS allows us to implement many different testing methods within a shared framework. The following is a detailed introduction to the basic modules of RITFIS: goal function, perturbation, constraint, and search method.

II-A Goal Function

Automated testing methods for prompts are very popular in the robustness research of LLM-based intelligent software. Technically, given a dataset ${D=\left\{\left(x_{i},y_{i}\right)\right\}_{i\in[N]}}$ and the original prompt $P$ , where $x$ represents the original example in the dataset, and $y$ represents the ground truth label, the robustness test for prompts aims to mislead the threat model $f$ by perturbing each example $x$ with a given budget $C$ of $\delta$ : ${\arg\max_{\delta\in C}E_{(x;y)\in D}L[f([P+\delta,x]),y]}$ . Here $L$ represents the loss function, and ${[\cdot,\cdot]}$ represents the concatenation operation. Existing work [1] suggests that LLM-based intelligent software inputs can be formulated in this way. For instance, in text classification tasks, $P$ could be “I will input the test text and you will respond with the label of the text and the confidence score corresponding to each label. My first text is:”, and $x$ could be “Net sales in 2007 are expected to be 10% up on 2006.”

In this paper, our focus is on testing the overall input $[P,x]$ rather than just the prompt $P$ . In the actual use of LLM-based intelligent software, both prompts and samples are indispensable, and the software often faces non-standardized or diverse inputs. Researching the robustness of the overall input helps enhance the model’s robustness and security. Moreover, it is known that LLM-based intelligent software is highly sensitive to prompts [13, 14], and past robustness studies on DNN-based intelligent software [15, 16] have shown that minor perturbations to the sample can mislead the software’s decisions. Therefore, research on input robustness is urgent, and we define the objective function for input robustness as follows:

\arg\max_{\delta\in C}E_{(x;y)\in D}L[f([P,x]+\delta),y]

(1)

where $\delta$ is the perturbation applied to the entirety of ‘prompt + example’, and $L$ is the confidence associated with the label output by the threat model. This definition is similar to the generation of adversarial test cases in DNN-based intelligent software testing, and we extend this concept to the field of robustness testing for LLM-based intelligent software.

II-B Perturbation

The perturbation module is responsible for directly modifying the input data of NLP models. In the robustness testing of LLM-based intelligent software, the purpose of perturbation methods is to subtly yet effectively alter the input text to probe the software’s robustness. Specifically, the types of perturbations in RITFIS include: synonym replacement; confusion word replacement; insertion, deletion, and swapping of words and characters; back-translation; and template-based transformation. Taking the commonly used synonym replacement as an example, the goal of this method is to maintain the original meaning while changing the input to test the model’s understanding ability, for instance, replacing ‘fast’ with ‘rapid.’ These transformation methods can be used individually or in combination to generate potential test cases. In RITFIS, the implementation and application of these perturbation methods are highly flexible, allowing users to customize them according to specific tasks and objectives.

II-C Constraint

In RITFIS, constraints are rules or standards used to ensure that the generated test cases remain consistent with the original examples in specific aspects, and these constraints are crucial for ensuring the effectiveness and practicality of the test cases. Common constraints include: stop-word filtering, part-of-speech constraints, maximum change rate constraints, blacklist vocabulary constraints, perturbation number limits, etc. These constraints in RITFIS are configurable, allowing researchers and testers to adjust them according to specific application scenarios and requirements. Properly applying these constraints can ensure that the generated test cases can effectively test the robustness of intelligent software while maintaining acceptability to human users.

II-D Search Method

Search methods are responsible for strategically exploring the potential test case space formed by perturbations, to identify successful test cases that can maximize the error in the target model. The core of search methods lies in their optimization ability, that is, intelligently applying perturbations and assessing their impact on the model output, to locate effective perturbation positions. These methods must effectively cross the model’s decision boundaries while maintaining the practicality and reasonableness of the input data. The choice and implementation of search methods depend on various factors, including the specific nature of the threat model, the application scenario of the testing work, and the available computational resources. The effectiveness of search methods directly impacts the success rate and efficiency of testing, and is a key component in assessing the robustness of intelligent software [17]. Depending on the tester’s understanding of the threat model, search methods can be classified and introduced from two perspectives:

•

White-box search methods require access to the model’s internal structure and parameters, utilizing this information to guide the generation of test cases. These methods are usually more efficient but require an in-depth understanding of the model. Both types of search methods have their advantages and limitations.
•

Black-box search methods do not require access to the internal structure or parameters of the threat model during the testing process and rely solely on the output of the intelligent software. Black-box methods are closer to real-world application scenarios, as testers often cannot obtain detailed information about the threat model.

In RITFIS, these methods provide a diverse range of options for testing the robustness of intelligent software. White-box methods can precisely locate effective test cases when the internal information of the model is accessible, but their applicability is limited by the level of understanding of the intelligent software. In practical applications, intelligent software typically interacts with LLMs through API interfaces, without access to their internal structure and parameters. In such cases, the interaction between intelligent software and LLMs is essentially a black-box environment. RITFIS focuses on black-box search methods, allowing intelligent software to operate based solely on model output, without needing to understand the specific workings or internal mechanisms of LLMs. Focusing on black-box search methods is not only a practical choice but also a key strategy to ensure the efficiency and robustness of intelligent software in a wide range of application environments, providing a way to effectively explore the capabilities of LLMs without in-depth knowledge of the model.

III Experimental setup

To assess the testing performance of existing automated testing methods on LLM-based intelligent software, we implemented 17 robustness testing methods through RITFIS. We initially evaluated the performance and efficiency of all methods, then selected five commonly used or latest testing methods for further experiments and analysis: TextFooler [18], StressTest [19], Checklist [20], TextBugger [21], and PWWS [22]. Experiments were conducted on three datasets, covering three scenarios: financial sentiment analysis (Financial) [23], movie review analysis (MR)⁵⁵5https://huggingface.co/datasets/MrbBakh/Rotten_Tomatoes, and news classification (AG’s News) [24], encompassing binary and multiple classifications as well as different text lengths. We used Llama-2-70b⁶⁶6https://github.com/facebookresearch/llama as the threat model for the experiments. Composed of 70 billion parameters, Llama-2-70b demonstrates exceptional accuracy and adaptability when processing large volumes of complex data, making it an important milestone in the current field of NLP. The evaluation metrics of the experiment include:

1) Success rate (S-rate) [10], which represents the proportion of usable test cases generated by the test method among all tested examples. In this experiment, its formula can be expressed as follows:

\text{S-rate}=\frac{N_{suc}}{N}

(2)

where, $N_{suc}$ is the number of test cases that test threat models successfully, and $N$ is the total number of input examples ( $N$ = 1,000 in our experiment) for the current test method.

2) Change rate (C-rate) [10], which represents the average proportion of the changed words in the original text. C-rate can be expressed as:

\text{C-rate }=\frac{1}{N_{suc}}\sum_{k=1}^{N_{suc}}\frac{\operatorname{diff}T_{k}}{\operatorname{len}\left(T_{k}\right)}

(3)

where $\operatorname{diff}T_{k}$ represents the number of words replaced in the input text $T_{k}$ and $\operatorname{len}$ ( $*$ ) represents the sequence length. C-rate is an indicator designed to measure the difference in content between the generated test cases and the original examples.

3) Perplexity (PPL) [10], an indicator used to assess the fluency of textual test cases. Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized sequence $X$ =( $x_{0}$ , $x_{1}$ ,…, $x_{t}$ ), then the perplexity of $X$ is,

\operatorname{PPL}(X)=\exp\left\{-\frac{1}{t}\sum_{i}^{t}\log p_{\theta}\left(x_{i}\mid x_{<i}\right)\right\}

(4)

where $\log p_{\theta}\left(x_{i}\mid x_{<i}\right)$ is the log-likelihood of the $i$ -th token conditioned on the preceding tokens $x_{<i}$ according to the language model. Intuitively, given the language model for computing PPL, the more fluent the test case, the less confusing it is.

4) Time overhead (T-O) [10], which refers to the average time it takes for a test method to generate a successful test case.

5) Query number (Q-N) [10], which indicates the average number of times a population-based method is needed to query the threat model when generating a test case. The query number and the time overhead together reflect the efficiency of the test method.

IV Results and analysis

In this study, we chose the text classification task as the focus for testing the input robustness of LLMs. Text classification, as a widely applied downstream task in the NLP field, aims to assign textual content to predefined category labels, such as ‘positive’ or ‘negative’. Recently observed, with LLMs outperforming traditional DNNs [25, 26] and crowdsourced workers [27, 28] in text classification, an increasing number of enterprises are inclined to use LLM-based intelligent software for text classification tasks. Table I presents the test results of five robustness testing methods on three datasets for 1000 randomly extracted original examples, with the best value for each metric highlighted in bold.

TABLE I: Performance of five methods to generate test cases.

Dataset	Method	Indicator
Dataset	Method	S-rate	C-rate	PPL	T-O	Q-N
Financial	TextFooler	0.652	0.012	51.943	339.459	95.372
	StressTest	0.399	0.086	40.496	9.910	2.981
	Checklist	0.402	0.019	52.575	113.312	33.107
	TextBugger	0.639	0.059	52.687	422.331	126.641
	PWWS	0.659	0.011	51.368	1419.016	452.283
AG’s News	TextFooler	0.481	0.017	48.587	1777.568	475.522
	StressTest	0.042	0.038	49.588	10.871	2.749
	Checklist	0.081	0.014	50.833	123.865	30.456
	TextBugger	0.539	0.105	53.409	1097.442	306.723
	PWWS	0.519	0.015	49.739	2625.879	691.261
MR	TextFooler	0.638	0.018	62.861	1101.723	314.526
	StressTest	0.119	0.082	47.406	9.908	3.342
	Checklist	0.238	0.020	62.479	123.521	42.481
	TextBugger	0.659	0.019	61.065	786.444	307.560
	PWWS	0.741	0.013	60.181	1470.677	531.485

Analyzing from the perspective of the effectiveness of the testing methods in uncovering robustness flaws, these methods face the same difficulty as when testing DNNs-based software: the longer the average length of the text in the dataset, the lower the success rate of testing. For example, TextFooler, a commonly used testing method, achieved a 63.856% success rate on the shortest MR dataset, but only 48.143% on the longest AG’s News dataset. The PWWS method averaged a 63.972% success rate across the three datasets, performing better than other baseline testing methods. Notably, PWWS achieved an 80.308% success rate on traditional DNNs, indicating that existing methods can reveal the robustness flaws of LLM-based software to a certain extent, but their testing capability for such software is still limited. We believe this issue primarily stems from two factors: First, the complexity and high integration of LLM-based software may make it difficult for standard robustness testing methods to fully reveal its potential flaws. Second, most existing testing methods rely on fixed perturbation paradigms, and the dynamic, continuous learning characteristics of LLM-based software may mean that test results at a given time point do not reflect its long-term robustness. To address these issues, we suggest improving the establishment of perturbation spaces and search methods in testing algorithms, tailored to LLMs’ unique characteristics and behavioral patterns, thereby increasing the coverage and depth of testing. Secondly, considering the dynamic nature of LLM software, we need to adopt continuous, iterative testing methods and adaptive testing strategies to capture the behavioral changes of the software in long-term operation.

To further analyze the performance of the testing methods, this paper evaluates the quality of test cases generated by different testing methods through two indicators: change rate and text perplexity. All five testing methods misled LLMs with no more than an 11% change rate, indicating that slight perturbations to the original samples could lead to erroneous outputs by LLMs. In terms of PPL, test cases generated by StressTest performed best, with an average PPL score of 45.830, meaning these test texts were more natural and fluent.

Besides the quality of test cases, the efficiency of testing methods is also a key focus of this study, including time cost and the number of queries. In these two metrics, StressTest performed best: it saved at least 103.402 seconds and reduced 27.707 queries to the threat model per successful test case compared to the next best method. However, this significant improvement in efficiency seems to come at the expense of the effectiveness of robustness testing, as StressTest had the lowest success rate across the three datasets. We believe this phenomenon is due to an overemphasis on rapid completion of testing, leading to a neglect of the comprehensiveness of test cases, thereby affecting the accuracy and reliability of test results. Secondly, excessively reducing threat model queries means insufficient exploration of software robustness flaws, lowering the possibility of uncovering complex issues. Therefore, when conducting robustness testing on LLM-based software, there is a need to rebalance the relationship between efficiency and effectiveness, ensuring that increasing the speed of testing does not sacrifice the depth and comprehensiveness of test cases. Specific actions can include increasing the depth of threat model queries to more comprehensively assess software robustness. Secondly, consider introducing more complex test case generation algorithms, which can enhance the coverage and quality of test cases while maintaining efficiency. We suggest adopting dynamic adjustment strategies, dynamically adjusting the allocation of time and query resources based on feedback during the testing process, thus optimizing the balance between efficiency and effectiveness.

V conclusion

In this paper, we present RITFIS, which evaluates the input robustness of LLM-based intelligent software in NLP. RITFIS formalizes robustness testing as a combinatorial optimization task, consisting of goal function, perturbation, constraint, and search method, and applies existing testing methods for DNN-based intelligent software to the robustness testing of LLM-based intelligent software. We selected five testing methods and conducted experiments on three datasets across five testing metrics. The research findings indicate that current LLM-based intelligent software is not robust enough, and the flaw detection methods of existing testing approaches are also not sufficiently effective. Therefore, in future work, we plan to adopt continuous iterative testing methods and rebalance efficiency and effectiveness, targeting the unique input paradigms and hallucination issues of LLMs, to design more effective robustness testing methods.

References

[1] K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y. Wang, L. Yang, W. Ye, N. Z. Gong, Y. Zhang, et al., “Promptbench: Towards evaluating the robustness of large language models on adversarial prompts,” arXiv preprint arXiv:2306.04528, 2023.
[2] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, and G. Mann, “Bloomberggpt: A large language model for finance,” arXiv preprint arXiv:2303.17564, 2023.
[3] L. Espinosa and M. Salathé, “Use of large language models as a scalable approach to understanding public health discourse,” medRxiv, pp. 2024–02, 2024.
[4] L. Jiang, “Detecting scams using large language models,” arXiv preprint arXiv:2402.03147, 2024.
[5] M. Davis, S. Choi, S. Estep, B. Myers, and J. Sunshine, “Nanofuzz: A usable tool for automatic test generation,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1114–1126, 2023.
[6] T. Ouyang, H.-Q. Nguyen-Son, H. H. Nguyen, I. Echizen, and Y. Seo, “Quality assurance of a gpt-based sentiment analysis system: Adversarial review data generation and detection,” arXiv preprint arXiv:2310.05312, 2023.
[7] Z. Guo, R. Jin, C. Liu, Y. Huang, D. Shi, L. Yu, Y. Liu, J. Li, B. Xiong, D. Xiong, et al., “Evaluating large language models: A comprehensive survey,” arXiv preprint arXiv:2310.19736, 2023.
[8] Y. Liu, Y. Yao, J.-F. Ton, X. Zhang, R. G. H. Cheng, Y. Klochkov, M. F. Taufiq, and H. Li, “Trustworthy llms: a survey and guideline for evaluating large language models’ alignment,” arXiv preprint arXiv:2308.05374, 2023.
[9] Y. Liu, T. Cong, Z. Zhao, M. Backes, Y. Shen, and Y. Zhang, “Robustness over time: Understanding adversarial examples’ effectiveness on longitudinal versions of large language models,” arXiv preprint arXiv:2308.07847, 2023.
[10] J. Morris, E. Lifland, J. Y. Yoo, J. Grigsby, D. Jin, and Y. Qi, “Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 119–126, 2020.
[11] C. B. Head, P. Jasper, M. McConnachie, L. Raftree, and G. Higdon, “Large language model applications for evaluation: Opportunities and ethical implications,” New Directions for Evaluation, vol. 2023, no. 178-179, pp. 33–46, 2023.
[12] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., “Llama 2: Open foundation and fine-tuned chat models, 2023,” URL https://arxiv. org/abs/2307.09288, 2023.
[13] J. Wang, X. Hu, W. Hou, H. Chen, R. Zheng, Y. Wang, L. Yang, H. Huang, W. Ye, X. Geng, et al., “On the robustness of chatgpt: An adversarial and out-of-distribution perspective,” arXiv preprint arXiv:2302.12095, 2023.
[14] C.-Y. Ko, P.-Y. Chen, P. Das, Y.-S. Chuang, and L. Daniel, “On robustness-accuracy characterization of large language models using synthetic datasets,” in International Conference on Machine Learning, 2023.
[15] P. Zhang, B. Ren, H. Dong, and Q. Dai, “Cagfuzz: coverage-guided adversarial generative fuzzing testing for image-based deep learning systems,” IEEE Transactions on Software Engineering, vol. 48, no. 11, pp. 4630–4646, 2021.
[16] Y. Xiao, Y. Lin, I. Beschastnikh, C. Sun, D. Rosenblum, and J. S. Dong, “Repairing failure-inducing inputs with input reflection,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pp. 1–13, 2022.
[17] M. Xiao, Y. Xiao, H. Dong, S. Ji, and P. Zhang, “Leap: Efficient and automated test method for nlp software,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1136–1148, IEEE, 2023.
[18] D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is bert really robust? a strong baseline for natural language attack on text classification and entailment,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 8018–8025, 2020.
[19] A. Naik, A. Ravichander, N. Sadeh, C. Rose, and G. Neubig, “Stress test evaluation for natural language inference,” in Proceedings of the 27th International Conference on Computational Linguistics, pp. 2340–2353, 2018.
[20] M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh, “Beyond accuracy: Behavioral testing of nlp models with checklist,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4902–4912, 2020.
[21] J. Li, S. Ji, T. Du, B. Li, and T. Wang, “Textbugger: Generating adversarial text against real-world applications,” in Proceedings 2019 Network and Distributed System Security Symposium, Internet Society, 2019.
[22] S. Ren, Y. Deng, K. He, and W. Che, “Generating natural language adversarial examples through probability weighted word saliency,” in Proceedings of the 57th annual meeting of the association for computational linguistics, pp. 1085–1097, 2019.
[23] P. Malo, A. Sinha, P. Korhonen, J. Wallenius, and P. Takala, “Good debt or bad debt: Detecting semantic orientations in economic texts,” Journal of the Association for Information Science and Technology, vol. 4, no. 65, pp. 782–796, 2014.
[24] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” Advances in neural information processing systems, vol. 28, 2015.
[25] X. Sun, X. Li, J. Li, F. Wu, S. Guo, T. Zhang, and G. Wang, “Text classification via large language models,” arXiv preprint arXiv:2305.08377, 2023.
[26] C. Li, J. Wang, Y. Zhang, K. Zhu, W. Hou, J. Lian, F. Luo, Q. Yang, and X. Xie, “Large language models understand and can be enhanced by emotional stimuli,” arXiv preprint arXiv:2307.11760, 2023.
[27] F. Gilardi, M. Alizadeh, and M. Kubli, “Chatgpt outperforms crowd-workers for text-annotation tasks,” arXiv preprint arXiv:2303.15056, 2023.
[28] X. He, Z. Lin, Y. Gong, A. Jin, H. Zhang, C. Lin, J. Jiao, S. M. Yiu, N. Duan, W. Chen, et al., “Annollm: Making large language models to be better crowdsourced annotators,” arXiv preprint arXiv:2303.16854, 2023.