This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Better to Ask in English: Evaluation of Large Language Models on English, Low-resource and Cross-Lingual Settings

Krishno Deya, Prerona Tarannumb, Md. Arid Hasana,
Imran Razzakc, Usman Naseemd
{krishno.dey, arid.hasan}@unb.ca
aUniversity of New Brunswick, bDaffodil International University,
cUniversity of New South Wales, dMacquarie University
Abstract

Large Language Models (LLMs) are trained on massive amounts of data, enabling their application across diverse domains and tasks. Despite their remarkable performance, most LLMs are developed and evaluated primarily in English. Recently, a few multi-lingual LLMs have emerged, but their performance in low-resource languages, especially the most spoken languages in South Asia, is less explored. To address this gap, in this study, we evaluate LLMs such as GPT-4, Llama 2, and Gemini to analyze their effectiveness in English compared to other low-resource languages from South Asia (e.g., Bangla, Hindi, and Urdu). Specifically, we utilized zero-shot prompting and five different prompt settings to extensively investigate the effectiveness of the LLMs in cross-lingual translated prompts. The findings of the study suggest that GPT-4 outperformed Llama 2 and Gemini in all five prompt settings and across all languages. Moreover, all three LLMs performed better for English language prompts than other low-resource language prompts. This study extensively investigates LLMs in low-resource language contexts to highlight the improvements required in LLMs and language-specific resources to develop more generally purposed NLP applications.

Better to Ask in English: Evaluation of Large Language Models on English, Low-resource and Cross-Lingual Settings


Krishno Deya, Prerona Tarannumb, Md. Arid Hasana, Imran Razzakc, Usman Naseemd {krishno.dey, arid.hasan}@unb.ca aUniversity of New Brunswick, bDaffodil International University, cUniversity of New South Wales, dMacquarie University


1 Introduction

Large Language Models (LLMs) have recently undergone significant advancements and have transformed the landscape of Natural Language Processing (NLP). LLMs are trained using large datasets, which enable them to recognize, translate, predict, or generate text and other content Yang et al. (2023). Recent advances in LLMs enable the development of more powerful, efficient, and versatile NLP systems with broad applicability across different domains and tasks Zhao et al. (2023).

Researchers have been evaluating LLMs for several NLP tasks, including machine translation Xu et al. (2023); Lyu et al. (2023), text summarization Pu et al. (2023); Zhang et al. (2023a), reasoning Suzgun et al. (2022); Miao et al. (2023), and mathematics Lu et al. (2023); Rane (2023). These works laid the foundation for exploring LLMs for downstream tasks in low-resource languages. LLMs are predominantly used in high-resource languages to evaluate different NLP tasks Xing (2024). Recently, there has been a few benchmarks that have utilized LLMs for low-resource languages Aggarwal et al. (2022); Asai et al. (2023). However, they do not extensively study the languages that we are evaluating in this study.

On the other hand, the prevailing literature predominantly employs traditional machine learning or transformer-based models for analyzing low-resource languages Hedderich et al. (2020). Recently, a few multi-lingual LLMs have emerged, but their performance in low-resource languages (especially the most spoken languages in South Asia) is not as good as in English Asai et al. (2023). Three of the most widely spoken languages in South Asia are Hindi, with 610 million speakers; Bangla, with 273 million speakers; and Urdu, with 232 million speakers. Moreover, Hindi is the 3rd, Bangla is the 7th, and Urdu is the 10th most spoken language in the world. Despite collectively representing over 1 billion speakers, computational resources for these three languages remain limited. In this study, we explore the effectiveness of LLMs in Bangla, Hindi, and Urdu languages vs English111We chose these languages because the authors are native speakers of Bangla, Hindi, and Urdu..

A very few noteworthy works employ LLMs in Bangla Liu et al. (2023); Hasan et al. (2023b); Aggarwal et al. (2022), Urdu Koto et al. (2024); Aggarwal et al. (2022), and Hindi Kumar and Albuquerque (2021); Koto et al. (2024); Aggarwal et al. (2022). These studies show that LLMs can achieve identical and better (in some cases) results compared to transformer-based models and traditional machine-learning techniques. However, more intensive studies are required to evaluate LLMs in these low-resource languages to determine their effectiveness and suggest potential improvement required in LLMs.

To address this gap, in this study, we analyze various LLMs (e.g., GPT-4, Llama 2, and Gemini) with zero-shot learning across both English and low-resource languages (e.g., Bangla, Hindi, and Urdu) under different prompt settings. Generic LLM names are used throughout the paper, exact versions are GPT-4, Llama-2–70b-chat, and Gemini Pro. Specifically, we evaluate their effectiveness in English compared to other low-resource languages from South Asia. We perform cross-lingual prompt translation from originally designed prompts to develop various prompt settings and investigate the performance of LLMs under those prompt settings. Our results show that GPT-4 outperformed Llama 2 and Gemini in all languages. Furthermore, Llama 2 cannot recognize prompts in low-resource languages, resulting in the lowest performance. Additionally, LLMs perform better in English than in other languages such as Bangla, Hindi, and Urdu. The key contributions of this work include:

  • We investigate the effectiveness of zero-shot prompting with LLMs (GPT-4, Llama 2, Gemini) for cross-lingual tasks involving low-resource languages from South Asia (Bangla, Hindi, Urdu) on XNLI and SIB-200 datasets. The findings demonstrate that the translation quality of prompts has minimal impact on LLM performance, suggesting potential for language-agnostic prompting techniques in low-resource NLP applications.

  • Our work explores five unique prompt settings to analyze the influence of prompt design on LLM performance in high-resource (English) and low-resource languages. The results show that English prompts consistently outperform prompts in other languages, even with translated prompts.

  • We present a novel approach incorporating Natural Language Inference (NLI) and zero-shot prompting. Providing task descriptions and expected outputs through NLI creates a richer context for LLMs, improving performance in English and low-resource language tasks.

  • We conduct a comparative analysis of 3 prominent LLMs (GPT-4, Llama 2, Gemini) to evaluate their effectiveness in low-resource languages. The findings reveal that GPT-4 outperforms the other two models across all languages and prompt settings.

2 Related Works

The emergence of LLMs marks a major milestone in NLP, proficient in various tasks such as language understanding Dou et al. (2019), text generation Huang et al. (2020), question answering Giampiccolo et al. (2007), and sentiment analysis Yu et al. (2018). However, their performance in low-resource languages needs improvement Robinson et al. (2023).

2.1 LLM for English

In recent years, extensive resources and benchmarks for English Wang et al. (2018); Williams et al. (2017) have fueled the development of LLMs. LLMs significantly impact tasks such as question answering Akter et al. (2023); Tan et al. (2023); Zhuang et al. (2023), reasoning Suzgun et al. (2022); Miao et al. (2023), and machine translation Xu et al. (2023); Lyu et al. (2023). They are also applied in NLI Gubelmann et al. (2023), Sentiment Analysis Sun et al. (2023); Zhang et al. (2023b), and Hate Speech Detection Zhang et al. (2024). Multi-lingual LLMs show strong performance across languages Sitaram et al. (2023), yet struggle with low-resource languages Ahuja et al. (2023). However, LLMs are known for their capabilities to understand the relationships among text sequences and produce results similar to state-of-the-art techniques Pahwa and Pahwa (2023); Gubelmann et al. (2023). The study of Brown et al. (2020) shows that LLMs can perform various tasks by following instructions and learning from just a few examples provided in context. Moreover, English-centric LLMs such as Llama 2 can almost perfectly match the input and output of the language when adjusted with few multi-lingual conversational instructions, despite their limited exposure to other languages Kew et al. (2023). The study of Asai et al. (2023) focuses on few-shot learning and instruction fine-tuning of smaller LLMs (such as mT5, mT0) and ChatGPT to improve the tasks (such as NLI, question answering, sentiment, commonsense reasoning, etc.) performance.

2.2 LLM for Low-resource Languages

Research and practical applications in the field of NLP are centred around high-resource languages that typically have large annotated corpora, well-established tools and libraries, and robust language models trained on large data. Despite being resourceful, the researchers prioritize analyzing high-resource languages and overlook other languages spoken by billions of people Bender (2019).

Most of the work across several downstream tasks in low-resource languages employs traditional machine learning or transformer-based language models Jahan and Oussalah (2023); Chhabra and Vishwakarma (2023). To improve the Natural Language Interface (NLI) task across low-resource languages, several researchers have attempted to develop benchmark datasets and frameworks Bhattacharjee et al. (2021); Rahman et al. (2017); Chakravarthy et al. (2020). Similar attempts were also made in tasks such as Sentiment Analysis Islam et al. (2021); Hasan et al. (2020a, 2023a); Patra et al. (2018); Muhammad and Burney (2023), Hate Speech DetectionBadjatiya et al. (2017); Zimmerman et al. (2018), etc. These works do not explore the use of LLMs for those tasks above.

Researchers have recently been shifting their attention towards developing applications for low-resource languages. As a result, several authors are evaluating LLMs in low-resource languages across several tasks Liu et al. (2023); Hasan et al. (2023b); Kabir et al. (2023); Koto et al. (2024); Kumar and Albuquerque (2021); Hee et al. (2024); García-Díaz et al. (2023). However, there are several challenges in evaluating LLMs in low-resource languages for different tasks Ahuja et al. (2023); Chung et al. (2023). Researchers have developed low-resource language benchmark Asai et al. (2023) and datasets Aggarwal et al. (2022); Khan et al. (2024) to address such challenges. INDICXNLI Aggarwal et al. (2022), INDICLLMSUITE Khan et al. (2024) dataset consists of several Indian regional languages but do not have any resources in Urdu. The benchmark BUFFET Asai et al. (2023) incorporates most spoken South Asian languages in NLI tasks; however, they do not incorporate Bangla in their experiments. Along with the several low-resource benchmarks and datasets, many prompting techniques were also explored to evaluate LLMs on low-resource languages Qin et al. (2023); Huang et al. (2023).

Recently, a few multi-lingual LLMs have emerged, but their performance in low-resource languages, especially the most spoken languages in South Asia, is less explored. To address this gap, in this study, we evaluate LLMs such as GPT-4, Llama 2, and Gemini to analyze their effectiveness in English compared to other low-resource languages from South Asia (e.g., Bangla, Hindi, and Urdu).

3 Methodology

Model Prompt Template
GPT-4 [ { ‘role’: ‘user’, ‘content’: "Classify the following ‘premise’ and ‘hypothesis’ into one of the following classes: ‘Entailment’, ‘Contradiction’, or ‘Neutral’. Provide only label as your response." premise: [PREMISE_TEXT] hypothesis: [HYPOTHESIS_TEXT] label: }, { role: ‘system’, content: "You are an expert data annotator and your task is to analyze the text and find the appropriate output that is defined in the user content." } ]
Llama 2 and Gemini Classify the following ‘premise’ and ‘hypothesis’ into one of the following classes: ‘Entailment’, ‘Contradiction’, or ‘Neutral’. Provide only label as your response. premise: [PREMISE_TEXT] hypothesis: [HYPOTHESIS_TEXT] label:
Table 1: Prompts used for zero-shot learning on XNLI dataset

3.1 Prompt Approach

The quality of the prompt impacts the performance of LLMs White et al. (2023). A well-designed prompt helps users and developers leverage the capabilities of LLMs to retrieve information that aligns with their specific interests. LLMs are usually trained on diverse datasets. Thus, providing clear instructions on interacting with LLMs and producing the desired information is essential. Designing a good prompt is an iterative process that requires instructions refining through successive interactions, enabling the LLMs to learn and produce desired results. In this study, we have employed zero-shot prompting and provided instruction in natural language. To help LLMs generate more relevant output, we provided detailed instructions containing the task descriptions and prompt template in Table 1 (and 7 in the Appendix). We used the same instruction format for all languages and LLMs.

Furthermore, to leverage the ability of GPT-4 to receive role information and perform according to that specific role, we provided role information along with the prompt. During the experiment, Gemini Pro blocked most prompts for containing harmful and inappropriate texts and made no predictions. To obtain the predictions for those harmful contents, we changed the safety settings of the Gemini Pro models. Despite changing the safety setting, models still did not make any predictions. See Appendix A for safety settings.

Model Lang. P1 P2 P3 P4 P5
Acc. F1macro Acc. F1macro Acc. F1macro Acc. F1macro Acc. F1macro
GPT-4 BN 68.72 69.05 68.72 69.05 - - 70.73 71.18 70.26 70.54
EN 86.73 86.79 - - 87.03 87.08 82.42 81.99 86.73 86.81
HI 71.52 71.97 70.26 70.73 68.52 68.80 71.20 71.67 - -
UR 65.07 64.77 66.45 66.73 65.31 65.48 - - 66.77 66.95
Llama 2 BN 35.49 36.73 33.24 31.27 - - 35.42 35.91 32.05 33.59
EN 65.73 58.76 - - 60.80 56.78 62.59 59.17 63.59 60.11
HI 36.27 38.57 33.53 33.66 38.02 40.89 33.19 33.43 - -
UR 36.39 35.56 39.48 35.83 36.77 37.65 - - 33.31 36.35
Gemini BN 62.19 61.74 60.56 60.16 - - 56.56 52.41 60.67 60.30
EN 73.71 73.36 - - 73.65 73.09 62.94 60.73 74.71 74.72
HI 61.90 61.72 61.88 61.70 62.10 61.47 52.10 48.19 - -
UR 49.75 46.11 57.30 56.47 59.04 58.40 - - 57.25 56.34
Table 2: Performances of the LLMs across the settings, and languages for XNLI dataset. Underline shows the best F1 score among four languages across all settings and LLMs. Blue indicates the best F1 score in all settings among three LLMs. Lang.: Language, Acc.: Accuracy, BN: Bangla, EN: English, HI: Hindi, and UR: Urdu

3.2 Experimental Details

3.2.1 Data

In this study, we employed LLMs to evaluate the NLI and Classification task due to the absence of cross-lingual datasets for other downstream NLP tasks.
We used the most widely known and publicly available NLI datasets, the cross-lingual natural language inference (XNLI) Conneau et al. (2018) and SIB-200 Adelani et al. (2024). XNLI: The dataset contains 15 languages, including low-resource languages such as Urdu and Hindi. English validation and test sets of MultiNLI were manually translated, while the train set was machine-translated for all languages. The dataset contains 392,702392,702 train, 2,4902,490 validation, and 5,0105,010 test samples. Each data sample contains the premise, hypothesis, and corresponding labels (e.g., entailment, neutral, and contradiction). In our study, we selected test sets of English, Hindi, and Urdu languages from the XNLI dataset. We also selected a Bangla XNLI dataset available for public use, and was generated by translating the XNLI dataset using English to Bangla translator model Hasan et al. (2020b), which is widely used and the only large-scale study on English-to-Bangla machine translation. The Bangla XNLI dataset contains 381,449381,449 train, 2,4192,419 validation, and 4,8954,895 test samples.

SIB-200: We used SIB-200 dataset for the classification task. SIB-200 is an inclusive and big evaluation dataset that contains more than 200 languages Adelani et al. (2024). The dataset is straightforward and mostly used for classification tasks. The dataset has 701 train samples, 99 validation samples, and 204 test samples. SIB-200 has been developed from the machine translation corpus Flores-200 Team et al. (2022). Later, the dataset was extended to 203 languages with sentence-level annotation.

The most spoken languages in South Asia—Hindi, Bangla, and Urdu are used daily by approximately 1.1 billion people 222https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers. Despite their widespread use, these languages are considered low-resource in NLP. We included all three in the XNLI dataset to assess LLMs’ performance compared to English. Having at least one native speaker of these languages among the authors aids in organizing and understanding LLMs’ responses during experiments.

3.2.2 Prompt Settings

Table 8 in the Appendix shows the prompt settings and prompts of all languages used in our study for the XNLI dataset. Setting P1 consists of four language-specific prompts, and each of the four prompts was designed by 4 native speakers of Bangla, English, Hindi, and Urdu languages. Setting P2, P3, P4, and P5 was created by translating the prompt from English, Bangla, Urdu, and Hindi respectively. Classes names such as ’Entailment’, ’Contradiction’, and ’Neutral’ were used consistently for all language prompts 333Class labels are in English for all languages. Table 9 in the Appendix shows a similar prompt setting used for the SIB-200 dataset. We used Google Translatorfor translating prompts across four languages.

3.2.3 Post-Processing

Our designed prompt instructed the LLMs to output the class labels (e.g., entailment, neutral, and contradiction) as the response. However, LLMs often returned extra characters and words along with the label. We processed the outputs by removing unknown characters and filtering the class labels using regular expressions to handle such responses. When the LLM response contained no class label, we classified them as “None" to get the final results.

3.2.4 Evaluation Metrics

To evaluate the performances, we computed accuracy, precision, recall, and F1 scores for all the experimental settings to evaluate the effectiveness of LLMs on cross-lingual translated prompts. We have computed weighted precision, recall, and macro F1 scores to deal with class imbalance.

4 Results and Discussion

In this section, we present and discuss the experimental results of our study. Firstly, we discuss the performance of LLMs in English compared to low-resource languages (i.e., Bangla, Hindi, and Urdu). Then, we provide a comparison of LLMs across different settings. Finally, we delve into the results of each setting (e.g., P1, P2, P3, P4, and P5). Additionally, we provide class-wise precision and recall scores in Table 4 and 5.

Model Lang. P1 P2 P3 P4 P5
Acc. F1macro Acc. F1macro Acc. F1macro Acc. F1macro Acc. F1macro
GPT-4 BN 86.27 84.35 86.76 85.00 - - 85.78 83.80 86.76 84.77
EN 86.27 84.35 - - 87.75 85.45 88.72 86.63 87.74 85.52
HI 84.80 83.01 87.74 86.72 86.76 85.25 85.78 83.29 - -
UR 86.76 84.46 86.76 84.03 85.78 83.02 - - 74.51 68.61
Llama 2 BN 25.33 17.22 40.00 11.43 - - 24.29 5.91 12.86 4.87
EN 58.97 57.42 - - 71.43 67.48 50.00 46.54 69.88 66.68
HI 25 8.20 34.97 22.79 27.09 10.25 23.53 7.86 - -
UR 19.89 5.14 25.00 5.76 23.76 5.59 - - 23.64 5.67
Gemini BN 84.80 83.93 85.29 83.79 - - 82.35 80.80 84.80 82.57
EN 84.80 82.66 - - 83.82 81.75 82.84 80.19 85.29 83.10
HI 83.33 79.75 87.25 85.85 82.35 79.09 83.82 81.80 - -
UR 81.37 77.86 81.37 79.17 86.76 84.10 - - 73.04 67.05
Table 3: Performances of the LLMs across the settings, and languages for SIB-200 dataset. Underline shows the best F1 score among four languages across all settings and LLMs. Blue indicates the best F1 score in all settings among three LLMs. Lang.: Language, Acc.: Accuracy, BN: Bangla, EN: English, HI: Hindi, and UR: Urdu

4.1 English vs Low-resource Languages

The experimental results presented in Table 2 and 3 indicate that all LLMs exhibit superior performance for English prompts (indicated by Underline) in most of the prompt settings. Note that the P2 setting does not contain English prompts. In the P1 setting (original prompts), GPT-4 outperforms other languages in English by 17.7517.75%, 14.8214.82%, and 22.0222.02% for Bangla, Hindi, and Urdu respectively for the XNLI dataset. This trend persists across all settings, with English consistently demonstrating superior performance compared to other languages, even with translated prompts. Similarly, for the SIB-200 dataset, GPT-4 prompts in all languages produce better performance than other languages. However, the performance of GPT on the SIB-200 dataset is more balanced compared to XNLI for prompts in all languages. The smaller size of the SIB-200 dataset could contribute to such a balanced performance. Although Bangla and Hindi perform better than Urdu, there remains a significant performance gap compared to English on the XNLI dataset. However, the performance differences among low-resource languages across all settings are very small.

Similarly, Llama 2 performs better in English but poorly in other languages. The performance gap between English and other languages is substantial in setting P1, with English outperforming Bangla, Hindi, and Urdu by 22.0322.03%, 20.1920.19%, and 23.2023.20%, respectively for the XNLI dataset. This trend persists across all settings, with English consistently performing better than other languages. In the SIB-200 dataset, the performance gap between English and the other three languages in the P1 setting is 40.240.2, 49.2249.22, and 52.2852.28 for Bangla, Hindi, and Urdu respectively. The English language prompts produce better performance than other languages for all other settings.

In contrast, the performance disparity between English and other languages is slightly lower in Gemini Pro compared to GPT-4 and Llama 2. Gemini Pro performs better in English than Bangla, Hindi, and Urdu by 11.6211.62%, 11.6411.64%, and 27.2527.25%, respectively in P1 settings for the XNLI dataset. This trend continues in translated prompt settings (P3, P4, and P5), with English consistently outperforming other languages. For the SIB-200 dataset, English prompts produce poor performance across all settings except for setting P5. However, the difference margin in performance among languages is very small and similar.

Such bias towards English in LLMs can be attributed to several factors. One significant reason is the disproportionate training distribution, with most LLMs predominantly trained on extensive English corpora. For instance, in Llama 2, approximately 9090% of the training data is sourced from English, leaving only 1010% representing other languages worldwide. This skewed training distribution significantly contributes to the observed biases towards low-resource languages. The findings underscore the need for substantial improvements in LLMs to ensure better generalization across low-resource languages. The balanced performance of LLMs on the SIB-200 dataset could be because of several factors. The SIB-200 dataset is significantly smaller compared to the XNLI dataset. The smaller size may have contributed to the increased quality of the SIB-200 dataset, making it easier for LLMs to understand and respond.

Model Lang P1 P2 P3 P4 P5
Cont. Ent. Neut. Cont. Ent. Neut. Cont. Ent. Neut. Cont. Ent. Neut. Cont. Ent. Neut.
GPT-4 BN 73.51 66.72 66.91 73.51 66.72 66.91 - - - 76.81 69.93 66.80 74.92 68.59 68.11
EN 90.90 87.56 81.92 - - - 91.22 87.84 82.17 89.83 83.44 72.71 90.91 87.72 81.80
HI 77.85 69.81 68.24 76.97 67.96 67.26 76.06 63.88 66.46 78.30 69.62 67.08 - - -
UR 73.20 61.74 59.37 72.68 63.47 64.03 71.57 61.04 63.82 - - - 73.47 63.32 64.06
Llama 2 BN 06.39 11.91 91.88 09.30 39.02 45.48 - - - 10.90 35.74 61.08 03.70 28.87 68.19
EN 88.36 68.11 19.79 - - - 76.77 61.38 32.20 82.97 62.59 31.95 80.62 63.16 36.56
HI 11.84 14.38 89.48 0.60 0.77 99.61 18.29 24.40 79.99 00.90 00.66 98.02 - - -
UR 29.61 40.41 36.65 39.02 45.60 22.86 24.32 37.55 51.09 - - - 11.00 12.64 85.42
Gemini BN 70.08 59.88 55.25 69.03 54.74 56.72 - - - 68.98 30.09 58.17 68.63 53.30 58.98
EN 82.05 72.53 65.48 - - - 80.61 74.51 64.14 79.23 42.43 60.53 82.53 73.92 67.72
HI 70.37 57.10 57.68 69.67 56.08 59.34 70.57 61.40 52.43 63.41 25.01 56.15 - - -
UR 53.75 20.95 63.62 67.05 50.56 51.79 68.84 50.56 55.80 - - - 66.62 51.16 51.24
Table 4: Class-wise F1macro score for GPT-4, Llama 2, and Gemini across five prompt settings for the XNLI dataset. Lang.: Language, BN: Bangla, EN: English, HI: Hindi, and UR: Urdu, Cont: contradiction, Ent: Entailment, Neut: Neutral.
Model Class P1 P2 P3 P4 P5
BN EN HI UR BN EN HI UR BN EN HI UR BN EN HI UR BN EN HI UR
GPT-4 Ent. 85.71 80.00 75.00 70.97 82.35 - 81.25 75.00 - 82.35 75.00 66.67 78.79 82.35 78.79 - 80.00 82.35 - 68.75
Geo. 76.47 74.29 71.79 81.08 77.78 - 78.95 68.75 - 74.29 77.78 81.08 80.00 74.29 80.00 - 77.78 74.29 - 52.83
Hel. 85.0 75.68 80.95 77.78 75.68 - 83.72 85.00 - 76.92 82.05 76.92 73.68 82.05 73.68 - 76.92 75.68 - 75.68
Pol. 91.53 95.08 93.33 91.53 91.53 - 94.92 94.92 - 95.08 93.33 91.53 89.66 95.08 89.66 - 91.53 95.08 - 23.53
Sci. 94.34 88.89 88.68 88.50 90.74 - 88.46 93.46 - 91.74 88.68 90.09 90.91 92.59 90.91 - 91.59 90.09 - 86.96
Spr. 89.80 89.80 86.96 89.80 90.20 - 91.67 85.71 - 87.50 93.88 8571 87.50 89.80 87.5 - 87.50 89.80 - 86.79
Tra. 87.06 86.75 84.34 91.57 86.75 - 88.10 85.39 - 90.24 86.05 89.16 86.05 90.24 86.05 - 88.10 91.36 - 85.71
Llama 2 Ent. 0.00 64.52 08.70 0.00 0.00 - 09.52 0.00 - 77.78 10.53 0.00 0.00 40.00 0.00 - 0.00 72.00 - 0.00
Geo. 17.39 42.55 0.00 0.00 0.00 - 09.52 0.00 - 46.81 0.00 0.00 0.00 34.29 0.00 - 0.00 46.51 - 0.00
Hel. 09.09 56.25 0.00 0.00 0.00 - 25.00 0.00 - 66.67 0.00 0.00 0.00 29.63 08.70 - 0.00 75.00 - 0.00
Pol. 27.03 68.97 0.00 0.00 0.0 - 16.67 0.00 - 80.60 06.25 0.00 0.00 57.14 0.0 - 0.00 80.70 - 0.00
Sci. 39.69 64.71 39.83 36.00 57.14 - 45.28 40.32 - 83.05 41.32 13.64 41.35 53.68 38.14 - 34.11 78.10 - 39.67
Spr. 22.22 66.67 0.00 0.00 0.00 - 07.14 0.00 - 87.50 0.00 0.0 0.00 64.86 0.0 - 0.00 89.47 - 0.00
Tra. 05.13 38.30 08.89 0.00 0.00 - 46.43 0.00 - 30.00 13.64 0.0 0.00 46.15 08.16 - 0.00 25.00 - 0.00
Gemini Ent. 72.73 81.25 76.47 59.26 80.00 - 78.79 72.73 - 82.35 56.25 75.00 73.68 77.42 72.73 - 81.08 81.25 - 60.61
Geo. 76.47 68.97 53.33 58.82 77.78 - 78.79 64.71 - 73.33 64.52 64.71 70.59 68.97 68.97 - 66.67 64.52 - 55.00
Hel. 84.21 75.68 78.95 81.08 76.92 - 87.18 75.68 - 70.27 84.21 87.18 75.00 66.67 81.08 - 76.92 82.05 - 79.07
Pol. 96.67 88.89 90.32 88.52 91.53 - 88.52 87.10 - 87.50 88.89 91.80 93.33 90.32 88.89 - 92.06 86.57 - 27.78
Sci. 83.64 86.44 90.91 85.22 88.29 - 90.27 84.48 - 86.18 86.24 90.91 87.76 84.30 85.71 - 89.72 88.50 - 82.88
Spr. 92.31 90.57 86.79 90.20 86.27 - 90.57 88.46 - 87.27 90.20 90.20 87.50 86.79 92.31 - 88.00 92.31 - 77.97
Tra. 81.48 86.84 81.48 81.93 85.71 - 86.84 81.08 - 85.33 83.33 88.89 77.78 86.84 92.93 - 83.54 86.49 - 86.05
Table 5: Class-wise F1macro score for GPT-4, Llama 2, and Gemini across five prompt settings for the SIB-200 dataset. Lang.: Language, BN: Bangla, EN: English, HI: Hindi, and UR: Urdu, Ent: Entertainment, Geo: Geography, Hel: Health, Sci: Science/Technology, Spr: Sports, Tra: Travel.

Detailed class-wise F1 scores are provided in Table 4 and Table 5. The class-wise result is balanced for both GPT-4 and Gemini Pro for the XNLI dataset. In contrast, Llama 2 is biased towards the Neutral class. This behavior may be due to its inability to fully comprehend the prompt, leading to random predictions in the Neutral class without proper analysis. Similarly, for the SIB-200 dataset, GPT-4 and Gemini Pro produce balanced class-wise results and are biased toward the "Science/Technolog" class for Llama 2. The intensity of bias is higher in the SIB-200 dataset, where Llama 2 could not generate any prediction for many classes across all settings.

In summary, GPT-4 demonstrated superior performance to Llama 2 and Gemini Pro across all settings and languages, showcasing its accuracy in understanding prompts and providing appropriate responses. Gemini Pro performed better than Llama 2 but lacked support for Urdu in the XNLI dataset and struggled with predicting samples containing harmful content. Conversely, Llama 2 faced challenges in understanding prompts in low-resource languages. These findings suggest that GPT-4’s multi-lingual capabilities and precise, prompt understanding contribute to its effectiveness in various language settings.

4.2 Cross-Lingual Translated Prompts

This section discusses the results of LLMs across languages in each setting. To make valid comparisons among different settings (i.e., P1, P2, P3, P4, and P5), we only consider the best F1 score achieved among the three LLMs (i.e., GPT-4, Llama 2, and Gemini Pro) for each language. Table 2 and 3 show the best scores for each language for individual prompt settings (indicated by Blue).

In the XNLI dataset, the Hindi-to-Bangla translated prompt in setting P5 achieved a higher F1 score (86.81%) than the original Bangla prompt in setting P1 (69.05%). Similarly, for the Urdu language, the translated prompt in setting P5 achieved a higher F1 score (66.95%) than the original Urdu prompt. For the English language, the Bangla-to-English translated prompt in setting P3 achieved a higher F1 score (87.08%) than the original English prompt. However, the original prompt achieved the highest F1 score for the Hindi language compared to other translated prompts. Similarly, in the SIB-200 dataset setting P5 has a better F1 score than the original Bangla prompt. Translated prompts in settings P4 and P3 have better F1 scores than original English and Hindi prompts. However, the original Urdu prompt has a better F1 score compared to any translated prompts. Our results suggest that the translation of a specific language does not significantly affect the prediction capabilities of LLMs.

4.3 Comparison among LLMs

We compare the performance of different LLMs on XNLI and SIB-200 datasets. Results showed that GPT4 and Llama 2 could make predictions for all data samples, while Gemini Pro did not make any predictions for prompts containing harmful content despite adjusting the safety setting. The number of unpredicted samples was very low (ranging from 1-4) and negligible. Additionally, LLMs returned unknown characters and words alongside class labels, which we addressed during post-processing (described in 3.2.3). We assigned an inverse class to samples with invalid labels, reducing the overall performance of LLMs during evaluation metric calculations. See Table 12 and 13 in Appendix for the total number of invalid labels returned by LLMs.

Table 2 and Table 3 demonstrate that GPT-4 consistently outperforms Llama 2 and Gemini Pro across all five settings (indicated by Blue) for both the datasets except for setting P3 on SIB-200 dataset. Further investigation shows that GPT -4 accurately understood prompts and provided appropriate responses, with minimal unwanted characters and words accompanying the labels. Interestingly, GPT-4 showed superior performance in English prompts across all settings, suggesting its robustness in understanding English language prompts compared to others. Moreover, GPT-4’s multi-lingual capabilities improved performance in languages such as Bangla, Hindi, and Urdu.

Conversely, Gemini Pro performed better than Llama 2 but not as well as GPT-4. Moreover, Gemini Pro produced the best performance for Urdu in setting P3 for the SIB-200 dataset. Despite its performance, Gemini Pro returned many unwanted characters, words, or invalid labels, which affected its overall performance. Additionally, Gemini Pro struggled with predicting samples containing harmful content and lacked support for the Urdu language, further limiting its performance.

In contrast, Llama 2 exhibited poor performance compared to GPT-4 and Gemini Pro. Trained on 90% English data, Llama 2 faced challenges in understanding prompts in low-resource languages such as Bangla, Hindi, and Urdu. Additionally, Llama 2 returned more unwanted characters, words, or invalid labels, significantly impacting its performance during evaluation metric calculations. Moreover, Llama 2 showed dominance in predicting the neutral class over entailment and contradiction classes, indicating its difficulty in accurately predicting sentences belonging to these classes in the XNLI dataset.

5 Conclusion and Future Work

In this study, we evaluated LLMs including GPT-4, Llama 2, and Gemini Pro across English and low-resource languages such as Bangla, Hindi, and Urdu, focusing on the NLI task due to dataset limitations. Despite recent advancements in LLMs, their performance on low-resource languages, particularly those spoken in South Asian countries, remains underexplored. Utilizing zero-shot prompting techniques in five different settings, our findings reveal GPT-4 consistently outperforms Llama 2 and Gemini Pro, with Gemini Pro showing better performance than Llama 2, particularly struggling with low-resource language prompts. Interestingly, English prompts yield superior responses compared to low-resource language prompts, and translating prompts from one language to another occasionally enhances performance. Future research should prioritize collecting low-resource datasets and developing resources for South Asian languages to improve LLM generalization, while incorporating techniques like few-shot prompting and developing cross-lingual datasets for various tasks would facilitate comprehensive evaluations on low-resource languages.

Limitations

We evaluated LLMs on a single task (i.e., NLI), which may not accurately reflect the capabilities of LLMs across a broader range of tasks and datasets. The unavailability of cross-lingual datasets such as XNLI and SIB-200, restricted us from conducting our study on a broader scale, incorporating more tasks and datasets. Additionally, we only utilized zero-shot prompting techniques and did not explore explicit prompting techniques (e.g., few-shot prompting) to enhance the performance. The utilization of only prompting techniques may limit the generalization of our findings. However, we could not explore other prompting techniques due to resource limitations; conducting experiments with LLMs requires premium access, which comes at a hefty cost.

References

  • Adelani et al. (2024) David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O. Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee. 2024. Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects. Preprint, arXiv:2309.07445.
  • Aggarwal et al. (2022) Divyanshu Aggarwal, Vivek Gupta, and Anoop Kunchukuttan. 2022. Indicxnli: Evaluating multilingual inference for indian languages. arXiv preprint arXiv:2204.08776.
  • Ahuja et al. (2023) Kabir Ahuja, Rishav Hada, Millicent Ochieng, Prachi Jain, Harshita Diddee, Samuel Maina, Tanuja Ganu, Sameer Segal, Maxamed Axmed, Kalika Bali, et al. 2023. Mega: Multilingual evaluation of generative ai. arXiv preprint arXiv:2303.12528.
  • Akter et al. (2023) Syeda Nahida Akter, Zichun Yu, Aashiq Muhamed, Tianyue Ou, Alex Bäuerle, Ángel Alexander Cabrera, Krish Dholakia, Chenyan Xiong, and Graham Neubig. 2023. An in-depth look at gemini’s language abilities. arXiv preprint arXiv:2312.11444.
  • Asai et al. (2023) Akari Asai, Sneha Kudugunta, Xinyan Velocity Yu, Terra Blevins, Hila Gonen, Machel Reid, Yulia Tsvetkov, Sebastian Ruder, and Hannaneh Hajishirzi. 2023. Buffet: Benchmarking large language models for few-shot cross-lingual transfer. arXiv preprint arXiv:2305.14857.
  • Badjatiya et al. (2017) Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, and Vasudeva Varma. 2017. Deep learning for hate speech detection in tweets. In Proceedings of the 26th international conference on World Wide Web companion, pages 759–760.
  • Bender (2019) Emily Bender. 2019. The# benderrule: On naming the languages we study and why it matters. The Gradient, 14:34.
  • Bhattacharjee et al. (2021) Abhik Bhattacharjee, Tahmid Hasan, Kazi Samin, Md Saiful Islam, M. Sohel Rahman, Anindya Iqbal, and Rifat Shahriyar. 2021. Banglabert: Combating embedding barrier in multilingual models for low-resource language understanding. Preprint, arXiv:2101.00204.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  • Chakravarthy et al. (2020) Sharanya Chakravarthy, Anjana Umapathy, and Alan W Black. 2020. Detecting entailment in code-mixed hindi-english conversations. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 165–170.
  • Chhabra and Vishwakarma (2023) Anusha Chhabra and Dinesh Kumar Vishwakarma. 2023. A literature survey on multimodal and multilingual automatic hate speech identification. Multimedia Systems, pages 1–28.
  • Chung et al. (2023) Willy Chung, Samuel Cahyawijaya, Bryan Wilie, Holy Lovenia, and Pascale Fung. 2023. Instructtods: Large language models for end-to-end task-oriented dialogue systems. arXiv preprint arXiv:2310.08885.
  • Conneau et al. (2018) Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  • Dou et al. (2019) Zi-Yi Dou, Keyi Yu, and Antonios Anastasopoulos. 2019. Investigating meta-learning algorithms for low-resource natural language understanding tasks. arXiv preprint arXiv:1908.10423.
  • García-Díaz et al. (2023) José Antonio García-Díaz, Ronghao Pan, and Rafael Valencia-García. 2023. Leveraging zero and few-shot learning for enhanced model generality in hate speech detection in spanish and english. Mathematics, 11(24):5004.
  • Giampiccolo et al. (2007) Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and William B Dolan. 2007. The third pascal recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pages 1–9.
  • Gubelmann et al. (2023) Reto Gubelmann, Aikaterini-Lida Kalouli, Christina Niklaus, and Siegfried Handschuh. 2023. When truth matters-addressing pragmatic categories in natural language inference (nli) by large language models (llms). In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (* SEM 2023), pages 24–39.
  • Hasan et al. (2023a) Md Arid Hasan, Firoj Alam, Anika Anjum, Shudipta Das, and Afiyat Anjum. 2023a. Blp-2023 task 2: Sentiment analysis. In Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pages 354–364.
  • Hasan et al. (2023b) Md Arid Hasan, Shudipta Das, Afiyat Anjum, Firoj Alam, Anika Anjum, Avijit Sarker, and Sheak Rashed Haider Noori. 2023b. Zero-and few-shot prompting with llms: A comparative study with fine-tuned models for bangla sentiment analysis. arXiv preprint arXiv:2308.10783.
  • Hasan et al. (2020a) Md Arid Hasan, Jannatul Tajrin, Shammur Absar Chowdhury, and Firoj Alam. 2020a. Sentiment classification in bangla textual content: A comparative study. In 2020 23rd international conference on computer and information technology (ICCIT), pages 1–6. IEEE.
  • Hasan et al. (2020b) Tahmid Hasan, Abhik Bhattacharjee, Kazi Samin, Masum Hasan, Madhusudan Basak, M Sohel Rahman, and Rifat Shahriyar. 2020b. Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for bengali-english machine translation. arXiv preprint arXiv:2009.09359.
  • Hedderich et al. (2020) Michael A Hedderich, Lukas Lange, Heike Adel, Jannik Strötgen, and Dietrich Klakow. 2020. A survey on recent approaches for natural language processing in low-resource scenarios. arXiv preprint arXiv:2010.12309.
  • Hee et al. (2024) Ming Shan Hee, Shivam Sharma, Rui Cao, Palash Nandi, Preslav Nakov, Tanmoy Chakraborty, and Roy Ka-Wei Lee. 2024. Recent advances in hate speech moderation: Multimodality and the role of large models. arXiv preprint arXiv:2401.16727.
  • Huang et al. (2023) Haoyang Huang, Tianyi Tang, Dongdong Zhang, Wayne Xin Zhao, Ting Song, Yan Xia, and Furu Wei. 2023. Not all languages are created equal in llms: Improving multilingual capability by cross-lingual-thought prompting. arXiv preprint arXiv:2305.07004.
  • Huang et al. (2020) Yi Huang, Junlan Feng, Shuo Ma, Xiaoyu Du, and Xiaoting Wu. 2020. Towards low-resource semi-supervised dialogue generation with meta-learning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4123–4128.
  • Islam et al. (2021) Khondoker Ittehadul Islam, Sudipta Kar, Md Saiful Islam, and Mohammad Ruhul Amin. 2021. Sentnob: A dataset for analysing sentiment on noisy bangla texts. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3265–3271.
  • Jahan and Oussalah (2023) Md Saroar Jahan and Mourad Oussalah. 2023. A systematic review of hate speech automatic detection using natural language processing. Neurocomputing, page 126232.
  • Kabir et al. (2023) Mohsinul Kabir, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, M Saiful Bari, and Enamul Hoque. 2023. Benllmeval: A comprehensive evaluation into the potentials and pitfalls of large language models on bengali nlp. arXiv preprint arXiv:2309.13173.
  • Kew et al. (2023) Tannon Kew, Florian Schottmann, and Rico Sennrich. 2023. Turning english-centric llms into polyglots: How much multilinguality is needed? arXiv preprint arXiv:2312.12683.
  • Khan et al. (2024) Mohammed Safi Ur Rahman Khan, Priyam Mehta, Ananth Sankar, Umashankar Kumaravelan, Sumanth Doddapaneni, Sparsh Jain, Anoop Kunchukuttan, Pratyush Kumar, Raj Dabre, Mitesh M Khapra, et al. 2024. Indicllmsuite: A blueprint for creating pre-training and fine-tuning datasets for indian languages. arXiv preprint arXiv:2403.06350.
  • Koto et al. (2024) Fajri Koto, Tilman Beck, Zeerak Talat, Iryna Gurevych, and Timothy Baldwin. 2024. Zero-shot sentiment analysis in low-resource languages using a multilingual sentiment lexicon. arXiv preprint arXiv:2402.02113.
  • Kumar and Albuquerque (2021) Akshi Kumar and Victor Hugo C Albuquerque. 2021. Sentiment analysis using xlm-r transformer and zero-shot transfer learning on resource-poor indian language. Transactions on Asian and Low-Resource Language Information Processing, 20(5):1–13.
  • Liu et al. (2023) Xiaoyi Liu, Mao Teng, Shuangtao Yang, and Bo Fu. 2023. Knowdee at blp-2023 task 2: Improving bangla sentiment analysis using ensembled models with pseudo-labeling. In Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pages 273–278.
  • Lu et al. (2023) Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. 2023. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255.
  • Lyu et al. (2023) Chenyang Lyu, Jitao Xu, and Longyue Wang. 2023. New trends in machine translation using large language models: Case examples with chatgpt. arXiv preprint arXiv:2305.01181.
  • Miao et al. (2023) Ning Miao, Yee Whye Teh, and Tom Rainforth. 2023. Selfcheck: Using llms to zero-shot check their own step-by-step reasoning. arXiv preprint arXiv:2308.00436.
  • Muhammad and Burney (2023) Khalid Bin Muhammad and SM Aqil Burney. 2023. Innovations in urdu sentiment analysis using machine and deep learning techniques for two-class classification of symmetric datasets. Symmetry, 15(5):1027.
  • Pahwa and Pahwa (2023) Bhavish Pahwa and Bhavika Pahwa. 2023. Bphigh at semeval-2023 task 7: Can fine-tuned cross-encoders outperform gpt-3.5 in nli tasks on clinical trial data? In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 1936–1944.
  • Patra et al. (2018) Braja Gopal Patra, Dipankar Das, and Amitava Das. 2018. Sentiment analysis of code-mixed indian languages: An overview of sail_code-mixed shared task@ icon-2017. arXiv preprint arXiv:1803.06745.
  • Pu et al. (2023) Xiao Pu, Mingqi Gao, and Xiaojun Wan. 2023. Summarization is (almost) dead. arXiv preprint arXiv:2309.09558.
  • Qin et al. (2023) Libo Qin, Qiguang Chen, Fuxuan Wei, Shijue Huang, and Wanxiang Che. 2023. Cross-lingual prompting: Improving zero-shot chain-of-thought reasoning across languages. arXiv preprint arXiv:2310.14799.
  • Rahman et al. (2017) Yeasin Ar Rahman, Mahtabul Alam Sohan, Khalid Ibn Zinnah, and Mohammed Moshiul Hoque. 2017. A framework for building a natural language interface for bangla. In 2017 international conference on electrical, computer and communication engineering (ecce), pages 935–940. IEEE.
  • Rane (2023) Nitin Rane. 2023. Enhancing mathematical capabilities through chatgpt and similar generative artificial intelligence: Roles and challenges in solving mathematical problems. Available at SSRN 4603237.
  • Robinson et al. (2023) Nathaniel R Robinson, Perez Ogayo, David R Mortensen, and Graham Neubig. 2023. Chatgpt mt: Competitive for high-(but not low-) resource languages. arXiv preprint arXiv:2309.07423.
  • Sitaram et al. (2023) Sunayana Sitaram, Monojit Choudhury, Barun Patra, Vishrav Chaudhary, Kabir Ahuja, and Kalika Bali. 2023. Everything you need to know about multilingual llms: Towards fair, performant and reliable models for languages of the world. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pages 21–26.
  • Sun et al. (2023) Xiaofei Sun, Xiaoya Li, Shengyu Zhang, Shuhe Wang, Fei Wu, Jiwei Li, Tianwei Zhang, and Guoyin Wang. 2023. Sentiment analysis through llm negotiations. arXiv preprint arXiv:2311.01876.
  • Suzgun et al. (2022) Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. 2022. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  • Tan et al. (2023) Yiming Tan, Dehai Min, Yu Li, Wenbo Li, Nan Hu, Yongrui Chen, and Guilin Qi. 2023. Can chatgpt replace traditional kbqa models? an in-depth analysis of the question answering performance of the gpt llm family. In International Semantic Web Conference, pages 348–367. Springer.
  • Team et al. (2022) NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. 2022. No language left behind: Scaling human-centered machine translation. Preprint, arXiv:2207.04672.
  • Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
  • White et al. (2023) Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. 2023. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382.
  • Williams et al. (2017) Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
  • Xing (2024) Frank Xing. 2024. Designing heterogeneous llm agents for financial sentiment analysis. arXiv preprint arXiv:2401.05799.
  • Xu et al. (2023) Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Hassan Awadalla. 2023. A paradigm shift in machine translation: Boosting translation performance of large language models. arXiv preprint arXiv:2309.11674.
  • Yang et al. (2023) Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Shaochen Zhong, Bing Yin, and Xia Hu. 2023. Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Transactions on Knowledge Discovery from Data.
  • Yu et al. (2018) Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, Saloni Potdar, Yu Cheng, Gerald Tesauro, Haoyu Wang, and Bowen Zhou. 2018. Diverse few-shot text classification with multiple metrics. arXiv preprint arXiv:1805.07513.
  • Zhang et al. (2023a) Haopeng Zhang, Xiao Liu, and Jiawei Zhang. 2023a. Summit: Iterative text summarization via chatgpt. arXiv preprint arXiv:2305.14835.
  • Zhang et al. (2024) Min Zhang, Jianfeng He, Taoran Ji, and Chang-Tien Lu. 2024. Don’t go to extremes: Revealing the excessive sensitivity and calibration limitations of llms in implicit hate speech detection. arXiv preprint arXiv:2402.11406.
  • Zhang et al. (2023b) Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, and Lidong Bing. 2023b. Sentiment analysis in the era of large language models: A reality check. arXiv preprint arXiv:2305.15005.
  • Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223.
  • Zhuang et al. (2023) Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. 2023. Toolqa: A dataset for llm question answering with external tools. arXiv preprint arXiv:2306.13304.
  • Zimmerman et al. (2018) Steven Zimmerman, Udo Kruschwitz, and Chris Fox. 2018. Improving hate speech detection with deep learning ensembles. In Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018).

Appendix A Prompt and Safety Settings

In this section, we present the prompts utilized in the study. Table 7 shows zero-shot learning prompts design used for the SIB-200 dataset. Consistency was maintained across all settings, with the only variation being the language of the prompt. Figure 8 and 9 show the prompts used in our study for XNLI and SIB-200 datasets respectively. Table 6 that are used for Gemini Pro across all prompt settings to prevent it from blocking predictions for harmful content.

Category Threshold
HARM_CATEGORY_HARASSMENT BLOCK_NONE
HARM_CATEGORY_HATE_SPEECH BLOCK_NONE
HARM_CATEGORY_SEXUALLY_EXPLICIT BLOCK_NONE
HARM_CATEGORY_DANGEROUS_CONTENT BLOCK_NONE
HARM_CATEGORY_SEXUAL BLOCK_NONE
HARM_CATEGORY_DANGEROUS BLOCK_NONE
Table 6: Safety setting used for Gemini Pro model to prevent blocking the predictions for harmful content.
Model Prompt Template
GPT-4 [ { ‘role’: ‘user’, ‘content’: "Classify the following ’text’ into one of the following classes: ’science/technology’, ’travel’, ’politics’, ’sports’, ’health’, ’entertainment’, or ’geography’. Provide only class as your response.." text: [TEXT] label: }, { role: ‘system’, content: "You are an expert data annotator and your task is to analyze the text and find the appropriate output that is defined in the user content." } ]
Llama 2 and Gemini Classify the following ’text’ into one of the following classes: ’science/technology’, ’travel’, ’politics’, ’sports’, ’health’, ’entertainment’, or ’geography’. Provide only class as your response. text: [TEXT] label:
Table 7: Prompts used for zero-shot learning on SIB-200 dataset
P1 Settings
[Uncaptioned image]
P2 Settings
[Uncaptioned image]
P3 Settings
[Uncaptioned image]
P4 Settings
[Uncaptioned image]
P5 Settings
[Uncaptioned image]
Table 8: Settings (P1) represents prompts provided by native speakers for XNLI Dataset. P2 Settings, P3 Settings, P4 Settings, and P5 Settings represents the translated prompts from English, Bangla, Urdu, and Hindi respectively. English class names are used as labels in all datasets, as the class labels are in English for all languages. BN: Bangla, EN: English, HI: Hindi, and UR: Urdu.
P1 Settings
[Uncaptioned image]
P2 Settings
[Uncaptioned image]
P3 Settings
[Uncaptioned image]
P4 Settings
[Uncaptioned image]
P5 Settings
[Uncaptioned image]
Table 9: Settings (P1) represents prompts provided by native speakers for SIB-200 Dataset. P2 Settings, P3 Settings, P4 Settings, and P5 Settings represents the translated prompts from English, Bangla, Urdu, and Hindi respectively. English class names are used as labels in all datasets, as the class labels are in English for all languages. BN: Bangla, EN: English, HI: Hindi, and UR: Urdu.

Appendix B Detailed Experimental Results

This section presents the detailed experimental results for all three LLMs across five different prompt settings. Table 10 and 11 shows class-wise experimental results (precision and recall) for GPT-4, Llama 2, and Gemini Pro across different prompt settings. Table 12 and 12 show the total number of invalid labels returned by LLMs.

Lang. Class P1 P2 P3 P4 P5
P. R. P. R. P. R. P. R. P. R.
GPT-4
BN Cont 89.45 62.39 89.45 62.39 - - 86.48 69.08 86.19 66.26
Ent 86.13 54.45 86.13 54.45 - - 79.80 62.23 71.69 56.71
Neut 53.50 89.29 53.50 89.29 - - 56.92 80.84 56.60 85.50
EN Cont 92.45 89.40 - - 92.87 89.64 86.56 93.35 93.26 88.68
Ent 88.25 86.88 - - 87.84 87.84 79.26 88.08 88.20 87.25
Neut 80.02 83.90 - - 80.79 83.59 81.23 65.81 79.49 84.25
HI Cont 89.95 68.62 91.08 66.65 90.78 65.45 87.98 70.54 - -
Ent 83.00 60.24 83.12 57.49 84.60 51.32 80.02 61.62 - -
Neut 56.70 85.69 54.96 86.65 53.10 88.80 57.02 81.44 - -
UR Cont 67.54 79.88 82.67 64.85 84.57 62.04 - - 79.22 68.50
Ent 74.89 52.51 76.24 54.37 79.86 49.40 - - 79.86 49.40
Neut 56.28 62.81 53.33 80.12 51.27 84.49 - - 51.27 84.49
Llama 2
BN Cont 06.61 06.20 14.95 06.75 - - 15.28 08.47 04.82 03.01
Ent 11.08 12.88 86.13 54.45 - - 26.80 53.65 22.21 41.20
Neut 97.01 87.27 98.57 29.56 - - 99.17 44.12 99.41 51.90
EN Cont 86.78 90.00 - - 73.25 80.66 85.90 80.24 77.65 83.83
Ent 52.84 95.81 - - 49.05 81.98 48.87 87.01 50.69 83.77
Neut 76.00 11.38 - - 86.84 19.76 71.91 20.54 86.58 23.17
HI Cont 11.88 11.80 0.60 0.60 19.10 17.54 0.89 0.89 - -
Ent 13.22 15.75 0.77 0.78 20.92 29.28 0.66 0.66 - -
Neut 99.56 81.26 100.00 99.22 98.68 67.25 99.45 99.45 - -
UR Cont 33.14 26.77 48.66 32.57 28.73 21.08 - - 10.91 11.08
Ent 30.50 59.88 33.17 72.93 28.56 54.79 - - 11.44 14.13
Neut 98.43 22.51 98.18 12.93 98.97 34.43 - - 99.68 74.73
Gemini
BN Cont 63.32 78.47 64.60 74.11 - - 61.35 78.77 68.01 69.26
Ent 66.25 54.63 67.48 46.05 - - 82.42 18.40 68.56 43.59
Neut 57.12 53.49 52.59 61.54 - - 48.58 72.46 51.41 69.16
EN Cont 76.97 87.84 - - 74.27 88.14 74.36 84.78 82.31 82.75
Ent 73.79 71.32 - - 73.53 75.51 72.78 29.94 88.20 87.25
Neut 69.44 61.95 - - 72.87 57.28 51.16 74.12 79.49 84.25
HI Cont 68.03 72.87 69.40 69.94 63.55 79.34 63.83 62.99 - -
Ent 68.92 48.74 68.24 47.60 64.16 58.86 63.57 15.57 - -
Neut 52.45 64.07 52.59 68.08 57.65 48.08 43.96 77.72 - -
UR Cont 42.93 71.86 59.34 77.07 64.72 73.52 - - 57.93 78.37
Ent 42.06 13.95 63.64 41.94 67.00 40.60 - - 66.70 41.50
Neut 63.80 63.45 50.75 52.88 50.07 63.01 - - 50.61 51.89
Table 10: Detailed Class-wise result for GPT-4, Llama 2, and Gemini Pro for XNLI dataset. Lang: Language, P: Precision, R: Recall, Cont: Contradiction, Ent: Entailment, Neut: Neutral.
Lang. Class P1 P2 P3 P4 P5
P. R. P. R. P. R. P. R. P. R.
GPT-4
BN Entertainment 93.75 78.95 93.33 73.68 - - 92.86 68.42 87.50 73.68
Geography 76.47 76.47 73.68 82.35 - - 77.78 82.35 73.68 82.35
Health 94.44 77.27 93.33 63.64 - - 92.86 86.67 88.24 68.18
Politics 93.10 90.00 93.10 90.00 - - 92.86 86.67 93.10 90.00
Science/Technology 90.91 98.04 85.96 96.08 - - 84.75 98.04 87.50 96.08
Sports 91.67 88.00 88.46 92.00 - - 91.30 84.00 91.30 84.00
Travel 82.22 92.50 83.72 90.00 - - 80.43 92.50 84.09 92.50
EN Entertainment 87.50 73.68 - - 93.33 73.68 93.33 73.68 93.33 73.68
Geography 72.22 76.47 - - 72.22 76.47 72.22 76.47 72.22 76.47
Health 93.33 63.64 - - 88.24 68.18 94.12 72.73 93.33 63.64
Politics 93.55 96.67 - - 93.55 96.67 93.55 96.67 93.55 96.67
Science/Technology 84.21 94.12 - - 86.21 98.04 87.72 98.04 83.33 98.04
Sports 91.67 88.00 - - 91.30 84.00 91.67 88.00 91.67 88.00
Travel 83.72 90.00 - - 88.10 92.50 88.10 88.10 90.24 92.50
HI Entertainment 92.31 63.16 100.00 68.42 92.31 63.16 91.67 57.89 - -
Geography 63.64 82.35 71.43 88.24 73.68 82.35 66.67 94.12 - -
Health 85.00 77.27 85.71 81.82 94.12 72.73 84.21 72.73 - -
Politics 93.33 93.33 96.55 93.33 93.33 93.33 93.10 90.00 - -
Science/Technology 85.45 92.16 86.79 90.20 85.45 92.16 90.57 94.12 - -
Sports 95.24 80.00 95.65 88.00 95.83 92.00 90.91 80.00 - -
Travel 81.40 87.50 84.09 92.50 80.43 92.50 82.22 92.50 - -
UR Entertainment 91.67 57.89 92.31 63.16 90.91 52.63 - - 84.62 57.89
Geography 75.00 88.24 73.33 64.71 75.00 88.24 - - 38.89 82.35
Health 100.00 6364 94.44 77.27 88.24 68.18 - - 93.33 63.64
Politics 93.10 90.00 96.55 93.33 93.10 90.00 - - 100.00 13.33
Science/Technology 80.65 98.04 89.29 98.04 83.33 98.04 - - 78.12 98.04
Sports 91.67 88.00 87.50 84.00 87.50 84.00 - - 82.14 92.00
Llama 2
BN Entertainment 0.00 0.00 0.00 0.00 - - 0.00 0.00 0.00 0.00
Geography 25.00 13.33 0.00 0.00 - - 0.00 0.00 0.00 0.00
Health 25.00 5.56 0.0 0.00 - - 0.00 0.00 0.00 0.00
Politics 31.25 23.81 0.00 0.00 - - 0.00 0.00 0.00 0.00
Science/Technology 27.08 74.29 04.00 10.00 - - 26.71 91.49 25.29 52.38
Sports 33.33 16.67 0.00 0.00 - - 0.00 0.00
Travel 09.09 03.57 0.00 0.00 - - 0.00 0.00
EN Entertainment 76.92 55.56 - - 77.78 77.78 83.33 26.32 69.23 75.00
Geography 33.33 58.82 - - 36.67 64.71 33.33 35.29 35.71 66.67
Health 81.82 42.86 - - 85.71 54.55 80.00 18.18 92.31 63.16
Politics 68.97 68.97 - - 71.05 93.10 100.00 40.00 71.88 92.00
Science/Technology 51.76 86.27 - - 73.13 96.08 36.69 100.00 74.55 82.00
Sports 92.86 52.00 - - 91.30 84.00 100.00 48.00 94.44 85.00
Travel 69.23 26.47 - - 100.00 17.65 100.00 30.00 57.14 16.00
HI Entertainment 25.00 05.26 33.33 05.56 100.0 05.56 0.00 0.0 - -
Geography 0 0 25.00 05.88 0.00 0.00 00.0 0.00 - -
Health 0 0 40.0 18.18 0.00 0.00 100.00 04.55 - -
Politics 0 0 50.00 100.00 50.00 03.33 24.32 88.24 - -
Science/Technology 25.26 94.12 29.81 94.12 26.18 98.04 24.32 88.24 - -
Sports 0 0 33.33 04.00 0 0 0.00 0.00 - -
Travel 40.00 05.00 081.25 32.50 75.00 07.5 22.22 05.00 - -
UR Entertainment 0.00 0.00 0.00 0.00 0.00 0.00 - - 0.00 0.00
Geography 0.00 0.00 0.00 0.00 0.00 0.00 - - 0.00 0.00
Health 0.00 0.00 0.00 0.00 0.00 0.00 - - 0.00 0.00
Politics 0.00 0.00 0.00 0.00 0.00 0.00 - - 0.00 0.00
Science/Technology 23.08 81.82 25.38 98.04 24.62 96.00 - - 25.00 96.00
Sports 0.00 0.00 0.00 0.00 0.00 0.00 - - 0.00 0.00
Travel 0.00 0.00 0.00 0.00 0.00 0.00 - - 0.00 0.00
Gemini
BN Entertainment 85.71 63.16 87.50 73.68 - - 73.68 73.68 83.33 78.95
Geography 76.47 76.47 73.68 82.35 - - 70.59 70.59 68.75 64.71
Health 100.00 72.73 88.24 68.18 - - 83.33 68.18 88.24 68.18
Politics 96.67 96.67 93.10 90.00 - - 93.33 93.33 87.88 96.67
Science/Technology 77.97 90.20 81.67 96.08 - - 91.49 84.31 85.71 94.12
Sports 88.89 96.00 84.62 88.00 - - 91.30 84.00 88.00 88.00
Travel 80.49 82.50 89.19 82.50 - - 70.00 87.50 84.62 82.50
EN Entertainment 100.00 68.42 - - 93.33 73.68 100.00 63.16 100.00 68.42
Geography 83.33 58.82 - - 84.62 64.71 83.33 58.82 71.43 58.82
Health 93.33 63.64 - - 86.67 59.09 85.71 54.55 94.12 7273
Politics 84.85 93.33 - - 82.35 93.33 87.50 93.33 78.38 96.67
Science/Technology 76.12 100.00 - - 77.27 100.00 72.86 100.00 80.65 98.04
Sports 85.71 96.00 - - 84.62 88.00 82.14 92.00 88.89 96.00
Travel 91.67 82.50 - - 91.43 80.00 91.67 82.50 94.12 80.00
HI Entertainment 86.67 68.42 92.86 68.42 69.23 47.37 85.71 63.16 - -
Geography 61.54 47.06 81.25 76.47 71.43 58.82 83.33 58.82 - -
Health 93.75 68.18 100.00 77.27 100.00 72.73 100.00 68.180 - -
Politics 87.50 93.33 87.10 90.0 84.85 93.33 84.85 93.33 - -
Science/Technology 84.75 98.04 82.26 100.00 81.03 92.16 78.69 94.12 - -
Sports 82.14 92.00 85.71 96.00 88.46 92.00 88.89 96.00 - -
Travel 80.49 82.50 91.67 82.50 79.55 87.50 80.95 85.00 - -
UR Entertainment 100.00 42.11 85.71 63.16 92.31 63.16 - - 71.43 52.63
Geography 58.82 85.82 64.71 64.71 64.71 64.71 - - 47.83 64.71
Health 100.00 68.18 93.33 63.64 100.00 77.71 - - 80.95 77.27
Politics 87.10 90.00 84.38 63.64 92.32 93.33 - - 83.33 16.67
Science/Technology 76.56 96.08 75.38 96.08 84.75 98.04 - - 76.67 90.20
Sports 88.46 92.00 85.19 92.00 88.46 92.00 - - 67.65 92.00
Travel 79.07 85.00 88.24 75.00 87.80 0.9 - - 80.43 92.50
Table 11: Detailed Class-wise result for GPT-4, Llama 2, and Gemini Pro for SIB-200 dataset. Lang: Language, P: Precision, R: Recall, Cont: Contradiction, Ent: Entailment, Neut: Neutral.
Model Lang. P1 P2 P3 P4 P5
GPT-4 BN 1 13 - 2 48
EN 0 - 0 0 0
HI 9 13 17 7 -
UR 43 53 18 - 130
Llama 2 BN 4147 1678 - 2195 2827
EN 19 - 619 322 588
HI 4010 4966 3306 4923 -
UR 1308 790 1917 - 3872
Gemini BN 142 149 - 68 127
EN 138 - 143 80 129
HI 111 131 138 133 -
UR 1330 105 85 - 77
Table 12: Number of invalid labels returned by LLMs across the settings, models, and languages for the XNLI dataset. Lang.: language.
Model Lang. P1 P2 P3 P4 P5
GPT-4 BN 0 1 - 0 0
EN 0 - 0 2 1
HI 1 0 1 2 -
UR 0 1 1 - 3
Llama 2 BN 12 0 - 0 15
EN 46 - 6 17 95
HI 12 13 7 14 -
UR 27 3 8 - 11
Gemini BN 1 2 - 2 3
EN 4 - 2 8 2
HI 0 1 3 3 -
UR 3 2 3 - 12
Table 13: Number of invalid labels returned by LLMs across the settings, models, and languages for the SIB-200 dataset. Lang.: language.