Changing Answer Order Can Decrease MMLU Accuracy

Vipul Gupta^{$1,2,\dagger$} David Pantoja ^{$1,3,\dagger$} Candace Ross^$1$ Adina Williams^$1$ Megan Ung^$1$
^$1$ FAIR at Meta AI
^$2$ Pennsylvania State University ^$3$ University of California, Berkeley

[email protected] [email protected]
{ccross, adinawilliams, meganu}@meta.com

Abstract

As large language models (LLMs) have grown in prevalence, particular benchmarks have become essential for the evaluation of these models and for understanding model capabilities. Most commonly, we use test accuracy averaged across multiple subtasks in order to rank models on leaderboards, to determine which model is best for our purposes. In this paper, we investigate the robustness of the accuracy measurement on a widely used multiple choice question answering dataset, MMLU. When shuffling the answer label contents, we find that all explored models decrease in accuracy on MMLU, but not every model is equally sensitive. These findings suggest a possible adjustment to the standard practice of leaderboard testing, where we additionally consider the percentage of examples each model answers correctly by random chance.

1 Introduction

^†^†

\dagger

This work was done during their internships at FAIR, Meta.

One of the largest outstanding issues with interpreting the results of model evaluation pertains to the robustness of accuracy measurements. For example, NLP model accuracy has been shown to be fairly brittle. For example, accuracy can drop when researchers apply input alterations based on paraphrasing (Gan and Ng, 2019), word order changes (Gauthier and Levy, 2019; Ribeiro et al., 2020; Sinha et al., 2021a, 2022; Allen-Zhu and Li, 2023a, b; Berglund et al., 2023; Golovneva et al., 2024; Kitouni et al., 2024), or other minor, largely meaning-preserving input variations or perturbations (Belinkov and Bisk, 2018; Ebrahimi et al., 2018; Jiang et al., 2020; Gao et al., 2021; Li et al., 2021; Sinha et al., 2021b; Moradi and Samwald, 2021; Papakipos and Bitton, 2022; Qian et al., 2022; Goodarzi et al., 2023; Sinha et al., 2023). If many models fail to be robust on a benchmark, regardless of their initially measured accuracy, we may need to reconsider how we use it as the basis for a leaderboard that actually ranks models.

While there are many approaches to investigating robustness, our approach relies on the intuition that a test-taker, human or model, should always select the right answer regardless of its label, i.e. whether it is listed as answer ‘A’ or ‘C’. Surely, if the right answer is unknown to the test-taker and they make an uneducated guess, they still could happen upon the right answer by chance, but, in an ideal scenario, a true expert should achieve the same score when tested multiple times on versions of a test where only the order that answers are presented in changes.

In humans, this performance stability, often called test-retest reliability is an important consideration to determine how to interpret the results of running a test (Bland and Altman, 1986). Humans test scores can fluctuate over time, because they are filtered through irrelevant mental or physical factors that affect measurement (Spearman, 1910; Dunlap, 1933). Such uninformative fluctuations can affect multiple choice tests, for example, when answers are presented in a different order during retest (Krosnick and Fabrigar, 1991; Tellinghuisen and Sulikowski, 2008; Lions et al., 2022). However, as models do not have the biological limitations of humans, we may expect them to exhibit less variation than humans, or possibly even none at all. Thus, we claim that a model should be robust to answer order changes: if it gets the correct answer to a question when the answer is labeled ‘A’, it should also always get the correct answer when it is labeled ‘C’. Put another way, the model should select the same answer for each question, regardless of its label, for every possible version of a benchmark; its accuracy should be static between test and retest.

In our work, we ask whether shuffling the order of the answer label contents, leaving the order of the labels (A, B, C, D) the same, affects the measurement of accuracy. We focus our investigation on the MMLU dataset, a popular dataset included on the widely used Hugging Face Open LLM Leaderboard¹¹1https://huggingface.co/spaces/open-llm-leaderboard/open˙llm˙leaderboard, which runs with the Eleuther LM Evaluation Harness (Gao et al., 2023) as its backend.

Testing top performers on the Open LLM Leaderboard, we find that all ten models are affected by our answer shuffling. This indicates that serious non-robustness in benchmarking with MMLU. To better rank models on a leaderboard with the MMLU dataset, we may want to take more random shuffles of label contents to better understand the extent to which a model genuinely can output the correct answer.

2 Methods

2.1 MMLU

Massive Multitask Language Understanding (MMLU) is a commonly used benchmark for evaluating LLMs (Hendrycks et al., 2021). It is intended to test a model’s world knowledge and problem solving ability, and consists of 57 tasks. Each example in MMLU consists of a question paired with four possible answers, only one of which is correct. Answers are a concatenation of an answer label denoted as a letter, with answer contents (a string of characters). To test the robustness of models to answer choice ordering, we shuffle the answer label contents, with prohibition that the correct answer contents don’t change and that we preserve the ordering of MMLU answer labels (A, B, C, D) across different evaluation runs, for example:

original		a possible shuffle
	A. 1	A. 4
	B. 2	B. 2
	C. 3	C. 1
	D. 4	D. 3

We can think of the original orders of answer content labels in each example in MMLU as one of the $n$ (out of 24 possible) shuffles of the example. Given the size of the MMLU dataset, it is not efficient to run all the possible shuffles (as each example has 24 options and there are nearly 14 thousand questions. To do a tractable exploration, we take two random seeds of MMLU, each of which has been shuffled, where each example has been selected from one of the 24 possible answer contents orders to create semantically equivalent versions of MMLU. We utilize the original MMLU implementation (Hendrycks et al., 2021), which uses 5-shot in context learning during evaluation.

Refer to caption — Figure 1: This figure illustrates the performance of a selection of state-of-the-art models that we tested on the original MMLU (v0) and 2 shuffled versions (v1 and v2). Models are ordered by accuracy drop in ‘our metric’. Here ‘-it’ denotes an instruction tuned model. The width of the violin corresponds to the number of subdatasets where the model received a particular score. The white indicator marks the median score for subdataset accuracies.

2.2 Metrics

In essence, we adopt a simplification of the classic formulation of test-retest repeatability from Bland and Altman to match the ML leaderboard setting: an evaluation (the running of a test on a model) is deemed perfectly stable, if and only if the measurements realized at one time of running it produces the same exact values when repeated at a later time, when the test is run under the same conditions. We minimally alter the testing conditions when we repeat the test to measure robustness—by changing the order of answer contents—but all other testing parameters remain static. In our setting, we set the number of test takers, $n$ , to 1.

In simple terms, this metric measures how often the model answers the questions correctly in both the original and the shuffled versions. If the model is actually robust, it will select the right answer no matter where it appears, as the answer’s meaning doesn’t change when you merely change its label and location in the answer string. If the model’s accuracy does change in this setting, then we can say the model isn’t actually very competent on the task that the test is testing.

To quantify (non-)robustness to answer order shuffling, we define a new metric, our metric, which measures how often the model answers the same question(s) correctly in both the original and in a shuffled version of MMLU. We take the average over all the shuffles performed as our metric:

\text{Our Metric}=\frac{1}{N}\sum_{i=1}^{N}\frac{1}{M}\sum_{j=1}^{M}V_{0}^{i}V_{j}^{i},

(1)

where $V_{0}^{i}\in\{0,1\}$ indicates whether the model answers the $i^{th}$ question correctly in MMLU dataset (1 if correct, 0 if incorrect). $V_{j}^{i}$ indicates whether the model answers $i^{th}$ question correctly in the $j^{th}$ shuffled version of the answer label content. $M$ is the total number of shuffles in the scope of the experiment (for us 2) and $N$ is the dataset size. We then take the average performance across two such shuffles.

As formulated, our metric tries to capture the true capabilities of the model by reducing the number of questions correctly answered by random chance. Assuming models do not have external memory of earlier queries, enforcing that the model correctly identify the answer $M$ times (for us twice), noticeably lowers the chance of it happening across the correct answer by chance.

2.3 Models

In this work, we evaluate 10 state-of-the-art LLMs, ranging in size from 7 billion to 70 billion parameters, most of which have performed very well on the Hugging Face Open LLM leaderboard. The 10 models we use are: Llama3 70B Instruct, Llama3 70B, Llama3 8B Instruct (Meta, 2024), Llama2 70B (Touvron et al., 2023), Yi 34B ( $01.$ AI et al., 2024), Mixtral 8x7B and Mixtral 8x7B Instruct (Jiang et al., 2024), Falcon 40B Instruct (Almazrouei et al., 2023), Mistral 7B Instruct (Jiang et al., 2023), and Gemma 7B Instruct (Team et al., 2024). All models are openly available, which enables the reproducibility of our findings.

3 Results

We found that all tested models performed worse according to our metric after answer content shuffling than on the original version of the dataset, as shown in Table 1. After shuffling, we see that models fail to select the correct answer for every question it originally selected correctly, as shown by our metric in Figure 1.

Model Name	MMLU	Our Metric	% Drop
Llama-3-70B-it	80.3	75.3	6.2
Llama-3-70B	78.9	72.4	8.2
Yi-34B	75.8	67.7	10.7
Mixtral-8x7B-it	70.6	60.7	14.0
Mixtral-8x7B	70.4	60.9	13.5
Llama-2-70B	69.0	58.8	14.8
Llama-3-8B-it	66.4	58.0	12.7
Mistral-7B-it	59.3	46.5	21.6
Falcon-40B-it	54.7	39.8	27.2
Gemma-7B-it	51.7	38.0	26.5

Table 1: Accuracy drop on MMLU due to changing answer order. Here ‘-it’ marks instruction tuned models.

We find that some models had higher retest accuracy than others. Models from the Llama-3 family were the most robust, especially Llama-3-70B. Interestingly, Llama-3-8B model was more robust than larger, generally high performing models such as Mixtral-8x7B and Llama-2-70B. For Llama3-70B and Mixtral-8x7B, we also found that their base and instruction finetuned models were comparably robust. Smaller models, like Mistral-7B and Gemma-7B, were generally more impacted. This result is consistent with findings in Zhou et al. (2024), which found more inconsistency for smaller models (less than 8B parameters), although in a slightly different setting. Some larger models, such as Falcon-40B-instruct whose score dropped from 54.7 to 39.8 with our approach, were also strongly impacted.

We also analyzed performance drop by subdataset in Table 2, and discovered that the models struggled the most with problem-solving subdatasets, such as high school mathematics. For Gemma-7B and Falcon-40B models, the drop in accuracy on these categories were as high as 40%. As these subdatasets make up a significant portion (over 15%) of original MMLU dataset, this analysis suggests serious robustness issues affecting accuracy scores on problem-solving categories. Additionally, among most impacted subdatasets, such as “college mathematics” and “global facts”, we investigated whether the drop may be due to the fact that shuffling can ablate the logical order of the original questions. In humans, presenting answer orders in logical order—such as 0,1,2,3 or 3,2,1,0—is recommended by test design research, because random order may pose unnecessary challenge for lower ability students (Huntley and Welch, 1993; Haladyna et al., 2002). We discovered that more than 95% of the original MMLU dataset was presented in logical order, which indicates that models may be benefiting from logical answer order and perhaps that they should be seen as lower ability test takers.

Model Name	MMLU	Our Metric	% Drop
Llama-3-70B-it	72.1	64.5	10.5
Llama-3-70B	68.7	57.7	16.0
Yi-34B	65.6	52.9	19.4
Mixtral-8x7B-it	56.9	43.4	23.7
Mixtral-8x7B	57.0	43.4	23.9
Llama-2-70B	54.6	40.4	26.0
Llama-3-8B-it	54.3	40.9	24.7
Mistral-7B-it	45.2	29.8	34.1
Falcon-40B-it	41.5	24.3	41.4
Gemma-7B-it	38.9	22.2	42.9

Table 2: Accuracy drop on problem solving categories of MMLU dataset due to option text shuffling.

4 Discussion & Conclusion

Related Work.

Like all evaluation datasets, validity is important. Several recent works have discussed MMLU’s validity (Gema et al., 2024; Zheng et al., 2023; Wang et al., 2024a, b). In particular, Wang et al. (2024b) found trivial and noisy questions in the dataset and proposed an update, MMLU-Pro, which aims to mitigate those issues. Concurrent work on model robustness to question-answering order (Zhou et al., 2024) applies a similar approach to ours that shuffles answer label content and also explores other possible modes of interrogating robustness. While they also find non-robustness to question variants, our work differs from theirs in that our metric can account for the multiplicity of potential orderings of answer labels; we also provide further analysis for each category in MMLU in Figure 3 in the appendix.

Conclusion.

This work tested the robustness of the evaluation benchmark pipeline for the popular leaderboard dataset MMLU. To separate out the effect of chance on model answers, we apply a largely meaningless change to the datasets by shuffling label contents. We find that this meaning-preserving alteration resulted in a decrease in MMLU accuracy for all models, but not to the same degree. We define a new metric that quantifies the effect of chance and suggest that it is important to take it into consideration during evaluation and leaderboard rankings of models.

References

$01.$ AI et al. (2024) $01.$ AI, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu, Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai, Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai. 2024. Yi: Open foundation models by 01.ai.
Allen-Zhu and Li (2023a) Zeyuan Allen-Zhu and Yuanzhi Li. 2023a. Physics of language models: Part 3.1, knowledge storage and extraction. arXiv preprint arXiv:2309.14316.
Allen-Zhu and Li (2023b) Zeyuan Allen-Zhu and Yuanzhi Li. 2023b. Physics of language models: Part 3.2, knowledge manipulation. arXiv preprint arXiv:2309.14402.
Almazrouei et al. (2023) Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. 2023. The falcon series of open language models.
Belinkov and Bisk (2018) Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
Berglund et al. (2023) Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. 2023. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288.
Bland and Altman (1986) J Martin Bland and DouglasG Altman. 1986. Statistical methods for assessing agreement between two methods of clinical measurement. The lancet, 327(8476):307–310.
Dunlap (1933) Jack W Dunlap. 1933. Comparable tests and reliability. Journal of Educational Psychology, 24(6):442.
Ebrahimi et al. (2018) Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. HotFlip: White-box adversarial examples for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 31–36, Melbourne, Australia. Association for Computational Linguistics.
Gan and Ng (2019) Wee Chung Gan and Hwee Tou Ng. 2019. Improving the robustness of question answering systems to question paraphrasing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6065–6075, Florence, Italy. Association for Computational Linguistics.
Gao et al. (2023) Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2023. A framework for few-shot language model evaluation.
Gao et al. (2021) Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, Online. Association for Computational Linguistics.
Gauthier and Levy (2019) Jon Gauthier and Roger Levy. 2019. Linking artificial and human neural representations of language. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 529–539, Hong Kong, China. Association for Computational Linguistics.
Gema et al. (2024) Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. 2024. Are we done with mmlu? arXiv preprint arXiv:2406.04127.
Golovneva et al. (2024) Olga Golovneva, Zeyuan Allen-Zhu, Jason Weston, and Sainbayar Sukhbaatar. 2024. Reverse training to nurse the reversal curse. arXiv preprint arXiv:2403.13799.
Goodarzi et al. (2023) Saeed Goodarzi, Nikhil Kagita, Dennis Minn, Shufan Wang, Roberto Dessi, Shubham Toshniwal, Adina Williams, Jack Lanchantin, and Koustuv Sinha. 2023. Robustness of named-entity replacements for in-context learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10914–10931, Singapore. Association for Computational Linguistics.
Haladyna et al. (2002) Thomas M Haladyna, Steven M Downing, and Michael C Rodriguez. 2002. A review of multiple-choice item-writing guidelines for classroom assessment. Applied measurement in education, 15(3):309–333.
Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding.
Huntley and Welch (1993) Renee M Huntley and Catherine J Welch. 1993. Numerical answer options: Logical or random order?. In The Annual of Meeting of the American Educational Research Association.
Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b.
Jiang et al. (2024) Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2024. Mixtral of experts.
Jiang et al. (2020) Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.
Kitouni et al. (2024) Ouail Kitouni, Niklas Nolte, Diane Bouchacourt, Adina Williams, Mike Rabbat, and Mark Ibrahim. 2024. The factorization curse: Which tokens you predict underlie the reversal curse and more.
Krosnick and Fabrigar (1991) JA Krosnick and LR Fabrigar. 1991. The handbook of questionnaire design.
Li et al. (2021) Dianqi Li, Yizhe Zhang, Hao Peng, Liqun Chen, Chris Brockett, Ming-Ting Sun, and Bill Dolan. 2021. Contextualized perturbation for textual adversarial attack. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5053–5069, Online. Association for Computational Linguistics.
Lions et al. (2022) Séverin Lions, Carlos Monsalve, Pablo Dartnell, María Paz Blanco, Gabriel Ortega, and Julie Lemarié. 2022. Does the response options placement provide clues to the correct answers in multiple-choice tests? a systematic review. Applied Measurement in Education, 35(2):133–152.
Meta (2024) Meta. 2024. Introducing meta llama 3: The most capable openly available llm to date.
Moradi and Samwald (2021) Milad Moradi and Matthias Samwald. 2021. Evaluating the robustness of neural language models to input perturbations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1558–1570, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Papakipos and Bitton (2022) Zoe Papakipos and Joanna Bitton. 2022. Augly: Data augmentations for robustness. arXiv preprint arXiv:2201.06494.
Qian et al. (2022) Rebecca Qian, Candace Ross, Jude Fernandes, Eric Smith, Douwe Kiela, and Adina Williams. 2022. Perturbation augmentation for fairer nlp. arXiv preprint arXiv:2205.12586.
Ribeiro et al. (2020) Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguistics.
Sinha et al. (2023) Koustuv Sinha, Jon Gauthier, Aaron Mueller, Kanishka Misra, Keren Fuentes, Roger Levy, and Adina Williams. 2023. Language model acceptability judgements are not always robust to context. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6043–6063, Toronto, Canada. Association for Computational Linguistics.
Sinha et al. (2021a) Koustuv Sinha, Robin Jia, Dieuwke Hupkes, Joelle Pineau, Adina Williams, and Douwe Kiela. 2021a. Masked language modeling and the distributional hypothesis: Order word matters pre-training for little. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2888–2913, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Sinha et al. (2022) Koustuv Sinha, Amirhossein Kazemnejad, Siva Reddy, Joelle Pineau, Dieuwke Hupkes, and Adina Williams. 2022. The curious case of absolute position embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4449–4472, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Sinha et al. (2021b) Sanchit Sinha, Hanjie Chen, Arshdeep Sekhon, Yangfeng Ji, and Yanjun Qi. 2021b. Perturbing inputs for fragile interpretations in deep natural language processing. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 420–434, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Spearman (1910) Charles Spearman. 1910. Correlation calculated from faulty data. British journal of psychology, 3(3):271.
Team et al. (2024) Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christopher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy. 2024. Gemma: Open models based on gemini research and technology.
Tellinghuisen and Sulikowski (2008) Joel Tellinghuisen and Michelle M Sulikowski. 2008. Does the answer order matter on multiple-choice exams? Journal of chemical education, 85(4):572.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models.
Wang et al. (2024a) Haochun Wang, Sendong Zhao, Zewen Qiang, Bing Qin, and Ting Liu. 2024a. Beyond the answers: Reviewing the rationality of multiple choice question answering for the evaluation of large language models. arXiv preprint arXiv:2402.01349.
Wang et al. (2024b) Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. 2024b. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574.
Zheng et al. (2023) Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2023. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations.
Zhou et al. (2024) Wenjie Zhou, Qiang Wang, Mingzhou Xu, Ming Chen, and Xiangyu Duan. 2024. Revisiting the self-consistency challenges in multi-choice question formats for large language model evaluation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 14103–14110, Torino, Italia. ELRA and ICCL.

Appendix A Appendix

A.1 Limitations

While we explore two possible shuffles of the answer label contents, we restricted ourselves to the $M$ to curtail compute costs. We do acknowledge that there are many more possible shuffles that might be tested, and more would doubtless lead to a better approximation of the non-robustness.

A.2 Category Wise Analysis

We analyzed how changing the answer order affects each category in the MMLU dataset. We found that some categories are more sensitive to these changes than others. Figure 2 shows the impact of answer order changes on eight randomly selected categories.

The MMLU has 57 subcategories, and we observed that some categories are more affected by answer order changes than others. For example, categories such as high school physics, abstract algebra, college mathematics, and moral disputes witnessed a significant decrease in performance after answer order changes. On the other hand, categories such as high school us history, econometrics, and professional law were less affected. In some cases, the impact was highly significant - for instance, the accuracy for Mistral-7B-instruct model on moral scenarios category decreased by 77%, from 31.4 to 7.1, after changing the answer order.

The different plots in Figure 2 highlight that not all categories are equally affected, some parts of MMLU dataset might be good indicator of model performance.

A.3 Computation Resources

For all experiments for this work, we utilized 8 V100 32GB GPUs. These GPUs were assembled in a cluster of 8 GPUs in a node. The cumulative computing time required to evaluate all the language models and complete the experiments amounted to approximately 2000 GPU hours.