Can NLP Models Correctly Reason Over Contexts that Break the Common Assumptions?
Abstract
Pre-training on large corpora of text enables the language models to acquire a vast amount of factual and commonsense knowledge which allows them to achieve remarkable performance on a variety of language understanding tasks. They typically acquire this knowledge by learning from the pre-training text and capturing certain patterns from it. However, real-world settings often present scenarios that do not abide by these patterns i.e. scenarios that break the common assumptions. Can state-of-the-art NLP models correctly reason over the contexts of such scenarios?
Addressing the above question, in this paper, we investigate the ability of models to correctly reason over contexts that break the common assumptions. To this end, we first systematically create evaluation data in which each data instance consists of (a) a common assumption, (b) a context that follows the assumption, (c) a context that breaks the assumption, and (d) questions based on the contexts. Then, through evaluations on multiple models including GPT-3 and Flan T5, we show that while doing fairly well on contexts that follow the common assumptions, the models struggle to correctly reason over contexts that break those assumptions. Specifically, the performance gap is as high as absolute points. Furthermore, we thoroughly analyze these results revealing several interesting findings. We believe our work and findings will encourage and facilitate further research in developing more robust models that can also reliably reason over contexts that break the common assumptions 111Data is available at https://github.com/nrjvarshney/break_the_common_assumptions.

1 Introduction
Pre-training on large corpora of text enables the natural language processing (NLP) models to acquire a vast amount of factual and commonsense knowledge Liu et al. (2019); Petroni et al. (2019); Yogatama et al. (2019); Davison et al. (2019). Due to this knowledge, they are able to achieve remarkable performance on a variety of language understanding tasks. They typically acquire this knowledge by learning from the pre-training text and capturing certain patterns from it. However, in real-world settings, we often encounter scenarios that do not abide by these patterns i.e. scenarios that break the common assumptions. Consider a context, ‘John likes to have tomato soup only when it is cold’, this breaks the common assumption that ‘people prefer to consume soup when it is hot’. Answering questions based on such contexts requires a model to truly understand the context and override its knowledge that it may have acquired (due to the predominant presence of certain patterns in the raw text) during pre-training. How well can state-of-the-art NLP models perform in such scenarios?
Recently, many datasets have been created that test different language understanding skills such as pronoun resolution Sakaguchi et al. (2021); Levesque et al. (2012), commonsense reasoning Talmor et al. (2019), numerical reasoning Dua et al. (2019); Patel et al. (2021); Mishra et al. (2022), qualitative reasoning Tafjord et al. (2019b, a), temporal reasoning Zhou et al. (2019), and feasibility reasoning Gupta et al. (2022). Furthermore, numerous adversarial datasets McCoy et al. (2019); Bartolo et al. (2020); Naik et al. (2018) have also been developed that test the robustness of models. Longpre et al. (2021) study entity-based conflicts in the parametric and contextual knowledge. Agarwal et al. (2020) investigate entity-based swapping to test the robustness of models. Prior work has also studied creating counterfactuals using various techniques such as token substitutions and adversarial attacks Ribeiro et al. (2020); Michel et al. (2019); Kaushik et al. (2020). However, evaluating models on the ability to reason over contexts that break the common assumptions (this is different from entity-based conflicts) has remained underexplored, and existing datasets do not contain a sufficient number of such examples.
In this work, we address the above limitations and comprehensively study the models’ ability to reason over contexts that break the common assumptions. To this end, we first systematically create questions (binary classification) in which the contexts break the common assumptions and the questions test the ability to reason over those contexts. Furthermore, for each such context, we also create a corresponding context that ‘follows’ the common assumption. Specifically, instances in our evaluation data consist of the following: (a) a common assumption, (b) a context that follows the assumption, (c) a context that breaks the assumption, and (d) questions based on the contexts. Figure 1 illustrates examples of our dataset. For binary classification questions, the task is to answer a given question as either ‘Yes’ or ‘No’.
We conduct comprehensive experiments with several NLP models such as Flan T5 Chung et al. (2022), GPT-3 Brown et al. (2020), and UnifiedQA Khashabi et al. (2020). First, we evaluate models on the scenario where the contexts follow the common assumptions; we show that the models perform fairly well in this setting. However, on evaluating them for the scenario where the contexts break the common assumptions, we find that the models falter and achieve considerably lower performance. Specifically, on the binary classification questions, Flan T5-xxl achieves an accuracy of just in the latter scenario ( absolute points lower than its performance on the former scenario). Furthermore, we show that this performance is considerably and consistently lower than the human performance baseline.
We further conduct a thorough analysis which reveals several interesting findings such as (a) models show poor consistency i.e. they are often not able to correctly answer both (context-question) and (context (Breaking)-question) pairs correctly and (b) explicitly providing the common assumption along with the context improves the performance when the context aligns withe the assumption but degrades when it breaks the assumption. Overall, we believe our work and findings will encourage and facilitate further research in developing more robust models that can also reliably reason over contexts that break the common assumptions.
Category | Common Assumption | Context (Breaking) | Question (Binary Class.) |
---|---|---|---|
Preferences | Generally, people prefer homes that are spacious and have adequate storage space. | John prefers small homes so that he can manage it properly. | John’s parents are looking for a new bungalow for him, will he like it? No |
Behaviors | Generally, people feel good when they meet an old friend. | Kevin had a traumatic childhood because of which he feels uncomfortable meeting people from his growing up years. | Kevin is invited for his school reunion celebration, will he enjoy the celebration and meeting his school friends? No |
Objects | Generally, a branded watch is more expensive than a regular watch | A premium brand which is known for expensive luxury watches has launched a new collection of watches which are available at low prices in this month. | Jimmy is looking to buy a watch but has a low budget, should he go for this premium brand? Yes |
Events | Generally, sporting events have an audience | This year soccer final is being held without an audience | The final soccer match will be played between the two most popular teams. Will the stadium be full of supporters of both teams? No |
Others | Animals usually do not enter the gym. | A gym in NY has a high membership fee and thus has no restrictions on the working hours and entry. | Will the gym allow Jim to take his dog with him? Yes |
2 Evaluation Data
In order to comprehensively study a system’s ability to reason over contexts that break the common assumptions, we first systematically create evaluation instances. In this section, we describe the data creation process and provide supporting details.
2.1 Data Creation
Context | Question |
---|---|
Matt always enjoys watching one-sided sports game | Q1: There are two matches tonight. One is high-intensity close match. Other is a boring one-sided game. Will Matt watch the close match? No |
Q2: There are two matches tonight. One is high-intensity close match. Other is a boring one-sided game. Will Matt watch the one-sided match? Yes | |
Matt doesn’t enjoy watching interesting sports game but likes one-sided games | Q1: There are two matches tonight. One is high-intensity close match. Other is a boring one-sided game. Will Matt watch the close match? No |
Q2: There are two matches tonight. One is a high-intensity close match and the ather is a boring one-sided game. Will Matt prefer to watch the one-sided match? Yes |
For creating data instances, we first compile a set of common assumptions across various categories, namely assumptions about preferences, behaviors, objects, and events. Table 1 demonstrates examples of common assumptions for each category. Then, we write a context that follows the common assumption and a corresponding context that breaks that assumption. Finally, we create binary classification questions from these contexts. Furthermore, we also create several variants of a (context, question) pair to comprehensively evaluate a system’s ability to correctly and consistently answer questions. Table 2 shows examples of such variants. We note that in this work, our focus is on common assumptions and not on entity-based factual knowledge.
Six computer science graduate students contributed to the development of this dataset. The data instances were cross-verified and instances on which the inter-annotator agreement was low were rejected. We also conduct validation of the compiled common assumptions; specifically, for each sentence, we asked human annotators to answer ‘Yes’ if they think that it is a common assumption otherwise answer ‘No’. For nearly all the compiled common assumptions, the majority answer is ‘Yes’ which posits that they are indeed common assumptions. We provide further details about this step in section 2.3.
Categories of common assumptions:
We create common assumptions for the following categories:
Assumptions about Preferences: In this category, we include assumptions where a preference (typically of humans) is involved; for e.g. “Generally, people prefer to eat fruits when they are fully ripened”, “Generally, busy people prefer to have an assistant who can help them with their tasks”, and “People usually like to go outside when the weather is pleasant”.
Assumptions about Behaviors: Here, we include assumptions about people’s behaviors such as ‘Generally, people feel good when they meet an old friend’, ‘Generally, people like to get free coupons’, and ‘People usually go to work in the morning’.
Assumptions about Objects: This category incorporates assumptions about objects/things such as ‘Generally, hotels are more expensive than a dormitory’, ‘Generally, bigger vehicles have more seating capacity’, and ‘Generally, schools have science laboratories’.
Assumptions about Events: In this category, we include assumptions about events such as ‘Generally, football games have an audience’ and ‘Generally, there are food stalls in a carnival celebration’.
We also include an Others category to incorporate common assumptions that do not fit into the above four categories.
Category (# Assumptions) | # Questions |
---|---|
Preferences (33) | 131 |
Behaviors (64) | 240 |
Objects (17) | 73 |
Events (26) | 95 |
Others (13) | 44 |
2.2 Data Statistics
For binary classification questions, the task is to answer a given question as either ‘Yes’ or ‘No’. To further measure the consistency of a system’s predictions, we evaluate its predictions on context pairs where one context follows the common assumption and a corresponding context that breaks it. We also conduct evaluations on different variations of a (context, question) pair as shown in Table 2. Table 3 shows the number of binary classification questions in our dataset across each category.
Model | Preferences | Behaviors | Objects | Events | Others | Average | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Con | Con (B) | Con | Con (B) | Con | Con (B) | Con | Con (B) | Con | Con (B) | Con | Con (B) | |
Human | 95.00 | 95.00 | 100.00 | 95.00 | 100.00 | 95.00 | 100 | 95.00 | 100.00 | 100.00 | 99.00 | 96.00 |
Flan T5-xxl | 88.17 | 68.7 | 90.21 | 69.79 | 89.04 | 71.91 | 89.47 | 77.89 | 86.36 | 63.64 | 89.19 | 70.67 |
Flan T5-xl | 85.12 | 72.52 | 90.21 | 66.25 | 86.98 | 69.17 | 88.95 | 70.0 | 81.82 | 65.91 | 87.83 | 68.61 |
Flan T5-large | 87.03 | 66.03 | 82.5 | 61.05 | 79.45 | 67.12 | 92.63 | 63.68 | 81.82 | 62.5 | 84.73 | 63.47 |
Flan T5-base | 66.03 | 56.49 | 71.66 | 58.96 | 63.69 | 60.95 | 80.53 | 58.42 | 67.05 | 55.69 | 70.5 | 58.32 |
UnifiedQA | 63.36 | 50.77 | 66.04 | 46.66 | 60.27 | 52.73 | 65.26 | 54.21 | 60.23 | 51.14 | 64.15 | 49.91 |
GPT-3 davinci-003 | 90.84 | 56.49 | 88.75 | 54.17 | 78.08 | 57.53 | 88.42 | 54.74 | 84.09 | 72.73 | 87.48 | 56.6 |
2.3 Data Validation
We note that it is important to validate the quality of the compiled common assumptions. To this end, for each sentence, we ask 3 human annotators to answer ‘Yes’ if they think that the given sentence is a common assumption otherwise answer ‘No’. Then, we use the majority voting aggregation strategy and find that for nearly all the compiled common assumptions, the majority answer is ‘Yes’. This validates the quality of the common assumptions compiled in this work.
In addition to the above validation step, we note that the questions were also cross-verified by the data creators (who are also the authors of this paper) and the instances where the inter-annotator agreement was low were rejected.
3 Experiments
3.1 Experimental Setup
Performance Metrics:
For binary classification questions, the task is to answer a given question as either ‘Yes’ or ‘No’. We calculate accuracy against the gold labels (Yes and No) for evaluation. To better evaluate a system’s capability, we measure its consistency in correctly answering both the scenarios corresponding to the context that follows the common assumption and the context that breaks it.
Models:
Human Performance Baseline:
We randomly select 40 context-question pairs (20 for contexts that follow common assumptions and 20 for corresponding contexts that break those assumptions) for each category and ask a total of 3 human annotators to ‘answer the given question in Yes or No based on the context’. We then use the majority voting aggregation method and calculate the human performance baseline.
3.2 Results
Table 4 shows the performance of different models on binary classification questions. Column ‘Con’ and column ‘Con (B)’ correspond to the performance on context-question pairs where contexts follow the common assumptions and where contexts break the assumptions respectively.
High Human Performance Baseline:
The first row in Table 4 shows the human performance baseline for each category of our evaluation data. It demonstrates that humans typically achieve high performance across all the data categories. This shows that typically humans are able to reason well in both the scenarios i.e. where contexts follow the common assumptions and where contexts break those assumptions. On average, the human performance is on ‘Con’ and on ‘Con (B)’.
Con vs Con (B) Performance:
On comparing the performance on questions for contexts that follow the common assumptions (‘Con’) and for contexts that break them (‘Con (B)’), we find that the models consistently achieve lower performance on ‘Con (B)’. This behavior is observed for all the models and for all categories of common assumptions. For instance, Flan T5-xxl model on average achieves accuracy on ‘Con’ and just on ‘Con (B)’. The gap in performance is observed for all the categories of common assumptions. The table also shows that with the increase in the size of the model, the performance on both ‘Con’ and ‘Con (B)’ improves. However, the gap in performance on them remains. This highlights that despite performing fairly well on reasoning over the contexts that follow the common assumptions, the models struggle to correctly reason over contexts that break those common assumptions.
Human vs Model Performance on ‘Con (B)’:
Table 4 shows that the performances of all models are considerably lower than the human performance baseline. Specifically, on ‘Con (B)’ instances, the human performance on average is higher than the Flan T5-xxl model. Furthermore, human performance is just slightly impacted when the contexts break the common assumptions (i.e. ‘Con’ column); however, the models’ performance degrades significantly. This behavior is observed for all the categories.
Models Show Poor Consistency:
Model | Pref | Beh | Obj | Eve | Oth |
---|---|---|---|---|---|
Flan T5-xxl | 56.3 | 58.68 | 58.9 | 68.42 | 50.0 |
Flan T5-xl | 54.81 | 54.96 | 56.16 | 55.79 | 45.45 |
Flan T5-large | 47.41 | 40.91 | 42.47 | 48.42 | 34.09 |
Flan T5-base | 24.44 | 32.23 | 26.03 | 35.79 | 20.45 |
UnifiedQA | 28.89 | 19.42 | 21.92 | 26.32 | 15.91 |
GPT-3 | 49.62 | 47.08 | 41.1 | 46.32 | 56.82 |
Table 5 shows the consistency (correctly answering a question based on both Context and Context (Breaking)) achieved by different models on the binary classification questions. The results show that all the models achieve poor consistency i.e. they are often not able to correctly answer both (context-question) and (context (Breaking)-question) pairs correctly. This is primarily due to the poor performance on (context (Breaking)-question) instances.
Model | Preferences | Behaviors | Objects | Events | Others | Average | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Con | Con (B) | Con | Con (B) | Con | Con (B) | Con | Con (B) | Con | Con (B) | Con | Con (B) | |
Flan T5-xxl | 91.6 | 65.65 | 92.08 | 70.83 | 93.15 | 61.64 | 93.68 | 73.68 | 88.64 | 59.09 | 92.11 | 68.1 |
Flan T5-xl | 87.79 | 64.89 | 92.08 | 63.33 | 89.04 | 63.01 | 90.53 | 60.0 | 88.64 | 54.55 | 90.22 | 62.44 |
Flan T5-large | 84.73 | 66.41 | 87.08 | 60.42 | 80.82 | 46.58 | 86.32 | 58.95 | 84.09 | 56.82 | 85.42 | 59.52 |
Flan T5-base | 64.12 | 54.96 | 75.0 | 52.92 | 65.75 | 60.27 | 78.95 | 48.42 | 65.91 | 45.45 | 71.36 | 53.0 |
UnifiedQA | 66.41 | 38.93 | 68.33 | 44.58 | 57.53 | 50.68 | 65.26 | 47.37 | 54.55 | 47.73 | 65.01 | 44.77 |
Impact of Explicitly Providing the Common Assumption with the Context:
Table 6 shows the impact of explicitly providing the common assumption along with the context. Since the common assumption aligns with the ‘Con’ contexts, it slightly improves the performance on ‘Con’; however, it hurts the performance on ‘Con (B)’. This happens because the contexts in ‘Con (B)’ break the provided common assumptions. Hence, it further distracts the model resulting in a drop in performance.
Failure Instances:
Context (Breaking) | Question (Answer) | Prediction |
---|---|---|
Ronald never hires a person that is experienced to handle his business. | Joan is an inexperienced candidate applying for the position. will he be considered for hiring? (Yes) | No |
John is content with his small apartment and wants to continue to stay here | His parents offered to help him buy a bigger home, will he decline the offer? (Yes) | No |
John enjoys in small homes so that he can manage it properly | John’s parents are looking for a new bungalow for him, will he like it? (No) | Yes |
Steven’s has an old car that is even slower than a bicycle | Steven rides his bicycle and car for one hour, will he cover more distance with bicycle? (Yes) | No |
Matt always enjoys watching boring sports game | There are two matches tonight. One is high intensity close match. Other is a boring one-sided game. Will Matt watch the one-sided match? (Yes) | No |
Table 7 shows examples of instances where Flan T5-xxl model gave incorrect predictions. On analyzing the failure instances, we find that a large fraction of the mistakes are on the instances where the correct answer is ‘Yes’ while the model gives ‘No’ as its prediction.
Performance on instance variations:
Model | Performance |
---|---|
Flan T5-xxl | 33.99 |
Flan T5-xl | 31.7 |
Flan T5-large | 21.57 |
Flan T5-base | 15.36 |
Table 8 shows the overall performance of different models on different variations of (context (Breaking) - question) pairs i.e. if a model predicts all the variants corresponding to a common assumption correctly then we give it a score of 1 otherwise we give it 0. Flan T5-xxl achieves a performance of just on this metric highlighting that the model is often not able to consistently answer ALL the variants correctly.
4 Conclusion
In this paper, we investigated the ability of models to correctly reason over contexts that break the common assumptions. To this end, we first systematically developed evaluation data that consists of a common assumption, a context that follows that assumption, a context that breaks the assumption, and question based on the contexts. Then, we evaluated multiple models and show that while performing fairly well on contexts that follow the common assumptions, the models struggle to correctly reason over contexts that break those assumptions. Furthermore, we conducted a thorough analysis which resulted in several interesting findings. In conclusion, we believe our work and findings will encourage and facilitate further research in developing more robust models that can also reliably reason over contexts that break the common assumptions.
Ethical Considerations
The names used in our data are selected from the most common English names. Though the contexts in our dataset break the common assumption, we ensure that all of them indeed describe a realistic scenario. We do not collect any personal information from data creators in the development of the evaluation data for this work.
Acknowledgement
We thank the Research Computing (RC) at Arizona State University (ASU) for providing computing resources for experiments.
References
- Agarwal et al. (2020) Oshin Agarwal, Yinfei Yang, Byron C Wallace, and Ani Nenkova. 2020. Entity-switched datasets: An approach to auditing the in-domain robustness of named entity recognition models. arXiv preprint arXiv:2004.04123.
- Bartolo et al. (2020) Max Bartolo, Alastair Roberts, Johannes Welbl, Sebastian Riedel, and Pontus Stenetorp. 2020. Beat the AI: Investigating adversarial human annotation for reading comprehension. Transactions of the Association for Computational Linguistics, 8:662–678.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Davison et al. (2019) Joe Davison, Joshua Feldman, and Alexander Rush. 2019. Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1173–1178, Hong Kong, China. Association for Computational Linguistics.
- Dua et al. (2019) Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota. Association for Computational Linguistics.
- Gupta et al. (2022) Himanshu Gupta, Neeraj Varshney, Swaroop Mishra, Kuntal Kumar Pal, Saurabh Arjun Sawant, Kevin Scaria, Siddharth Goyal, and Chitta Baral. 2022. " john is 50 years old, can his son be 65?" evaluating nlp models’ understanding of feasibility. arXiv preprint arXiv:2210.07471.
- Kaushik et al. (2020) Divyansh Kaushik, Eduard Hovy, and Zachary C Lipton. 2020. Learning the difference that makes a difference with counterfactually augmented data. International Conference on Learning Representations (ICLR).
- Khashabi et al. (2020) Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. 2020. UNIFIEDQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896–1907, Online. Association for Computational Linguistics.
- Levesque et al. (2012) Hector J. Levesque, Ernest Davis, and Leora Morgenstern. 2012. The Winograd Schema Challenge. In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, KR’12, pages 552–561. AAAI Press, Rome, Italy.
- Liu et al. (2019) Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. 2019. Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1073–1094, Minneapolis, Minnesota. Association for Computational Linguistics.
- Longpre et al. (2021) Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. Entity-based knowledge conflicts in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- McCoy et al. (2019) Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.
- Michel et al. (2019) Paul Michel, Xian Li, Graham Neubig, and Juan Pino. 2019. On evaluation of adversarial perturbations for sequence-to-sequence models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3103–3114, Minneapolis, Minnesota. Association for Computational Linguistics.
- Mishra et al. (2022) Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta Baral, and Ashwin Kalyan. 2022. NumGLUE: A suite of fundamental yet challenging mathematical reasoning tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3505–3523, Dublin, Ireland. Association for Computational Linguistics.
- Naik et al. (2018) Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. 2018. Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2340–2353, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Patel et al. (2021) Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
- Petroni et al. (2019) Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
- Ribeiro et al. (2020) Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguistics.
- Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Tafjord et al. (2019a) Oyvind Tafjord, Peter Clark, Matt Gardner, Wen-tau Yih, and Ashish Sabharwal. 2019a. Quarel: A dataset and models for answering questions about qualitative relationships. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7063–7071.
- Tafjord et al. (2019b) Oyvind Tafjord, Matt Gardner, Kevin Lin, and Peter Clark. 2019b. QuaRTz: An open-domain dataset of qualitative relationship questions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5941–5946, Hong Kong, China. Association for Computational Linguistics.
- Talmor et al. (2019) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
- Yogatama et al. (2019) Dani Yogatama, Cyprien de Masson d’Autume, Jerome Connor, Tomas Kocisky, Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, et al. 2019. Learning and evaluating general linguistic intelligence. arXiv preprint arXiv:1901.11373.
- Zhou et al. (2019) Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan Roth. 2019. “going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3363–3369, Hong Kong, China. Association for Computational Linguistics.