Understanding and Mitigating Classification Errors
Through Interpretable Token Patterns
Abstract
State-of-the-art NLP methods achieve human-like performance on many tasks, but make errors nevertheless. Characterizing these errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors, but also gives a way to act and improve the classifier. We propose to discover those patterns of tokens that distinguish correct and erroneous predictions as to obtain global and interpretable descriptions for arbitrary NLP classifiers. We formulate the problem of finding a succinct and non-redundant set of such patterns in terms of the Minimum Description Length principle. Through an extensive set of experiments, we show that our method, Premise, performs well in practice. Unlike existing solutions, it recovers ground truth, even on highly imbalanced data over large vocabularies. In VQA and NER case studies, we confirm that it gives clear and actionable insight into the systematic errors made by NLP classifiers.
Extended Abstract
As††*equal contribution much as ‘to err is human,’ NLP models make errors too. Some of these errors are due to noise that is inherent to the process we want to model, and therewith relatively benign. Systematic errors, on the other hand, e.g. those due to bias or misspecification, are much more serious as these lead to models that are inherently unreliable. If we know under what conditions a model performs poorly, we can actively intervene, e.g., by augmenting the training data or fixing preprocessing issues, and so improve overall reliability and performance. Before we can do so, we first need to know whether a classifier makes systematic errors, and if so, how to characterize them in easily understandable terms.
Given a dataset with labels that specify which instances were classified correctly or incorrectly, we are interested in finding combinations of features that describe where the classifier’s predictions are incorrect. For an NLP task, the input features are words or tokens. If, for example, the combination of words “how, many” is primarily found in misclassified instances, this can indicate that our classifier struggles with the concept of counting. A toy example is visualized in Figure 1.
Instances Correct Prediction? | |
---|---|
How many ducks are in the picture? | ✗ |
What are the ducks eating? | ✗ |
How many roosters are in the puddle? | ✗ |
Do you see ducks in the puddle? | ✓ |
Are there many ducks playing? | ✓ |
Local explanation methods such as LIME Ribeiro et al. (2016) describe the decision boundary of each instance. In contrast, we are interested in an efficient way to obtain a global and non-redundant description of our classifier’s issues on the given input data. To this end, we turn to pattern mining. Here, a combination of features is a pattern, and we look for the set of patterns that best characterizes on which instances the classifier tends to perform poorly. This can be phrased as the more general problem of label description. For data with binary features, we are interested in the associations between the feature data and the labels. We formulate this problem in terms of the Minimum Description Length (MDL) principle. The MDL principle is a formal but practical instantiation of Kolmogorov Complexity and allows to cast our problem in terms of finding the model—the set of patterns—that best compresses the data without loss, measured by the description length of our model class.
To capture phenomena of text input, e.g., synonyms, we consider a model class representing a rich pattern language that allows us to express conjunctions, mutual exclusivity, and nested combinations thereof. As the search space is twice exponential and does not exhibit any easy-to-exploit structure, we propose the efficient Premise algorithm to heuristically discover the premises under which we see the given predictions. To guide the search further, we also leverage classifier-independent word embeddings. A full technical description of our novel approach can be found in Hedderich et al. (2022).
Experiments & Results
In extensive experiments on synthetic data with known ground truth, we compare Premise against over ten different baseline approaches. Some methods directly fail due to prohibitively large running time, not finishing a single run within 12 hours. The remaining seven approaches perform poorly, one common issue being that they find thousands of patterns even when there exist only a few ground truth patterns. Only Premise is able to provide a non-redundant description in the presence of noise and label imbalance, and is able to easily scale to vocabulary sizes of thousands of tokens.
Understanding Limitations of VQA Models Visual Question Answering (VQA) is the popular and challenging task of answering textual questions about a given image. We analyze the misclassification of Visual7W Zhu et al. (2016) and LXMERT Tan and Bansal (2019), both architectures that were state-of-the-art at their time but performed far from optimal and thus serve as interesting applications for describing misclassification.
pattern | example from the dataset |
---|---|
UNK | how are the UNK covered |
(how, many) | how many elephants are there |
(what, (color, | what color is the bench |
colors, colour)) | |
(on, top, of) | what is on the top of the cake |
(left, to) | what can be seen to the left |
(on, wall, hanging) | what is hanging on the wall |
(how, does, look) | how does the woman look |
(what, does, (say, | what does the sign say |
like, think, know, want)) |
The patterns found by Premise highlight the advantage of the richer pattern language, allowing to find patterns with related concepts such as (what, (color, colors, colour)). Generally, our discovered patterns highlight different types of wrongly answered questions, including counting questions, identification of objects and their colors, spatial reasoning, and higher reasoning tasks like reading signs. These indicate realistic problems, with the issue of counting having been identified manually in the past as well Zhang et al. (2018). We also observe that Visual7W and LXMERT share certain issues, but specific problems, such as identifying colors, do not show for the latter. This could be an indicator that more recent classifiers have improved capabilities regarding these problems.
Premise also discovers patterns that are biased towards correct classification, which can indicate issues with the dataset. For instance, (who, took, (photo, picture, pic, photos, photograph)), although a difficult question, is nearly always answered by ”photographer“ in this dataset, and thus easy to learn. Another problematic question is indicated by the pattern (clock, time), where usually the answer is ”UNK“ — the unknown word token — due to the limited vocabulary of Visual7W. The pattern hence indicates a setting where the VQA classifier undeservedly gets a good score.
Mitigating NER Classification Errors Here, we analyze a setting where a Named Entity Recognition classifier might perform well during development, its performance when deployed “in the wild” however is much worse. Understanding the difference is important for being able to improve the classifier. With Premise, we can identify issues related to different preprocessing, differing label conventions, and unlabeled data. To empirically validate that the found patterns affect the classifier’s performance, we also fine-tune the classifier on pattern-guided data, subselecting samples that show patterns associated with errors, which improves the overall performance significantly compared to finetuning based on a random subset of samples. The patterns discovered by Premise provide, hence, actionable insights into how a classifier can be improved.
Try It Yourself! For an in-depth analysis of the experiments, we refer to the full publication Hedderich et al. (2022). To make our approach easy to use, we provide the PyPremise Python library111https://github.com/uds-lsv/PyPremise which encapsulates our approach and allows the NLP or ML practitioner to get explanations for their classifier with a few lines of code.
References
- Hedderich et al. (2022) Michael A. Hedderich, Jonas Fischer, Dietrich Klakow, and Jilles Vreeken. 2022. Label-descriptive patterns and their application to characterizing classification errors. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 8691–8707. PMLR.
- Ribeiro et al. (2016) Marco Túlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why should I trust you?": Explaining the predictions of any classifier. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 1135–1144. ACM.
- Tan and Bansal (2019) Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5100–5111.
- Zhang et al. (2018) Yan Zhang, Jonathon S. Hare, and Adam Prügel-Bennett. 2018. Learning to count objects in natural images for visual question answering. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
- Zhu et al. (2016) Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded Question Answering in Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).