Capturing Label Distribution: A Case Study in NLI

Shujian Zhang Chengyue Gong Eunsol Choi
Department of Computer Science, The University of Texas at Austin

Abstract

We study estimating inherent human disagreement (annotation label distribution) in natural language inference task. Post-hoc smoothing of the predicted label distribution to match the expected label entropy is very effective. Such simple manipulation can reduce KL divergence by almost half, yet will not improve majority label prediction accuracy or learn label distributions. To this end, we introduce a small amount of examples with multiple references into training. We depart from the standard practice of collecting a single reference per each training example, and find that collecting multiple references can achieve better accuracy under the fixed annotation budget. Lastly, we provide rich analyses comparing these two methods for improving label distribution estimation.

1 Introduction

Recent papers Pavlick and Kwiatkowski (2019); Nie et al. (2020) have shown that human annotators disagree on solving natural language inference (NLI) tasks Dagan et al. (2005); Bowman et al. (2015), which decides whether hypothesis h is true given premise p. Such disagreement is not an annotation artifact but rather exhibits the judgement of annotators with differing interpretations of entailment Reidsma and op den Akker (2008). We study how to estimate such distribution of labels for NLI task, as introduced in newly proposed evaluation dataset Nie et al. (2020) which contains 100 human labels per example.

Without changing model architectures Devlin et al. (2019), we focus on improving predicted label distribution. We introduce two simple ideas: calibration and training with examples with multiple references (multi-annotated). Estimating label distribution is closely related to calibration Raftery et al. (2005), which studies how aligned is the predicted probability distribution with empirical likelihood, in this case, human label distribution. When trained naively with cross entropy loss on unambiguously annotated data (examples with a single label), models generate a over-confident distribution Thulasidasan et al. (2019) putting a strong weight on a single label. Observing that the entropy of predicted distribution is substantially lower than that of human label distribution, we calibrate this distribution such that two distributions have comparable entropy via temperature scaling Guo et al. (2018) and label smoothing (Szegedy et al., 2016).

While such calibration shows strong gains, reducing the KL divergence Kullback and Leibler (1951) between predicted and human distribution by roughly half, it does not improve accuracy. Prior works introduce inherent human disagreement into evaluation, and we further embrace such ambiguity into training. Almost all nlp datasets Wang et al. (2019); Rajpurkar et al. (2016) present single reference for training examples while collecting multiple references for examples in the evaluation dataset. We show that, under the same annotation budget, adding a small amount of multi-annotated training, at the cost of decreasing total examples annotated, improves label distribution estimation as well as prediction accuracy. Lastly, we provide rich analyses on differences between calibration approach and multi-annotation training approach. To summarize, our contributions are the following:

•

Introduce calibration techniques to improve label distribution estimation in natural language inference task.
•

Present an empirical study showing collecting multiple references for a small number of training examples is more effective than labeling as many examples as possible, under the same annotation budget.
•

Study the pitfalls of using distributional metrics to evaluate human label distribution.

2 Evaluation

	ChaosSNLI ( $H=0.563$ )				ChaosMNLI ( $H=0.732$ )
	JSD $\downarrow$	KL $\downarrow$	acc (old/new) $\uparrow$	$H$	JSD $\downarrow$	KL $\downarrow$	acc (old/new) $\uparrow$	$H$
Best reported (all))	0.220	0.468	0.749 / 0.787		0.305	0.665	0.674 / 0.635
Est. human	0.061	0.041	0.775 / 0.940		0.069	0.038	0.660 / 0.860
RoBERTa (all)	0.229	0.505	0.727 / 0.754		0.307	0.781	0.639 / 0.592
RoBERTa (our reimpl.,all)	0.230	0.502	0.723 / 0.754		0.310	0.790	0.642 / 0.594
RoBERTa (our reimpl.,subset)	0.242	0.548	0.684 / 0.710	0.345	0.308	0.799	0.670 / 0.604	0.414
+ calib (temp. scaling)	0.202	0.281	0.684 / 0.710	0.569	0.233	0.324	0.670 / 0.604	0.720
+ calib (pred smoothing)	0.222	0.326	0.684 / 0.710	0.566	0.245	0.347	0.670 / 0.604	0.722
+ calib (train smoothing)	0.221	0.338	0.688 / 0.710	0.537	0.252	0.372	0.680 / 0.602	0.701
+ multi-annot	0.183	0.203	0.690 / 0.740	0.649	0.190	0.179	0.646 / 0.690	0.889
+ pred smoothing & multi-annot	0.202	0.196	0.690 / 0.740	0.773	0.209	0.189	0.646 / 0.690	0.977

Table 1: Main results:

H

next to the dataset name on the top row refers to the entropy value of human label distribution. All calibration methods show significant gains on distribution metrics, but introducing multi-annotated examples into training shows the strongest results. The top block results are from Nie et al. (2020), and rows in grey color are not strictly comparable (evaluated on the different set).

2.1 Data

We use the training data from the original SNLI Bowman et al. (2015) and MNLI dataset Williams et al. (2018), each containing 592K and 392K instances. We evaluate on ChaosNLI dataset Nie et al. (2020), which is collected the same way as original SNLI dataset, but contains 100 labels per example instead of five.¹¹1It covers SNLI, MNLI, and $\alpha$ NLI Bhagavatula et al. (2020), and we focus our study on the first two datasets as they show more disagreement among the annotators. We repartition this data to simulate a multi-annotation setting, whether having more than one label per training example can be helpful. We reserve 500 randomly sampled examples for evaluation and use the rest for training.²²2The original datasets split data such that premise does not occur in both train and evaluation set. This random repartition breaks that assumption, now a premise can occur in both training and evaluation with different hypotheses. However, we find that the performance on examples with/without overlapping premise in the training set does not vary significantly.

2.2 Metrics

Following Nie et al. (2020), we report classification accuracy, Jensen-Shannon Divergence Endres and Schindelin (2003), Kullback-Leibler Divergence Kullback and Leibler (1951), comparing human label distributions with the softmax outputs of models. The accuracy is computed twice, once against aggregated gold labels in the original dataset (old), and against the aggregated label from 100-way annotated dataset (new). In addition, we report the model prediction label distribution entropy $H$ .

	#annot	ChaosSNLI ( $H=0.563$ )				ChaosMNLI ( $H=0.732$ )
	#annot	JSD $\downarrow$	KL $\downarrow$	acc (old/new) $\uparrow$	$H$	JSD $\downarrow$	KL $\downarrow$	acc (old/new) $\uparrow$	$H$
RoBERTa		0.250	0.547	0.676 / 0.688	0.363	0.312	0.753	0.628 / 0.578	0.444
+ pred smoothing	150K	0.231	0.342	0.676 / 0.688	0.573	0.253	0.363	0.628 / 0.578	0.737
+ multi-annot		0.186	0.219	0.684 / 0.732	0.643	0.195	0.183	0.616 / 0.684	0.910
RoBERTa		0.264	0.534	0.668 / 0.656	0.445	0.319	0.686	0.552 / 0.496	0.518
+ pred smoothing	15K	0.256	0.412	0.668 / 0.656	0.594	0.276	0.424	0.552 / 0.496	0.752
+ multi-annot		0.293	0.336	0.624 / 0.674	0.985	0.252	0.269	0.546 / 0.554	1.026

Table 2: Performances with smaller annotation budget. With smaller annotation budget, using multi-annotation can hurt accuracy on noisier evaluation setting (old), but still shows improvements on less noisy setting (new) and on most distribution metrics.

2.3 Comparison Systems

We use RoBERTa Liu et al. (2019) based classification model, i.e., encoding concatenated hypothesis and premise and pass the resulting $[$ CLS $]$ representation through a fully connected layer to predict the label distribution, trained with cross entropy loss.

Calibration Methods

We experiment with three calibration methods Guo et al. (2018); Miller et al. (1996). The first two methods are post-hoc and do not require re-training of the model. For all methods, we tuned a single scalar hyperparameter per dataset such that prediction label distribution entropy that matches that of human label distribution.

•

temp. scaling: scaling by multiplying non-normalized logits by a scalar hyperparameter
•

pred smoothing: process softmaxed label distribution by moving $\alpha$ probability mass from the label with the highest mass to the all labels equally
•

train smoothing: process training label distribution by shifting $\alpha$ probability mass from the gold label to the all labels equally

Multi-Annotated Data Training

We compare the results under the fixed annotation budget, i.e., the number of annotations collected. We vary number of examples annotated and the number of annotations per example. We remove 10k randomly-sampled single annotated examples from training portion, and add 1k 10-way annotated examples from the re-partitioned training portion of the ChaosNLI dataset. For each example, we sample 10 out of 100 annotations. We first train model with single-annotated examples and further finetune it with multi-annotated examples.³³3We find merging multi-annotation data with single-annotation data does not show improvements.

2.4 Results

Table 1 compares the published results from Nie et al. (2020) to our reimplementation and proposed approaches. As we set aside some 100-way annotated examples for training, our results are on the randomly selected subset of evaluation dataset, for sanity check, we report our re-implemented results on the full set, which matches the reported results.

The initial model was over confident, with smaller predicted label entropy (0.345/0.414) compared to the annotated label distribution entropy (0.563/0.732). We find all calibration methods improve performance on both distribution metrics (JSD and KL). Temperature scaling yields slightly better results than label smoothing, consistent with the findings from Desai and Durrett (2020) which shows temperature scaling is better for in-domain calibration compared to label smoothing.

Finetuning on multi-annotated data improves distribution metrics and the accuracy, yet the performance is below the estimated human performance. This approach seems to be effective at capturing the inherent ambiguity, and to learn label noise from the crowdsourced dataset. Our method shows more gains in accuracy on newly 100-way annotated dataset, with less noisy majority label, than the majority label from the original 5-way development set. We rerun the baseline and this model three times with different random seeds to determine the variance, which is small.⁴⁴4The standard deviation value of KL on all method / dataset pairs is lower than 0.01. Using both multi-annotated data and calibration show mixed results, suggesting that calibration is not needed when you have multi-annotated examples.

Table 2 studies scenarios with smaller annotation budgets. Similar trend holds, yet in the smallest annotation budget setting, the gains in distribution metrics from fine-tuning approach comes at the cost of accuracy drop, potentially as the model is not exposed to diverse examples.

Refer to caption — Figure 1: The empirical distribution of label/prediction entropy on ChaosSNLI dataset, where x-axis denotes the entropy value and y-axis denotes the example count on the entropy bin. Initial model prediction shows low entropy values for many examples, being over-confident. Post-hoc calibration successfully shifts the distribution to be less confident, but with artifacts of not being confident on any examples. Finetuning on the small amount of multi-annotated data (d) successfully simulate the entropy distribution of human labels.

3 Analysis

Can we estimate the distribution of ambiguous and less ambiguous examples? Figure 1 shows the empirical example distribution over the entropy bins: The leftmost plot shows the annotated human label entropy over our evaluation set, and the plot next to it shows the prediction entropy of the baseline RoBERTa model predictions. Trained only on single label training examples, the model is over-confident about its prediction. With label smoothing, the over-confidence problem is relieved, but still does not match the distribution of ground truth (see plot (c)). Training with multi-annotated data (plot (d)) makes the prediction distribution similar to the ground truth.


$\alpha$ =0	$\alpha$ =0.3	$\alpha$ =0.6

Figure 2: Jensen-Shannon Divergence (JSD) on different label distribution entropy bins on ChaoSNLI dataset. Before label smoothing, the scores are lower for ambiguous examples, but the results swap after further smoothing.

Are label distribution metrics reliable? Nie et al. (2020) suggests that ambiguous examples for humans are also more challenging for models in this dataset. We aware that this observation holds for accuracy measure but not for distribution metrics. Figure 2 demonstrates that when we further smooth the label distribution, JSD is better on supposedly more challenging examples where humans disagree. We find smoothing, which improves distribution metrics but not accuracy, can generate unlikely label distributions, e.g. assigning high values to both entailment and contradiction label. We hypothesize that humans often confuse between neutral and one of the two labels, not between these two labels. To quantify this intuition, we compute the average minimal probability assigned to either contradiction or entailment label: min value for human annotation is only 0.03, and 0.02 for the baseline model. We notice label smoothing increases min value to 0.06 and finetuning with multi-annotated increases the min value to 0.04.

We summarize studies not covered in our evaluation section (details in the appendix). Do the results hold for other model architecture or bigger model? Yes. Our results show the same pattern with ALBERT model Lan et al. (2020) and the larger variant of RoBERTa model. Is the method sensitive to the number of labels per training example? No. We try different label strategies (5-way, 10-way, or 20-way), while keeping the total number of annotation fixed, and observe no changes. What happens if you keep smoothing the distribution? We choose label smoothing hyper-parameter such that the predicted label distribution entropy matches that of human annotation distribution. However, we notice keep smoothing model prediction further ( $\alpha=0.125\rightarrow 0.4$ for SNLI) brings further gains. Should we carefully select which examples to have multiple annotations? Maybe. We experiment on how to select examples to have multiple annotations, using the ideas from Swayamdipta et al. (2020). We finetune with 100 most hard-to-learn, most easy-to-learn, most ambiguous, and randomly sample examples from 1K examples. Easy-to-learn examples, with lowest label distribution entropy, are the least effective, but the difference is small in our settings.

4 Related Work

Human Disagreement in NLP

Prior works have covered inherent ambiguity in different language interpretation tasks. Aroyo and Welty (2015) demonstrates that it is improper to believe that there is a single truth for crowdsourcing. Question answering, summarization and translation literatures have been collecting multiple references per example for evaluation. Most related to our work, Pavlick and Kwiatkowski (2019) carefully examines the distribution behind human references and Nie et al. (2020) has conducted a larger-scale data collection. To capture the subtleties of NLI task, Glickman et al. (2005); Zhang et al. (2017); Chen et al. (2020) introduce graded human responses on the subjective likelihood of an inference. In this work, we explicitly focus on improving label distribution estimation on the existing benchmarks.

Efficient Labeling

Previous works have explored different data labeling strategies for NLP tasks, from active learning Fang et al. (2017), providing fine-grained rationales Dua et al. (2020) to model prediction validation approaches Kratzwald et al. (2020). Recent work Mishra and Sachdeva (2020) studies how much annotation is necessary to solve NLI task. In this work, we study collecting multiple references per each training example, which has not been explored to our understanding.

Calibration

in NLP Nguyen and O’Connor (2015); Ott et al. (2018) could make predictions more useful and interpretable. Large-scale pretrained models are not well calibrated Jiang et al. (2018), and its predictions tend to be over-confident Malkin and Bilmes (2009); Thulasidasan et al. (2019). Calibration has been studied for pretrained language models for classification Desai and Durrett (2020), reading comprehension Kamath et al. (2020) and in general machine learning topics (e.g. Guo et al., 2018; Pleiss et al., 2017). While these works focus on improving robustness to out-of-domain distribution, we study predicting label distributions.

5 Conclusion

We study capturing inherent human disagreement in the NLI task through calibration and using a small amount of multi-annotated training examples. Annotating fewer examples many times as apposed to annotating as many examples as possible can be useful for other language understanding tasks with ambiguity and generation tasks where multiple references are valid and desirable Hashimoto et al. (2019).

Acknowledgements

The authors thank Greg Durrett, Raymond Mooney, and Michael Zhang for helpful comments on the paper draft.

References

Aroyo and Welty (2015) Lora Aroyo and Chris Welty. 2015. Truth is a lie: Crowd truth and the seven myths of human annotation. AI Mag., 36:15–24.
Bhagavatula et al. (2020) Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Wen-tau Yih, and Yejin Choi. 2020. Abductive commonsense reasoning. In ICLR.
Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. emnlp, abs/1508.05326.
Chen et al. (2020) Tongfei Chen, Zhengping Jiang, Keisuke Sakaguchi, and Benjamin Van Durme. 2020. Uncertain natural language inference. In ACL.
Dagan et al. (2005) I. Dagan, Oren Glickman, and B. Magnini. 2005. The pascal recognising textual entailment challenge. In MLCW.
Desai and Durrett (2020) Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers. emnlp, abs/2003.07892.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.
Dodge et al. (2019) Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A Smith. 2019. Show your work: Improved reporting of experimental results. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2185–2194.
Dua et al. (2020) Dheeru Dua, Sameer Singh, and Matt Gardner. 2020. Benefits of intermediate annotations in reading comprehension. In ACL.
Endres and Schindelin (2003) Dominik Maria Endres and Johannes E Schindelin. 2003. A new metric for probability distributions. IEEE Transactions on Information theory, 49(7):1858–1860.
Fang et al. (2017) Meng Fang, Y. Li, and Trevor Cohn. 2017. Learning how to active learn: A deep reinforcement learning approach. ArXiv, abs/1708.02383.
Glickman et al. (2005) Oren Glickman, I. Dagan, and Moshe Koppel. 2005. A probabilistic classification approach for lexical textual entailment. In AAAI.
Guo et al. (2018) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2018. On calibration of modern neural networks. ICML.
Hashimoto et al. (2019) T. Hashimoto, Hugh Zhang, and Percy Liang. 2019. Unifying human and statistical evaluation for natural language generation. ArXiv, abs/1904.02792.
Jiang et al. (2018) Heinrich Jiang, Been Kim, and M. Gupta. 2018. To trust or not to trust a classifier. In NeurIPS.
Kamath et al. (2020) Amita Kamath, Robin Jia, and Percy Liang. 2020. Selective question answering under domain shift. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5684–5696.
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kratzwald et al. (2020) Bernhard Kratzwald, Stefan Feuerriegel, and Huan Sun. 2020. Learning a cost-effective annotation policy for question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3051–3062.
Kullback and Leibler (1951) Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86.
Lan et al. (2020) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. Albert: A lite bert for self-supervised learning of language representations. ArXiv, abs/1909.11942.
Liu et al. (2019) Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, M. Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692.
Malkin and Bilmes (2009) J. Malkin and J. Bilmes. 2009. Multi-layer ratio semi-definite classifiers. 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4465–4468.
Miller et al. (1996) David J. Miller, A. Rao, K. Rose, and A. Gersho. 1996. A global optimization technique for statistical classifier design. IEEE Trans. Signal Process., 44:3108–3122.
Mishra and Sachdeva (2020) Swaroop Mishra and Bhavdeep Singh Sachdeva. 2020. Do we need to create big datasets to learn a task? In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 169–173.
Nguyen and O’Connor (2015) Khanh Nguyen and Brendan T. O’Connor. 2015. Posterior calibration and exploratory analysis for natural language processing models. In EMNLP.
Nie et al. (2020) Yixin Nie, Xiang Zhou, and Mohit Bansal. 2020. What can we learn from collective human opinions on natural language inference data? emnlp, abs/2010.03532.
Ott et al. (2018) Myle Ott, M. Auli, David Grangier, and Marc’Aurelio Ranzato. 2018. Analyzing uncertainty in neural machine translation. In ICML.
Pavlick and Kwiatkowski (2019) Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent disagreements in human textual inferences. Transactions of the Association for Computational Linguistics, 7:677–694.
Pleiss et al. (2017) Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. 2017. On fairness and calibration. In Advances in Neural Information Processing Systems, pages 5680–5689.
Raftery et al. (2005) A. Raftery, T. Gneiting, F. Balabdaoui, and M. Polakowski. 2005. Using bayesian model averaging to calibrate forecast ensembles. Monthly Weather Review, 133:1155–1174.
Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP.
Reidsma and op den Akker (2008) D. Reidsma and Rieks op den Akker. 2008. Exploiting ‘subjective’ annotations. In COLING 2008.
Swayamdipta et al. (2020) Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. 2020. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In EMNLP.
Szegedy et al. (2016) C Szegedy, V Vanhoucke, S Ioffe, J Shlens, and Z Wojna. 2016. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826.
Thulasidasan et al. (2019) S. Thulasidasan, Gopinath Chennupati, J. Bilmes, Tanmoy Bhattacharya, and S. Michalak. 2019. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. In NeurIPS.
Wang et al. (2019) Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. In NeurIPS.
Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. naacl, abs/1704.05426.
Wolf et al. (2020) Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz, Joe Davison, Sam Shleifer, et al. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45.
Zhang et al. (2017) Sheng Zhang, Rachel Rudinger, Kevin Duh, and Benjamin Van Durme. 2017. Ordinal common-sense inference. Transactions of the Association for Computational Linguistics, 5:379–395.

Appendix

Hyperparameters And Experimental Settings

Our implementation is based on the HuggingFace Transformers (Wolf et al., 2020). We optimize the KL divergence as objective with the Adam optimizer (Kingma and Ba, 2014) and batch size is set to 128 for all experiments. The Roberta-base and Albert are trained for 3 epochs on single-annotated data. For the finetuning phase, the model is trained for another 9 epochs. The learning rate, $10^{-5}$ , is chosen from AllenTune (Dodge et al., 2019). For post-hoc label smoothing, we try $\alpha\in\{0.1,0.125,0.15$ $,...,0.6\}$ and choose $\alpha^{*}$ such that predicted label entropy matches the human label distribution entropy. For SNLI, the value is 0.125, and 0.225 for MNLI. For post-hoc temperature scaling, we try temperature in $\{1.5,1.75,2,...,5\}$ and choose the temperature such that predicted label entropy matches the human label distribution entropy (1.75 for SNLI and 2 for MNLI). For training label smoothing, we follow the post-hoc label smoothing value.

Additional Experiments

Table 3 shows ablations on different label collection strategies. While keeping the total number of annotations, we change the number of multi-annotated data and the number of annotation per multi-annotated example. We find that performance improvement do not vary significantly across the varying settings.

# multi	# single	JSD	KL	acc (old/new)	$H$
0	150K	0.25	0.55	0.676 / 0.688	0.363
0.5K (20-way)	130K	0.20	0.22	0.676 / 0.726	0.695
1K (10-way)	140K	0.19	0.22	0.684 / 0.732	0.643
5K (5-way)	145K	0.19	0.22	0.676 / 0.732	0.701

Table 3: Label count comparison on SNLI dataset. The total number of annotation is consistent among different rows.

Figure 3 shows how the label smoothing hyperparameter $\alpha$ impacts the KL divergence score. When increasing $\alpha$ , the KL divergence first decreases and then increases.

(a) SNLI	(b) MNLI
KL divergence
Label Smoothing $\alpha$

Table 4 shows that, with a different model (ALBERT), we can still have consistent result and conclusion.

	#annot	ChaosSNLI ( $H=0.563$ )				ChaosMNLI ( $H=0.732$ )
	#annot	JSD	KL	acc (old/new)	$H$	JSD	KL	acc (old/new)	$H$
ALBERT		0.243	0.474	0.668 / 0.684	0.422	0.314	0.735	0.596 / 0.544	0.496
+ post-hoc smoothing	150K	0.231	0.344	0.668 / 0.684	0.581	0.268	0.410	0.596 / 0.544	0.742
+ multi-annot		0.201	0.225	0.676 / 0.720	0.709	0.217	0.218	0.584 / 0.634	0.948

Table 4: Ablation studies: performance with a different model (ALBERT).

Examples

We present examples where our model improved baseline predictions in Table 5. Original model produces over-confident predictions (row 1, 2 and 5). The over-confidence predictions are fixed with further finetuning with multi-annotated data. On a few examples, finetuning further improves the accuracy.

Premise	Hypothesis	Human Label Dist	Baseline	Finetuned
Premise	Hypothesis	[Entailment, Neutral, Contradiction]
Six men, all wearing identifying number plaques, are participating in an outdoor race.	A group of marathoners run.	[0.31, 0.68, 0.01]	[0.01, 0.97, 0.02]	[0.27, 0.65, 0.08]
A small dog wearing a denim miniskirt.	A dog is having all its hair shaved off.	[0.00, 0.40, 0.60]	[0.00, 0.03, 0.97]	[0.03, 0.29, 0.68]
A goalie is watching the action during a soccer game.	The goalie is sitting in the highest bench of the stadium.	[0.01, 0.84, 0.15]	[0.00, 0.39, 0.61]	[0.02, 0.81, 0.17]
Two male police officers on patrol, wearing the normal gear and bright green reflective shirts.	The officers have shot an unarmed black man and will not go to prison for it.	[0.00, 0.66, 0.34]	[0.00, 0.15, 0.85]	[0.01, 0.59, 0.40]
An african american runs with a basketball as a caucasion tries to take the ball from him.	The white basketball player tries to get the ball from the black player, but he cannot get it.	[0.37, 0.61, 0.02]	[0.04, 0.92, 0.04]	[0.23, 0.70, 0.07]
A group of people standing in front of a club.	There are seven people leaving the club.	[0.02, 0.78, 0.20]	[0.02, 0.34, 0.64]	[0.02, 0.66, 0.32]

Table 5: Examples from ChaosSNLI development set. ‘Baseline’ and ‘Finetuned’ denote the model trained on single-annotation data and the model further finetuned on multi-annotated data, respectively.

(a) Human label entropy	(b) RoBERTa prediction entropy

(c) Calibrated RoBERTa	(d) Multi-annot RoBERTa