Can Language Model Understand Word Semantics as A Chatbot? An Empirical Study of Language Model Internal External Mismatch

Jinman Zhao^1∗, Xueyan Zhang^2∗, Xingyu Yue¹, Weizhe Chen¹, Zifan Qian³, Ruiyu Wang¹
¹University of Toronto,²Waterloo University,³University of Alberta
[email protected], [email protected]

Abstract

Current common interactions with language models is through full inference. This approach may not necessarily align with the model’s internal knowledge. Studies show discrepancies between prompts and internal representations. Most focus on sentences understanding. We study the discrepancy of word semantics understanding in internal and external mismatch across Encoder-only, Decoder-only, and Encoder-Decoder pre-trained language models.⁰⁰0^∗Equal contribution

Jinman Zhao^1∗, Xueyan Zhang^2∗, Xingyu Yue¹, Weizhe Chen¹, Zifan Qian³, Ruiyu Wang¹ ¹University of Toronto,²Waterloo University,³University of Alberta [email protected], [email protected]

1 Introduction

Language models (LMs) (Devlin et al., 2019a; Radford et al., 2019; Wang and Komatsuzaki, 2021; Brown et al., 2020) have drawn a wide range of interest in many fields. The ability to process natural language, encode data into parameters, and generate convincing paragraphs drives many people to consider it as trusted knowledge source. LM’s truthfulness is then a key factor in determining if they are suitable for many downstream applications; in other words, researchers need to assess LMs integrity in their claims.

Machine honesty is very important in recent LLM research. Honesty intersects with aspects such as truthfulness (Evans et al., 2021), calibration (Guo et al., 2017; Minderer et al., 2021; Mielke et al., 2022), self-knowledge (Yin et al., 2023; Kadavath et al., 2022), non-deceptiveness (Azaria and Mitchell, 2023) and so on. There are works investigating whether AI models are aware of what they are expressing. The comprehensive analysis on the honesty of LLMs by Kadavath et al. (2022) concludes that LLMs are well-calibrated. Cheng et al. (2024) has similar conclusions regarding models’ awareness and understanding of what they know and what they do not know. Other works also demonstrated quirky behaviors and phenomena associated with how the model respond to prompt (Khashabi et al., 2022; Webson et al., 2023).

Prior works keep demonstrate that there is a discrepancy between internal and external representations. Hu and Levy (2023) explored the discrepancies between the model’s internal next token distribution and the distribution obtained using prompts such as "What is the best next word?". Liu et al. (2023) analyzed the internal and external inconsistencies of the model from the perspectives of probing(internal) and querying(external). Azaria and Mitchell (2023) investigated how to use the internal state to determine the truthfulness of text generated by language models, thereby also confirming inconsistencies between the model’s internal and external outputs.

In this work, external output refers to the results produced by LMs, specifically the distributions over special positional tokens (e.g. [MASK] token in Encoder-based LMs, next token in Decoder-based LMs). Researches show that there are information stores in the internal hidden representation. We use hidden representation as the internal information (Wang et al., 2023b). ELMo (Peters et al., 2018) is the first to introduce the concept of contextual embeddings by adapting embeddings to word usage in context. Before that word embeddings are static Mikolov et al. (2013); Pennington et al. (2014). BERT (Devlin et al., 2019a) utilizes transformer architecture to capture deep contextual nuances, setting new standards for various tasks.

Word embeddings represent the contextual meaning of a word using high-dimensional vectors. In this work, we employed probes and queries to compare language models across three commonly used word embedding evaluation benchmarks. Previous research by Liu et al. (2023) found no significant difference between queries and probes in question-answering tasks, which primarily focus on sentence-level meaning extraction. However, our results diverge markedly from these findings; we observed a substantial gap between probes and queries, highlighting potential limitations of queries in capturing word-level semantics.

2 Method

To investigate LM’s understanding on word semantics, we mainly focus on 3 distinct tasks spanning the spectrum of LM training streams; namely word similarity, structured prediction, and Analogy. First, we introduce the benchmark, followed by the strategy of probing and querying.

We employ the linear probing, which are commonly used in recent NLP works (Liu et al., 2023; Marks and Tegmark, 2024). Compared to the finetuning process, linear probing takes only thousands of parameters which is significantly smaller than the LMs itself with millions to billions of parameters.

2.1 Word Similarity

Benchmark

Word similarity tasks (Finkelstein et al., 2001; Luong et al., 2013) are used to test semantic similarities between words. We use WiC (Pilehvar and Camacho-Collados, 2019) to test the similarity of contextual embedding. WiC contains 5428 test data and 1400 training data. Each data contains a pair of sentences that both contain the target word, and the golden is to answer whether the target word in two sentences has the same meaning contextually.

Probe

Let $t_{1},...t_{m}$ be the tokens that construct the target sentence. $\vec{h_{1,1}},..\vec{h_{1,m}}$ be the hidden vector of target word tokens in the first sentence. We use the average vector $\vec{h_{1}}=\frac{1}{m}\sum_{i}{\vec{h_{1,i}}}$ to represent the target word in the first sentence.

Similarly, we use $\vec{h_{2}}$ to represent the target word in the other sentence. We adopt the classification objective function in Reimers and Gurevych (2019) that takes $[\vec{h1},\vec{h2},|\vec{h1}-\vec{h2}|]$ as input and build a 2-class logistic regression on top:

out=Softmax(Linear([\vec{h1},\vec{h2},|\vec{h1}-\vec{h2}|]))

Query

We use the queries that are commonly used in other work Wei et al. (2022). For example:

{Sentence1}
{Sentence2}
Does the word "{word}" mean the same thing in the above two sentences?
Answer:[MASK]

The prompts we used are listed in Appendix A. We report the accuracy with the highest accuracy. For generative LMs, we will ask LMs to generate [MASK] position tokens. After the inference, we extract the result logits and compare the probability of the expected output token; for example, Bert is expected to output token ’Yes’ or ’No’, and then a normalized probability is computed.

2.2 Structured Prediction

Benchmark

Named Entity Recognition (NER) (Tjong Kim Sang and De Meulder, 2003; Derczynski et al., 2017) task is to identify and classify entities (like names of persons, organizations, locations and etc.) in a given text. NER is also used to evaluate word embeddings (Pennington et al., 2014). In this work, we use CoNLL2003 (Tjong Kim Sang and De Meulder, 2003) which contains 46,435 tokens in the test set. CoNLL2003 has four entities: person, location, miscellaneous and organization. Detailed statistics are listed in Appendix C.

Probe

Similarly, we use $\vec{h_{i}}$ to be the average hidden vectors of all tokens in the $i^{th}$ word. We then build a 5-class logistic regression:

out=Softmax(Linear(\vec{h_{i}}))

Query

After comparing the accuracy of many prompts, we adopt the following:

{Sentence}. The word {word} in the previous sentence is labelled as [MASK]

We compare the probability of “location”, “person”, “organization”, “miscellaneous” and select the one with the highest score as the output.

2.3 Analogy

Benchmark

BATS (Gladkova et al., ) is an analogy dataset containing 199 validation data and 1799 test data. BATS is commonly used to evaluate the quality of word embeddings by testing their ability to capture semantic and syntactic relationships between words. This benchmark contains multiple-choice questions that give stem words a and b and ask to choose the best pair of words from 4 choices that best fit " a is to b as c is to d?". For example, given the stem pairs ("einstein", "physicist") and 4 choices pairs ("bee", "larva"), ("schwarzenegger", "napoleon"), ("pascal", "mathematician"), ("locke", "Confucius"), apparently the pair ("pascal", "mathematician") should be chosen since it has the closest relation as the stem pair.

Probe

We first use GPT-4 to generate 5 sentences for each word in the BATS. Then compute hidden vectors of each word of each sentence. Then average 5 word vectors to be the vector representation of each word. For the probe, each data has three negative samples and one positive sample, which makes the training data unbalanced. We follow (Ushio et al., 2021), for gold analogies, we put both (a, b)-(c, d) and (a, c) - (b, d) as positive samples. This would increase the size of the positive samples. Let $\vec{h_{a}}$ be the vector representation of word $a$ and so on. For the analogy question, the distance from b to a should be similar to the distance from d and c. Therefore we also inherit classification objection Reimers and Gurevych (2019).

	$\displaystyle\vec{c}=\|[\vec{h_{a}}-\vec{h_{b}}$	$\displaystyle,\vec{h_{c}}-\vec{h_{d}},\vec{h_{a}}-\vec{h_{b}}+\vec{h_{d}}-\vec{h_{c}}]\|$		(1)
	$\displaystyle out$	$\displaystyle=Softmax(Linear(\vec{c}))$		(2)

During the evaluation step, the pair with the highest positive probability will be chosen.

Query

We select the following prompt:

{ $Stem_{1}$ } is to { $Stem_{2}$ } as:
A) { $Choice_{1}a$ } is to { $Choice_{1}b$ }
B) { $Choice_{2}a$ } is to { $Choice_{2}b$ }
C) { $Choice_{3}a$ } is to { $Choice_{3}b$ }
D) { $Choice_{4}a$ } is to { $Choice_{4}b$ }
Answer:[MASK]

Other prompts are listed in Appendix B.

3 Results

Model Selection

We selected three fundamentally different language models based on the architecture.

•

For the encoder-based model, we choose BERT model (Devlin et al., 2019b).
•

For the decoder-based models, we opt for both GPT-2 (Radford et al., 2019).
•

For the encoder-decoder-based model evaluation, we select T5 (Raffel et al., 2023).

3.1 Main Results

Table 1 shows the accuracy achieved by representative models in the target benchmark. We found noticeable differences between probe and query in terms of word semantic capturing. This gap is evident across all models and all benchmarks, highlighting that pretrained language models, when used as chatbots, can exhibit information discrepancies compared to the knowledge stored within their internal neurons.

In WiC benchmark, the answer to the prompt question is binary (yes-no question); we observe that all models are query accuracy is within the range of 49% to 53%, close to random guess (50%). Probe accuracy is considerably higher with a highest 65% chance to correctly understand context-sentence word semantics. As aforementioned, because probing performs linear classification directly on the word embedding, the higher accuracy above random guess indicate that the internal representation is indeed capable to distinguish the word similarity; however, this knowledge failed to propagate to the model output.

F1 score is a common indicator for NER tasks; we observed a more pronounced internal-external discrepancy. Because models with encoder have a better understanding of the input words, they outperform decoder-only models. For instance, BERT embeddings for probing achieved state-of-the-art performance with an F1 score of 96%. GPT-2, on the other hand, has a much lower F1 score, conforming to the observation made by Wang et al. (2023a) and Xie et al. (2023), where GPT3/ChatGPT in both fine-tune and zero-shot setting is less performant than BERT. In contrast, the performance of queries was even lower than random guessing.

Given that the prompt in Analogy benchmark is a multiple choice question with four options, BERT models exhibits a nearly random guess accuracy around 25% in query, while the probe accuracy almost doubles. The query accuracy of GPT and T5 models direct some of their understanding to the output, reaching around 30%. GPT-2 has the lowest probe accuracy at 41%; it may reflect that decoder-based models are more suitable for text generation and less performant in extracting the meaning of words.

Model	method	WiC	NER			Analogy
Model	method	Acc(%)	Precision	Recall	F1	Acc(%)
BERT-base	Query	50	7	100	14	25
BERT-base	Probe	65	95	96	96	51
BERT-large	Query	53	3	100	6	26
BERT-large	Probe	65	96	95	96	48
GPT-2	Query	49	4	42	8	33
GPT-2	Probe	58	97	32	48	41
T5-small	Query	49	5	8	6	31
T5-small	Probe	61	98	94	96	47
T5-large	Query	50	4	6	5	35
T5-large	Probe	65	99	96	97	48

Table 1: Accuracy of encoder, decoder, and encoder-decoder models on benchmark WIC, NER, and Analogy.

3.2 Instruct Tuning and Finetuning

When there is a mismatch between internal and external representation, it may indicate an alignment issue; the knowledge of the model is not properly propagated to the very end. We then investigate if finetuning improves the misalignment issue.

Flan T5 is a instruction-finetune model based on T5 in a mixture of tasks Raffel et al. (2023); Wei et al. (2022); specifically, WiC is explicitly used as one of the datasets. As shown in Table 2, Flan T5 outperforms the T5 in terms of query accuracy, proving that finetuning indeed enhances model’s ability to direct the knowledge to the output. A similar observation can be found in Liu et al. (2023), where the authors finetune GPT2-XL on true question/answer pairs. However, although the accuracy is boosted from 50% to 59%, probing still shows a better performance. The model seems to have a similar understanding of word semantics in both models, and thus Flan T5 slightly improves probe accuracy from 65% to 68% compared to T5.

Model	Method	WiC
T5-large	Query	50
	Probe	65
Flan-T5-large	Query	59
	Probe	68

Table 2: Accuracy of T5 and Flan-T5.

3.3 Calibration

A well-calibrated model should exhibit close alignment between confidence and accuracy. We demonstrate the confidence and accuracy of three models on the WIC task in Figure 1; probe are better calibrated than queries. Furthermore, model with better WiC performance like BERT and T5 has the best calibration than GPT-2.

Refer to caption — Figure 1: Model confidence and Accuracy comparison on WiC datasets.

4 Conclusion

In this paper, we studied the discrepancy between language model’s internal and external representations. We mainly focus on the ability to understand the word semantics. Probe consistently shows a better performance than query, indicating that there is potential to improve models truthfulness. Currently, the model knowledge is not properly reflected on the model’s generated output. We find that finetuning or calibration help to improve the accuracy to some extend, but it still not on par to probe accuracy. Other factors like model size also contribute to the discrepancy. Improving the model’s truthfulness will unleash their potential in applications where reliability and robustness are preferable.

Limitation

Due to limitations in hardware resources and budget constraints, the number of models included in our study is relatively limited. Although we selected representative models to validate our hypotheses, this limitation might affect the generalizability of our findings. Additionally, with restricted computational capacity, we were unable to explore more complex model architectures, which could have provided deeper insights into specific issues. Future research could expand the scope of model selection and explore more diverse and intricate models by securing additional resources, thus enhancing the comprehensiveness and accuracy of the study.

References

Azaria and Mitchell (2023) Amos Azaria and Tom Mitchell. 2023. The internal state of an LLM knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, Singapore. Association for Computational Linguistics.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
Cheng et al. (2024) Qinyuan Cheng, Tianxiang Sun, Xiangyang Liu, Wenwei Zhang, Zhangyue Yin, Shimin Li, Linyang Li, Kai Chen, and Xipeng Qiu. 2024. Can ai assistants know what they don’t know? arXiv preprint arXiv:2401.13275.
Derczynski et al. (2017) Leon Derczynski, Eric Nichols, Marieke van Erp, and Nut Limsopatham. 2017. Results of the WNUT2017 shared task on novel and emerging entity recognition. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 140–147, Copenhagen, Denmark. Association for Computational Linguistics.
Devlin et al. (2019a) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019a. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Devlin et al. (2019b) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019b. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Evans et al. (2021) Owain Evans, Owen Cotton-Barratt, Lukas Finnveden, Adam Bales, Avital Balwit, Peter Wills, Luca Righetti, and William Saunders. 2021. Truthful ai: Developing and governing ai that does not lie. arXiv preprint arXiv:2110.06674.
Finkelstein et al. (2001) Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406–414.
(9) Anna Gladkova, Aleksandr Drozd, and Satoshi Matsuoka. Analogy-based detection of morphological and semantic relations with word embeddings: What works and what doesn’t. In Proceedings of the NAACL-HLT SRW, address = San Diego, California, June 12-17, 2016, publisher = ACL, year = 2016, pages = 47-54 doi = 10.18653/v1/N16-2002, url = https://www.aclweb.org/anthology/N/N16/N16-2002.pdf,.
Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
Hu and Levy (2023) Jennifer Hu and Roger Levy. 2023. Prompting is not a substitute for probability measurements in large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5040–5060, Singapore. Association for Computational Linguistics.
Kadavath et al. (2022) Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. 2022. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
Khashabi et al. (2022) Daniel Khashabi, Xinxi Lyu, Sewon Min, Lianhui Qin, Kyle Richardson, Sean Welleck, Hannaneh Hajishirzi, Tushar Khot, Ashish Sabharwal, Sameer Singh, and Yejin Choi. 2022. Prompt waywardness: The curious case of discretized interpretation of continuous prompts. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3631–3643, Seattle, United States. Association for Computational Linguistics.
Liu et al. (2023) Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, and Jacob Andreas. 2023. Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4791–4797, Singapore. Association for Computational Linguistics.
Luong et al. (2013) Thang Luong, Richard Socher, and Christopher Manning. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 104–113, Sofia, Bulgaria. Association for Computational Linguistics.
Marks and Tegmark (2024) Samuel Marks and Max Tegmark. 2024. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. Preprint, arXiv:2310.06824.
Mielke et al. (2022) Sabrina J. Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. 2022. Reducing Conversational Agents’ Overconfidence Through Linguistic Calibration. Transactions of the Association for Computational Linguistics, 10:857–872.
Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.
Minderer et al. (2021) Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic. 2021. Revisiting the calibration of modern neural networks. Advances in Neural Information Processing Systems, 34:15682–15694.
Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
Pilehvar and Camacho-Collados (2019) Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. WiC: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273, Minneapolis, Minnesota. Association for Computational Linguistics.
Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
Raffel et al. (2023) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2023. Exploring the limits of transfer learning with a unified text-to-text transformer. Preprint, arXiv:1910.10683.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
Tjong Kim Sang and De Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
Ushio et al. (2021) Asahi Ushio, Luis Espinosa Anke, Steven Schockaert, and Jose Camacho-Collados. 2021. BERT is to NLP what AlexNet is to CV: Can pre-trained language models identify analogies? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3609–3624, Online. Association for Computational Linguistics.
Wang and Komatsuzaki (2021) Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
Wang et al. (2023a) Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fei Wu, Tianwei Zhang, Jiwei Li, and Guoyin Wang. 2023a. Gpt-ner: Named entity recognition via large language models. Preprint, arXiv:2304.10428.
Wang et al. (2023b) Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Chen Chen, and Jundong Li. 2023b. Knowledge editing for large language models: A survey. Preprint, arXiv:2310.16218.
Webson et al. (2023) Albert Webson, Alyssa Loo, Qinan Yu, and Ellie Pavlick. 2023. Are language models worse than humans at following prompts? it’s complicated. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7662–7686, Singapore. Association for Computational Linguistics.
Wei et al. (2022) Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. 2022. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
Xie et al. (2023) Tingyu Xie, Qi Li, Jian Zhang, Yan Zhang, Zuozhu Liu, and Hongwei Wang. 2023. Empirical study of zero-shot NER with ChatGPT. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7935–7956, Singapore. Association for Computational Linguistics.
Yin et al. (2023) Zhangyue Yin, Qiushi Sun, Qipeng Guo, Jiawen Wu, Xipeng Qiu, and Xuanjing Huang. 2023. Do large language models know what they don’t know? In Findings of the Association for Computational Linguistics: ACL 2023, pages 8653–8665, Toronto, Canada. Association for Computational Linguistics.

Appendix A WiC Prompt

See Table 3 for the list of prompts we use in WIC evaluation.

Prompt
{sentence1}
{sentence2}
Does the word "{word}" mean the same thing in the above two sentences?
Answer:[MASK]
Sentence 1: {sentence1}
Sentence 2: {sentence2}
Does {word} mean the same thing in these two sentences?
Answer:[MASK]
Here is one sentence: {sentence1}
Here is another sentence: {sentence2}
Does the term {word} mean the same thing in both these sentences?
Answer:[MASK]
In these two sentences (1) {sentence1} (2) {sentence2},
does the word {word} mean the same thing?
Answer:[MASK]
Does the word "{word}" have the same meaning in the following two sentences?
{sentence1}
{sentence2}
Answer:[MASK]
Is the word "{word}" used in the same way in the following two sentences?
{sentence1}
{sentence2}
Answer:[MASK]
Does the word "{word}" have the same definition in the next two sentences?
{sentence1}
{sentence2}
Answer:[MASK]
Is {word} used to mean the same thing in the next two sentences?
{sentence1}
{sentence2}
Answer:[MASK]
Does "{word}" mean the same thing in these two sentences?
{sentence1}
{sentence2}
Answer:[MASK]
Does the word "{word}" mean the same thing in "{sentence1}" and "{sentence2}"?
Answer:[MASK]

Table 3: Prompts for WIC.

Appendix B Analogy Question Prompts

See Table 4 for the prompts we use for analogy question.

Prompt
{ $Stem_{1}$ } is to { $Stem_{2}$ } as:
A) {} is to {}
B) {} is to {}
C) {} is to {}
D) {} is to {}
Answer:[MASK]
Which of the following pairs has the most similar relation with { $Stem_{1}$ , $Stem_{2}$ }?
A) {, }
B) {, }
C) {, }
D) {, }
Answer:[MASK]

Table 4: Prompts for Analogy question.

Appendix C CONLL2003 Statistics

See Table 5 for CoNLL2003 statistics.

Dataset	Sentences	Tokens	Entities
Train	14,041	203,621	23,499
Dev	3,250	51,362	5,942
Test	3,453	46,435	5,648

Table 5: CoNLL2003 Statistics.