VicunaNER: Zero/Few-shot Named Entity Recognition using Vicuna

Bin Ji
National University of Singapore

Abstract

Large Language Models (LLMs, e.g., ChatGPT) have shown impressive zero- and few-shot capabilities in Named Entity Recognition (NER). However, these models can only be accessed via online APIs, which may cause data leak and non-reproducible problems. In this paper, we propose VicunaNER, a zero/few-shot NER framework based on the newly released open-source LLM – Vicuna. VicunaNER is a two-phase framework, where each phase leverages multi-turn dialogues with Vicuna to recognize entities from texts. We name the second phase as Re-Recognition, which recognizes those entities not recognized in the first phase (a.k.a. Recongition). Moreover, we set entity correctness check dialogues in each phase to filter out wrong entities. We evaluate VicunaNER’s zero-shot capacity on 10 datasets crossing 5 domains and few-shot capacity on Few-NERD. Experimental results demonstrate that VicunaNER achieves superior performance in both shot settings. Additionally, we conduct comprehensive investigations on Vicuna from multiple perspectives.

1 Introduction

Named Entity Recognition (NER) serves as a precondition for many downstream Natural Language Processing (NLP) tasks such as relation extraction. Deep supervised learning NER methods require extensive entity annotations, and it is hard to transfer them across domains. Zero- and few-shot NER is targeted in this scenario, which calls for zero or a few annotated examples and is capable of domain transferring.

Prototypical networks have been widely investigated for zero/few-shot NER, such as StructShot Yang and Katiyar (2020), CONTaiNER Das et al. (2022), ESD Wang et al. (2022), DecomMetaNER Ma et al. (2022), and EP-Net Ji et al. (2022). However, these networks still require fine-tuning datasets of thousands or tens of thousands of examples.

Brown et al. (2020) demonstrate that scaling up language models significantly improves task-agnostic, few-shot NLP task performance, and they propose GPT-3, the well-known milestone of Large Language Models (LLMs). GPT-3 achieves promising performance in diverse NLP tasks without any gradient updates or fine-tuning. Inspired by GPT-3, numerous LLMs are pre-trained or fine-tuned such as InstructGPT Ouyang et al. (2022), Chinchilla Hoffmann et al. (2022), ChatGPT¹¹1https://chat.openai.com/chat, PaLM Driess et al. (2023) and GPT-4 OpenAI (2023). Based on these LLMs, zero- and few-shot NER has been comprehensively investigated. For example, Jimenez Gutierrez et al. (2022) explore biomedical few-shot NER with GPT-3. And based on ChatGPT, He et al. (2023) investigate document-level few-shot NER; Hu et al. (2023) conduct research on zero-shot clinical NER; Wei et al. (2023) propose ChatIE to explore zero-shot information extraction including NER. Although these LLM-based studies achieve strong performance, and sometimes even reach competitiveness with prior best prototypical networks, the LLMs can only be accessed through online APIs, which causes the following problems:

1.

Data leak problem. For example, sensitive data from Samsung was leaked to ChatGPT.²²2https://techcrunch.com/2023/05/02/samsung-bans-use-of-generative-ai-tools-like-chatgpt-after-april-internal-data-leak/
2.

Non-reproducible problem. Because the LLMs are fine-tuned constantly, but the details are not publicly available Tu et al. (2023).

Fortunately, some open-source LLMs are available to the public, such as T5 Raffel et al. (2020), OPT Zhang et al. (2022), GLM Zeng et al. (2023), BLOOM Workshop et al. (2023), and LLaMA Touvron et al. (2023). Especially, LLaMA attracts much research attention due to that: (1) it can be deployed on local servers; (2) it has evolved many powerful variants via fine-tuning, such as Alpaca Taori et al. (2023), Baize Xu et al. (2023), Koala Geng et al. (2023), and Vicuna Chiang et al. (2023).

With the goal of exploring unlimited zero- and few-shot NER approaches, we propose VicunaNER, a Vicuna-based framework that can conduct both zero- and few-shot NER. VicunaNER is composed of two phases, which are known as Recognition and Re-Recogniziton, respectively.

1.

Recognition consists of multi-turn dialogues with Vicuna. The first-turn prompts Vicuna to recognize entities from texts. For each of the recognized entities, we use one-turn dialogue to prompt Vicuna to check its correctness. After doing this, Recognition generates a list of entities for each text.³³3It is also possible that no entity is recognized. However, we observe that Recognition fails to recognize numerous entities when analyzing the entity results, which motivates us to add the Re-Recogniziton phase.
2.

Re-Recogniziton is also composed of multi-turn dialogues with Vicuna. Given a text and its entities recognized in Recogniziton, Vicuna is prompted to recognize those unrecognized entities in the first-turn dialogue. Then Vicuna is prompted to check the correctness of newly recognized entities in the other turn dialogues.

Entities recognized in the two phases are merged as the NER results.

We evaluate VicunaNER’s zero-shot capacity on 10 datasets crossing 5 domains and few-shot capacity on Few-NERD Ding et al. (2021). Experimental results show that: (1) Under the zero-shot setting, VicunaNER outperforms the ChatGPT-based ChatIE on xxx out of the xxx datasets, even ChatGPT is more powerful than Vicuna; (2) Under the few-shot setting, VicunaNER consistently surpasses the listed baselines, including LLM-based frameworks and prototypical networks. Additionally, we conduct comprehensive investigations to disclose the drawbacks of Vicuna, providing references for fine-tuning it in the future.

Refer to caption — Figure 1: The architecture of VicunaNER. It is composed of two phases namely Recognition and Re-Recognition, and each phase consists of multi-turn dialogues with Vicuna. We use a zero-shot NER example to describe the workflow. Texts in the gray background are prompts; entity lists in the red background manage entities recognized by the first-turn dialogue in each phase;entity lists in the green background manage entities recognized in each phase; the entity list in the blue background manages the entities recognized by VicunaNER.

2 Related Work

3 VicunaNER

As shown in Figure 1, Vicuna is composed of two phases, which are named as Recognition and Re-Recognition, respectively. Recognition is stacked upon Re-Recognition, and both of them consist of multi-turn dialogues with Vicuna. Recogniton conducts the first round of NER, and Re-Recognition conducts another round of NER. The reasons for this stacked design are summarized as follows:

1.

We find that Vicuna fails to recognize numerous entities in Recognition. Hence the main purpose of Re-Recognition is to recognize those unrecognized entities, which guarantees better model performance.
2.

Different from the LLMs (e.g., ChatGPT) that only be allowed to access via online APIs, we can deploy the open-source Vicuna on a local server, which enables us to leverage Vicuna’s generation capabilities without restrictions.⁴⁴4Note that Vicuna is intended for non-commercial use only.

We will illustrate the architecture in $\S$ 3.1 – 3.3. Moreover, we will discuss more details in $\S$ 3.4.

3.1 Recognition

Recognition consists of multi-turn dialogues with Vicuna. Given a text and pre-defined named entity types, the first-turn dialogue prompts Vicuna to recognize entities included in the given text. For the zero-shot NER task, we combine descriptions of entity types and the given text to obtain a prompt; for the few-shot NER task, we combine descriptions of entity types, support examples, and the given text to obtain a prompt. Next, we feed the prompt to Vicuna.

We find there are several kinds of entity prediction errors when analyzing the recognized entities, including entity boundary error, entity type error, and non-entity text spans that are mistakenly recognized as entities. Figure 1 shows an entity type error, “University of Exeter”, which should be predicted as an Organization entity, is predicted as a Location entity actually. We use the other turn dialogues to filter these mistakenly recognized entities. To be specific, for each predicted entity, we combine its text span, its type, and the text it is included to obtain a prompt. Figure 1-② shows four prompt examples. Then we feed the prompt to Vicuna. For example, we filter out the “(Location, University of Exeter)” in Figure 1. At last, we use a list to manage the remaining entities, as the “Entity list 1” in Figure 1 shows.

For better comprehension, we report several real-world prompt examples of this phase in Appendix A.

3.2 Re-Recognition

Actually, Recognition achieves a whole round of zero/few-shot NER and we can terminate the NER process after obtaining the entity list. However, we find that Vicuna fails to recognize numerous entities when analyzing the entities recognized in Recognition. Hence, we design the Re-Recognition phase to recognize those unrecognized entities.

As shown in Figure 1, Re-Recognition consists of multi-turn dialogues with Vicuna, which is similar to Recognition. The only difference is the prompts used in the first-turn dialogue of the two phases. To be specific, we add entity descriptions to the prompt used in this phase, where these entities are recognized in Recognition, as Figure 1-③ shows. The purpose of doing this is to guide Vicuna in recognizing those unrecognized entities solely. We also use a list to manage entities recognized in Re-Recognition, as the “Entity list 2” in Figure 1 shows.

Although the prompt is designed to ask Vicuna to solely recognize those unrecognized entities, we find that Vicuna still recognizes some already recognized entities, such as the “(Location, Yate)” shown in Figure 1. We attribute it to the fact that Vicuna has limitations in ensuring the factual accuracy of its outputs Chiang et al. (2023).

For better comprehension, we report real-world prompt examples of this phase in Appendix A.

3.3 Entity Merging

As aforementioned, we obtain one entity list in each of the two phases, but there may be entities that overlap between the two entity lists. Hence, we remove these overlapping entities when merging the two lists to obtain the NER results, as shown in Figure 1.

3.4 Discussion

3.4.1 Comparison of VicunaNER and ChatIE

Concurrent with our work, ChatIE is ChatGPT-based framework that can conduct zero-shot NER, and it also adopts a two-phase architecture. We claim that our VicunaNER is quite different from ChatIE in the following aspects:

1.

Our VicunaNER depends on the open-source Vicuna, while ChatIE is built upon the more powerful but restricted ChatGPT API.
2.

Our VicunaNER conducts a whole round of NER in each of its two phases, While ChatIE solely extracts entity types in its first phase and recognizes entities according to the extracted types in its second phase.
3.

Our VicunaNER can conduct both zero- and few-shot NER tasks, while ChatIE is only designed to perform the Zero-shot NER task.

3.4.2 Are More Re-Recognition Phases Necessary?

It seems that adding more Re-Recognition phases can trigger better zero/few-shot NER performance. However, we demonstrate that adding more than one Re-Recognition phase solely brings tiny performance improvements but greatly increases model inference time. We conduct experimental investigations on counts of the Re-Recognition phase in $\S$ xxx

3.4.3 Entity Form

Following the established line of work Wei et al. (2023), we don’t prompt Vicuna to output entity locations because it is hard for LLMs to output the exact locations. This may cause confusion when an entity occurs more than once in a given text but VicunaNER only recognizes some of them.

References

Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
Das et al. (2022) Sarkar Snigdha Sarathi Das, Arzoo Katiyar, Rebecca Passonneau, and Rui Zhang. 2022. CONTaiNER: Few-shot named entity recognition via contrastive learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6338–6353, Dublin, Ireland. Association for Computational Linguistics.
Ding et al. (2021) Ning Ding, Guangwei Xu, Yulin Chen, Xiaobin Wang, Xu Han, Pengjun Xie, Haitao Zheng, and Zhiyuan Liu. 2021. Few-NERD: A few-shot named entity recognition dataset. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3198–3213, Online. Association for Computational Linguistics.
Driess et al. (2023) Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. 2023. Palm-e: An embodied multimodal language model.
Geng et al. (2023) Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. 2023. Koala: A dialogue model for academic research. Blog post.
He et al. (2023) Jiabang He, Lei Wang, Yi Hu, Ning Liu, Hui Liu, Xing Xu, and Heng Tao Shen. 2023. Icl-d3ie: In-context learning with diverse demonstrations updating for document information extraction.
Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack William Rae, and Laurent Sifre. 2022. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems.
Hu et al. (2023) Yan Hu, Iqra Ameer, Xu Zuo, Xueqing Peng, Yujia Zhou, Zehan Li, Yiming Li, Jianfu Li, Xiaoqian Jiang, and Hua Xu. 2023. Zero-shot clinical entity recognition using chatgpt.
Ji et al. (2022) Bin Ji, Shasha Li, Shaoduo Gan, Jie Yu, Jun Ma, Huijun Liu, and Jing Yang. 2022. Few-shot named entity recognition with entity-level prototypical network enhanced by dispersedly distributed prototypes. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1842–1854, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Jimenez Gutierrez et al. (2022) Bernal Jimenez Gutierrez, Nikolas McNeal, Clayton Washington, You Chen, Lang Li, Huan Sun, and Yu Su. 2022. Thinking about GPT-3 in-context learning for biomedical IE? think again. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4497–4512, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Ma et al. (2022) Tingting Ma, Huiqiang Jiang, Qianhui Wu, Tiejun Zhao, and Chin-Yew Lin. 2022. Decomposed meta-learning for few-shot named entity recognition. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1584–1596, Dublin, Ireland. Association for Computational Linguistics.
OpenAI (2023) OpenAI. 2023. Gpt-4 technical report.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models.
Tu et al. (2023) Shangqing Tu, Chunyang Li, Jifan Yu, Xiaozhi Wang, Lei Hou, and Juanzi Li. 2023. Chatlog: Recording and analyzing chatgpt across time.
Wang et al. (2022) Peiyi Wang, Runxin Xu, Tianyu Liu, Qingyu Zhou, Yunbo Cao, Baobao Chang, and Zhifang Sui. 2022. An enhanced span-based decomposition method for few-shot sequence labeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5012–5024, Seattle, United States. Association for Computational Linguistics.
Wei et al. (2023) Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu, Yufeng Chen, Meishan Zhang, et al. 2023. Zero-shot information extraction via chatting with chatgpt. arXiv preprint arXiv:2302.10205.
Workshop et al. (2023) BigScience Workshop, :, Teven Le Scao, Angela Fan, and Christopher Akiki et al. 2023. Bloom: A 176b-parameter open-access multilingual language model.
Xu et al. (2023) Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. 2023. Baize: An open-source chat model with parameter-efficient tuning on self-chat data.
Yang and Katiyar (2020) Yi Yang and Arzoo Katiyar. 2020. Simple and effective few-shot named entity recognition with structured nearest neighbor learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6365–6375, Online. Association for Computational Linguistics.
Zeng et al. (2023) Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations.
Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. 2022. Opt: Open pre-trained transformer language models.

Appendix A Prompt Examples

This is a section in the appendix. use the full name of named entity types rather than short names.