Alignment bias hurt Large Language Models in Information extraction

First Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain
\AndSecond Author
Affiliation / Address line 1
Affiliation / Address line 2
Affiliation / Address line 3
email@domain

Abstract

This document is a supplement to the general instructions for *ACL authors. It contains instructions for using the LaTeX style files for ACL conferences. The document itself conforms to its own specifications, and is therefore an example of what your manuscript should look like. These instructions should be used both for papers submitted for review and for final versions of accepted papers.

1 Introduction

Bias in machine learning refers to systematic errors in predictions in the machine learning process, such as annotator bias, measurement bias, etc hellström2020bias. In the era of large language models (LLMs), this issues are addressed by filtering low-quality corpora kojima2022large and training with human preferences ouyang2022training. However, performance remains subpar in handling information extraction (IE) tasks wadhwa2023revisiting, which we believe is due to definition bias.

Definition bias in IE refers to the tendency of an information extraction system to favor certain interpretations of data over others, often due to the way concepts, entities, or relationships are defined within the system. As the fast development of Unified Information Extraction (UIE) lu-etal-2022-unified and Large Language Models (LLMs) openai2022chatgpt; openai2023gpt4; geminiteam2023gemini in recent years, two novel definition bias emerge, which are: Bias among IE datasets and Bias between IE and instruction tuning (IFT) datasets. Regarding Bias among IE datasets, it refers to the definition differences between different data sets under the same annotation schema. As illustrated in Figure 1, different datasets have different annotations to the same input for both Named Entity Recognition (NER) and Relation Extraction (RE) tasks. Regarding Bias between IE and instruction tuning datasets, it highlights the mismatch between the information extraction task and the general task. As depicted in Figure 1, although GPT-4 openai2023gpt4 is capable of extracting entities or relational triples in accordance with the specified task description without providing extra examples, its prediction differ from those in the existing datasets.

\includegraphics

[width=]Figure/intro.pdf

Figure 1: Definition bias among different datasets and LLMs even when they share the same entity type (for NER) or the same relation type (for RE).

\includegraphics

[width=]Figure/graphic.pdf

Figure 2: Three settings for the probing tasks on definition bias across datasets, including (a) fully supervised, (b) source prompt and (c) LLMs zero/few-shot.

To systematically investigate definition bias in IE, we devise a series of probing experiments. First, we analyze whether definition bias exists and how it varies among datasets sharing the same tasks. By conducting cross-validation experiment among various dataset in the NER and RE, we observe a significant decrease in performance, indicating that definition bias negatively impacts the transferability of a fully-supervised model. An intuitive way to alleviate definition bias is unified information extraction, which is trained across multiple IE dataset. Therefore, we analyze in the unified information extraction setting, does definition bias still exist? By introducing source prompt li2022learning that apply true or fake source name for the UIE models, we discover the inconsistency of the UIE for extraction, which indicates that UIE suffers from definition bias among IE datasets. The other way to mitigate definition bias is LLMs, which is able to understand a wide range of human instructions. Thereupon, we analyze Can LLMs address the challenge of definition bias? By conducting experiments on few-shot settings on NER and RE task with in-context learning, we find that it’s difficult for LLMs without parameter updates to attain satisfactory performance, which indicates that LLMs still suffer from definition bias between IE and instruction fine-tuning datasets.

According to our probing experiments, it is imperative to address definition bias by proposing a universal solution for IE tasks. However, mitigating definition bias is non-trivial, primarily owing to the following three challenges. {inparaenum}[(1)]

Enhancing the capacity of LLMs in general information extraction tasks is vital to reduce the definition bias between information extraction datasets and instruction tuning datasets;

Mitigating the definition bias during the tuning of LLMs with different IE datasets;

Learning from new data over time, adapting to new tasks while ensuring the model remains good performance on existing task, is a significant challenge.

To address these challenges, we propose a framework to alleviate definition bias, which consists of definition bias measurement, bias-aware fine-tuning and task-specific bias mitigation. Using Fleiss’s Kappa fleiss1971measuring, we measure the two types of definition bias above. Then we conduct bias-aware fine-tuning with multiple information extraction instructions to enhance the extraction capabilities with less definition bias. Ultimately, we conduct the task-specific bias mitigation, with low-rank adaptation technique (LoRA) hu2021lora for specific information extraction tasks to further align the LLMs with annotations.

Our paper is organized as follows: In Section LABEL:sec:probing_setting, we present three probing experiments designed to explore the presence of definition bias and assess the ability of existing frameworks to address this issue in IE. Section LABEL:sec:probing_exp details the results and analysis of these experiments, concluding that frameworks based on either one-stage processing or parameter-free updates are insufficient to tackle definition bias in IE. Consequently, we propose a novel framework featuring two-stage fine-tuning, specifically developed to mitigate the identified definition bias, as introduced in Section LABEL:sec:framework. Ultimately, in Section LABEL:sec:exp, we compare the performance of our framework with state-of-the-art methods in universal information extraction, demonstrating its effectiveness in reducing definition bias.

2 Related Work

2.1 LLMs for information extraction

Large language models have shown remarkable performance in instruction following openai2023gpt4. To better align the natural instruction task from pre-trained and instruction tuning task, wei2023zero; wadhwa2023revisiting; zhang2023aligning convert the structural information extraction task into natural instruction task such as question answering, multi-choice, etc. While li2023codeie; guo2023retrieval recast the structured output in the form of code to better leverage the LLMs of code to address the complex structure. Although LLMs show impressive performance in various information extraction tasks by designing fine-grained instruction, they still fail to address definition bias without further tuning.

2.2 Universal information extraction

Unified Information extraction, proposed by lu-etal-2022-unified, uniformly encodes various information extraction tasks with a predefined structured extraction language (SEL), and enhances the common IE abilities via a large-scale pre-trained generation model. lou2023universal further introduce USM to model different IE tasks, while wang2023instructuie unified tasks into natural language instruction. GoLLIE converts the IE schema into code-style structural description and adds guidelines to improve zero-shot results sainz2023gollie. However, they mainly focus on how to encode different extraction task into a uniform structure but fail to notice and detect the definition bias among various datasets.

3 Methodology

We initially propose an experiment employing cross-validation to investigate the presence of definition bias in the IE tasks. Subsequently, we design two specific detection tasks: source prompt detection and few-shot prompting in LLMs, to examine two categories of definition bias: bias within IE datasets and bias between IE and instruction fine-tuning datasets. These experiments aim to explore the effectiveness of the UIE and LLM frameworks in addressing the definition bias issue.

3.1 Whether definition bias exists?

To better illustrate the definition bias among different information extraction tasks, we design a cross-extraction task. As shown in Figure 2(a), we train multiple fully-supervised models with different datasets on the same task (NER and RE) respectively, and test them on other datasets to evaluate whether definition bias exists.

We first introduce two BERT-based extraction frameworks to handle the NER and RE task, respectively.

Named Entity Recognition

We adopt GlobalPointer su2022global, an efficient span-based approach that models the beginning and end positions to predict entities using a two-dimensional scoring matrix. By incorporating extended softmax and cross-entropy, GlobalPointer is better equipped to learn from scenarios involving class imbalance.

Relation Extraction

We adopt ReRe xie-etal-2021-revisiting as the basic model for relation extraction. ReRe is a pipeline approach that first performs sentence-level relation detection, followed by subject/object extraction. Specifically, the ReRe model treats the former as a multi-class classification task and the latter as a span detection task.

During the cross-validation process, we encountered label type biases across different datasets. For instance, the ACE 2004 dataset requires the extraction of the weapon entity, which is not a requirement in the CoNLL 2003 dataset. Consequently, we focus exclusively on the types of labels (such as entity types in NER and relationship types in RE) that are annotated in both the training and testing datasets. An example is the person label, which is common for both ACE 2004 and CoNLL 2003.

To mitigate the impact of text distribution shift on the experimental results, we sample a subset of sentences with similar semantics as a cross-validation set. Specifically, we measure the semantic similarity between two sentences by calculating the cosine similarity of their sentence embeddings. We define the semantic similarity of the sentence $sent_{i}$ to the dataset $\mathcal{D}$ . Finally, we filter out all sentences that fall below $threshold(\mathcal{D})$ .

{gather}

sim(sent_i, D) = max_ref_j∈D cosine(V_sent_i, V_ref_j)
threshold(D) = σ⋅1|D|∑_s_i∈D sim(s_i, D∖{s_i})

where $V_{S}$ denotes the embedding vector of a sentence $S$ encoded by a sentence model¹¹1We adopt MPNet song2020mpnet as our sentence embedding encoder, which is commonly used for retrieval., and $\sigma$ denotes the hyperparameters that adjust the threshold, empirically set to $0.7$ .

3.2 Can UIE address definition bias?

Unified information extraction, which employs a pre-defined structured extraction language to encode different extraction structures, can accurately recognize extraction instructions. Inspired by li2022learning, which introduces a novel prompt-based method in a transferable setting for text generation tasks, we adopt source prompt settings for probing. Briefly, in our experiment setting, a source can be denoted as the name of the dataset (e.g., ACE 2004). By presenting UIE with various sources—indicating which dataset the instance is from—we can guide it to yield different extraction results. This approach allows us to assess whether it can maintain consistent results with different source prompts.

As Figure 2(b) shows, the probing experiment consists of two parts: source prompt tuning and source prompt inference. Initially, we undertake a source prompt tuning process to enhance the UIE model’s ability to recognize different sources. Subsequently, we examine the definition bias within the UIE model by introducing various sources.

Source Prompt Tuning

The source prompt process can be regarded as a general multi-task learning framework. First, we define a set of source information extraction tasks $\mathcal{S}=\{\mathcal{S}_{1},...,\mathcal{S}_{n}\}$ , where the $k$ -th task $\mathcal{S}_{k}=\{(x_{i}^{k},y_{i}^{k})\}_{i=1}^{N_{k}}$ contains $N_{k}$ tuples of the input text $x_{i}^{k}\in\mathcal{X}_{k}$ and its corresponding output text $y_{i}^{k}\in\mathcal{Y}_{k}$ . For a target information extraction task $\mathcal{T}$ , the goal of multi-task learning is to leverage previously learned task-specific knowledge of the source tasks $\mathcal{S}$ to improve prediction of the extraction result. Unlike the traditional multitask fine-tuning scenario, in source prompt tuning, we learn an independent source prompt $p_{k}$ for each source information extraction task $\mathcal{S}_{k}$ in source prompt tuning, where $x_{i}^{k}$ consists of extraction task source name $s_{k}$ , information extraction task description $t_{k}$ , and the sentence $sent_{i}^{k}$ . For example, a single instance "Here’s a dataset from ACE 2004, please list all ’person’ entity words in the text. Input sentence: Xinhua News Agency, Beijing, September 1st, by reporter Jingcai Wu." contains the components that are described above.

To clearly demonstrate that UIE with instruction tuning can implicitly learn the definitions of dataset through source prompt, we assign a nickname $p_{k}^{\prime}$ for every dataset and randomly replace $p_{k}$ with $p_{k}^{\prime}$ . For simplicity, we merely reverse the order of the original dataset names, thereby generating a non-natural language nicknames. For example, the dataset name "ACE 2004" is replaced with "4002 ECA". This procedure is designed to eliminate the influence caused by the differences in learning various source names in the UIE and to ensure that the discrepancies in results between true and fake settings are solely due to dataset definition bias.

Specifically, we adopt Llama-v2-13B touvron2023llama and FlanT5-11B chung2022scaling as our backbone models in source prompt tuning settings because of their powerful instruction understanding and instruction-following capabilities. Based on multiple datasets in NER and RE, we add an additional source prompt to every extraction instance to indicate the dataset to which it belongs. Further details on source prompt tuning are described in the Appendix LABEL:apd:source_prompt.

Source Prompt Inference

In reference, we provide different source prompts with the same extraction instance to our UIE models that have been fine-tuned on the dataset with source prompts. To probe the definition bias in universal generative information extraction, UIE predicts the extraction result with True source (the extraction case with the original source name), Nickname source (a nickname of the original source name) and Fake source (the extraction case with a fake source name). With different source name, UIE generates different extraction results following different definitions learned from source prompt tuning.

3.3 Can LLMs address definition bias?

Large language models exhibit remarkable instruction understanding capabilities, which help them achieve extraordinary performance on various tasks. However, due to the definition bias between IE datasets and IFT datasets, there is a significant performance gap in LLMs when it comes to the information extraction task wadhwa2023revisiting. In-context learning, where LLMs make predictions based solely on contexts augmented with a few examples, is a training-free learning framework that enables models to adapt to new tasks dong2023survey. It is considered a solution to address the definition bias between IE datasets and instruction tuning datasets.

As shown in Figure 2(c), we conduct the probing experiment with multiple LLMs in both zero-shot and few-shot settings.

Specifically, we utilize open-source LLMs such as Llama-v2-chat-70B touvron2023llama, and close-source LLMs GPT-3.5-Turbo openai2022chatgpt, GPT-4 openai2023gpt4 as our backbone models. In zero-shot settings, we prompt LLMs with a task description, which probes the definition bias between IE and IFT datasets. Meanwhile, in few-shot settings, we prompt LLMs with a task description and an additional four cases randomly sampled from the corresponding training set to examine whether in-context learning can address the definition bias. For a fair comparison, we sample 200 cases from each dataset and test them in both zero-shot and few-shot settings, respectively.

4 Document Body

4.1 Footnotes

Footnotes are inserted with the \footnote command.²²2This is a footnote.

4.2 Tables and figures

See Table 1 for an example of a table and its caption. Do not override the default caption sizes.

Command	Output
`{\"a}`	ä
`{\^e}`	ê
`{\‘i}`	ì
`{\.I}`	İ
`{\o}`	ø
`{\’u}`	ú
`{\aa}`	å

Command	Output
`{\c c}`	ç
`{\u g}`	ğ
`{\l}`	ł
`{\~n}`	ñ
`{\H o}`	ő
`{\v r}`	ř
`{\ss}`	ß

Table 1: Example commands for accented characters, to be used in, e.g., BibTeX entries.

4.3 Hyperlinks

Users of older versions of LaTeX may encounter the following error during compilation:

\pdfendlink ended up in different nesting level than \pdfstartlink.

This happens when pdfLaTeX is used and a citation splits across a page boundary. The best way to fix this is to upgrade LaTeX to 2018-12-01 or later.

4.4 Citations

Output	natbib command	Old ACL-style command
(Gusfield:97)	`\citep`	`\cite`
Gusfield:97	`\citealp`	no equivalent
Gusfield:97	`\citet`	`\newcite`
(Gusfield:97)	`\citeyearpar`	`\shortcite`

Table 2: Citation commands supported by the style file. The style is based on the natbib package and supports all natbib citation commands. It also supports commands defined in previous ACL style files for compatibility.

Table 2 shows the syntax supported by the style files. We encourage you to use the natbib styles. You can use the command \citet (cite in text) to get “author (year)” citations, like this citation to a paper by Gusfield:97. You can use the command \citep (cite in parentheses) to get “(author, year)” citations (Gusfield:97). You can use the command \citealp (alternative cite without parentheses) to get “author, year” citations, which is useful for using citations within parentheses (e.g. Gusfield:97).

4.5 References

The LaTeX and BibTeX style files provided roughly follow the American Psychological Association format. If your own bib file is named custom.bib, then placing the following before any appendices in your LaTeX file will generate the references section for you:

\bibliographystyle{acl_natbib}
\bibliography{custom}

You can obtain the complete ACL Anthology as a BibTeX file from https://aclweb.org/anthology/anthology.bib.gz. To include both the Anthology and your own .bib file, use the following instead of the above.

\bibliographystyle{acl_natbib}
\bibliography{anthology,custom}

Please see Section 5 for information on preparing BibTeX files.

4.6 Appendices

Use \appendix before any appendix section to switch the section numbering over to letters. See Appendix A for an example.

5 BibTeX Files

Unicode cannot be used in BibTeX entries, and some ways of typing special characters can disrupt BibTeX’s alphabetization. The recommended way of typing special characters is shown in Table 1.

Please ensure that BibTeX records contain DOIs or URLs when possible, and for all the ACL materials that you reference. Use the doi field for DOIs and the url field for URLs. If a BibTeX entry has a URL or DOI field, the paper title in the references section will appear as a hyperlink to the paper, using the hyperref LaTeX package.

Acknowledgements

This document has been adapted by Steven Bethard, Ryan Cotterell and Rui Yan from the instructions for earlier ACL and NAACL proceedings, including those for ACL 2019 by Douwe Kiela and Ivan Vulić, NAACL 2019 by Stephanie Lukin and Alla Roskovskaya, ACL 2018 by Shay Cohen, Kevin Gimpel, and Wei Lu, NAACL 2018 by Margaret Mitchell and Stephanie Lukin, BibTeX suggestions for (NA)ACL 2017/2018 from Jason Eisner, ACL 2017 by Dan Gildea and Min-Yen Kan, NAACL 2017 by Margaret Mitchell, ACL 2012 by Maggie Li and Michael White, ACL 2010 by Jing-Shin Chang and Philipp Koehn, ACL 2008 by Johanna D. Moore, Simone Teufel, James Allan, and Sadaoki Furui, ACL 2005 by Hwee Tou Ng and Kemal Oflazer, ACL 2002 by Eugene Charniak and Dekang Lin, and earlier ACL and EACL formats written by several people, including John Chen, Henry S. Thompson and Donald Walker. Additional elements were taken from the formatting instructions of the International Joint Conference on Artificial Intelligence and the Conference on Computer Vision and Pattern Recognition.

Appendix A Example Appendix

This is an appendix.