This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Let the Rule Speak: Enhancing In-context Learning Debiasing with Interpretability

Ruixi Lin   Yang You
Department of Computer Science
National University of Singapore
{ruixi,youy}@comp.nus.edu.sg
Abstract

In-context learning, which allows large language models to perform diverse tasks with a few demonstrations, is found to have imbalanced per-class prediction accuracy on multi-class text classification. Although notable output correction methods have been developed to tackle the issue and simultaneously improve downstream prediction accuracy, they may fail to answer the core interpretability challenges: why and which certain classes need corrections, and more importantly, a tailored correction for per-sample, per-class’s probability. To address such interpretability gaps, we first find that the imbalance arises from certain classes consistently receiving high ICL output probabilities, whereas others receiving lower or mixed ranges, so the former is more frequently chosen, resulting in higher accuracy; more crucially, we find that these ranges have significantly varying degrees of influence on the accuracy bias, highlighting the need for precise, interpretable probability corrections by range. Motivated by this, we propose FuRud, a Fuzzy Rule Optimization based Debiasing method, that (1) detects which classes need corrections, and (2) for each correction-needed class, detects its probability ranges and applies asymmetric amplifications or reductions to correct them interpretably. Notably, across seven benchmark datasets, FuRud reduces the pairwise class accuracy bias (COBias) by more than half (56%), while achieving a relative increase of 21% in accuracy, outperforming state-of-the-art debiasing methods. Moreover, FuRud can optimize downstream tasks with as few as 10 optimization examples. Furthermore, FuRud can work for prompt formats that lead to highly skewed predictions. For example, FuRud greatly improves ICL outputs which use letter options, with 44% relative accuracy increase and 54% relative COBias reduction.

1 Introduction

The classification outputs by in-context learning (ICL) are described as biased when they exhibit imbalanced per-class prediction accuracy. Addressing such imbalances while improving overall accuracy is seen as a category of debiasing. Concretely, the skewness in the output space can be alleviated by targeted corrections on output logits or probabilities, with or without explicitly modeling the per-class accuracy differences, i.e., COBias [1]. However, while effective, prior methods could lack straightforward explanations on why and which certain classes need corrections. What’s more challenging is to have a tailored per-sample, per-class correction.

A direct cause of COBias is that ICL tends to assign specific ranges of output probabilities to each class. When some classes always receive high probabilities for any input example, others may have lower or mixed probability ranges. The consequence is that latter classes are less frequently predicted than the former, resulting in consistently lower accuracies and calling for probability corrections. In addition, among all examples of a class A, the subset of examples whose in-context learned probability of answer A is relatively low often receive a lower test accuracy, compared to the subset whose class A probability is higher, suggesting that different probability ranges within a class need different corrections.

Taking these overlooked aspects into account, a correction should be tailored for each class and for each sample. To achieve this, a helpful correction should be able to asymmetrically amplify or reduce different ranges of a class’s probabilities. In this paper, we address the pressing need for enhanced understandings in how biased ICL predictions happen, and propose two research questions about a main concern, yet a potential direction, in interpretable ICL output corrections.

RQ1: What is the interpretability challenge in correcting in-context learned representations? Given an NN-class classification dataset, let us denote its mm-th example’s input prompt and label as (xm,ym)(x_{m},y_{m}), where xmx_{m} consists of a task instruction, few-shot demonstrative examples, and the input example’s question. The LLM in-context learns the class probabilities 𝒑m=(pm1,,pmN)\boldsymbol{p}_{m}=(p_{m1},\dots,p_{mN}) (normalized over the NN classes), and the prediction y^m\hat{y}_{m} is argmaxipmi\operatorname*{arg\,max}_{i}p_{mi}. The probabilities 𝒑m\boldsymbol{p}_{m} may need corrections given the debiasing objective of reducing COBias. Therefore, our task is to correct certain dimensions of 𝒑m\boldsymbol{p}_{m} towards reducing COBias and improving overall accuracy. The interpretability challenges raised in this process can be specified as (1), detecting which classes need corrections, and (2), for each correction-needed class, applying range-specific amplifications/reductions.

RQ2: How can we improve interpretability with fuzzy rules? We leverage membership functions from the field of fuzzy rule based systems for debiasing. For backgrounds, a membership function is a curve that defines a mapping from a crisp input value to a fuzzy value between 0 and 1 [2]. Based on this, given class probabilities as input attributes, membership functions transform the probabilities to fuzzy values, which could be viewed as corrected probabilities under certain debiasing optimization objectives.

The key intuition here is that a membership function can asymmetrically amplify or reduce different ranges of inputs. Therefore, a fuzzy rule based debiaser for class probability pmip_{mi} can be written as fAi(pmi)f_{A_{i}}(p_{mi}), where AiA_{i} is a fuzzy set for class ii, and its membership function fAif_{A_{i}} maps the probability to a corrected pmi:=fAi(pmi)p^{\prime}_{mi}:=f_{A_{i}}(p_{mi}). Then 𝒑m\boldsymbol{p}^{\prime}_{m} consists of corrected per-class probabilities.

Alternatively, the debiaser can be viewed as a single rule:

If class 1 is A1 and … and class N is ANAntecedent then predict argmaxjfAj(pmj)Consequent\text{If}\underbrace{\text{ class $1$ is $A_{1}$ and ... and class $N$ is $A_{N}$}}_{\text{Antecedent}}\text{ then }\underbrace{\text{predict $\textstyle\operatorname*{arg\,max}_{j}f_{A_{j}}(p_{mj})$}}_{\text{Consequent}} (1)

Our goal is to optimize the rule, i.e., select fuzzy sets/membership functions for every class in the antecedent, towards mitigating COBias and improving overall prediction accuracy. Specially, we include a Don’t Change membership function that will keep a class unchanged, suggesting that the LLM in-context learns an accurate probability for the class. When a correction is needed, the membership function detects the probability range that a class’s probability belongs to, and updates it with the returned function value. The problem becomes jointly selecting a set of membership functions for each class towards improving multi-objectives based on COBias and accuracy.

To this end, we propose a Fuzzy Rule Optimization based Debiasing method, FuRud, which demonstrate via extensive experiments (Section 4) and discussions (Section 5) that it achieves good improvements over accuracy and COBias while providing sample-level interpretability.

In a nutshell, FuRud uses an optimization set of samples for membership function selection. The optimization set’s questions are prompted in 1-shot manner, and probabilities are measured across answer classes for each question. These probabilities and ground-truth answers across all questions are aggregated in the multi-objective model, to jointly learn an optimal membership function for each class. At inference, a test example’s class probabilities are obtained similarly. Then we apply the learned membership functions to perform tailored corrections at each class’s probability in the given test sample. An overview of FuRud is shown in Figure 1, illustrating desired corrections and performance improvements.

To highlight, the membership functions learned by FuRud enable sample-level interpretability. FuRud enables us to know whether the LLM in-context learns an accurate probability for a class within a given sample. This is achieved by learning a correction function (membership function) for each class, towards the multi-objectives of reducing COBias and enhancing accuracy. If the Don’t Change function is learned for a class, it means the LLM in-context learns an accurate probability for the class; otherwise, a tailored correction is performed by the membership function. The source code will be released upon paper publication. In summary, our messages are:

  • We propose an interpretable fuzzy rule optimization based debiasing method (FuRud), to account for both inter-class surface biases and intra-class range-wise influences.

  • We formulate a multi-objective programming model to jointly optimize a set of triangular membership functions for each class. The functions are human-readable, which can asymmetrically correct probabilities of different ranges that are misrepresented.

  • Across seven benchmarks, FuRud demonstrates its effectiveness for improved overall accuracy, reduced per-class accuracy imbalance, and enhanced interpretability. For example, it improves ICL accuracy by a relative increase of 21% and reduces COBias by a relative decrease of 56%; it achieves higher accuracy (avg. accuracy reaching 72.0%) and competitive COBias (avg. COBias dropping to 17.8%) over state-of-the-art debiasing methods.

Refer to caption
Figure 1: An overview of how FuRud optimizes and transforms each class of a dataset with interpretability; the input to FuRud model is the NN-dimensional probability vectors of the optimization set of a dataset, and the output is membership functions selected for each class; the selected functions are directly plugged in to test examples at inference. This is for illustration purposes only, actual range changes and improvements vary across datasets.

2 Related Work

Language Model Bias Mitigation. At the heart of debiasing is detecting biased patterns that arise in a large language model (LLM)’s outputs. Prior work has found various prediction biases in ICL, and address the biased patterns by methods of contextual prompt engineering and output adjustment [3, 4, 5]. Particularly, on classification tasks, researchers have found that LLMs’ outputs are sensitive to ICL formatting, such as prompt templates, demonstrations, and verbalizers [6, 7, 8]; besides, LLMs tend to output common tokens in the pre-training data [5]. These bias factors lead to majority label bias [5], COBias (pairwise class accuracy differences) [1], etc, causing imbalanced per-class accuracies, and researchers address these biases by making output distribution calibrations [5, 9, 10], or by class probability re-weighting [1]. For example, Zhao et al. [5] calibrate the output distribution with content-free/dummy test prompts. Zhou et al. [10] calibrate the output distribution in a test-time manner, estimating a contextual correction term of each class on a batch of test examples; the proposed Batch Calibration (BC) method outperforms previous calibration methods [5, 9] on a range of text classification tasks. Lin and You [1] re-weights output probabilities by a set of class-specific weight coefficients; the proposed Debiasing as Nonlinear Integer Programming method (DNIP) achieves much lower COBias with higher accuracy than the ICL baseline. Though these debiasing methods effectively adjust ICL outputs, they do not emphasize interpretable bias handling. For example, a calibration method may not explicitly explain why a class needs corrections, or users may not fathom how a re-weighting method performs the exact corrections a class need.

Fuzzy Rule Techniques for Interpretable Machine Learning. Interpretable machine learning often needs a human-readable subset of features to generate the target [11, 12]. Fuzzy rules are intrinsically interpretable and are widely studied for interpretable machine learning [13, 14, 15]. In classical fuzzy rule classification systems, input attributes are assigned to fuzzy sets to generate rules for pattern classification [16, 17, 18, 19, 20]. A fuzzy classification system thus contains multiple human-readable rules, which can be as simple as “1. If attribute Bare Nuclei is Small then consequent class Benign.2.3. If attribute Uniformity of Cell Size is not Small then consequent class Malignant.” [20]. Here, Small and not Small are fuzzy sets, with corresponding membership functions. Membership functions provide the core interpretability of the fuzzy systems. In this work, we extend fuzzy membership functions to help with debiasing.

3 FuRud: Fuzzy Rule Optimization Based Debiasing

The core idea is to handle the imbalanced per-class accuracy issue with fuzzy membership functions. In the fuzzy rule setting, for NN classes, each class selects a fuzzy set AiA_{i}, or equivalently, a membership function fAif_{A_{i}}, from a family of KK fixed fuzzy sets. We let F={f1,,fk,,fK}F=\{f_{1},...,f_{k},...,f_{K}\} denote the family of membership functions. The membership function selection problem can be solved using combinatorial optimization. To this end, we introduce FuRud, a Fuzzy Rule Optimization Based debiasing method. The FuRud optimization is performed on a set of labeled examples, and the selected membership functions are directly applied to transform test-time class probabilities.

Refer to caption
Figure 2: The family of membership functions.

Membership Functions. We first introduce the triangular membership functions to select from. Triangular membership functions are popular for fuzzy rule-based classification [17]. The main benefits of triangular functions are: the speed of changes is easily controlled by the slope, and the linearity is computationally efficient and easy to understand. Since we do not know an appropriate fuzzy partition for each class in downstream datasets, we simultaneously employ four fuzzy partitions, resulting in membership functions of different granularities.

Figure 2 shows 19 triangular membership functions of four fuzzy partitions, including the Don’t Change membership function - the identity function (slope=1). Other than Don’t Change, each membership function represents a sharp or smooth transformation of the input variable. Details of the functions are discussed in Appendix A. The general form of a triangular membership function fk()f_{k}(\cdot) can be written as:

fk(pmi;ak,bk,ck)={0,ifpmiakpmiakbkak,akpmibkckpmickbk,bkpmick0,otherwisef_{k}(p_{mi};a_{k},b_{k},c_{k})=\left\{\begin{aligned} &0,&&\text{if}\ p_{mi}\leq a_{k}\\ &\frac{p_{mi}-a_{k}}{b_{k}-a_{k}},&&a_{k}\leq p_{mi}\leq b_{k}\\ &\frac{c_{k}-p_{mi}}{c_{k}-b_{k}},&&b_{k}\leq p_{mi}\leq c_{k}\\ &0,&&\text{otherwise}\end{aligned}\right. (2)

where ak,bk,cka_{k},b_{k},c_{k} are the left endpoint, the input value where the peak is reached, and the right endpoint of fkf_{k}. For example, for f11f_{11}, the ak,bk,cka_{k},b_{k},c_{k} values are 0.125, 0.25, 0.375 respectively.

Then, we compute the updated probability pmip^{\prime}_{mi} by:

pmi={pmi,if i=1Npmi=0kfk(pmi)𝟙(κi=k),otherwisep^{\prime}_{mi}=\left\{\begin{aligned} &p_{mi},&&\text{if $\textstyle\sum_{i=1}^{N}p^{\prime}_{mi}=0$}\\ \sum\limits_{k}&f_{k}(p_{mi})\mathbbm{1}(\kappa_{i}=k),&&\text{otherwise}\end{aligned}\right. (3)

where κi\kappa_{i} is the integer selection variable for class ii. 𝟙()\mathbbm{1}(\cdot) evaluates to 1 if the condition inside is satisfied, otherwise 0. Furthermore, in case pmi=0p^{\prime}_{mi}=0 for all classes, we reset each to be its original probability in 𝒑m\boldsymbol{p}_{m}. Therefore, y^m=argmaxipmi\hat{y}_{m}=\operatorname*{arg\,max}_{i}p^{\prime}_{mi}.

Multi-Objective Programming and Energy Function. Let 𝜿=(κ1,,κN)\boldsymbol{\kappa}=(\kappa_{1},\dots,\kappa_{N}) be the integer selection variables for classes 1,,N1,...,N, where κi\kappa_{i} is chosen from the given set of membership functions, and κi=k\kappa_{i}=k means fkf_{k} is chosen. Our goal is to learn 𝜿\boldsymbol{\kappa} that improve ICL classifications under two main evaluation metrics, accuracy and COBias [1]. To this end, we adopt multi-objective programming for simultaneous better accuracy and lower COBias.

The first objective is to improve overall accuracy:

maxZAcc=1|SOpt|mSOpt𝟙{y^m=ym}\max Z\textsuperscript{Acc}=\frac{1}{|S\textsuperscript{Opt}|}\sum\nolimits_{m\in S\textsuperscript{Opt}}\mathbbm{1}\{\hat{y}_{m}=y_{m}\} (4)

where SOpt is the indices of examples used for optimization.

Furthermore, we balance the class accuracy difference by explicitly modeling COBias, which accounts for an overall difference between pairwise per-class accuracies. Minimizing COBias helps address low-accuracy classes from ICL outputs. Therefore, the second objective is:

minZCOBias=1NC2i=1N1j=i+1N|AcciAccj|\min Z\textsuperscript{COBias}=\frac{1}{\textsubscript{N}C_{2}}\sum\nolimits_{i=1}^{N-1}\sum_{j=i+1}^{N}\bigl{|}\text{Acc}_{i}-\text{Acc}_{j}\bigr{|} (5)

where NC2=N(N1)/2\textsubscript{N}C_{2}=N(N-1)/2, Acci\text{Acc}_{i} is the accuracy score for optimization examples in class ii.

To further handle extreme cases of low class accuracies, we penalize classes that fail to reach an accuracy threshold, and minimize the loss between the threshold and per-class accuracy (cut off at 0). The third objective is:

minZExtreme=i=1Nmax{0,λAcci}\min Z\textsuperscript{Extreme}=\sum\nolimits_{i=1}^{N}\max\{0,\lambda-\text{Acc}_{i}\} (6)

where λ\lambda is a fixed threshold value.

The above objective functions are a mix of minimization and maximization, and the resulted multi-objective programming model requires integer variables. Each of them alone corresponds to an integer programming problem, which is NP-complete [21]. Classic solutions for integer programming use operational research techniques, such as Branch-and-Bound, often used for linear integer programming problems. It could be difficult for such methods to handle nonlinear integer programming models which contain non-differentiable functions. Consequently, a series of metaheuristic algorithms have emerged, such as Simulated Annealing (SA), and each metaheuristic has their own strengths and limitations. We use one of the metaheuristics, SA, to tackle the proposed mathematical model. The SA implementation follows [1]. Since it is difficult to solve each one as an individual optimization problem and force an optimal solution, our strategy is instead to compute a weighted sum of 1ZAcc,ZCOBias,ZExtreme1-Z\textsuperscript{Acc},Z\textsuperscript{COBias},Z\textsuperscript{Extreme} as a single energy function EE to be optimized using SA. Hence, the multi-objectives are combined into a total minimization objective:

minκE(κ;λ,𝒑)\min_{\kappa}E(\kappa;\lambda,\boldsymbol{p}^{\prime}) (7)

where E(κ;λ,𝒑)=ω+hSObjγhZhE(\kappa;\lambda,\boldsymbol{p}^{\prime})=\omega+\textstyle\sum_{h\in S\textsuperscript{Obj}}\gamma^{h}Z^{h}, SObjS\textsuperscript{Obj} is the names of the penalty functions corresponding to the individual objectives, and ω,γh\omega,\gamma^{h}s are penalty parameters. Therefore, the SA algorithm optimizes on EE to obtain an optimal set of membership functions.

In summary, the class corrections aim at reducing COBias and improving accuracy. Each equation from 4 to 6 exactly targets one of these two goals. In detail, Eq. 4 targets maximizing overall accuracy, Eq. 6 targets minimizing COBias, and Eq. 6 targets maximizing per-class accuracy, which enforces it to meet a threshold; Eq. 7 combines the three objectives as a multi-objective function. Details on how Eq. 7 is optimized are described in experimental setups (Section 4.1).

4 Experiments

4.1 Experimental Setups

Evaluation Tasks and Evaluation Metrics. The proposed method is evaluated on a diverse range of text classification datasets, including AGNews [22], a 4-class news topic classification; DBpedia [23], a 14-class ontology classification dataset derived from Wikipedia; SST-5 [24], a 5-class sentiment classification dataset; TREC [25, 26], a 6-class question classification dataset; RTE [27], a binary entailment recognition dataset; and two biomedical domain-specific datasets, including DDI [28], a 5-class drug-drug interaction relation extraction dataset; PubMedQA [29], a 3-class biomedical question answering dataset. Each evaluation dataset is split into optimization/development/test sets. We follow [1] to preprocess the datasets. Evaluation metrics are accuracy and COBias.

FuRud Setups. The 19 triangular membership functions in Figure 2 form the base of selections for FuRud. To obtain the per-class probabilities from ICL, we prompt Llama-2-13B (13B parameters) in 1-shot manner. The output softmax probabilities normalized over all classes are used as attributes. The energy function we used in the experiments is a special form of Equation 7 with ω=1,γAcc=1,γCOBias=α,γExtreme=β\omega=1,\gamma\textsuperscript{Acc}=-1,\gamma\textsuperscript{COBias}=\alpha,\gamma\textsuperscript{Extreme}=\beta. In other words, the final multi-objective optimization function is minκZ=1ZAcc+αZCOBias+βZExtrememin_{\kappa}Z=1-Z\textsuperscript{Acc}+\alpha Z\textsuperscript{COBias}+\beta Z\textsuperscript{Extreme}, where we learn κi\kappa_{i} for class i=1,,Ni=1,\dots,N on an optimization set of samples, which is the full or a subset of training set. Each κi\kappa_{i} is selected from the given set of membership functions, and κi=k\kappa_{i}=k means membership function fkf_{k} is selected. At inference time, for a test sample, let p=(p1,,pi,,pN)p=(p_{1},\dots,p_{i},\dots,p_{N}) be its in-context learned output class probabilities, then these probabilities are transformed by their learned membership functions, according to Eq. 3. The corrected prediction is y^=argmaxifκi(pi)\hat{y}=\operatorname*{arg\,max}_{i}f_{\kappa_{i}}(p_{i}).

The above model ZZ is optimized using the SA metaheuristic. The core step of SA is to sample a new solution κ=(κ1,,κN)\kappa=(\kappa_{1},\dots,\kappa_{N}), e.g., (16, \dots, 8), and evaluate it on ZZ. If ZZ is smaller, the algorithm accepts the new solution; otherwise, it accepts the new solution with an acceptance probability exp(ΔZ/T)exp(-\Delta Z/T), where TT is the temperature at the step. The values of α,β\alpha,\beta are tuned on the development set. Since we do not know an estimate for the expected threshold value λ\lambda in downstream tasks, we set it to 0.5 for simplicity. Prompting is done on a 80G A100 GPU. The simulated annealing algorithm executes on an AMD EPYC 7742 CPU with execution time in minutes.

We compare FuRud with the ICL baseline and two state-of-the-art ICL debiasing methods, including DNIP [1] and BC [10]. For fair comparisons, for each dataset, we prompt with three different 1-shot demonstrations and obtain three sets of initial probabilities. The demonstration is randomly sampled from optimization examples. The average test accuracy and COBias over the three runs are reported.

4.2 Main Results

Acc. \uparrow COBias \downarrow
Method ICL BC DNIP FuRud ICL BC DNIP FuRud
AGNews 79.97.0 82.55.0 87.90.7 85.73.4 28.316.1 23.112.1 6.30.6 6.91.6
DBpedia 88.61.7 89.11.5 93.40.6 92.20.4 16.23.7 15.43.3 7.70.6 9.20.6
SST-5 44.94.3 47.62.3 48.31.9 48.83.8 53.15.0 49.810.7 18.710.1 22.28.4
TREC 68.510.8 72.94.4 77.12.0 77.33.9 35.96.5 31.95.1 14.21.3 18.51.4
RTE 71.52.2 76.10.6 74.30.8 74.51.8 43.47.0 16.41.9 4.33.3 7.15.0
DDI 7.20.9 14.42.5 40.46.0 69.36.3 45.65.9 32.67.6 7.53.2 36.84.6
PubMedaQA 55.12.9 55.51.3 63.114.0 55.95.4 61.21.9 26.23.2 41.129.6 24.08.4
Avg. 59.4 62.6 69.2 72.0 40.5 27.9 14.3 17.8
Table 1: Test accuracy and COBias (%); average scores over three runs are reported. FuRud outperforms previous methods in accuracy, and is on par with DNIP in COBias.

Table 1 shows the test accuracy and COBias of ICL, BC, DNIP, and FuRuD. Comparing FuRud to the ICL baseline, the average relative accuracy increase is 21%, and the average relative COBias reduction is 56%. The average test accuracy of FuRud over seven benchmarks is 72%, which outperforms the accuracy of BC and DNIP; the average test COBias of FuRud is 17.8%, which is comparable to DNIP with obtains the lowest COBias (14.3% ) among the methods compared. It is noted that FuRud uses the full optimization set to make a fair comparison to DNIP. However, FuRud can also work in a few-shot optimization manner, as discussed in Section 5.2. On top of that, FuRud provides enhanced interpretability, as visualized in the following section.

4.3 Interpretability Analysis

Refer to caption
Figure 3: Class probability changes before and after applying FuRud. There was a stark accuracy difference of 37% for RTE’s True and False before FuRud, manifesting the model (ICL)’s tendency to assign higher probabilities to True. FuRud addresses this accuracy bias by amplifying the medium range of False and simultaneously reducing the relatively high range of True.
Refer to caption
Figure 4: Zooming in on transformations applied to class Business from AGNews, whose accuracy increases from 80% (ICL) to 86%. The special case returns the original class probability of an example when transformed probabilities sum to 0 (Eq. 3).

We visualize the class-wise probability changes before and after applying FuRud in Figure 3. AGNews and RTE are taken as examples (other datasets’ results are similar). The run with seed 1 out of all three runs is used for illustrating the membership functions. For both AGNews and RTE, around half of the classes have an increased/kept accuracy. More importantly, on both datasets, the worst-performing class by ICL significantly improves. In details, the relatively low to medium probability ranges of the worst-performing class gets amplified, whereas the relatively high probability ranges of other classes gets slightly reduced. This shows FuRud’s effective amplifications or reductions in the most correction-needed probability ranges of a class.

To further see this, Figure 4 illustrates the detailed transformation of different probability ranges of class Business of AGNews. For the 1,204 test examples with label Business, we divide their ICL output probabilities at the position of class Business into 5 different ranges, from [0.0,0.2][0.0,0.2] to [0.8,1.0][0.8,1.0]. The top row shows that examples in the first two ranges, or [0.0,0.4][0.0,0.4], have relatively low accuracies (0 and 9%). These probabilities need corrections most, which are effectively transformed by the membership function f11f_{11}, selected by FuRud for class Business. The red color highlights activated parts for the transformations, resulting in new probability ranges of the examples and improved accuracies (9% and 66%). This further demonstrates the improved interpretability and higher accuracy obtained by FuRud, especially for a less performing class.

5 Discussion

Method Acc. COBias
ICL (letter) 36.913.6 47.215.6
FuRud (letter) 53.110.5 21.68.2
Table 2: Test Scores (%) of FuRud on Letter Based ICL Outputs, averaged over the seven datasets.

5.1 FuRud Greatly Improves Highly Skewed Letter Based ICL Outputs, by 44% Relative Accuracy Increase and 54% Relative COBias Reduction

In this section, we show the effectiveness of FuRud under a different set of prompt output choices - the letter options, which could lead to more serious shallow matching issue than label token options. When letter options are used in a prompt, a model is expected to output a single letter choice of “A”, “B”, etc. mapping to a class label. Output choices significantly contribute to prompt sensitivity. In fact, LLMs have been shown to have a tendency to select a certain letter option regardless of the content, where for instance a model could over-predict the letter “A” [30], suggesting moderate to high COBias. This surface pattern matching issue of letter options is also obvious on the datasets we evaluated, which could even lead to over 90% accuracy in the biased class and much lower accuracy in some other classes. For example, on AGNews, the model is biased to predict “B” (class label: Sports), leading to an average of 99% accuracy in Sports and 12% accuracy in Business over three runs.

We apply FuRud to the highly distorted letter based ICL outputs. Table 2 shows the test accuracy and COBias for ICL and FuRud, averaged over seven benchmark datasets, where FuRud improves accuracy by an relative 44% and achieves a significant COBias reduction of a relative 54% over ICL. Besides the tabled results, on the aforementioned AGNews dataset, overall test accuracy improves to 66% from 45%, and COBias reduces to 10% from 54%. The per-class accuracy changes from ICL to FuRud are: World, 40% \rightarrow 69%; Sports, 99% \rightarrow 70%; Business, 12% \rightarrow 66%; Technology, 27% \rightarrow 59%. These results demonstrate the effectiveness of FuRud on debiasing highly skewed ICL outputs, suggesting that FuRud can debias no matter how poor or perfect the input prompt is.

5.2 Few-shot Optimization

Refer to caption
Figure 5: Few-shot optimization.

FuRud can optimize a downstream task with as few as 10 examples. Figure 5 shows test accuracy and COBias of FuRud (in mint green color) when used in a few-shot optimization manner, starting with 10 few-shot examples and growing to 100 and 500 examples. TREC and SST-5 are shown to illustrate that FuRud can achieve an average of 9% accuracy improvements with 18% COBias reduction over the ICL baseline at 10 few-shot optimization examples. At 10 examples, FuRud obtains a 11% and 6% relative increase in accuracy over the ICL baseline on TREC and SST-5 respectively, at the same time, it reduces COBias by a relative 20% and 16% on each dataset. The accruacy and COBias performances gradually improve as the number of examples increases to 500. Compared to existing methods, FuRud outperforms BC in few-shot scenarios, and performs better than (TREC) or on par (SST-5) with DNIP while being interpretable. Similar findings apply to the other five datasets, as shown in Appendix B.

5.3 Effect of Membership Function Granularities

Refer to caption
Figure 6: Accuracy-COBias tradeoff with 5 combinations of fuzzy partitions.

We experiment with different combinations of the four fuzzy partitions in Figure 2, in addition to the main results using all partitions. The partitions are characterized by different rates of change, i.e., different absolute values of slopes of the rising/falling edges. A larger slope indicates more granularities. The slopes for the top left, top right, bottom left, and bottom right partitions are ±1,±2,±4,±8\pm 1,\pm 2,\pm 4,\pm 8 respectively. Specifically, the bottom right partition has the Don’t Change function y=xy=x and its symmetric function y=1xy=1-x, which will be referred to as the DC partition. Since the Don’t Change function plays a vital role in keeping some classes unchanged, we experiment with five combinations, including DC , and DC with each partition of slope ±2,±4,±8\pm 2,\pm 4,\pm 8. The accuracy and COBias scores of five combinations are shown in Figure 6. The average score of seven datasets are reported, and for each dataset, the average accuracy and COBias over three runs is taken. COBias reduces with higher granularities and accuracy slightly decreases. DC can reach 74% accuracy, being 15% higher than ICL accuracy, but the improvement is mainly from DDI, suggesting that DC alone is not enough to transform the biased probabilities. The optimal accuracy and COBias is achieved with mixed partitions.

In addition, the Don’t Change fuction is essentially needed in debiasing. We perform an ablation analysis with the partition ±8\pm 8 only, and find that, while achieving similar accuracies, its COBias is 6% higher than using DC with partition ±8\pm 8. Moreover, for example, 4 out of 14 classes on DBpedia are optimized with Don’t Change, suggesting that keeping certain classes unchanged is necessary for jointly optimizing overall accuracy and COBias. This demonstrates that a dedicated Don’t Change function is needed in the multi-objective optimization.

In summary, higher membership function granularities are good for COBias reduction. However, although it is tempting to include as many membership functions as possible to reduce COBias, there is the accuracy-COBias tradeoff. Too many membership functions may not further boost accuracy and could induce more computational costs.

5.4 More Discussions

FuRud’s Performances on More LLMs. For more LLMs of varied sizes and families, FuRud consistently improves both overall accuracy and COBias, showcased by the additional experimental results on Llama-2-7B and GPT-2-XL in Appendix C.

FuRud’s Performances under More ICL Demonstration Selection Strategies. To further see how demonstrations in the prompt affect performances, we additionally prompt Llama-2-13B with an additional demonstration selection strategy, k-shot prompting, where k is the number of classes; a demonstrative example from each class is randomly selected from the optimization set, and these examples are cascaded as a demonstrative example. FuRud significantly improves accuracy and COBias in this setting, as detailed in Appendix D.

Computational Costs. As for computational costs, the computational time of FuRud optimization is in the scale of minutes, from several minutes to around 30 minutes, depending on the dataset (e.g., number of classes, optimization set sizes, etc). For DNIP, the computational time is similarly in the scale of minutes. For the calibration method Batch Calibration (BC), it applies an analytical calculation on all samples’ ICL probabilities, introducing small computational overhead.

Interpretability compared: DNIP and FuRud. The DNIP method shows good debiasing performances, but it applies indiscriminate reduction (or relative amplification) to the probabilities, making it difficult to capture the varying degrees of influence of different probability ranges to the accuracy bias, potentially limiting its interpretability. The use of fuzzy membership functions overcomes this issue, and this is a main innovation of our paper.

Can we use the traditional fuzzy rule based systems for debiasing? That would require maintaining multiple candidate rules like "RqR_{q}: If the probability of class 1 is Aq1A_{q1} and … and the probability of class NN is AqNA_{qN}, then predict YqY_{q}," where YqY_{q} is the consequent/target class. Training such rules is computationally expensive, and inference time for a winning rule grows with the number of candidates. Additionally, calculating the product of membership values could cause issues such as overflow, and achieving high accuracy might demand an overwhelming number of rules, making the system inefficient. In contrast, FuRud eliminates the need for learning multiple rules, as its transformations could implicitly capture many rules found in traditional fuzzy classification systems.

We have a different motivation from traditional post-hoc corrections. Some may argue that ensuring equitable accuracies across all classes is a well-studied problem in standard machine learning classifiers. It is worth emphasizing that the per-class prediction accuracy imbalance should be treated within their particular context. The accuracy bias in ICL outputs stems from completely different causes than the unequal class accuracies observed in potentially overfitted traditional classifiers, where the former is rooted in prompts and the LLMs, and the latter arises from class imbalance of supervised training data. That’s why our method is particularly applied to ICL’s output token class probabilities, pinpointing specific patterns and applying precise, targeted corrections.

6 Conclusion and Future Work

In this work, we present a fuzzy rule optimization based debiasing method to enhance ICL output class representations with interpretability. FuRud learns a per-class correction function, i.e., a membership function, which decides if and how a class’s probability needs correction for each sample. If correction is needed, the corrected class probability will be tailored by the membership function, which is a main innovation of this paper. On a diverse set of text classification benchmarks, FuRud greatly improves the average test accuracy and test COBias over ICL, by a relative increase of 21% and a relative reduction of 56%, outperforming state-of-the-art methods. Moreover, FuRud can work for prompt formats that may lead to highly skewed predictions, e.g., letter options. Furthermore, FuRud can optimize a downstream task with as few as 10 optimization examples.

In the future, more versatile rules can be explored, and we may also examine the tradeoff between the accuracy and rule complexity. Simpler rules are easier to understand, but the transformations may fail to catch the intricate interactions between class predictions. More complex rules may have better modeling capabilities, but they are harder to read. In addition, this work focuses on evaluating text classification, and we will extend interpretable ICL debiasing to more language tasks, modalities, and model architectures.

References

  • Lin and You [2024] Ruixi Lin and Yang You. COBias and Debias: Minimizing Language Model Pairwise Accuracy Bias via Nonlinear Integer Programming, 2024. URL https://arxiv.org/abs/2405.07623.
  • Zadeh [1965] L.A. Zadeh. Fuzzy sets. Information and Control, 8(3):338–353, 1965. URL https://www.sciencedirect.com/science/article/pii/S001999586590241X.
  • Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  • Schick et al. [2021] Timo Schick, Sahana Udupa, and Hinrich Schütze. Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. Transactions of the Association for Computational Linguistics, 9:1408–1424, 12 2021. URL https://doi.org/10.1162/tacl_a_00434.
  • Zhao et al. [2021] Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate Before Use: Improving Few-shot Performance of Language Models. In Proceedings of the 38th International Conference on Machine Learning, pages 12697–12706, 2021. URL https://proceedings.mlr.press/v139/zhao21c.html.
  • Min et al. [2022] Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, 2022. URL https://aclanthology.org/2022.emnlp-main.759.
  • Holtzman et al. [2021] Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7038–7051, 2021. URL https://aclanthology.org/2021.emnlp-main.564.
  • Schick and Schütze [2021] Timo Schick and Hinrich Schütze. Exploiting cloze-questions for few-shot text classification and natural language inference. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty, editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269, Online, April 2021. URL https://aclanthology.org/2021.eacl-main.20.
  • Fei et al. [2023] Yu Fei, Yifan Hou, Zeming Chen, and Antoine Bosselut. Mitigating label biases for in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14014–14031, July 2023. URL https://aclanthology.org/2023.acl-long.783.
  • Zhou et al. [2024] Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine A Heller, and Subhrajit Roy. Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=L3FHMoKZcS.
  • Jethani et al. [2021] Neil Jethani, Mukund Sudarshan, Yindalon Aphinyanaphongs, and Rajesh Ranganath. Have We Learned to Explain?: How Interpretability Methods Can Learn to Encode Predictions in their Interpretations. Proceedings of Machine Learning Research, 130:1459–1467, 2021. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8096519.
  • Carvalho et al. [2019] Diogo V. Carvalho, Eduardo M. Pereira, and Jaime S. Cardoso. Machine learning interpretability: A survey on methods and metrics. Electronics, 8(8), 2019. URL https://www.mdpi.com/2079-9292/8/8/832.
  • Vernon et al. [2024] Eric M. Vernon, Naoki Masuyama, and Yusuke Nojima. Integrating white and black box techniques for interpretable machine learning. In Xin-She Yang, Simon Sherratt, Nilanjan Dey, and Amit Joshi, editors, Proceedings of Ninth International Congress on Information and Communication Technology, pages 639–649, 2024.
  • Vilone and Longo [2020] Giulia Vilone and Luca Longo. Explainable Artificial Intelligence: a Systematic Review, 2020. URL https://arxiv.org/abs/2006.00093.
  • Ishibuchi and Nojima [2007] Hisao Ishibuchi and Yusuke Nojima. Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. International Journal of Approximate Reasoning, 44(1):4–31, 2007. URL https://www.sciencedirect.com/science/article/pii/S0888613X06000405.
  • Ishibuchi et al. [1999] Hisao Ishibuchi, Tomoharu Nakashima, and Tadahiko Murata. Performance Evaluation of Fuzzy Classifier Systems for Multidimensional Pattern Classification Problems. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 29(5):601–618, 1999. doi: 10.1109/3477.790443. URL https://ieeexplore.ieee.org/document/790443.
  • Ishibuchi et al. [2005] Hisao Ishibuchi, Takashi Yamamoto, and Tomoharu Nakashima. Hybridization of Fuzzy GBML Approaches for Pattern Classification Problems. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 35(2):359–365, 2005. URL https://ieeexplore.ieee.org/abstract/document/1408064.
  • Nojima and Ishibuchi [2016] Yusuke Nojima and Hisao Ishibuchi. Multiobjective Fuzzy Genetics-based Machine Learning with a Reject Option. In 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pages 1405–1412, 2016.
  • Rudziński [2016] Filip Rudziński. A multi-objective genetic optimization of interpretability-oriented fuzzy rule-based classifiers. Applied Soft Computing, 38:118–133, 2016. URL https://www.sciencedirect.com/science/article/abs/pii/S1568494615006109.
  • Gorzałczany and Rudziński [2017] Marian B. Gorzałczany and Filip Rudziński. Interpretable and accurate medical data classification – a multi-objective genetic-fuzzy optimization approach. Expert Systems with Applications, 71:26–39, 2017. doi: https://doi.org/10.1016/j.eswa.2016.11.017. URL https://www.sciencedirect.com/science/article/pii/S0957417416306467.
  • Garey and Johnson [1979] M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory of NP-completeness. Mathematical Sciences Series. Freeman, 1979. ISBN 9780716710448. URL https://books.google.com.sg/books?id=fjxGAQAAIAAJ.
  • Zhang et al. [2015] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level Convolutional Networks for Text Classification. In Advances in Neural Information Processing Systems, 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf.
  • Auer et al. [2007] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. DBpedia: A Nucleus for A Web of Open Data. In Proceedings of the 6th International The Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference, pages 722–735, 2007.
  • Socher et al. [2013] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, 2013. URL https://aclanthology.org/D13-1170.
  • Voorhees and Tice [2000] Ellen M. Voorhees and Dawn M. Tice. Building a question answering test collection. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 200–207, 2000. URL https://doi.org/10.1145/345508.345577.
  • Li and Roth [2002] Xin Li and Dan Roth. Learning Question Classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002. URL https://aclanthology.org/C02-1150.
  • Dagan et al. [2006] Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL Recognising Textual Entailment Challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, pages 177–190, 2006.
  • Segura-Bedmar et al. [2013] Isabel Segura-Bedmar, Paloma Martínez, and María Herrero-Zazo. SemEval-2013 Task 9 : Extraction of Drug-Drug Interactions from Biomedical Texts (DDIExtraction 2013). In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 341–350, 2013. URL https://aclanthology.org/S13-2056.pdf.
  • Jin et al. [2019] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A Dataset for Biomedical Research Question Answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, 2019. URL https://aclanthology.org/D19-1259.
  • Bentham et al. [2024] Oliver Bentham, Nathan Stringham, and Ana Marasović. Chain-of-Thought Unfaithfulness as Disguised Accuracy, 2024. URL https://arxiv.org/abs/2402.14897.

Appendix A Details on Membership Functions

Table 3 lists the details about the membership functions used in this work.

Function Parameters Name Short Form Meaning
f1f_{1} 0, 0, 0.5 Low-2 L-2
Low-range transformation,
smooth change with slope 2-2, peak at 0
f2f_{2} 0, 0.5, 1 Medium-2 M-2
Medium-range transformation,
smooth change with slope ±2\pm 2, peak at 0.5
f3f_{3} 0.5, 1, 1 High-2 H-2
High-range transformation,
smooth change with slope 22, peak at 1
f4f_{4} 0, 0, 0.25 Low-4 L-4
Low-range transformation,
sharp change with slope 4-4, peak at 0
f5f_{5} 0, 0.25, 0.5 Medium Low-4 ML-4
Low-to-medium-range transformation,
sharp change with slope ±4\pm 4, peak at 0.25
f6f_{6} 0.25, 0.5, 0.75 Medium-4 M-4
Medium-range transformation,
sharp change with slope ±4\pm 4, peak at 0.5
f7f_{7} 0.5, 0.75, 1 Medium High-4 MH-4
Medium-to-high-range transformation,
sharp change with slope ±4\pm 4, peak at 0.75
f8f_{8} 0.75, 1, 1 High-4 H-4
High-range transformation,
sharp change with slope 44, peak at 1
f9f_{9} 0, 0, 0.125 Very Very Low-8 VVL-8
Very-very-low-range transformation,
very sharp change with slope 8-8, peak at 0
f10f_{10} 0, 0.125, 0.25 Very Low-8 VL-8
Very-low-range transformation,
very sharp change with slope ±8\pm 8, peak at 0.125
f11f_{11} 0.125, 0.25, 0.375 Low-8 L-8
Low-range transformation,
very sharp change with slope ±8\pm 8, peak at 0.25
f12f_{12} 0.25, 0.375, 0.5 Medium Low-8 ML-8
Low-to-medium-range transformation,
very sharp change with slope ±8\pm 8, peak at 0.375
f13f_{13} 0.375, 0.5, 0.625 Medium-8 M-8
Medium-range transformation,
very sharp change with slope ±8\pm 8, peak at 0.5
f14f_{14} 0.5, 0.625, 0.75 Medium High-8 MH-8
Medium-to-high-range transformation,
very sharp change with slope ±8\pm 8, peak at 0.625
f15f_{15} 0.625, 0.75, 0.875 High-8 H-8
High-range transformation,
very sharp change with slope ±8\pm 8, peak at 0.75
f16f_{16} 0.75, 0.875, 1 Very High-8 VH-8
Very-high-range transformation,
very sharp change with slope ±8\pm 8, peak at 0.875
f17f_{17} 0.875, 1, 1 Very Very High-8 VVH-8
Very-very-high-range transformation,
very sharp change with slope 88, peak at 1
f18f_{18} 0, 0, 1 Full-1 F-1
Full-range transformation,
very smooth change with slope 1-1, peak at 0
f19f_{19} 0, 1, 1 Don’t Change Don’t Change Identity function
Table 3: Names, parameters (a,b,ca,b,c), short forms, and meanings for membership functions.

Appendix B Additional Few-shot Optimization Results

Figure 7 shows additional few-shot optimization results. In a few-shot optimization manner, FuRud achieves better or comparable results than DNIP, and better results than BC and the ICL baseline, while providing enhanced interpretability.

Refer to caption
Figure 7: Additional few-shot optimization results.

Appendix C FuRud’s Performances on More LLMs

We ran experiments of FuRud on two additional models, Llama-2-7B and GPT2-XL. Results are shown in Table 4. For example, on Llama-2-7B, FuRud improves accuracy by a relative 22%, and reduces COBias by a relative 63% over ICL baselines, demonstrating that FuRud gains consistent performance improvements on various models. Indeed, our current evaluations are focused on relatively small LLMs, but our approach can also work for larger models, as long as class probabilities are available and the imbalanced per-class accuracy issue exists.

Appendix D FuRud’s Performances under More ICL Demonstration Selection Strategies

Model Metric AGNews DBpedia SST-5 TREC RTE DDI PubMedQA Avg.
Llama-2-7B
ICL Acc 86.42.586.4_{2.5} 88.92.088.9_{2.0} 42.111.142.1_{11.1} 66.76.666.7_{6.6} 66.34.366.3_{4.3} 6.70.46.7_{0.4} 40.36.740.3_{6.7} 56.8
COBias 14.06.514.0_{6.5} 13.52.113.5_{2.1} 55.61.555.6_{1.5} 33.210.033.2_{10.0} 61.610.561.6_{10.5} 41.41.741.4_{1.7} 40.916.140.9_{16.1} 37.2
FuRud Acc 88.50.5\boldsymbol{88.5_{0.5}} 91.50.5\boldsymbol{91.5_{0.5}} 49.50.7\boldsymbol{49.5_{0.7}} 73.13.9\boldsymbol{73.1_{3.9}} 72.71.0\boldsymbol{72.7_{1.0}} 54.46.4\boldsymbol{54.4_{6.4}} 55.77.6\boldsymbol{55.7_{7.6}} 69.3\boldsymbol{69.3}
COBias 7.42.5\boldsymbol{7.4_{2.5}} 8.40.6\boldsymbol{8.4_{0.6}} 24.01.2\boldsymbol{24.0_{1.2}} 14.11.9\boldsymbol{14.1_{1.9}} 4.22.7\boldsymbol{4.2_{2.7}} 16.95.0\boldsymbol{16.9_{5.0}} 21.816.6\boldsymbol{21.8_{16.6}} 13.8\boldsymbol{13.8}
GPT2-XL
ICL Acc 52.15.452.1_{5.4} 31.89.931.8_{9.9} 34.913.734.9_{13.7} 27.410.527.4_{10.5} 55.41.955.4_{1.9} 14.54.414.5_{4.4} 55.20.055.2_{0.0} 38.8
COBias 35.511.535.5_{11.5} 40.03.640.0_{3.6} 48.75.448.7_{5.4} 45.68.745.6_{8.7} 82.424.582.4_{24.5} 40.75.940.7_{5.9} 59.412.659.4_{12.6} 50.3
FuRud Acc 69.00.5\boldsymbol{69.0_{0.5}} 67.711.8\boldsymbol{67.7_{11.8}} 43.43.1\boldsymbol{43.4_{3.1}} 41.72.7\boldsymbol{41.7_{2.7}} 51.23.7\boldsymbol{51.2_{3.7}} 53.217.0\boldsymbol{53.2_{17.0}} 48.40.3\boldsymbol{48.4_{0.3}} 53.5\boldsymbol{53.5}
COBias 7.42.9\boldsymbol{7.4_{2.9}} 23.06.5\boldsymbol{23.0_{6.5}} 25.41.4\boldsymbol{25.4_{1.4}} 30.27.0\boldsymbol{30.2_{7.0}} 8.93.6\boldsymbol{8.9_{3.6}} 23.16.5\boldsymbol{23.1_{6.5}} 17.64.6\boldsymbol{17.6_{4.6}} 19.4\boldsymbol{19.4}
Table 4: Test accuracy and COBias Comparisons on more LLMs.
Demonstration
Selection
Metric AGNews DBpedia SST-5 TREC RTE DDI PubMedQA Avg.
k-shot, ICL Acc 83.51.583.5_{1.5} 95.21.295.2_{1.2} 50.32.350.3_{2.3} 67.012.767.0_{12.7} 75.00.875.0_{0.8} 9.71.09.7_{1.0} 52.35.352.3_{5.3} 61.9
COBias 14.95.114.9_{5.1} 7.02.27.0_{2.2} 36.37.236.3_{7.2} 38.25.138.2_{5.1} 22.513.222.5_{13.2} 39.73.539.7_{3.5} 20.94.220.9_{4.2} 25.6
k-shot, FuRud Acc 88.10.6\boldsymbol{88.1_{0.6}} 96.60.4\boldsymbol{96.6_{0.4}} 54.31.3\boldsymbol{54.3_{1.3}} 77.96.0\boldsymbol{77.9_{6.0}} 75.94.6\boldsymbol{75.9_{4.6}} 62.32.1\boldsymbol{62.3_{2.1}} 59.25.9\boldsymbol{59.2_{5.9}} 73.5\boldsymbol{73.5}
COBias 7.72.5\boldsymbol{7.7_{2.5}} 4.40.7\boldsymbol{4.4_{0.7}} 13.84.1\boldsymbol{13.8_{4.1}} 11.63.3\boldsymbol{11.6_{3.3}} 5.01.4\boldsymbol{5.0_{1.4}} 27.02.2\boldsymbol{27.0_{2.2}} 21.38.7\boldsymbol{21.3_{8.7}} 13.0\boldsymbol{13.0}
Table 5: Test accuracy and COBias under the k-shot demonstration selection strategy.

We additionally prompt Llama-2-13B with the following demonstration selection strategy: k-shot prompting, where k is the number of classes. A demonstrative example from each class is randomly selected from the optimization set and represented in the prompt. FuRud significantly improves accuracy and COBias over ICL baselines, as shown in Table 5.

Compared to the 1-shot strategy (Table 1), the k-shot strategy provides a different starting point for FuRud. For example, the average ICL accuracy by k-shot (61.9%) is slightly larger than that obtained by 1-shot (59.4%), and average COBias (25.6%) is smaller than 1-shot (40.5%). FuRud boosts average accuracy to 73.5% and reduces COBias to 13.0%. In conclusion, different example selection strategies provide different starting points to optimize, on which FuRud consistently improve.