This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language Understanding

Yunchang Zhu CAS Key Laboratory of AI Security, Institute of Computing Technology, CAS; University of Chinese Academy of SciencesBeijingChina [email protected] Liang Pang CAS Key Laboratory of AI Security, Institute of Computing Technology, CASBeijingChina [email protected] Kangxi Wu CAS Key Laboratory of AI Security, Institute of Computing Technology, CAS; University of Chinese Academy of SciencesBeijingChina [email protected] Yanyan Lan Institute for AI Industry Research, Tsinghua UniversityBeijingChina [email protected] Huawei Shen CAS Key Laboratory of AI Security, Institute of Computing Technology, CAS; University of Chinese Academy of SciencesBeijingChina [email protected]  and  Xueqi Cheng CAS Key Laboratory of AI Security, Institute of Computing Technology, CAS; University of Chinese Academy of SciencesBeijingChina [email protected]
(2018)
Abstract.

Current natural language understanding (NLU) models have been continuously scaling up, both in terms of model size and input context, introducing more hidden and input neurons. While this generally improves performance on average, the extra neurons do not yield a consistent improvement for all instances. This is because some hidden neurons are redundant, and the noise mixed in input neurons tends to distract the model. Previous work mainly focuses on extrinsically reducing low-utility neurons by additional post- or pre-processing, such as network pruning and context selection, to avoid this problem. Beyond that, can we make the model reduce redundant parameters and suppress input noise by intrinsically enhancing the utility of each neuron? If a model can efficiently utilize neurons, no matter which neurons are ablated (disabled), the ablated submodel should perform no better than the original full model. Based on such a comparison principle between models, we propose a cross-model comparative loss for a broad range of tasks. Comparative loss is essentially a ranking loss on top of the task-specific losses of the full and ablated models, with the expectation that the task-specific loss of the full model is minimal. We demonstrate the universal effectiveness of comparative loss through extensive experiments on 14 datasets from 3 distinct NLU tasks based on 5 widely used pretrained language models and find it particularly superior for models with few parameters or long input.

natural language understanding, question answering, pseudo-relevance feedback, loss function
This work was supported by the Sponsor National Key R&D Program of China (Grant #2022YFB3103700, Grant #2022YFB3103704), the Sponsor National Natural Science Foundation of China (NSFC) under Grants No. Grant #62276248, Grant #U21B2046, and the Sponsor Youth Innovation Promotion Association CAS under Grants No. Grant #2023111.
copyright: acmcopyrightjournalyear: 2018doi: XXXXXXX.XXXXXXXjournal: TOISjournalvolume: 37journalnumber: 4article: 111publicationmonth: 8ccs: Information systems Information retrieval query processingccs: Information systems Retrieval models and rankingccs: Information systems Question answeringccs: Information systems Clustering and classification This paper is an extension of the SIGIR 2022 conference paper (Zhu et al., 2022). The earlier conference paper is limited to solving the query drift problem in pseudo-relevance feedback by comparing the retrieval loss using different size feedback sets. However, the comparison principle and comparative loss are actually general and task-agnostic. Moreover, in addition to comparing the input of the model, we can also compare the parameters of the model. Thus, in this work, we (1) provide a more general and complete formulation of the comparison principle and comparative loss, (2) directly use a unified comparative loss as the final loss being optimized, eliminating the need to set a weighting coefficient between the comparative regularization term and the task-specific losses, (3) improve the previous comparison method that compares inputs with different context sizes, and propose an alternative dropout-based comparison method to improve the utility of the parameters to the model, and (4) apply the comparative loss to more tasks and models and empirically demonstrate its universal effectiveness.

1. Introduction

Natural language understanding (NLU) has been pushed a remarkable step forward by deep neural models. To further enhance the performance of deep models, enlarging model size (Brutzkus and Globerson, 2019; Kaplan et al., 2020; Brown et al., 2020; Raffel et al., 2020) and input context (Wang et al., 2022; Borgeaud et al., 2022; Izacard et al., 2022) are two conventional and effective ways, where the former introduces more hidden neurons and the latter brings more input neurons. Although neural models with more hidden or input neurons have higher accuracy on average, large-scale models do not always beat small models. For example, on one hand, many network pruning methods have shown that compressed models with significantly reduced parameters (neuron connections) can maintain accuracy (LeCun et al., 1989; Han et al., 2015; Liu et al., 2017) and even improve generalization (Bartoldson et al., 2020), Meyes et al. (2019) find that ablation of neurons can consistently improve performance in some specific classes, and Zhong et al. (2021) empirically demonstrate that larger language models indeed perform worse on a non-negligible fraction of instances. These phenomena indicate that some hidden neurons in the currently trained model are dispensable or even obstructive. On the other hand, much of the work on question answering (Clark and Gardner, 2018; Yang et al., 2018) and query understanding (Mitra et al., 1998; Zighelnic and Kurland, 2008; Collins-Thompson, 2009) has noted that feeding more contextual information is more likely to distract the model and hurt performance. This is not surprising, as more input neurons not only mean more relevant features but are also likely to introduce more noise that interferes with the model. Similar to network pruning that cuts out inefficient parameters through post-processing, many context selection methods (Min et al., 2018; Tu et al., 2020; Zheng et al., 2020; Dua et al., 2021) trim off noisy segments from the input context by pre-processing. In essence, both network pruning and context selection reduce inefficient hidden or input neurons through additional processing. However, apart from extrinsically reducing inefficient neurons, can we intrinsically improve the utility of neurons during model training?

Imagine an ideal neural network in which all its neurons should be able to cooperate efficiently to maximize the utility of each neuron. If a fraction of the input or hidden neurons in this network are ablated111The output value of the neuron is set to 0, which is equivalent to all the connection weights to and from this neuron being set to 0. (disabling partial input context or model parameters), the ablated submodel is not supposed to perform better, even if the ablated neurons are noisy. This is because an efficient222In this work, “efficient” refers specifically to the high utility of neurons. model should have already suppressed these noises. Following this intuition, we can roughly find a comparison principle between the original full model and its ablated model: the fewer neurons are ablated in the model, the better the model should perform. During training, we can use task-specific losses as a proxy for model performance on training samples, with lower task-specific losses implying better performance. For example, the task-specific loss of the efficient full model (a) in Fig. 1 is supposed to be minimal, and if the ablated model (b) is also efficient with respect to its restricted parameter space, the task-specific loss of the ablated model (d) is supposed to be greater than that of (b) because (d) ablates one more input neuron than (b).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1. An illustration of a full neural model (a) and its ablated models (b, c, and d), where a hidden neuron is ablated in (b), an input neuron is ablated in (c), and (d) additionally ablate another input neuron based on (b). According to the comparison principle, if the full model (a) is an efficient model, the comparative relation between the task-specific losses obtained by these models should be (a) \leq (b), (c), (d). If the ablated model (b) is also efficient in its parameter space, then their comparative relation can be further written as (a) \leq (b) \leq (d). Note that (b, c) and (c, d) are two non-comparable model pairs. This is because the ablated model (c) is not a submodel of (b) and (d), and vice versa.

Noting the gap between the ideal model and reality (Min et al., 2018; Zhong et al., 2021), we aim to ensure this necessity (comparison principle) during the training to improve the model’s utilization of neurons. Based on the natural comparison principle between models, we propose a cross-model comparative loss to train models without additional manual supervision. In general, the comparative loss is a ranking loss on top of multiple task-specific losses. First, these task-specific losses are derived from the full neural model and several comparable ablated models whose neurons are ablated to varying degrees. Next, the ranking loss is a pairwise hinge loss that penalizes models that have fewer ablated neurons but larger task-specific losses. Concretely, if a model with fewer ablated neurons acquires a larger task-specific loss than another model with more ablated neurons, then the difference between the task-specific losses of the pair will be taken into account in the final comparative loss; otherwise the pair complies with the comparison principle and does not incur any training loss. In this way, the comparative loss can drive the order of task-specific losses to be consistent with the order of the ablation degrees. Through theoretical derivation, we also show that comparative loss can be viewed as a dynamic weighting of multiple task-specific losses, enabling adaptive assignment of weights depending on the performance of the full/ablated models.

The comparability among multiple ablated models is a fundamental prerequisite for comparative loss. As a counterexample, although the ablated model (c) in Fig. 1 ablates less neurons than (d), they are not comparable and so no comparative loss can be applied. To make the ablated models comparable with each other, we progressively ablate the models. The first ablated model is obtained by performing one ablation on the basis of the full model. If more ablated models are needed, in each subsequent ablation step we construct a new ablated model by performing a further ablation on top of the ablated model from the previous step, which makes the newly ablated model certainly a comparable submodel of the previous ones. We provide two alternative controlled ablation methods for each ablation step, called CmpDrop and CmpCrop. CmpDrop ablates hidden neurons by the dropout (Hinton et al., 2012) technique, which is theoretically applicable to all dropout-compatible models. While CmpCrop ablates input neurons by cropping extraneous context segments and is theoretically applicable to all tasks that contain extraneous content in the input context.

We apply comparative loss with CmpDrop or/and CmpCrop on 14 datasets from 3 NLU tasks (text classification, question answering and query understanding) with distinct prediction types (classification, extraction and ranking) on top of 5 widely used pretrained language models (PLMs) (Devlin et al., 2019; Liu et al., 2019a; Clark et al., 2020; Lan et al., 2020; Beltagy et al., 2020). The empirical results demonstrate the effectiveness of comparative loss over state-of-the-art baselines, as well as the enhanced utility of parameters and context. Our analysis also confirms that comparative losses can indeed more appropriately weight multiple task-specific losses, as indicated by our derivation. By exploring different comparison strategies, we observe that comparing the models ablated by first CmpCrop and then CmpDrop can bring the greatest improvement. Interestingly, we find that comparative loss is particularly effective for models with few parameters or long inputs. This may imply that comparative loss can help models with lower capacity to fit the more or longer training samples better, while models with higher capacity are inherently prone to fit less data, so comparative loss is less helpful. Moreover, we discover that different ablation methods have different effects on training, with CmpDrop helping task-specific loss to decrease to lower levels faster and CmpCrop alleviating overfitting to some extent.

The main contributions can be summarized as follows:

  • We propose comparative loss, a cross-model loss function based on the comparison principle between the full model and its ablated models, to improve the neuronal utility without additional human supervision.

  • We progressively ablate the models to make multiple ablated models comparable and present two controlled ablation methods based on dropout and context cropping, applicable to a wide range of tasks and models.

  • We theoretically show how comparative loss works and empirically demonstrate its effectiveness through experiments on 3 distinct natural language understanding tasks. We release the code and processed data at https://github.com/zycdev/CmpLoss.

2. Preliminaries

Before introducing our cross-model comparative loss, we review some of the concepts and notations needed afterward. We first introduce typical training methods for the model, followed by formalizations of network pruning and context selection methods that can further improve the model performance by removing inefficient inputs or hidden neurons. Finally, we elaborate on the concept of ablation, which recurs throughout the paper.

2.1. Conventional Training

Given a training dataset 𝒟\mathcal{D} for a specified task and a neural network ff parameterized by 𝜽|𝜽|\bm{\theta}\in\mathbb{R}^{|\bm{\theta}|}, the training objective for each sample (x,y)𝒟(x,y)\in\mathcal{D} is to minimize empirical risk

(1) emp(x,y,𝜽)=L(y,f(x;𝜽)),\mathcal{L}_{\mathrm{emp}}(x,y,\bm{\theta})=L(y,f(x;\bm{\theta})),

where xx is the input context, yy in output space 𝒴\mathcal{Y} is the label, and L:𝒴×𝒴0L:\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}_{\geq 0} is the task-specific loss function, 0\mathbb{R}_{\geq 0} denoting the set of non-negative real numbers. In NLU tasks, xx is typically a sequence of words, while yy can be a single category label for classification (Pang et al., 2016; Wang et al., 2019; Xu et al., 2022), or a pair of start and end boundaries for extraction (Rajpurkar et al., 2016; Yang et al., 2018; Pang et al., 2019), or a sequence of relevance levels for ranking (Niu et al., 2012; Nguyen et al., 2016; Zhu et al., 2020; Pang et al., 2020).

2.2. Network Pruning

After training a neural model f(x;𝜽)f(x;\bm{\theta}), to reduce memory and computation requirements at test time, network pruning (Blalock et al., 2020) entails producing a smaller model f(x;𝒎𝜽)f(x;\bm{m}\odot\bm{\theta}^{\prime}) with similar accuracy through post-hoc processing. Here 𝒎{0,1}|θ|\bm{m}\in\{0,1\}^{|\theta|} is a binary mask that fixes certain pruned parameters to 0 through elementwise product \odot, and the parameter vector 𝜽\bm{\theta}^{\prime} may be different from 𝜽\bm{\theta} because 𝒎𝜽\bm{m}\odot\bm{\theta}^{\prime} is usually retrained from 𝒎𝜽\bm{m}\odot\bm{\theta} to fit the pruned network structure.

Although pruning is often viewed as a way to compress models, it has also been motivated by the desire to prevent overfitting. Pruning systematically removes redundant parameters and neurons that do not significantly contribute to performance and thus have much less prediction variance, which makes us reminiscent of dropout (Labach et al., 2019), another widely used technique to avoid overfitting. Similarly, dropout also uses a mask to disable a fraction (such as p%p\%) of parameters or neurons. The significant difference, though, is that the mask 𝒎\bm{m} in dropout is randomly sampled from a Bernoulli(1p%)\mathrm{Bernoulli}(1-p\%) distribution, rather than deterministically defined by a criterion (e.g., the bottom p%p\% of parameters in magnitude should be masked) as in pruning. This in turn brings convenience: a model trained with dropout does not need to be retrained for a specific mask, because the model’s neurons have already started to learn how to adapt to the absence of some neurons in the previous training.

2.3. Context Selection

To eliminate the noisy content in the input context xx and further improve the model performance, context selection selectively crops out a condensed context xxx^{\prime}\sqsubseteq x to produce the final model prediction. In general, the model requires specialized training to fit the selected context. Therefore, context selection is pre-hoc processing relative to training, requiring removing the noise from the training samples in advance. With a slight abuse of notation, here we use xxx^{\prime}\sqsubseteq x to denote that xx^{\prime} is a condensed subsequence (possibly equal) of xx. In general, xx^{\prime} is an ordered combination of segments of xx, where the segments are usually at the sentence (Min et al., 2018; Pang et al., 2021), chunk (Zheng et al., 2020), paragraph (Clark and Gardner, 2018), or document (Tu et al., 2020; Dua et al., 2021) granularity. It is worth noting that the selector for segment selection generally requires additional supervised training and needs to be run in advance of the prediction, which introduces additional computation overhead.

2.4. Ablation

To assess the contribution of certain components to the overall model, ablation studies investigate model behavior by removing or replacing these components in a controlled setting (Cohen and Howe, 1988). Here, in the context of machine learning, “ablation” refers to the removal of components of the model, which is an analogy to ablative brain surgery (removal of components of an organism) in biology (Meyes et al., 2019). We refer to the model after component removal as the “ablated model”, which should continue to work. However, if the removed components are responsible for performance improvement, the performance of the ablated model is expected to be worse (Frank et al., 2021).

In this paper, we use “ablation” to refer specifically to the removal of some neurons of a neural model, i.e., to set the output of some specific neurons to zero. From such a neuronal perspective, network pruning and context selection can be viewed as two kinds of ablation, the former removing some low-contributing hidden neurons after training and the latter removing some low-information input neurons before training. However, in contrast to ablation studies that aim to investigate the role of the ablated neurons, we aim to learn to improve the utility of the ablated neurons.

3. Methodlogy

The primary motivation of this work is to inherently improve the utility of neurons in NLU models through a cross-model training objective, rather than post-hoc network pruning or pre-hoc context selection to eliminate inefficient neurons. In the following, we first describe a comparison principle. Then, we propose a novel comparative loss based on the corollary of the comparison principle and present how to train models with comparative loss by two controlled ablation methods. Finally, we discuss how comparative loss works.

3.1. Comparison Principle

For an efficient model, we believe that all its neurons should be able to work together efficiently to maximize the utility of each neuron. This means that each neuron should contribute to the overall model, or at least be harmless, because the cooperation of neurons is supposed to eliminate the negative effects of noise that may be produced by individual neurons. Thus, if we ablate some neurons, even those that produce noise, due to the missing contribution of the ablated neurons, then the ablated submodel should perform no better than the original full model, in other words, its task-specific loss should be no smaller than the original.

Formally, we define a neural model as an efficient model if and only if it performs no weaker than any of its ablation models, and we formalize the comparison principle between an efficient model and its ablation models as follows.

Comparison Principle 0.

Suppose f(x;𝛉)f(x;\bm{\theta}) is an efficient neural model for the input xx with respect to the parameter space |𝛉|\mathbb{R}^{|\bm{\theta}|}, let xxx^{\prime}\sqsubset x be the ablated input and 𝛉=𝐦𝛉\bm{\theta}^{\prime}=\bm{m}\odot\bm{\theta} be the ablated parameters. Then, for any subsequence xx^{\prime} of xx whose label is still yy, the input-ablated model f(x;𝛉)f(x^{\prime};\bm{\theta}) should not perform better than the full model f(x;𝛉)f(x;\bm{\theta}), and for any parameters 𝛉\bm{\theta}^{\prime} masked by arbitrary 𝐦\bm{m}, the parameter-ablated model f(x;𝛉)f(x;\bm{\theta}^{\prime}) should not perform better than f(x;𝛉)f(x;\bm{\theta}), i.e.,

(2) L(y,f(x;𝜽))L(y,f(x;𝜽)),xx with g(x)=y,L(y,f(x;\bm{\theta}))\leq L(y,f(x^{\prime};\bm{\theta})),\forall x^{\prime}\sqsubset x\text{ with }g(x^{\prime})=y,
(3) L(y,f(x;𝜽))L(y,f(x;𝜽)),𝜽=𝒎𝜽 with 𝒎{0,1}|𝜽|,L(y,f(x;\bm{\theta}))\leq L(y,f(x;\bm{\theta}^{\prime})),\forall\bm{\theta}^{\prime}=\bm{m}\odot\bm{\theta}\text{ with }\bm{m}\in\{0,1\}^{|\bm{\theta}|},

where g(x)g(x^{\prime}) means the ground-truth output of xx^{\prime}.

In the above definition, we consider that an efficient neural model should be input-efficient and parameter-efficient. In particular, the input-efficient property refers that the model can efficiently utilize the input neurons (words). If the model f(x;𝜽)f(x;\bm{\theta}) satisfies Eq. (2), we say that f(;𝜽)f(\cdot;\bm{\theta}) is input-efficient for the input xx. The parameter-efficient property refers that the model can utilize the hidden neurons efficiently. If the model f(x;𝜽)f(x;\bm{\theta}) satisfies Eq. (3), we say f(x;𝜽)f(x;\bm{\theta}) is parameter-efficient for the input xx with respect to the parameter space |𝜽|\mathbb{R}^{|\bm{\theta}|}. According to Eq. (3), we can definitely find at least one vector 𝜽\bm{\theta} that is parameter-efficient for the input xx, i.e., the zero vector and the optimal parameter vector that minimizes the empirical risk. If the parameter space is large enough, from those vectors parameter-efficient for xx, we can find some parameters that simultaneously satisfy Eq. (2), i.e, the input-efficient property. That is, if the parameter space |𝜽|\mathbb{R}^{|\bm{\theta}|} is large enough, there exists at least one parameter vector 𝜽\bm{\theta} that makes the model f(x;𝜽)f(x;\bm{\theta}) efficient for xx. Specially, if all activation functions in the neural model ff have zero output values for zero, then f(x;𝟎)=f(x;𝟎)xxf(x^{\prime};\bm{0})=f(x;\bm{0})\ \forall x^{\prime}\sqsubset x, and hence the parameter vector 𝜽=𝟎\bm{\theta}=\bm{0} is efficient for any xx.

Notably, we restrict the ablated input xx^{\prime} for comparison to only those subsequences whose ground-truth output g(x)g(x^{\prime}) remains unchanged, i.e., g(x)=g(x)=yg(x^{\prime})=g(x)=y. This is because ablation may remove some key information from the original input xx, such as the trigger words in the classification, resulting in an unknown change in the label yy. In this unusual case, L(y,f(x;𝜽))L(y,f(x^{\prime};\bm{\theta})) will no longer be a reasonable proxy of the performance of the input-ablated model, so it makes no sense to compare it to the task-specific loss of the original model. For example, in binary classification (y{0,1}y\in\{0,1\}), for the original input xx, f(x;𝜽)f(x;\bm{\theta}) predicts the correct category yy with low confidence, whereas for the ablated input xx^{\prime} whose category label changes to g(x)=1yg(x^{\prime})=1-y, f(x;𝜽)f(x^{\prime};\bm{\theta}) predicts 1y1-y with high confidence. Even though L(y,f(x;𝜽))L(y,f(x;𝜽))L(y,f(x;\bm{\theta}))\leq L(y,f(x^{\prime};\bm{\theta})), the input-ablated model f(x;𝜽)f(x^{\prime};\bm{\theta}) actually outperforms the original f(x;𝜽)f(x;\bm{\theta}), i.e., L(y,f(x;𝜽))L(1y,f(x;𝜽))L(y,f(x;\bm{\theta}))\geq L(1-y,f(x^{\prime};\bm{\theta})), and we cannot consider f(;𝜽)f(\cdot;\bm{\theta}) to be input-efficient for xx. Although we can use L(g(x),f(x;𝜽))L(g(x^{\prime}),f(x^{\prime};\bm{\theta})) as a performance proxy for the input-ablated model in Eq. (2), in practice it is difficult to know how the labels of the ablated inputs will change, so we try to avoid such label-changing scenarios. For the sake of concision, from here on, we default the ablation of the input context does not change the output label if not otherwise specified, i.e., g(x)=yg(x^{\prime})=y.

Further, we can ablate a full model f(x(0);𝜽(0))f(x^{(0)};\bm{\theta}^{(0)}) multiple (cc) times, but are these ablated models {f(x(i);𝜽(i))}i=1c\{f(x^{(i)};\bm{\theta}^{(i)})\}_{i=1}^{c} comparable to each other? The comparison principle only points out the comparative relation between an efficient model and any of its ablated models, and cannot be directly applied to multiple ablated models. However, if we assume that these ablated models are constructed step by step, i.e., each ablated model f(x(j);𝜽(j))f(x^{(j)};\bm{\theta}^{(j)}) is obtained by progressively ablating the input (x(j)x(j1)x^{(j)}\sqsubset x^{(j-1)}) x or parameters (𝜽(j)=𝒎(j)𝜽(j1)\bm{\theta}^{(j)}=\bm{m}^{(j)}\odot\bm{\theta}^{(j-1)}) based on its previous model f(x(j1);𝜽(j1))f(x^{(j-1)};\bm{\theta}^{(j-1)}), then f(x(j);𝜽(j))f(x^{(j)};\bm{\theta}^{(j)}) can be considered as an ablated model of all its ancestor models {f(x(i);𝜽(i))}i=0j1\{f(x^{(i)};\bm{\theta}^{(i)})\}_{i=0}^{j-1}. For simplicity, we abbreviate their task-specific losses as l(i)=L(y,f(x(i);𝜽(i)))l^{(i)}=L(y,f(x^{(i)};\bm{\theta}^{(i)})). Furthermore, if all {f(x(i);𝜽(i))}i=0c1\{f(x^{(i)};\bm{\theta}^{(i)})\}_{i=0}^{c-1} are simultaneously assumed to be efficient with respect to their parameter spaces 𝒎(i)0\mathbb{R}^{\|\bm{m}^{(i)}\|_{0}} 333The number of non-zeros (L0 norm) in the mask determines the number of available parameters., we can apply the comparison principle to compare the task-specific losses of any two models, i.e., l(i)l(j),i<jl^{(i)}\leq l^{(j)},\forall i<j.

Formally, we define an efficient model to be hereditarily efficient if and only if its ablated models are all efficient. Similarly, if the parameter-ablated models of a parameter-efficient model are all parameter-efficient, we call this parameter-efficient model hereditarily parameter-efficient. And the hereditarily input-efficient model is defined in the same way. Specifically, the parameter vector 𝜽=𝟎\bm{\theta}=\bm{0} is hereditarily parameter-efficient, and f(x;𝟎)f(x;\bm{0}) is also hereditarily efficient for any xx if all activation functions of ff have zero output values for zero.

Based on the definition of the hereditarily efficient model and the comparison principle, we can draw the following corollary.

Corollary 3.1.

Suppose f(x(0);𝛉(0))f(x^{(0)};\bm{\theta}^{(0)}) is a hereditarily efficient neural model for the input x(0)x^{(0)} with respect to the parameter space |𝛉(0)|\mathbb{R}^{|\bm{\theta}^{(0)}|}, let {f(x(i);𝛉(i))}i=1c\{f(x^{(i)};\bm{\theta}^{(i)})\}_{i=1}^{c} be its multiple progressively ablated models, where x(i)x(i1)x^{(i)}\sqsubset x^{(i-1)} x or 𝛉(i)=𝐦(i)𝛉(i1)\bm{\theta}^{(i)}=\bm{m}^{(i)}\odot\bm{\theta}^{(i-1)}. Then, all {f(x(i);𝛉(i))}i=1c1\{f(x^{(i)};\bm{\theta}^{(i)})\}_{i=1}^{c-1} are also efficient models with respect to their parameter spaces 𝐦(i)0\mathbb{R}^{\|\bm{m}^{(i)}\|_{0}}, and their task-specific losses should be monotonically non-decreasing with the degrees of ablation, i.e.,

(4) l(0)l(1)l(i)l(c).l^{(0)}\leq l^{(1)}\cdots\leq l^{(i)}\cdots\leq l^{(c)}.

In brief, the corollary describes a desirable transitive comparison between a hereditarily efficient neural model and its ablated models, i.e., the less ablation, the better the performance. Unfortunately, this natural property has been largely ignored before, which motivates us to exploit it to train models that utilize neurons more efficiently.

3.2. Comparative Loss

Refer to caption
Figure 2. The Venn diagram for some of the concepts in this paper. The empirical risk minimized (ERM) refers to the minimization of Eq. (1), which is a subset of the parameter-efficient (satisfying Eq. (3)). The efficient (intersecting purple region) model in the comparison principle, in addition to being parameter-efficient, also needs to be input-efficient (satisfying Eq. (2)). The hereditarily efficient model requires not only the full model to be efficient, but also any of its ablated models to be efficient, i.e., satisfying Eq. (4) in Corollary 3.1. The training objective of the comparative loss Eq. (5) is both hereditarily efficient and ERM, i.e., the central overlapping grid region.

Based on Corollary 3.1, we can train a hereditarily efficient model with the objective of ordered comparative relation in Eq. (4). To measure the difference from the desirable order, we can use pairwise hinge loss (Herbrich et al., 2000) to evaluate the ranking of the task-specific losses of the full model and its ablated models, like i=0c1j=i+1cmax(0,l(i)l(j))\sum_{i=0}^{c-1}\sum_{j=i+1}^{c}\max(0,l^{(i)}-l^{(j)}). However, optimizing this ranking loss alone cannot guarantee that these task-specific losses are minimized, i.e., the full/ablated models may not be empirical risk minimized (ERM) (Vapnik, 1991) with respect to their parameter spaces. To push these models to be ERM, we introduce a special scalar bb as the baseline value of the task-specific loss and assume that it is derived from a dummy ablated model f(x(c+1);𝜽(c+1))f(x^{(c+1)};\bm{\theta}^{(c+1)}). The dummy model is set to have the highest degree of ablation, and in principle, its task-specific loss l(c+1)l^{(c+1)} should be the highest. However, to push the task-specific losses of the real models {f(x(i),𝜽(i))}i=0c\{f(x^{(i)},\bm{\theta}^{(i)})\}_{i=0}^{c} down, we usually set l(c+1)=bl^{(c+1)}=b to a small value (e.g., 0) and expect all {l(i)}i=0c\{l^{(i)}\}_{i=0}^{c} to be reduced by this target. In this way, our comparative losses can still be written as a pairwise ranking loss, except that on top of the c+2c+2 task-specific losses,

(5) cmp(x,y,𝜽)=i=0cj=i+1c+1max(0,l(i)l(j)).\mathcal{L}_{\mathrm{cmp}}(x,y,\bm{\theta})=\sum_{i=0}^{c}\sum_{j=i+1}^{c+1}\max\big{(}0,l^{(i)}-l^{(j)}\big{)}.

Fig. 2 visualizes the localization (central grid region) of the ideal model of comparative loss, which is both ERM and hereditarily efficient. The hereditarily efficient is a subset of the efficient, and the efficient is the intersection of the input-efficient and parameter-efficient. In this light, comparative loss sets a stricter training objective than ERM. When we set cc and bb to 0, Eq. (5) can degenerate to Eq. (1). Further, the comparative loss is equivalent to

l(i)>bl(i)+i=0c1j=i+1cmax(0,l(i)l(j)),\sum_{l^{(i)}>b}l^{(i)}+\sum_{i=0}^{c-1}\sum_{j=i+1}^{c}\max\big{(}0,l^{(i)}-l^{(j)}\big{)},

where the first term is to minimize the empirical risk of those not reaching the target bb, and the second term constrains the comparative relation to pursue the full model being hereditarily efficient.

Refer to caption
Figure 3. The overview of comparative loss (best viewed in color). Given a data sample (x,y)(x,y), conventional training typically feeds the input context xx into the neural model to obtain the prediction y(0)y^{(0)} and then just minimizes the task-specific loss l(0)l^{(0)}. In contrast, comparative loss not only progressively ablates the original model to minimize multiple task-specific losses {l(i)}i=0c\{l^{(i)}\}_{i=0}^{c}, but also constrains their comparative relation with a pairwise hinge loss.

To train using comparative loss, we first need to obtain several comparable ablated models and task-specific losses. As shown in Fig. 3, we consider the original model with the input of the entire context as the full model f(x(0);𝜽(0))f(x^{(0)};\bm{\theta}^{(0)}). According to Corollary 3.1, we progressively perform cc-step ablation based on the full model. At the ii-th (1ic1\leq i\leq c) ablation step, we use CmpCrop or CmpDrop to ablate a small portion of the input or hidden neurons based on the model f(x(i1);𝜽(i1))f(x^{(i-1)};\bm{\theta}^{(i-1)}) from the previous step, which makes the newly ablated model f(x(i);𝜽(i))f(x^{(i)};\bm{\theta}^{(i)}) comparable to all its ancestor models. After all these models have predicted once, we have c+1c+1 comparable task-specific losses. Together with l(c+1)=bl^{(c+1)}=b from the dummy ablated model, we can calculate the final loss using Eq. (5). Using stochastic gradient descent optimization as an example, Algorithm 1 illustrates the training process more formally.

Algorithm 1 Training with Comparative Loss
1:Training dataset 𝒟\mathcal{D}, steps of ablation cc, dropout rate pp, baseline value of task-specific loss bb, learning rate η\eta.
2:model parameters 𝜽\bm{\theta}.
3:Randomly initialize model parameters 𝜽\bm{\theta}
4:while not converged do
5:     randomly sample a data pair (x,y)𝒟(x,y)\sim\mathcal{D}
6:     x(0)x,𝜽(0)𝜽x^{(0)}\leftarrow x,\bm{\theta}^{(0)}\leftarrow\bm{\theta}
7:     l(0)L(y,f(x(0);𝜽(0)))l^{(0)}\leftarrow L(y,f(x^{(0)};\bm{\theta}^{(0)}))
8:     for i1i\leftarrow 1 to cc do
9:         if ablate hidden neurons then \triangleright ablate model parameters
10:              𝜽(i)CmpDrop(𝜽(i1),p)\bm{\theta}^{(i)}\leftarrow\mathrm{CmpDrop}(\bm{\theta}^{(i-1)},p)
11:              x(i)x(i1)x^{(i)}\leftarrow x^{(i-1)}
12:         else\triangleright ablate input context
13:              x(i)CmpCrop(x(i1))x^{(i)}\leftarrow\mathrm{CmpCrop}(x^{(i-1)})
14:              𝜽(i)𝜽(i1)\bm{\theta}^{(i)}\leftarrow\bm{\theta}^{(i-1)}
15:         end if
16:         calculate the task-specific loss: l(i)L(y,f(x(i);𝜽(i)))l^{(i)}\leftarrow L(y,f(x^{(i)};\bm{\theta}^{(i)}))
17:     end for
18:     set the dummy’s task-specific loss: l(c+1)bl^{(c+1)}\leftarrow b
19:     calculate the comparative loss cmp(x,y,𝜽)\mathcal{L}_{\mathrm{cmp}}(x,y,\bm{\theta}) by Eq. (5)
20:     update parameters: 𝜽𝜽η𝜽cmp(x,y,𝜽)\bm{\theta}\leftarrow\bm{\theta}-\eta\nabla_{\bm{\theta}}\mathcal{L}_{\mathrm{cmp}}(x,y,\bm{\theta})
21:end while

CmpDrop and CmpCrop in Algorithm 1 are the two alternative ablation methods we present for each ablation step, the former for ablating the parameters (hidden neurons) and the latter for ablating the input context (input neurons). They both randomly ablate neurons in a controlled manner on top of the previous model, which allows the coverage of all potential ablated models without retraining each ablated model. This is because the randomly ablated models are jointly trained and adapt to the absence of some neurons during the training process. As for which one to use at each ablation step can be specific to the model and task dataset. Ideally, CmpDrop can be used as long as the model is dropout compatible, and CmpCrop can be used as long as the input context of the task contains dispensable segments. Below we will introduce CmpDrop and CmpCrop in detail.

3.2.1. CmpDrop: Ablate Parameters by Dropout

Dropout randomly disables each neuron with probability pp, which coincides with our need to randomly ablate hidden neurons. To obtain a model f(;𝜽(i))f(\cdot;\bm{\theta}^{(i)}) with more ablated parameters, instead of simply applying a larger dropout rate on the original model f(;𝜽(0))f(\cdot;\bm{\theta}^{(0)}), we ablate the surviving neurons from the previous ablated model f(;𝜽(i1))f(\cdot;\bm{\theta}^{(i-1)}) with probability pp consistently. Specifically, the output values of those dropped neurons are set to zeros, and the output values of the surviving neurons are scaled by 1/(1p)1/(1-p) to ensure consistency with the expected output value of a neuron in all full/ablated models (Labach et al., 2019). This is equivalent to applying a mask with scaling444Slightly different from the binary mask in the comparison principle, we incorporate the scaling factors together into the mask in order to still express the parameter ablation concisely by 𝜽(i)=𝒎(i)𝜽(i1)\bm{\theta}^{(i)}=\bm{m}^{(i)}\odot\bm{\theta}^{(i-1)}. 𝒎(i){0,1,1/(1p)}|𝜽|\bm{m}^{(i)}\in\{0,1,1/(1-p)\}^{|\bm{\theta}|} to the previous parameters 𝜽(i1)\bm{\theta}^{(i-1)} to obtain the ablated parameters 𝜽(i)=𝒎(i)𝜽(i1)\bm{\theta}^{(i)}=\bm{m}^{(i)}\odot\bm{\theta}^{(i-1)}. Each element in 𝒎(i)\bm{m}^{(i)} corresponds to the scaling factor of each parameter,

mk(i)={1,if no dropout in θk’s layer;11p,if neurons at both ends of θk survive;0,otherwise.m_{k}^{(i)}=\begin{cases}1,&\text{if no dropout in $\theta_{k}$'s layer;}\\ \frac{1}{1-p},&\text{if neurons at both ends of $\theta_{k}$ survive;}\\ 0,&\text{otherwise.}\end{cases}

For the third case in the above equation, once a neuron is newly ablated, all connection parameters from and to it are set to zero; in addition, parameters that have been ablated by 𝒎(i1)\bm{m}^{(i-1)} are still set to zero.

In practice, we can leverage the existing dropout to implement it. However, for the comparability of the ablated models, we must use the same random seed and the same state of the random number generator in each CmpDrop. In this way, assuming that the current ablation step is the nn-th execution of CmpDrop, we can simply run the model with the dropout rate of 1(1p)n1-(1-p)^{n}.

3.2.2. CmpCrop: Ablate Input by Cropping

Given an input context xx, CmpCrop aims to crop out a condensed context xx^{\prime} that does not change the original ground-truth output, i.e. xxx^{\prime}\sqsubset x and g(x)=g(x)=yg(x^{\prime})=g(x)=y. Assume that we know the minimum support context xx^{\star} for xx at training time, i.e., xxg(x)=g(x)\forall x^{\prime}\sqsupseteq x^{\star}\ g(x^{\prime})=g(x^{\star}) and xxg(x)g(x)\forall x^{\prime}\sqsubset x^{\star}\ g(x^{\prime})\neq g(x^{\star}). Then, CmpCrop can produce a streamlined context by randomly cropping out several insignificant segments from the non-support context xxx\setminus x^{\star}. In this way, the trimmed streamlined context is sure to contain the minimum support context, so the ground-truth output does not change.

In practice, to use CmpCrop, we must ensure that enough insignificant segments are set aside in the original context x(0)x^{(0)} for cropping. The segments can be of document, paragraph or sentence granularity. For example, in question answering, an insignificant segment can be any retrieved paragraph that does not affect the answer to the question. If the dataset does not annotate the minimal support context, we can manually inject a few extraneous noise segments into x(0)x^{(0)}.

3.3. Discussion

Further deriving Eq. (5), we find that comparative loss can be viewed as a dynamic weighting of multiple task-specific losses. In particular, the loss can be rewritten as follows,

(6) cmp\displaystyle\mathcal{L}_{\mathrm{cmp}} =i=0cj=i+1c+1max(0,l(i)l(j))\displaystyle=\sum_{i=0}^{c}\sum_{j=i+1}^{c+1}\max\big{(}0,l^{(i)}-l^{(j)}\big{)}
=i=0cj=i+1c+1𝟙l(i)>l(j)l(i)i=0cj=i+1c+1𝟙l(i)>l(j)l(j)\displaystyle=\sum_{i=0}^{c}\sum_{j=i+1}^{c+1}\mathbbm{1}_{l^{(i)}>l^{(j)}}\cdot l^{(i)}-\sum_{i=0}^{c}\sum_{j=i+1}^{c+1}\mathbbm{1}_{l^{(i)}>l^{(j)}}\cdot l^{(j)}
=i=0c+1j=i+1c+1𝟙l(i)>l(j)l(i)j=0ci=j+1c+1𝟙l(j)>l(i)l(i)\displaystyle=\sum_{i=0}^{c+1}\sum_{j=i+1}^{c+1}\mathbbm{1}_{l^{(i)}>l^{(j)}}\cdot l^{(i)}-\sum_{j=0}^{c}\sum_{i=j+1}^{c+1}\mathbbm{1}_{l^{(j)}>l^{(i)}}\cdot l^{(i)}
=i=0c+1j=i+1c+1𝟙l(i)>l(j)l(i)i=1c+1j=0i1𝟙l(j)>l(i)l(i)\displaystyle=\sum_{i=0}^{c+1}\sum_{j=i+1}^{c+1}\mathbbm{1}_{l^{(i)}>l^{(j)}}\cdot l^{(i)}-\sum_{i=1}^{c+1}\sum_{j=0}^{i-1}\mathbbm{1}_{l^{(j)}>l^{(i)}}\cdot l^{(i)}
=i=0c+1j=i+1c+1𝟙l(i)>l(j)l(i)i=0c+1j=0i1𝟙l(j)>l(i)l(i)\displaystyle=\sum_{i=0}^{c+1}\sum_{j=i+1}^{c+1}\mathbbm{1}_{l^{(i)}>l^{(j)}}\cdot l^{(i)}-\sum_{i=0}^{c+1}\sum_{j=0}^{i-1}\mathbbm{1}_{l^{(j)}>l^{(i)}}\cdot l^{(i)}
=i=0c+1[j=i+1c+1𝟙l(i)>l(j)l(i)j=0i1𝟙l(j)>l(i)l(i)]\displaystyle=\sum_{i=0}^{c+1}\left[\sum_{j=i+1}^{c+1}\mathbbm{1}_{l^{(i)}>l^{(j)}}\cdot l^{(i)}-\sum_{j=0}^{i-1}\mathbbm{1}_{l^{(j)}>l^{(i)}}\cdot l^{(i)}\right]
=i=0c+1l(i)[j=i+1c+1𝟙l(i)>l(j)j=0i1𝟙l(i)<l(j)]\displaystyle=\sum_{i=0}^{c+1}l^{(i)}\cdot\left[\sum_{j=i+1}^{c+1}\mathbbm{1}_{l^{(i)}>l^{(j)}}-\sum_{j=0}^{i-1}\mathbbm{1}_{l^{(i)}<l^{(j)}}\right]
=i=0c+1jiCMP(i,j,l(i),l(j))l(i),\displaystyle=\sum_{i=0}^{c+1}\sum_{j\neq i}\mathrm{CMP}(i,j,l^{(i)},l^{(j)})\cdot l^{(i)},

where 𝟙C\mathbbm{1}_{C} is an indicator function equal to 1 if condition CC is true and 0 otherwise, and the CMP\mathrm{CMP} function determines whether model f(x(i);𝜽(i))f(x^{(i)};\bm{\theta}^{(i)}) complies with the comparison principle compared to f(x(j);𝜽(j))f(x^{(j)};\bm{\theta}^{(j)}) and adjusts the weight of l(i)l^{(i)}. There are two cases of non-compliance: for the case where f(x(i);𝜽(i))f(x^{(i)};\bm{\theta}^{(i)}) is less ablated (i<ji<j) but more loss is obtained, we increase the weight of l(i)l^{(i)}; for the case where f(x(i);𝜽(i))f(x^{(i)};\bm{\theta}^{(i)}) is more ablated (i>ji>j) but less loss is obtained, we decrease the weight of l(i)l^{(i)}. Formally, the CMP\mathrm{CMP} function can be written as

CMP(i,j,l(i),l(j))={1,if i<j and l(i)>l(j);1,if i>j and l(i)<l(j);0,otherwise.\mathrm{CMP}(i,j,l^{(i)},l^{(j)})=\begin{cases}1,&\text{if $i<j$ and $l^{(i)}>l^{(j)}$;}\\ -1,&\text{if $i>j$ and $l^{(i)}<l^{(j)}$;}\\ 0,&\text{otherwise.}\end{cases}

Here we can notice that for a pair of models that do not conform to the comparison principle, we increase (+1) the weight of the task-specific loss of the model that is ablated less and equally decrease (-1) the weight of the loss of the model that is ablated more. Thus, let α(i)=jiCMP(i,j,l(i),l(j))\alpha^{(i)}=\sum_{j\neq i}\mathrm{CMP}(i,j,l^{(i)},l^{(j)}) denote the weight of l(i)l^{(i)}, then the sum of the weights of all task-specific losses (including the dummy one) is 0, i.e., i=0cα(i)=α(c+1)=i=0c𝟙l(i)>b\sum_{i=0}^{c}\alpha^{(i)}=-\alpha^{(c+1)}=\sum_{i=0}^{c}\mathbbm{1}_{l^{(i)}>b}. Since l(c+1)=bl^{(c+1)}=b is a constant, Eq. (6) is also equivalent to i=0cα(i)l(i)\sum_{i=0}^{c}\alpha^{(i)}l^{(i)}, i.e., the total weight equals the number of task-specific losses worse than the virtual baseline bb, and is adaptively assigned to the c+1c+1 losses according to their performance. In this way, poorly performing full/ablated models will be more heavily optimized. And we empirically compare other heuristic weighting strategies in §\S5.1.

For parameter ablation, in addition to being able to weight each task-specific loss differentially, the comparative loss with CmpDrop can also differentially calculate the gradients of the parameters in different parts. According to Eq. (6), comparative loss is equal to the sum of all differences of task-specific loss pairs that violate the comparison principle, i.e, i<jl(i)>l(j)(l(i)l(j))\sum_{i<j\,\land\,l^{(i)}>l^{(j)}}\big{(}l^{(i)}-l^{(j)}\big{)}, so we can analyze the gradient of comparative loss from the difference of each task-specific loss pair. For ease of illustration, we take the original model f(x;𝜽)f(x;\bm{\theta}) and a model f(x;𝜽)f(x^{\prime};\bm{\theta}^{\prime}) whose parameters have been ablated nn times as an example, and other model pairs with different parameters are similar. Assume that the original parameters 𝜽=(𝒖,𝒗,𝒘)\bm{\theta}=(\bm{u},\bm{v},\bm{w}) and the ablated parameters 𝜽=(𝒖,𝒗,𝒘)=(𝒖,𝒗/(1p)n,𝟎)\bm{\theta}^{\prime}=(\bm{u}^{\prime},\bm{v}^{\prime},\bm{w}^{\prime})=(\bm{u},\bm{v}/(1-p)^{n},\bm{0}), where 𝒖\bm{u} is the parameters from the layers without dropout, 𝒘\bm{w} is the parameters ablated by nn times CmpDrop, and 𝒗\bm{v}^{\prime} is the scaled parameters surviving from nn times dropout. Then, if their task-specific losses violate the comparison principle, i.e., l>ll>l^{\prime}, the gradient of the comparative loss contributed by this model pair is

𝜽(ll)\displaystyle\nabla_{\bm{\theta}}(l-l^{\prime}) =(𝒖(ll),𝒗(ll),𝒘l).\displaystyle=(\nabla_{\bm{u}}(l-l^{\prime}),\nabla_{\bm{v}}(l-l^{\prime}),\nabla_{\bm{w}}l).

We can see that the comparative loss with respect to 𝒘\bm{w} is higher than the comparative loss with respect to the other parameters. This is intuitive because the model instead performs better after ablating away 𝒘\bm{w} indicating that the current 𝒘\bm{w} is inefficient, so we need to focus on updating 𝒘\bm{w}.

In addition to the dynamic weighting perspective, comparative loss can also be considered as an “inverse ablation study” during training. This is because, in contrast to ablation studies that determine the contribution of removed components during validation, comparative loss believes that the ablated neurons should contribute and optimizes parameters with this objective.

For training complexity, given a generally small number of comparisons cc (i.e., number of ablation steps), the overhead of computing the final comparative loss is negligibly small, and the increased computation overhead per update step comes mainly from the multiple forward and backward propagations of the models. Specifically, the overhead of a training step using comparison loss is 1+c1+c times that of conventional training for the same batch size. For inference complexity, however, models trained using comparative loss are the same as conventionally trained models at test time.

4. Experiments

To evaluate the effectiveness and generalizability of our approach for natural language understanding, we conduct experiments on 3 tasks with representative output types, including classification (8 datasets), extraction (2 datasets), and ranking (4 datasets). Among them, the classification task requires predicting a single category for a piece of text or a text pair, the extraction task requires predicting a pair of boundary positions to extract the span between the start and end boundaries, and the ranking task requires predicting a list of relevance level to rank candidates. Specifically, the three distinct tasks are text classification (see §\S4.1), reading comprehension (extraction, see §\S4.2), and pseudo-relevance feedback (ranking, see §\S4.3), respectively. We evaluate the comparative loss with just CmpDrop in text classification and reading comprehension, the comparative loss with just CmpCrop in reading comprehension and pseudo-relevance feedback, and the comparative loss with both CmpDrop and CmpCrop in reading comprehension. For each task, we first introduce the dataset used, then present the implementation of our models as well as the baselines, and finally show the experimental results.

Before we start each experiment, we explain some common experimental settings. For the baseline value bb of the task-specific loss in Algorithm 1, we provide two setting options. One is to simply set b=0b=0, which is equivalent to setting an unreachable target value for all full/ablated models and thus pushing their task-specific losses to decrease. However, this results in the exposure of all training data to the full model and may aggravate overfitting. Therefore, to reduce the times the full model is optimized, our second option is to set the baseline value to the task-specific loss of the full model, i.e., b=l(0)b=l^{(0)}. In this way, the full model is optimized only when it performs worse than its ablated model. In practice, we prefer setting b=0b=0, and change to setting b=l(0)b=l^{(0)} if we find that the model is prone to overfitting on the dataset. For the dropout rate pp in each CmpDrop, we use the same setting as the baseline models, which is 0.1 in all our experiments. For other conventional training hyperparameters, such as batch size and learning rate, we also keep the same as the carefully tuned baseline models if not specifically specified. We implement our models and baseline models in PyTorch with HuggingFace Transformers (Wolf et al., 2020). All models are trained on Tesla V100 GPUs. In the text classification and reading comprehension tasks, we trained each model with 5 random seeds. In the pseudo-relevance feedback task, we trained all models with a fixed random seed 42. In the tables, the results presented as mean±standard deviation{}_{\pm\text{standard deviation}} are tallied on the evaluation results of the five random seeds, otherwise the performance of the model trained with the random seed 42. For convenience, in the tables, we use ‘Cmp’ to represent the comparative loss and use ‘Drop’ and ‘Crop’ in parentheses to refer to CmpDrop and CmpCrop, respectively.

4.1. Classification: Application to Text Classification

Text classification is a fundamental task in natural language understanding, which aims to assign a predefined category to a piece or a group of text. In many text classification datasets, all segments of the input context seem to play an important role in the text category and there is almost no annotation of the minimal support context, so it is difficult for us to construct an input-ablated model by directly cropping the original input without changing the classification label. That is, it is likely to violate the constraint that the label of the ablated input is unchanged in the comparison principle, and thus we cannot apply CmpCrop to this task. However, many current neural classification models use dropout during training, so in this task, we only validate the comparative loss that uses just CmpDrop.

4.1.1. Datasets

The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019) is a collection of diverse natural language understanding tasks. Following (Devlin et al., 2019), we exclude the problematic WNLI set and conduct experiments on 8 datasets: (1) Multi-Genre Natural Language Inference (MNLI(Williams et al., 2018) is a sentence pair classification task that aims to predict whether the second sentence is an entailment, contradiction, or neutral to the first one. (2) Microsoft Research Paraphrase Corpus (MRPC(Dolan and Brockett, 2005) aims to predict if two sentences in the pair are semantically equivalent. (3) Question Natural Language Inference (QNLI(Wang et al., 2019) is a binary sentence pair classification task that aims to predict whether a sentence contains the correct answer to a question. (4) Quora Question Pairs (QQP(Chen et al., 2018) is a binary sentence pair classification task that aims to predict whether two questions asked on Quora are semantically equivalent. (5) Recognizing Textual Entailment (RTE(Bentivogli et al., 2009) is a binary entailment task similar to MNLI, but with much fewer training samples. (6) Stanford Sentiment Treebank (SST-2(Socher et al., 2013) is a binary sentence sentiment classification task consisting of sentences extracted from movie reviews. (7) The Semantic Textual Similarity Benchmark (STS-B(Cer et al., 2017) is a sentence pair classification task that aims to determine how two sentences are semantically similar. (8) The Corpus of Linguistic Acceptability (CoLA(Warstadt et al., 2019) is a binary sentence classification task aimed at judging whether a single English sentence conforms to linguistics.

4.1.2. Models & Training

Following R-Drop (liang et al., 2021), another state-of-the-art training method leveraging dropout, we validate our comparative loss on the popular classification models based on PLMs555https://github.com/huggingface/transformers/blob/v4.19.2/examples/pytorch/text-classification/README.md. Specifically, we take BERTbase (Devlin et al., 2019), RoBERTabase (Liu et al., 2019a) and ALBERTbase (Lan et al., 2020) as our backbones to perform finetuning. The task-specific loss is mean squared error (MSE) for STS-B and cross-entropy for other datasets. We use different training hyperparameters for each dataset. For baseline models and our models trained with comparative loss, we independently select the learning rate within {1e-5, 2e-5, 3e-5, 4e-5}, warmup rate within {0, 0.1}, the batch size within {16, 24, 32}, and the number of epochs from 2 to 5. For our models, we tune the number of ablation steps cc (i.e., the number of CmpDrop) from 1 to 4. Following the hyperparameter setup in R-Drop (liang et al., 2021), we implement R-Drop for all backbone models as well to serve as a competitor, which performs dropout multiple times as CmpDrop does.

4.1.3. Results

Table 1. Classification performance on the development sets of GLUE language understanding benchmark.
Model MNLI MRPC QNLI QQP RTE SST-2 STS-B CoLA Average
BERTbase (Devlin et al., 2019) 84.2±0.3 85.9±0.5 91.0±0.1 91.0±0.1 68.2±1.7 92.2±0.2 88.9±1.0 61.9±1.1 82.92±0.14
+ R-Drop (liang et al., 2021) 84.3±0.2 86.5±0.5 91.6±0.1 91.4±0.1 68.8±1.2 92.2±0.3 89.4±0.8 62.9±0.7 83.38±0.14
+ Cmp (cc Drop) 84.8±0.2 87.1±1.0 91.9±0.2 91.5±0.1 70.5±1.2 93.1±0.4 89.7±0.6 63.4±1.1 83.96±0.31
RoBERTabase (Liu et al., 2019a) 87.4±0.3 89.9±0.6 92.8±0.2 91.4±0.1 76.5±0.9 95.1±0.2 90.8±0.1 62.7±0.6 85.82±0.13
+ R-Drop (liang et al., 2021) 87.6±0.1 90.0±0.6 92.9±0.1 91.5±0.0 78.5±1.5 95.4±0.2 91.1±0.1 64.0±0.5 86.37±0.26
+ Cmp (cc Drop) 88.0±0.1 90.5±0.5 93.3±0.1 91.9±0.1 79.3±1.2 95.5±0.2 91.3±0.1 65.4±0.8 86.91±0.20
ALBERTbase (Lan et al., 2020) 85.0±0.3 88.4±0.6 92.1±0.3 90.4±0.1 77.8±0.8 93.0±0.4 90.9±0.2 59.8±1.7 84.68±0.18
+ R-Drop (liang et al., 2021) 85.4±0.4 88.7±0.8 92.1±0.2 90.5±0.1 78.0±1.4 93.3±0.3 90.9±0.2 59.7±0.4 84.83±0.14
+ Cmp (cc Drop) 85.7±0.1 89.5±0.4 92.3±0.1 91.0±0.1 78.3±1.4 93.5±0.2 91.0±0.2 60.3±0.9 85.19±0.24

We present classification performance in Table 1, where the evaluation metrics are Pearson correlation for STS-B, Matthew’s correlation for CoLA, and Accuracy for the others. For models based on BERTbase, we can see that our model (+ Cmp) comprehensively outperforms the well-tuned baseline BERTbase and achieves an improvement of 1.04 points (on average), which proves the effectiveness of comparative loss in classification tasks. Moreover, our model trained with comparative loss also outperforms the model trained with state-of-the-art R-Drop by 0.58 points on average, which demonstrates the superiority of comparative loss. For models based on other more advanced RoBERTabase and ALBERTbase, we can find consistent improvement. In addition, since ALBERT reuses parameters across multiple layers, it has the smallest boostable space for parameter utilization, which is consistent with our observation that comparative loss brings the smallest boost to ALBERT.

4.2. Extraction: Application to Reading Comprehension

Extractive reading comprehension (RC) (Rajpurkar et al., 2016; Liu et al., 2019b) is an essential technical branch of question answering (QA) (Hirschman and Gaizauskas, 2001; Lin, 2002; Chen, 2018; Rodriguez and Boyd-Graber, 2021; Zhu et al., 2021). Given a question and a context, extractive RC aims to extract a span from the context as the predicted answer. Current dominant RC models basically use pretrained Transformer (Vaswani et al., 2017) architectures, which employ dropout in many layers during finetuning. This allows us to use CmpDrop to improve the utility of the model parameters. Additionally, the given context is usually lengthy and contains many distracting noise segments, which also allows us to use CmpCrop to improve the model’s utilization of the context by randomly deleting the labeled distracting paragraphs. Therefore, we intend to verify the effectiveness of comparative loss using CmpDrop or/and CmpCrop in this task.

4.2.1. Datasets

We evaluate the comparative loss using only CmpDrop on SuQAD (Rajpurkar et al., 2016), which contains 100K single-hop questions with 9832 for validation, and HotpotQA (Yang et al., 2018), which contains 113K multi-hop questions with 7405 for validation. For HotpotQA, we consider the distractor setting, where the context of each question contains 10 paragraphs, but only 2 of them are useful for answering the question, and the rest 8 are retrieved distracting paragraphs that are relevant but do not support the answer. This allows us to evaluate the comparative loss with CmpCrop on HotpotQA distractor.

4.2.2. Models & Training

We follow simple but effective RC models based on PLMs (Devlin et al., 2019; Liu et al., 2019a; Clark et al., 2020; Lan et al., 2020; Beltagy et al., 2020), which take as input a concatenation of the question and the context and use a linear layer to predict the start and end positions of the answer. And we use cross-entropy of answer boundaries as the task-specific loss function following (Devlin et al., 2019) and use a learning rate warmup over the first 10% steps. For SQuAD, we use the popular BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019a), ELECTRA (Clark et al., 2020) and ALBERT (Lan et al., 2020) with a maximum sequence length of 512 as the backbone, all of which have successively achieved top rankings in multiple QA benchmarks (Rajpurkar et al., 2016, 2018; Yang et al., 2018). We first tune the learning rate in range {1e-5, 3e-5, 5e-5, 8e-5, 1e-4, 2e-4}, batch size in {8, 12, 32} and number of epochs in {1, 2, 3} for baseline models. Then, setting c=2c=2, we take these hyperparameters along and train our models using the comparative loss with two CmpDrop. For HotpotQA, we use the state-of-the-art Longformer (Beltagy et al., 2020) with a maximum sequence length of 2048 as the backbone, which is fed with the format <s> [YES] [NO] [Q] question </s> [T] title1 [P] paragraph1 \cdots [T] title10 [P] paragraph10 </s>. The special tokens [YES]/[NO], [Q], [T], and [P] represent yes/no answers and the beginning of questions, titles, and paragraphs, respectively. Similarly, we select the learning rate in {1e-5, 3e-5}, batch size in {6, 9, 12} and number of epochs in {3, 5, 8} for the baseline model. We then train our models with three comparative losses respectively, the first two applying one CmpDrop/CmpCrop (c=1c=1), while the third applying one CmpCrop followed by one CmpDrop (c=2c=2). Besides, inheriting common hyperparameters and searching for coefficient weights α\alpha in {0.1, 0.5, 1, 1.5}, we also implement R-Drop (liang et al., 2021) as a competitor to CmpDrop.

4.2.3. Results

Table 2. Question answering performance on the development sets of SQuAD and HotpotQA distractor. The results with {\dagger} are inquired from the authors of its paper.
Model EM F1
SQuAD
BERTbase (Devlin et al., 2019) 80.8 88.5
BERTbase (our implementation) 81.3±0.2 88.5±0.1
+ R-Drop (liang et al., 2021) 82.2±0.1 89.1±0.1
+ Cmp (2 Drop) 82.3±0.2 89.3±0.1
RoBERTabase (Liu et al., 2019a) (our implementation) 85.8±0.1 92.2±0.1
+ R-Drop (Liu et al., 2019a) 86.4±0.1 92.3±0.1
+ Cmp (2 Drop) 86.5±0.2 92.6±0.1
ELECTRAbase (Clark et al., 2020) 84.5 90.8
ELECTRAbase (our implementation) 85.9±0.3 92.3±0.2
+ R-Drop (liang et al., 2021) 86.5±0.1 92.3±0.1
+ Cmp (2 Drop) 86.6±0.1 92.7±0.1
ALBERTbase (Lan et al., 2020) 82.3 89.3
ALBERTbase (our implementation) 83.6±0.2 90.6±0.1
+ R-Drop (liang et al., 2021) 83.7±0.2 90.7±0.2
+ Cmp (2 Drop) 84.4±0.1 91.0±0.1
HotpotQA
Longformerbase{}_{\mathrm{base}}^{\dagger} (Beltagy et al., 2020) 60.3 74.3
Longformerbase(our implementation) 61.9±0.4 75.6±0.3
+ R-Drop (liang et al., 2021) 62.0±0.2 76.0±0.2
+ Cmp (1 Drop) 63.0±0.4 77.0±0.4
+ Cmp (1 Crop) 62.6±0.2 76.4±0.2
+ Cmp (1 Crop 1 Drop) 63.5±0.3 77.2±0.3

Since we focus on extraction here, we only measure the extracted answers using EM (exact match) and F1, which is a little different from the official HotpotQA setting that simultaneously evaluates the identification of support facts. From Table 2 we can see that our implemented baseline models trained directly using the task-specific loss Eq. (1) largely achieve better results than those reported in their original papers. Once trained using comparative loss Eq. (5) instead, our models can still significantly outperform these well-tuned baseline models even without re-searching the training hyperparameters, demonstrating the effectiveness of comparative loss on the extraction task. Also, the consistent improvement based on the three different PLMs demonstrates the model-agnostic nature of comparative loss. Furthermore, from the results on HotpotQA we can find that although both CmpDrop and CmpCrop deliver significant improvement, CmpCrop + CmpDrop achieves the best results, suggesting that CmpDrop and CmpCrop may bring different benefits to the trained models.

4.3. Ranking: Application to Pseudo-Relevance Feedback

Pseudo-relevance feedback (PRF) (Attar and Fraenkel, 1977) is an effective query understanding (Chang and Deng, 2020) technique to improve ranking accuracy, which aims to alleviate the mismatch of linguistic expressions between a query and its potential relevant documents. Given an original query qq and a document collection CC, a base ranking model returns a ranked list D=(d1,d2,,d|D|)D=(d_{1},d_{2},\cdots,d_{|D|}). Let DkD_{\leq k} denote the feedback set containing the top kk documents, where kk is usual referred to as the PRF depth. The goal of PRF is to reformulate the original query qq into a new representation q(k)q^{(k)} using the query-relevant information in DkD_{\leq k}, i.e., q(k)=f((q,Dk);𝜽)q^{(k)}=f((q,D_{\leq k});\bm{\theta}), where q(k)q^{(k)} is expected to yield better ranking results. Although PRF methods do usually improve ranking performance on average (Clinchant and Gaussier, 2013), individual reformulated queries inevitably suffer from query drift (Mitra et al., 1998; Zighelnic and Kurland, 2008) due to the objectively present noise in the feedback set, causing them to be inferior to the original ones. Therefore, we can use comparative loss with CmpCrop to train PRF models to suppress the extra increased noise by comparing the effect of queries reformulated using feedback sets with different PRF depths.

4.3.1. Datasets

We conduct experiments on MS MARCO passage (Nguyen et al., 2016) collection, which consists of 8.8M English passages collected from the search results of Bing’s 1M real-world queries. The Train set of MS MARCO contains 530K queries (about 1.1 relevant passages per query on average), the Dev set contains 6980 queries, and the online Eval set contains 6837 queries. Apart from these, we also consider TREC DL 2019 (Craswell et al., 2020), TREC DL 2020 (Craswell et al., 2021), and DL-HARD (Mackie et al., 2021), three offline evaluation benchmarks based on the MS MARCO passage collection, which contain 43, 54, and 50 fine-grained (relevance grades from 0 to 3) labeled queries, respectively. Among them, DL-HARD (Mackie et al., 2021) is a recent evaluation benchmark focusing on complex queries. We use MS MARCO Train set to train models, and evaluate trained models on the MS MARCO Dev set to tune hyperparameters and select model checkpoints. The selected models are finally evaluated on the online MS MARCO Eval666https://microsoft.github.io/MSMARCO-Passage-Ranking-Submissions/leaderboard/ and three other offline benchmarks.

4.3.2. Models & Training

We carry out PRF experiments on two base retrieval models, ANCE (Xiong et al., 2020) (dense retrieval) and uniCOIL (Lin and Ma, 2021) (sparse retrieval), respectively. For their PRF models, we do not explicitly modify the query text, but directly generate a new query vector for retrieval following the current state-of-the-art method ANCE-PRF (Yu et al., 2021). This allows us to directly optimize the retrieval of reformulated queries end-to-end with the negative log likelihood of the positive document (Karpukhin et al., 2020) as the task-specific loss:

L(𝒒(k))=logesim(𝒒(k),𝒅+)esim(𝒒(k),𝒅+)+dDesim(𝒒(k),𝒅),L(\bm{q}^{(k)})=-\log\frac{e^{\mathrm{sim}(\bm{q}^{(k)},\bm{d}^{+})}}{e^{\mathrm{sim}(\bm{q}^{(k)},\bm{d}^{+})}+\sum_{d^{-}\in D^{-}}e^{\mathrm{sim}(\bm{q}^{(k)},\bm{d}^{-})}},

where 𝒅+\bm{d}^{+} is the vector of a sampled document relevant to qq and 𝒒(k)\bm{q}^{(k)}, sim(,)\mathrm{sim}(\cdot,\cdot) is the dot product of two vectors, and DD^{-} is the collection of negative documents for them. Since only vectors of queries are updated777The fixed vectors of documents are restored from the index pre-built by https://github.com/castorini/pyserini., we mine a lite collection (5.3M for dense retrieval and 3.7M for sparse retrieval) containing positive and hard negative documents of all training queries. In this way, for each query, all documents in the lite collection except its positive documents can be used as its DD^{-}. In general, our PRF model consists of an encoder, a vector projector, and a pooler. First the original query qq and feedback documents in DkD_{\leq k} are concatenated in order with [SEP] as separator and input to the encoder to get the contextual embedding of each token. Then, the projector maps the contextual embeddings to vectors with the same dimension as the document vectors. Finally, all token vectors are pooled into a single query vector. For dense retrieval, the encoder is initialized from ANCEFirstP888https://github.com/microsoft/ANCE, the projector is a linear layer, and the pooler applies a layer normalization on the first vector ([CLS]) in the sequence, as in the previous work (Yu et al., 2021). For sparse retrieval, the encoder and projector are initialized from BERTbase with the masked language model head, where the projector is an MLP with GeLU (Hendrycks and Gimpel, 2016) activation and layer normalization, and the pooler is composed of a max pooling operation and an L2 normalization999We find that L2 normalization helps the model train stably, and it does not change the relevance ranking of documents to a query.. We finetune PRF baseline models for up to 12 epochs with a batch size of 96, a learning rate selected from {2e-5, 1e-5, 5e-6}, and PRF depth kk randomly sampled from 0 to 5 for each query. We then finetune our PRF models using the comparative loss of c=1c=1 CmpCrop for up to 6 epochs with a batch size of 48. In this way, the maximum number of training steps for our models remains the same as the baseline models, i.e., up to 12 optimizations per original query. Due to the large training costs of using multiple random seeds, we used paired t-test to calculate significant differences in retrieval performance.

4.3.3. Results

Table 3. Retrieval performance on benchmarks built on MS MARCO passage collection. ANCE and uniCOIL are base retrieval models, + PRF denotes the PRF baseline model, + Cmp denotes our PRF model trained with the comparative loss of 1 CmpCrop, and superscript (k) represents the PRF depth used during testing. Superscript indicates statistically significant improvements over its PRF baseline model with p0.1p\leq 0.1.
Model MARCO Dev MARCO Eval TREC DL 2019 TREC DL 2020 DL-HARD
NDCG@10 MRR@10 R@1K MRR@10 NDCG@10 R@1K NDCG@10 R@1K NDCG@10 R@1K
ANCE (Xiong et al., 2020) 38.76 33.01 95.84 31.70 64.76 75.70 64.58 77.64 33.39 76.65
+ PRF(3) (Yu et al., 2021) 40.10 34.40 95.90 33.00 68.10 79.10 69.50 81.50 36.50 76.10
   + Cmp(3) (1 Crop) 40.68 34.84 96.94 - 68.42 80.10 69.58 81.77 35.61 79.39
   + Cmp(5) (1 Crop) 41.01 35.14 97.03 34.17 69.58 80.81 70.44 82.77 37.44 79.55
uniCOIL (Lin and Ma, 2021) 41.21 35.13 95.81 34.42 70.09 82.83 67.35 84.42 35.96 76.85
+ PRF(3) 41.76 35.48 96.85 - 69.42 83.32 69.25 84.44 36.53 77.48
   + Cmp(3) (1 Crop) 42.02 35.75 96.91 35.14 70.10 83.58 69.70 84.51 36.90 77.67

We report the official metrics (MRR@10 for MARCO and NDCG@10 for others) and Recall@1K of the models on multiple benchmarks in Table 3. In addition to reporting results for the best-performing PRF depths (numbers in superscript brackets), for a fair comparison with ANCE-PRF(3) (second row), we also present the results of ANCE-PRF + Cmp(3), both of which use the first 3 documents as feedback. We can see that PRF baseline models (+ PRF) indeed generally outperform their base retrieval models, except that uniCOIL-PRF degrades by 0.67 percentage points in NDCG@10 of TREC DL 2019, which reflects the presence of query drift. Our PRF models (+ Cmp) trained with comparative loss, however, outperform their base retrieval model across the board. Under the same use of 3 feedback documents, our ANCE-PRF + Cmp also outperforms the published state-of-the-art ANCE-PRF (Yu et al., 2021) on all metrics except NDCG@10 on DL-HARD. Moreover, when 5 feedback documents are used, ANCE-PRF + Cmp achieves a go-ahead over ANCE-PRF on NDCG@10 of DL-HARD. For sparse retrieval, our PRF model (+ Cmp) trained with comparative loss also surpasses the strong baseline uniCOIL-PRF implemented following ANCE-PRF. All of these results above demonstrate the effectiveness of comparative loss on the ranking task.

5. Analysis

In this section, we further conduct several experiments for a more thorough analysis. First, from the dynamic weighting perspective found in §\S3.3, we examine whether the adaptive weighting of comparative loss is more effective than other weighting strategies (§\S5.1). Next, we try several other comparison strategies to find some guiding experience in choosing the number of ablations and ablation methods in practice (§\S5.2). Then, to confirm the enhancement of comparative loss on the utility of hidden and input neurons, we investigate the performance of models with different numbers of parameters (§\S5.3) and context lengths (§\S5.4). Furthermore, we visualize the loss curves to find the impact of the comparative losses with different ablation methods on the task-specific loss (§\S5.5). Finally, we show the actual training overhead of comparative loss in detail (§\S5.6).

5.1. Effect of Weighting Strategy

Table 4. QA performance on the development set of HotpotQA distractor with different weighting strategies. Cmp refers to Longformer + CmpCrop + CmpDrop that adaptively weights multiple task-specific losses through comparative loss. The others are heuristics, where AVERAGE assigns the same weights to all task-specific losses, FIRST and LAST assign weight only to the first or last, and MAX dynamically assigns weight only to the largest one.
Weighting Method EM F1
Cmp 63.5±0.3 77.2±0.3
AVERAGE 63.2±0.3 76.7±0.3
FIRST 62.1±0.1 75.8±0.1
LAST 61.9±0.4 75.6±0.3
MAX 63.1±0.3 76.7±0.3

To verify the role of comparative loss from the dynamic weighting perspective, we keep all the training settings of Longformer + CmpCrop + CmpDrop from the last row of Table 2 unchanged and replace only the weighting strategy of task-specific losses with some heuristics. Table 4 shows their performance on the HotpotQA development set. AVERAGE, FIRST and LAST are three static weighting strategies. AVERAGE assigns equal weights to all task-specific losses, while FIRST and LAST assign weight to only the first and last task-specific loss, respectively, i.e., FIRST optimizes l(0)l^{(0)} of the full model without dropout and LAST optimizes l(2)l^{(2)} of the model with regular dropout rate pp (equivalent to the baseline Longformer in Table 2). MAX is another dynamic weighting strategy that assigns weight to only the largest task-specific loss. We can see that dynamic weighting in comparative losses is significantly better than these heuristic weighting strategies, which proves that comparative loss can assign weights more appropriately. In addition, AVERAGE is better than the latter three strategies that consider only one task-specific loss, indicating that it is beneficial to consider multiple task-specific losses. Moreover, although the latter three are all assigned to only one task-specific loss, MAX is better than the other two, which indicates that dynamic assignment is better than static assignment.

Notably, the FIRST that directly optimizes the full model outperforms the LAST that is trained with dropout, suggesting that the inconsistency of dropout between the training and inference stages (Zolna et al., 2017) may indeed lead to underfitting of the full model. And the fact that Cmp far outperforms FIRST and LAST indicates that comparative loss can automatically strike a balance between ensuring training-inference consistency and preventing overfitting.

5.2. Effect of Comparison Strategy

Table 5. QA performance on the development set of HotpotQA distractor with different comparison strategies. cc is the number of ablation steps. x 2 indicates that an ablation method is repeated twice, and A+BA+B means that AA is used followed by BB.
cc Ablation Order EM F1
1 CmpDrop 63.1 77.0
CmpCrop 63.1 76.8
2 CmpDrop x 2 63.4 77.1
CmpCrop x 2 63.0 76.7
CmpDrop + CmpCrop 63.2 76.8
CmpCrop + CmpDrop 63.6 77.4

To study the impact of comparison strategies, i.e., how many ablation steps we should use for comparison and which ablation method we should choose at each step, we try a variety of comparison strategies on HopotQA with different numbers of comparisons and ablation orders. As shown in Table 5, the results are not significantly further improved when we repeat CmpDrop/CmpCrop twice, but the results are further improved when we apply CmpCrop first and then CmpDrop. This indicates that comparing multiple models ablated by the same method, i.e., encouraging the model be either hereditarily input-efficient or hereditarily parameter-efficient, seems to have little effect on the performance of the full model, but the successive use of two different ablation methods, i.e., encouraging the model be efficient (both input-efficient and parameter-efficient), is helpful. However, applying CmpDrop followed by CmpCrop did not perform as well as applying CmpDrop only, suggesting that the order of the ablation methods is important and perhaps the ablation should be done in the order of the information flow in the model.

Refer to caption
Figure 4. Average results on eight GLUE datasets as the number of ablation steps changes.

To further confirm the influence of the number of ablation steps cc, we show in Fig. 4 the relationship between the model’s Average metric over the eight GLUE datasets and the number of ablations. We can find little difference in the average performance of the models trained with different numbers of CmpDrop, with the model trained with one CmpDrop performing significantly best mainly because its huge advantage on two of the datasets pulls up the average. Therefore, if there is no extreme demand for performance, we usually do not need to tune the hyperparameter cc.

5.3. Effect of Model Parameters

Table 6. Evaluation results of baselines with different model sizes and initializations on the SQuAD development set (EM/F1), and relative gains of our models trained using comparative loss with CmpDrop over baselines.
# Parameter BERT ELECTRA RoBERTa
Baseline Gain (%) Baseline Gain (%) Baseline Gain (%)
Tiny: 4M 41.5±0.5/54.2±0.3 2.2±1.4/1.7±1.0 - - - -
Small: 14M 74.2±0.6/82.7±0.5 2.2±0.8/1.6±0.6 78.1±0.2/85.9±0.1 1.6±0.5/1.2±0.3 - -
Medium: 42M 78.0±0.4/85.8±0.2 1.6±0.7/1.3±0.4 - - - -
Base: 110M 81.3±0.2/88.5±0.1 1.2±0.2/0.9±0.2 85.9±0.3/92.3±0.2 0.8±0.4/0.5±0.3 85.8±0.1/92.2±0.1 0.8±0.2/0.5±0.1
Large: 335M 83.9±0.2/90.8±0.1 1.2±0.2/0.7±0.1 89.0±0.1/94.7±0.0 0.7±0.2/0.3±0.0 89.0±0.3/94.7±0.0 0.7±0.3/0.3±0.1

To investigate the impact of model parameters, we explore the application of the comparative loss with CmpDrop on different-sized versions of BERT, RoBERTa and ELECTRA. From Table 6 we can see that the comparative loss with CmpDrop achieves a consistent improvement over the baselines based on these backbone models, which indicates that the comparative loss can improve model performance by increasing parameter utility without increasing the number of parameters. Moreover, except for the one outlier of BERTMedium, we can roughly find that the less the model parameters, the greater the relative gain from comparative loss. This is reasonable because the individual hidden neurons in a model with lower capacity play a larger role, so the improvement in the utility of hidden neurons can be more reflected in the final performance. Whereas for a model of higher capacity, it is easier to fit less training data, i.e., its task-specific loss is already low, so comparative loss has less room to play in reducing task-specific loss further. In addition, we observe that the boost to BERT from the comparative loss with CmpDrop is generally higher compared to RoBERTa and ELECTRA with more complicated pretraining, suggesting that the comparative loss helps the model escape from local optima due to parameter initialization.

5.4. Effect of Input Context

Refer to caption
Refer to caption
Figure 5. Performance curves using different context sizes. (a) PRF models on MARCO Dev, the horizontal dotted line represents the base retrieval model. (b) RC models on HotpotQA Dev.

To review the utility of the input context (i.e., input neurons) to models, we plot in Fig. 5 the performance trends of the models using different context sizes. First, in both datasets, our models trained with comparative loss consistently outperform the baseline models for all context sizes, indicating that our models are able to utilize input neurons more efficiently with equal amounts of input context. Second, this shows that our comparative loss can further improve the model performance after streamlining the input with context selection. In addition, we notice that our ANCE-PRF + CmpCrop in Fig. 5 improves retrieval performance as expected as the number of feedback documents increases, while ANCE-PRF reaches peak performance at 4 feedback documents and then suffers performance degradation, implying that our model is more robust and able to mine and exploit relevant information in the added feedback documents. In contrast to PRF, for HotpotQA in Fig. 5, the performance of all RC models decreases as the number of paragraphs increases. This is understandable, since only 2 paragraphs in HotpotQA are supporting facts, and the remaining 8 mostly serve as a distraction, so the ideal performance curve can just be a horizontal line that does not drop when the paragraph number increases. Interestingly, we find that the degradation of Longformer + CmpDrop (2.7%) and Longformer + CmpCrop + CmpDrop (3.0%) from the oracle setting (2 gold paragraphs) to the distractor setting (10 paragraphs) is lower than that of the baseline Longformer (3.4%). This suggests that comparative loss can help the models suppress the noisy information in the added context. Although Longformer + CmpCrop (3.7%) has a larger degradation than Longformer, we believe this is because Longformer + CmpCrop needs to be optimized for various numbers of paragraphs, unlike other models without CmpCrop that focus on learning for one input form (i.e., always ten paragraphs). However, this variety of input forms makes Longformer + CmpCrop perform better than Longformer + CmpDrop when the number of paragraphs is small (5\leq 5).

To further quantitatively demonstrate the help of comparative loss in the robustness of the PRF model to context size, we report in Table 7 the robustness indexes (Collins-Thompson, 2009) of ANCE-PRF + CmpCrop and ANCE-PRF at different numbers of feedback documents. The robustness index is defined as N+N|Q|\frac{N_{+}-N_{-}}{|Q|}, where |Q||Q| is the total number of evaluated queries and N+N_{+} and NN_{-} are the number of queries that the PRF model improves or downgrades when one more feedback document is used. The value of robustness index is in [-1, 1], and the model with higher robustness index is more robust. We can see that the PRF model trained using comparative loss with CmpCrop is significantly more robust than the baseline model. Besides, from the gaps in their robustness indexes (only 0.03 or 0.02 for 1 or 2 documents, but 0.05 for more documents), we can find that the comparative loss is more helpful for long-form inputs.

Table 7. The robustness index of 𝒒(k)\bm{q}^{(k)} with respect to 𝒒(k1)\bm{q}^{(k-1)} on MARCO Dev at each PRF depth kk, where 𝒒(k)\bm{q}^{(k)} and 𝒒(k1)\bm{q}^{(k-1)} are reformulated query vectors by the PRF model, the latter having one less document in the input context than the former.
kk 1 2 3 4 5
ANCE-PRF 0.51 0.54 0.58 0.58 0.61
ANCE-PRF + Cmp (1 Crop) 0.54 0.56 0.63 0.63 0.66

5.5. Loss Visualization

Refer to caption
(a) Answer extraction loss on HopotQA Train
Refer to caption
(b) Answer extraction loss on HopotQA Dev
Refer to caption
(c) Retrieval loss on MARCO Train
Refer to caption
(d) Retrieval loss on MARCO Dev
Figure 6. Task-specific loss curves for the full model.

To figure out the impact of comparative loss on task-specific loss, we plot the curves of task-specific loss for the full model (i.e., l(0)l^{(0)}) in Fig. 6. From Fig. 6(a) and Fig. 6(b) we can see that with the same batch size, the comparative loss can help our models fit better compared to the baseline Longformer. Comparing Longformer + CmpDrop and Longformer + CmpCrop, we can find that the training loss of the former is significantly smaller, which indicates that the comparative loss with CmpDrop helps the model fit the training data better. Whereas the evaluation loss of Longformer + CmpCrop rises less in the later stage, which indicates that the comparative loss with CmpCrop can mitigate the overfitting to some extent. Since the number of task-specific losses per sample optimized by comparative loss is 1+c1+c times that of conventional training, we also plot the task-specific loss curves for PRF models in Fig. 6(c) and Fig. 6(d), where the batch size of our uniCOIL-PRF + CmpCrop is 1/(1+c)1/(1+c) of the baseline uniCOIL-PRF. In this way, the number of task-specific losses optimized in one batch for our model and the baseline is the same, which helps to further clarify the role of the comparative loss with CmpCrop. We can see that while the training loss of our model in Fig. 6(c) does not drop as low as the baseline, its evaluation loss in Fig. 6(d) drops to a lower level and significantly mitigates the overfitting.

5.6. Training Efficiency

Table 8. Specific settings for the number of ablation steps of BERT + Cmp on each GLUE dataset, as well as the performance gain and increase in training computation overhead compared to BERT.
MNLI MRPC QNLI QQP RTE SST-2 STS-B CoLA
cc 3 1 4 2 1 2 4 4
Performance (%) \uparrow +1.3 +3.8 +0.9 +0.5 +4.1 +0.4 +0.6 +1.6
FLOPs \uparrow x3.5 x1.6 x3.5 x0.9 x2.1 x0.7 x4.8 x3.9

We present in Table 8 the performance gain and relative change in training FLOPs of BERTbase + Cmp compared to BERTbase, as well as the specific number of comparisons (i.e., number of ablation steps cc) chosen for each dataset. We find that the actual overhead of training with comparative loss is usually less than 1+c1+c times that of conventional training, and even less than that of conventional training (e.g., on QQP). This is because models trained with comparative loss tend to converge earlier than baselines. Combined with the insensitivity of comparative loss to the number of comparisons found from Fig. 4, we believe that setting cc to 1 or 2 can lead to effective and fast training when data is sufficient.

6. Related Work

In this section, we introduce and discuss some work that has different motivations but is technically relevant to us, starting with contrastive learning (Le-Khac et al., 2020) that learns by comparing, followed by recent training methods that also use dropout multiple times.

6.1. Contrastive Learning

Contrastive learning has recently achieved significant success in representation learning in computer vision and natural language processing. At its core, contrastive learning aims to learn effective representations by pulling semantically similar neighbors together and pushing apart non-neighbors (Hadsell et al., 2006). Instead of learning a signal from individual data samples one at a time, it learns by comparing different samples (Le-Khac et al., 2020). The comparison is performed between positive pairs of similar samples and negative pairs of dissimilar samples. The positive pair must ensure that the two samples are similar, which can be constructed either by using supervised similarity annotation or by self-supervision. In self-supervised contrastive learning, a positive pair can consist of an original sample and its data augmentation. For example, SimCLR (Chen et al., 2020) in computer vision uses a crop, flip, distortion or rotation of an original image as its similar view, and SimCSE (Gao et al., 2021) in natural language processing applies two dropout masks to an input sentence to create two slightly different sentence embeddings that are then used as a positive pair of sentence embeddings. To share more computation and save cost, negative pairs usually consist of two dissimilar samples within the same training batch. Although both learn through comparison, contrastive learning aims at pursuing alignment and uniformity (Wang and Isola, 2020) of representations, while our comparative loss aims at pursuing orderliness of the task-specific losses of the full model and its ablated model. Moreover, as the lexical meaning suggests, contrastive learning only classifies the relationship (i.e., similar or dissimilar) between two data samples in a binary manner, whereas our comparative loss compares multiple full/ablated models by ranking. However, these two are not in conflict, and our comparative loss can be used over the contrastive losses that served as task-specific losses.

6.2. Dropout-based Comparison

Dropout is a family of stochastic techniques used in neural network training or inference that have attracted extensive research interest and are widely used in practice. The standard dropout (Hinton et al., 2012) aims to avoid overfitting of the network by reducing the co-adaptation of neurons, where the outputs of individual neurons only provide useful information in combination with other neuron outputs. After this, a line of research focused on improving the standard dropout by employing other strategies for dropping neurons, such as dropconnect (Wan et al., 2013) and variational dropout (Kingma et al., 2015).

A line of research that is relevant to us is the use of dropout multiple times in training. SimCSE (Gao et al., 2021) forwards the model twice with different dropout masks of the same rate and uses a contrastive loss to constrain the distribution of model outputs in the representation space. A possible side effect of dropout revealed by the existing literature (Ma et al., 2016; Zolna et al., 2017) is the non-negligible inconsistency between the training and inference stages of the model, i.e., the submodels are optimized during training, but the full model without dropout is used during inference. To address this inconsistency, R-Drop (liang et al., 2021) forward runs the model multiple times with different dropout masks to obtain multiple predicted probability distributions and applies KL-divergence on them to constrain their consistency. Unlike their multiple dropout masks that are sampled independently, the multiple dropout rates are increasing and the masks are progressive in our CmpDrop, with the subsequent mask obtained by further randomly discarding elements based on the previous one. In addition, we impose constraints on the task-specific losses at the end rather than on the representations and probabilities upstream. Notably, the full model is also optimized in due time when trained using the comparative loss with CmpDrop, which we argue is important to mitigate the inconsistency between training and inference. This is because, while dropout avoids co-adaptation of neurons, it also weakens the cooperation between neurons (§\S5.1 gives some empirical support). In particular, in cases where all neurons are involved, the full model trained with dropout has not been taught how to make them work together efficiently and thus cannot be fully exploited during testing. Surprisingly, our comparative loss with CmpDrop can balance between promoting the cooperation of neurons and preventing their co-adaptation.

7. Conclusion

In this paper, we propose cross-model comparative loss, a simple task-agnostic loss function, to improve the utility of neurons in NLU models. Comparative loss is essentially a ranking loss based on the comparison principle between the full model and its ablated models, with the expectation that the less ablation there is, the smaller the task-specific loss. To ensure comparability among multiple ablated models, we progressively ablate the models and provide two controlled ablation methods based on dropout and context cropping, applicable to a wide range of tasks and models. We show theoretically how comparative loss works, suggesting that it can adaptively assign weights to multiple task-specific losses. Extensive experiments and analysis on 14 datasets from 3 distinct NLU tasks demonstrate the universal effectiveness of comparative loss. Interestingly, our analysis confirms that comparative loss can indeed assign weights more appropriately, and finds that comparative loss is particularly effective for models with few parameters or long input.

In the future, we would like to apply comparative loss in other domains, such as natural language generation and computer vision, and explore its applications on other model architectures beyond Transformer. It could also be interesting to explore the application of comparative loss on top of self-supervised losses (e.g., contrastive loss) during pretraining. For training costs, how to reduce the overhead by reusing more shared computations is a direction worth exploring. Further, more advanced ablation methods in training, such as dropconnect (Wan et al., 2013) rather than standard dropout and adversarial rather than stochastic, may deserve future research efforts.

Acknowledgements.
This work was supported by the Sponsor National Key R&D Program of China (Grant #2022YFB3103700, Grant #2022YFB3103704), the Sponsor National Natural Science Foundation of China (NSFC) under Grants No. Grant #62276248, Grant #U21B2046, and the Sponsor Youth Innovation Promotion Association CAS under Grants No. Grant #2023111.

References

  • (1)
  • Attar and Fraenkel (1977) R. Attar and A. S. Fraenkel. 1977. Local Feedback in Full-Text Retrieval Systems. J. ACM 24, 3 (July 1977), 397–417. https://doi.org/10.1145/322017.322021
  • Bartoldson et al. (2020) Brian Bartoldson, Ari Morcos, Adrian Barbu, and Gordon Erlebacher. 2020. The Generalization-Stability Tradeoff In Neural Network Pruning. In Advances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 20852–20864. https://proceedings.neurips.cc/paper/2020/hash/ef2ee09ea9551de88bc11fd7eeea93b0-Abstract.html
  • Beltagy et al. (2020) Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. https://doi.org/10.48550/arXiv.2004.05150 arXiv:2004.05150
  • Bentivogli et al. (2009) Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. The Fifth PASCAL Recognizing Textual Entailment Challenge.. In TAC.
  • Blalock et al. (2020) Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. 2020. What Is the State of Neural Network Pruning? Proceedings of Machine Learning and Systems 2 (March 2020), 129–146. https://proceedings.mlsys.org/paper/2020/hash/d2ddea18f00665ce8623e36bd4e3c7c5-Abstract.html
  • Borgeaud et al. (2022) Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego De Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack Rae, Erich Elsen, and Laurent Sifre. 2022. Improving Language Models by Retrieving from Trillions of Tokens. In Proceedings of the 39th International Conference on Machine Learning. PMLR, 2206–2240. https://proceedings.mlr.press/v162/borgeaud22a.html
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models Are Few-Shot Learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., Red Hook, NY, USA, 1877–1901. https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
  • Brutzkus and Globerson (2019) Alon Brutzkus and Amir Globerson. 2019. Why Do Larger Models Generalize Better? A Theoretical Perspective via the XOR Problem. In Proceedings of the 36th International Conference on Machine Learning. PMLR, 822–830. https://proceedings.mlr.press/v97/brutzkus19b.html
  • Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada, 1–14. https://doi.org/10.18653/v1/S17-2001
  • Chang and Deng (2020) Yi Chang and Hongbo Deng. 2020. Query understanding for search engines. Springer.
  • Chen (2018) Danqi Chen. 2018. Neural Reading Comprehension and Beyond. Stanford University.
  • Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 1597–1607. https://proceedings.mlr.press/v119/chen20j.html
  • Chen et al. (2018) Zihan Chen, Hongbo Zhang, Xiaoji Zhang, and Leqi Zhao. 2018. Quora question pairs. https://www.kaggle.com/c/quora-question-pairs
  • Clark and Gardner (2018) Christopher Clark and Matt Gardner. 2018. Simple and Effective Multi-Paragraph Reading Comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 845–855. https://doi.org/10.18653/v1/P18-1078
  • Clark et al. (2020) Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In International Conference on Learning Representations. https://openreview.net/forum?id=r1xMH1BtvB
  • Clinchant and Gaussier (2013) Stéphane Clinchant and Eric Gaussier. 2013. A Theoretical Analysis of Pseudo-Relevance Feedback Models. In Proceedings of the 2013 Conference on the Theory of Information Retrieval (ICTIR ’13). Association for Computing Machinery, New York, NY, USA, 6–13. https://doi.org/10.1145/2499178.2499179
  • Cohen and Howe (1988) Paul R. Cohen and Adele E. Howe. 1988. How Evaluation Guides AI Research: The Message Still Counts More than the Medium. AI Magazine 9, 4 (Dec. 1988), 35–35. https://doi.org/10.1609/aimag.v9i4.952
  • Collins-Thompson (2009) Kevyn Collins-Thompson. 2009. Reducing the Risk of Query Expansion via Robust Constrained Optimization. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM ’09). Association for Computing Machinery, New York, NY, USA, 837–846. https://doi.org/10.1145/1645953.1646059
  • Craswell et al. (2021) Nick Craswell, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2021. Overview of the TREC 2020 Deep Learning Track. arXiv:2102.07662 [cs] (Feb. 2021). arXiv:2102.07662 http://arxiv.org/abs/2102.07662
  • Craswell et al. (2020) Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 Deep Learning Track. arXiv:2003.07820 [cs] (March 2020). arXiv:2003.07820 http://arxiv.org/abs/2003.07820
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  • Dolan and Brockett (2005) William B. Dolan and Chris Brockett. 2005. Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005). https://aclanthology.org/I05-5002
  • Dua et al. (2021) Dheeru Dua, Cicero Nogueira dos Santos, Patrick Ng, Ben Athiwaratkun, Bing Xiang, Matt Gardner, and Sameer Singh. 2021. Generative Context Pair Selection for Multi-hop Question Answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 7009–7015. https://doi.org/10.18653/v1/2021.emnlp-main.561
  • Frank et al. (2021) Stella Frank, Emanuele Bugliarello, and Desmond Elliott. 2021. Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 9847–9857. https://doi.org/10.18653/v1/2021.emnlp-main.775
  • Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 6894–6910. https://doi.org/10.18653/v1/2021.emnlp-main.552
  • Hadsell et al. (2006) Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2. IEEE, 1735–1742.
  • Han et al. (2015) Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning Both Weights and Connections for Efficient Neural Network. In Advances in Neural Information Processing Systems, Vol. 28. Curran Associates, Inc. https://papers.nips.cc/paper/2015/hash/ae0eb3eed39d2bcef4622b2499a05fe6-Abstract.html
  • Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).
  • Herbrich et al. (2000) Ralf Herbrich, Thore Graepel, and Klaus Obermayer. 2000. Large Margin Rank Boundaries for Ordinal Regression. Advances in large margin classifiers 88, 2 (2000), 115–132.
  • Hinton et al. (2012) Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors. https://doi.org/10.48550/arXiv.1207.0580 arXiv:1207.0580
  • Hirschman and Gaizauskas (2001) L. Hirschman and R. Gaizauskas. 2001. Natural Language Question Answering: The View from Here. Natural Language Engineering 7, 4 (Dec. 2001), 275–300. https://doi.org/10.1017/S1351324901002807
  • Izacard et al. (2022) Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2022. Few-Shot Learning with Retrieval Augmented Language Models. https://doi.org/10.48550/arXiv.2208.03299 arXiv:2208.03299
  • Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. https://doi.org/10.48550/arXiv.2001.08361 arXiv:2001.08361
  • Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
  • Kingma et al. (2015) Durk P Kingma, Tim Salimans, and Max Welling. 2015. Variational Dropout and the Local Reparameterization Trick. In Advances in Neural Information Processing Systems, Vol. 28. Curran Associates, Inc. https://papers.nips.cc/paper/2015/hash/bc7316929fe1545bf0b98d114ee3ecb8-Abstract.html
  • Labach et al. (2019) Alex Labach, Hojjat Salehinejad, and Shahrokh Valaee. 2019. Survey of Dropout Methods for Deep Neural Networks. https://doi.org/10.48550/arXiv.1904.13310 arXiv:1904.13310
  • Lan et al. (2020) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations. https://openreview.net/forum?id=H1eA7AEtvS
  • Le-Khac et al. (2020) Phuc H. Le-Khac, Graham Healy, and Alan F. Smeaton. 2020. Contrastive Representation Learning: A Framework and Review. IEEE Access 8 (2020), 193907–193934. https://doi.org/10.1109/ACCESS.2020.3031549
  • LeCun et al. (1989) Yann LeCun, John Denker, and Sara Solla. 1989. Optimal Brain Damage. In Advances in Neural Information Processing Systems, Vol. 2. Morgan-Kaufmann. https://papers.nips.cc/paper/1989/hash/6c9882bbac1c7093bd25041881277658-Abstract.html
  • liang et al. (2021) xiaobo liang, Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei Chen, Min Zhang, and Tie-Yan Liu. 2021. R-Drop: Regularized Dropout for Neural Networks. In Advances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 10890–10905. https://proceedings.neurips.cc/paper/2021/hash/5a66b9200f29ac3fa0ae244cc2a51b39-Abstract.html
  • Lin (2002) Jimmy Lin. 2002. The Web as a Resource for Question Answering: Perspectives and Challenges. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). European Language Resources Association (ELRA), Las Palmas, Canary Islands - Spain. http://www.lrec-conf.org/proceedings/lrec2002/pdf/85.pdf
  • Lin and Ma (2021) Jimmy Lin and Xueguang Ma. 2021. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807 [cs] (June 2021). arXiv:2106.14807 http://arxiv.org/abs/2106.14807
  • Liu et al. (2019b) Shanshan Liu, Xin Zhang, Sheng Zhang, Hui Wang, and Weiming Zhang. 2019b. Neural Machine Reading Comprehension: Methods and Trends. Applied Sciences 9, 18 (Jan. 2019), 3698. https://doi.org/10.3390/app9183698
  • Liu et al. (2019a) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019a. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs] (July 2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
  • Liu et al. (2017) Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. 2017. Learning Efficient Convolutional Networks through Network Slimming. In 2017 IEEE International Conference on Computer Vision (ICCV). 2755–2763. https://doi.org/10.1109/ICCV.2017.298
  • Ma et al. (2016) Xuezhe Ma, Yingkai Gao, Zhiting Hu, Yaoliang Yu, Yuntian Deng, and Eduard Hovy. 2016. Dropout with expectation-linear regularization. arXiv preprint arXiv:1609.08017 (2016).
  • Mackie et al. (2021) Iain Mackie, Jeffrey Dalton, and Andrew Yates. 2021. How Deep Is Your Learning: The DL-HARD Annotated Deep Learning Dataset. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, NY, USA, 2335–2341. https://doi.org/10.1145/3404835.3463262
  • Meyes et al. (2019) Richard Meyes, Melanie Lu, Constantin Waubert de Puiseau, and Tobias Meisen. 2019. Ablation Studies in Artificial Neural Networks. https://doi.org/10.48550/arXiv.1901.08644 arXiv:1901.08644
  • Min et al. (2018) Sewon Min, Victor Zhong, Richard Socher, and Caiming Xiong. 2018. Efficient and Robust Question Answering from Minimal Context over Documents. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 1725–1735. https://doi.org/10.18653/v1/P18-1160
  • Mitra et al. (1998) Mandar Mitra, Amit Singhal, and Chris Buckley. 1998. Improving Automatic Query Expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’98). Association for Computing Machinery, New York, NY, USA, 206–214. https://doi.org/10.1145/290941.290995
  • Nguyen et al. (2016) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated Machine Reading Comprehension Dataset. In CoCo@ NIPS.
  • Niu et al. (2012) Shuzi Niu, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2012. Top-k learning to rank: labeling, ranking and evaluation. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (Portland, Oregon, USA) (SIGIR ’12). Association for Computing Machinery, New York, NY, USA, 751–760. https://doi.org/10.1145/2348283.2348384
  • Pang et al. (2021) Liang Pang, Yanyan Lan, and Xueqi Cheng. 2021. Match-Ignition: Plugging PageRank into Transformer for Long-form Text Matching. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (Virtual Event, Queensland, Australia) (CIKM ’21). Association for Computing Machinery, New York, NY, USA, 1396–1405. https://doi.org/10.1145/3459637.3482450
  • Pang et al. (2019) Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Lixin Su, and Xueqi Cheng. 2019. HAS-QA: Hierarchical Answer Spans Model for Open-Domain Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence 33, 01 (July 2019), 6875–6882. https://doi.org/10.1609/aaai.v33i01.33016875
  • Pang et al. (2016) Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. 2016. Text matching as image recognition. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (Phoenix, Arizona) (AAAI’16). AAAI Press, 2793–2799.
  • Pang et al. (2020) Liang Pang, Jun Xu, Qingyao Ai, Yanyan Lan, Xueqi Cheng, and Jirong Wen. 2020. SetRank: Learning a Permutation-Invariant Ranking Model for Information Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 499–508. https://doi.org/10.1145/3397271.3401104
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html
  • Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don’t Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Melbourne, Australia, 784–789. https://doi.org/10.18653/v1/P18-2124
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2383–2392. https://doi.org/10.18653/v1/D16-1264
  • Rodriguez and Boyd-Graber (2021) Pedro Rodriguez and Jordan Boyd-Graber. 2021. Evaluation Paradigms in Question Answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 9630–9642. https://doi.org/10.18653/v1/2021.emnlp-main.758
  • Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing. 1631–1642.
  • Tu et al. (2020) Ming Tu, Kevin Huang, Guangtao Wang, Jing Huang, Xiaodong He, and Bowen Zhou. 2020. Select, Answer and Explain: Interpretable Multi-Hop Reading Comprehension over Multiple Documents. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9073–9080. https://doi.org/10.1609/aaai.v34i05.6441
  • Vapnik (1991) Vladimir Vapnik. 1991. Principles of risk minimization for learning theory. Advances in neural information processing systems 4 (1991).
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
  • Wan et al. (2013) Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. 2013. Regularization of Neural Networks Using DropConnect. In Proceedings of the 30th International Conference on Machine Learning. PMLR, 1058–1066. https://proceedings.mlr.press/v28/wan13.html
  • Wang et al. (2019) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In International Conference on Learning Representations. https://openreview.net/forum?id=rJ4km2R5t7
  • Wang et al. (2022) Shuohang Wang, Yichong Xu, Yuwei Fang, Yang Liu, Siqi Sun, Ruochen Xu, Chenguang Zhu, and Michael Zeng. 2022. Training Data Is More Valuable than You Think: A Simple and Effective Method by Retrieving from Training Data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 3170–3179. https://doi.org/10.18653/v1/2022.acl-long.226
  • Wang and Isola (2020) Tongzhou Wang and Phillip Isola. 2020. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 9929–9939. https://proceedings.mlr.press/v119/wang20k.html
  • Warstadt et al. (2019) Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. 2019. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics 7 (2019), 625–641.
  • Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 1112–1122. https://doi.org/10.18653/v1/N18-1101
  • Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6
  • Xiong et al. (2020) Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In International Conference on Learning Representations. https://openreview.net/forum?id=zeFrfgyZln
  • Xu et al. (2022) Shicheng Xu, Liang Pang, Huawei Shen, and Xueqi Cheng. 2022. Match-Prompt: Improving Multi-task Generalization Ability for Neural Text Matching via Prompt Learning. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (Atlanta, GA, USA) (CIKM ’22). Association for Computing Machinery, New York, NY, USA, 2290–2300. https://doi.org/10.1145/3511808.3557388
  • Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2369–2380. https://doi.org/10.18653/v1/D18-1259
  • Yu et al. (2021) HongChien Yu, Chenyan Xiong, and Jamie Callan. 2021. Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (CIKM ’21). Association for Computing Machinery, New York, NY, USA, 3592–3596. https://doi.org/10.1145/3459637.3482124
  • Zheng et al. (2020) Zhi Zheng, Kai Hui, Ben He, Xianpei Han, Le Sun, and Andrew Yates. 2020. BERT-QE: Contextualized Query Expansion for Document Re-ranking. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4718–4728. https://doi.org/10.18653/v1/2020.findings-emnlp.424
  • Zhong et al. (2021) Ruiqi Zhong, Dhruba Ghosh, Dan Klein, and Jacob Steinhardt. 2021. Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 3813–3827. https://doi.org/10.18653/v1/2021.findings-acl.334
  • Zhu et al. (2020) Yunchang Zhu, Liang Pang, Yanyan Lan, and Xueqi Cheng. 2020. L2R2: Leveraging Ranking for Abductive Reasoning. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, New York, NY, USA, 1961–1964. https://doi.org/10.1145/3397271.3401332
  • Zhu et al. (2021) Yunchang Zhu, Liang Pang, Yanyan Lan, Huawei Shen, and Xueqi Cheng. 2021. Adaptive Information Seeking for Open-Domain Question Answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 3615–3626. https://doi.org/10.18653/v1/2021.emnlp-main.293
  • Zhu et al. (2022) Yunchang Zhu, Liang Pang, Yanyan Lan, Huawei Shen, and Xueqi Cheng. 2022. LoL: A Comparative Regularization Loss over Query Reformulation Losses for Pseudo-Relevance Feedback. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 825–836. https://doi.org/10.1145/3477495.3532017
  • Zighelnic and Kurland (2008) Liron Zighelnic and Oren Kurland. 2008. Query-Drift Prevention for Robust Query Expansion. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’08). Association for Computing Machinery, New York, NY, USA, 825–826. https://doi.org/10.1145/1390334.1390524
  • Zolna et al. (2017) Konrad Zolna, Devansh Arpit, Dendi Suhubdy, and Yoshua Bengio. 2017. Fraternal dropout. arXiv preprint arXiv:1711.00066 (2017).