This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adaptive Knowledge-Enhanced Bayesian Meta-Learning for Few-shot Event Detection

Shirong Shen1    Tongtong Wu1    Guilin Qi1∗    Yuan-Fang Li2   
   Gholamreza Haffari 2 Sheng Bi1
1School of Computer Science and Engineering, Southeast University, China
2Faculty of Information Technology, Monash University, Melbourne, Australia
{ssr, wutong8023, gqi}@seu.edu.cn, [email protected],
[email protected], [email protected]
Abstract

Event detection (ED) aims at detecting event trigger words in sentences and classifying them into specific event types. In real-world applications, ED typically does not have sufficient labelled data, thus can be formulated as a few-shot learning problem. To tackle the issue of low sample diversity in few-shot ED, we propose a novel knowledge-based few-shot event detection method which uses a definition-based encoder to introduce external event knowledge as the knowledge prior of event types. Furthermore, as external knowledge typically provides limited and imperfect coverage of event types, we introduce an adaptive knowledge-enhanced Bayesian meta-learning method to dynamically adjust the knowledge prior of event types. Experiments show our method consistently and substantially outperforms a number of baselines by at least 15 absolute F1F_{1} points under the same few-shot settings.

* Corresponding author.

1 Introduction

Event detection is an important task in information extraction, aiming at detecting event triggers from text and then classifying them into event types  Chen et al. (2015). For example, in “The police arrested Harry on charges of manslaughter”, the trigger word is arrested, indicating an “Arrest” event. Event detection has been widely applied in Twitter analysis Zhou et al. (2017), legal case extraction de Araujo et al. (2017), and financial event extraction Zheng et al. (2019), to name a few.

Typical approaches to event detection Chen et al. (2015); McClosky et al. (2011); Liu et al. (2019) generally rely on large-scale annotated datasets for training. Yet in real-world applications, adequate labeled data is usually unavailable. Hence, methods that generalize effectively with small quantities of labeled samples and adapt quickly to new event types are highly desirable for event detection.

Refer to caption
Figure 1: A 3-way 3-shot event detection example, in which the model uses the support set to predict the event types of query samples.

Various approaches have been proposed to enable learning from only a few samples, i.e., few-shot learning Finn et al. (2017); Snell et al. (2017); Zhang et al. (2018a). Yet few-shot event detection (FSED) has been less studied until recently Lai et al. (2020a); Deng et al. (2020). Although these methods achieve encouraging progress on typical NN-way MM-shot setting (Figure 1), the performance remains unsatisfactory as the diversity of examples in the support set is usually limited.

Intuitively, introducing high-quality semantic knowledge, such as FrameNet Baker et al. (1998), is a potential solution to the insufficient diversity issue Qu et al. (2020); Tong et al. (2020); Liu et al. (2016, 2020). However, as shown in Figure 2, such knowledge-enhanced methods also suffer from two major issues: (1) the incomplete coverage by the knowledge base and (2) the uncertainty caused by the inexact alignment between predefined knowledge and diverse applications.

Refer to caption
Figure 2: An example of FrameNet. Left side: the relation between frame ‘Chatting’ and its sub-frame. Right side: the definitions and LUs (Lexical Units) of frame Chatting and Discussion. The blue words represent the mentions of arguments in definition. It can be seen that, in FrameNet, the definition of the sub-frame is similar to the definition of its super-frame. External knowledge base can provide rich semantic information, yet the knowledge base is typically incomplete, such as the missing of a desired frame “online-chat”.

To tackle the above issues, in this paper, we propose an Adaptive Knowledge-Enhanced Bayesian Meta-Learning (AKE-BML) framework. More specifically, (1) we align the event types between the support set and FrameNet via heuristic rules.111For event types that cannot be accurately aligned to FrameNet, we match the nearest super-ordinate frame for them. (2) We propose encoders for encoding the samples and knowledge-base in the same semantic space. (3) We propose a learnable offset for revising the aligned knowledge representations to build the knowledge prior distribution for event types and generate the posterior distribution for event type prototype representations. (4) In the prediction phrase, we adopt the learned posterior distribution for prototype representations to classify query instances into event types.

We conduct comprehensive experiments on the aggregated benchmark dataset of few-shot event detection Deng et al. (2020). The experimental results show that our method consistently and substantially outperforms state-of-the-art methods. In all six NN-way-MM-shot settings, our model achieves a large F1F_{1} superiority of at least 15 absolute points.

2 Related Work

Event Detection.

Recent event detection methods based on neural networks have achieved good performance Chen et al. (2015); Sha et al. (2016); Nguyen et al. (2016); Lou et al. (2021). These methods use neural networks to construct the context features of candidate trigger words to classify events. Pre-trained language models such as BERT Devlin et al. (2019) have also become an indispensable component of event detection models Yang et al. (2019); Wadden et al. (2019); Shen et al. (2020). However, neural models rely on large-scale labeled event datasets and fail to predict the labels of new event types. A recent study utilized the basic metric-based few-shot learning method for event detection Lai et al. (2020b). Deng et al. Deng et al. (2020) tackles few-shot learning for event classification with a dynamic memory network. To enhance background knowledge, ontology embedding is used in ED Deng et al. (2021). These methods have achieved encouraging results in the few-shot learning setting. However, they do not address the problem of insufficient sample diversity in the support set. Our method leverages the knowledge in FrameNet to augment the support set for event detection.

Few-shot Learning and Meta-learning.

Few-shot learning trains a model with only a few labeled samples in a support set and predicts the labels of unlabeled samples in the query set. Various approaches have been proposed to solve the few-shot learning problem, which mainly fall into three categories: (1) metric-based methods Vinyals et al. (2016); Snell et al. (2017); Garcia and Bruna (2012); Sung et al. (2018), (2) optimization-based methods Finn et al. (2017); Nichol et al. (2018); Ravi and Larochelle (2016), and (3) model-based methods Yan et al. (2015); Zhang et al. (2018b); Sukhbaatar et al. (2015); Zhang et al. (2018a). However, these methods rely heavily on the support set and suffer from poor robustness caused by insufficient sample diversity of the support set.

Bayesian meta-learning Ravi and Larochelle (2016); Yoon et al. (2018) can construct the posterior distribution of the prototype vector through external information outside the support set. The effectiveness of this method has been shown in the few-shot relation extraction task Qu et al. (2020). It inspires us to solve the problem of insufficient sample diversity in the task of few-shot event detection by introducing external knowledge. However, this method ignores the semantic deviation between knowledge and target types. Specifically, a knowledge base may provide incomplete coverage of target types in a given support set, which leads to inaccurate matching between a target type and knowledge.

3 Problem Definition

In this paper, the Few-Shot Event Detection (FSED) problem is defined as a typical N-way-M-shot problem. Specifically, a tiny labeled support set SS is provided for model training. SS contains N distinct event types and each event type has only M labeled samples, where M is typically small (e.g. M=5,10,15M=5,10,15). More precisely, in each FSED task we are given a small support set S={(xs,ys)}S=\{(x_{s},y_{s})\}. Let XS={xs}sSX_{S}=\{x_{s}\}_{s\in S} represent the samples in the support set SS, i.e. xs=(Is,tts)x_{s}=(I_{s},tt_{s}), where IsI_{s} is the sentence of the sample xsx_{s} and ttstt_{s} is the candidate trigger word of xsx_{s}. We denote by YSY_{S} an ordered list of event types, i.e. YS={ys}sSY_{S}=\{y_{s}\}_{s\in S}, where each ysy_{s} is the ground-truth event type of sample xsx_{s}. For each support set SS, we only consider a subset of event types TsT_{s} from the entire set of event types TT. Hence, in the NN-way-MM-shot setting, |TS|=N|T_{S}|=N and |XS|=|YS|=NM|X_{S}|=|Y_{S}|=N*M.

Moreover, we assume an external knowledge base \mathscr{F} that contains a number of frames. Each frame FtF_{t}\in\mathscr{F} consists of three parts: Ft=(Dt,At,Lt)F_{t}=(D_{t},A_{t},L_{t}), where DtD_{t}, AtA_{t} and LtL_{t} are the definition, arguments, and linguistic units (LUs) of the frame respectively. Please see Appendix A for details of FrameNet.

For each support set SS, we are also given a query set QQ composed of some unlabeled samples XQ={xq}qQX_{Q}=\{x_{q}\}_{q\in Q}, where xq=(Iq,ttq)x_{q}=(I_{q},tt_{q}), IqI_{q} is the sentence of sample xqx_{q}, and ttqtt_{q} is the candidate trigger word of xqx_{q}. Our goal is to learn a neural classifier for these event types by using the external knowledge and the support set. We will apply the classifier to predict the labels of the query samples in QQ, i.e., YQ={yq}qQY_{Q}=\{y_{q}\}_{q\in Q} with each yqTSy_{q}\in T_{S}. We do this by learning p(YQ|XQ,XS,YS,)p(Y_{Q}|X_{Q},X_{S},Y_{S},\mathscr{F}).

4 Adaptive Knowledge-Enhanced Bayesian Meta-Learning

Refer to caption
Figure 3: Framework overview. Our method combines both the external knowledge and the support set into a prior distribution of event prototype. We customize two encoders to generate sample representations and knowledge representations. Then we utilize the support set to generate a learnable offset for revising the aligned knowledge representations to generate the prior distribution for prototype representations. Finally, we use Monte-Carlo sampling and stochastic gradient Langevin dynamics to draw samples of prototypes for prediction.

We now present our adaptive knowledge-enhanced few-shot event detection approach. The overall structure of our method is shown in Figure 3. Our method represents each event type tt with a prototype vector vt\textbf{v}_{t}, which is then used to classify the query sentences. We use VTS={vt}tTS\textbf{V}_{T_{S}}=\{\textbf{v}_{t}\}_{t\in T_{S}} to represent the collection of prototype vectors for all event types in TST_{S}. Then the conditional distribution p(YQ|XQ,XS,YS,)p(Y_{Q}|X_{Q},X_{S},Y_{S},\mathscr{F}) can be represented as:

p(YQ|XQ,VTS)p(VTS|XS,YS,)𝑑VTS.\int p(Y_{Q}|X_{Q},V_{T_{S}})p(V_{T_{S}}|X_{S},Y_{S},\mathscr{F})\,d\textbf{V}_{T_{S}}. (1)

To calculate Eq. 1, we first introduce the sample encoder and knowledge encoder to give the vector representations of samples and the knowledge of event types. Then we use the sample representations and knowledge representation to construct the adaptive knowledge-enhanced posterior distribution p(VTS|XS,YS,)p(V_{T_{S}}|X_{S},Y_{S},\mathscr{F}) of VTS\textbf{V}_{T_{S}} and give the likelihood p(YQ|XQ,VTS)p(Y_{Q}|X_{Q},V_{T_{S}}) by VTS\textbf{V}_{T_{S}} and sample representations. Finally we leverage Monte Carlo sampling to approximate the posterior distribution and draw each prototype sample by the stochastic gradient Langevin dynamics Welling and Teh (2011) to optimize model parameters in an end-to-end fashion. We now explain the framework in more details.

4.1 Sample and Knowledge Encoder

The purpose of encoding knowledge is to make up for the lack of diversity and coverage of the support set. Thus we align the knowledge and sample encoding and map them into the same semantic space. Intuitively, trigger and arguments are the main factors for entity detection. Hence, to align the trigger and arguments from samples and external knowledge, we design two encoders for the knowledge and samples, generating the final knowledge encoding ht\textbf{h}_{t} and the sample encoding 𝓔(x)\bm{\mathcal{E}}(x) with the same dimensions.

Knowledge Encoder. Given a knowledge frame Ft={Dt,At,Lt}F_{t}=\{D_{t},A_{t},L_{t}\} for the event type tt, we encode it into a real-valued vector to represent the semantics of tt. As shown in Figure 2, for a frame FtF_{t}, the linguistic units LtL_{t} can represent the features of the trigger words, the arguments AtA_{t} can represent the context of the trigger words in samples, and DtD_{t} describes the semantic relationship between AtA_{t} and tt.

For each event type tt, the proposed knowledge encoder uses BERT to generate the text encoding EDt\textbf{E}_{D_{t}} and ELt\textbf{E}_{L_{t}} from the description DtD_{t} and the LUs LtL_{t} respectively. Moreover, the arguments encoding EAt\textbf{E}_{A_{t}} is a sequence of eAt(i)\textbf{e}_{A_{t}}^{(i)}, i.e., the average token encoding in the ii-th argument mention in DtD_{t}, which ensures that the encoding of AtA_{t} fully contains the semantics of the event type tt. Then, as shown in Figure 3, the trigger word prior encoding and argument prior encoding are generated by follows:

  • Trigger word prior encoding. We use attention to get the weighted sum of words in LtL_{t} as the trigger word prior encoding eLt\textbf{e}^{*}_{L_{t}}. The query of the attention is EDt\textbf{E}_{D_{t}}, key and value are both ELt\textbf{E}_{L_{t}}.

  • Argument prior encoding. An attention mechanism is used to aggregate the arguments information into eAt\textbf{e}^{*}_{A_{t}}, where the query of the attention is eLt\textbf{e}^{*}_{L_{t}}, key and value are both EAt\textbf{E}_{A_{t}}.

Finally, we concatenate the trigger word prior encoding eLt\textbf{e}^{*}_{L_{t}} and the argument prior encoding eAt\textbf{e}^{*}_{A_{t}}, and use a feed forward network fhf^{h} to generate the knowledge encoding vector ht\textbf{h}_{t} of event type tt,

ht=fh([eAt;eLt]).\textbf{h}_{t}=f^{h}\left(\left[\textbf{e}^{*}_{A_{t}};\textbf{e}^{*}_{L_{t}}\right]\right). (2)

Sample encoder. We follow the same strategy to build a sample encoder. Given each sample x=(I,tt)x=(I,tt), i.e., a candidate trigger word tttt and its context II, we first utilize BERT to encode xx and select the encoding of tttt as the trigger representation ett\textbf{e}^{*}_{tt}. As arguments are not explicitly given in xx, we use an attention mechanism to aggregate the implicit argument information for current trigger tttt, in which the query is ett\textbf{e}^{*}_{tt}, key and value are both token encoding generate from II. We denote the argument encoding as ea\textbf{e}^{*}_{a}.

Finally, we concatenate the trigger word encoding ett\textbf{e}^{*}_{tt} and the argument encoding ea\textbf{e}^{*}_{a}, and use a feed forward network ff^{\mathcal{E}} to generate the sample encoding vector 𝓔(x)\bm{\mathcal{E}}(x),

𝓔(x)=f([ea;ett]).\bm{\mathcal{E}}(x)=f^{\mathcal{E}}\left(\left[\textbf{e}^{*}_{a};\textbf{e}^{*}_{tt}\right]\right). (3)

4.2 Adaptive Knowledge-Enhanced Posterior

The posterior distribution can be factorized into a prior distribution (given the event knowledge) and a likelihood on the support set Qu et al. (2020) as,

p(VTS|XS,YS,)p(YS|XS,VTS)p(VTS|),p(\textbf{V}_{T_{S}}|X_{S},Y_{S},\mathscr{F})\propto p(Y_{S}|X_{S},\textbf{V}_{T_{S}})p(\textbf{V}_{T_{S}}|\mathscr{F}), (4)

where p(YS|XS,VTS)p(Y_{S}|X_{S},\textbf{V}_{T_{S}}) is the likelihood on the support set, and p(VTS|)p(\textbf{V}_{T_{S}}|\mathscr{F}) is the adaptive knowledge-based prior for the prototype vectors. We describe the details of these two components as follows:

Adaptive Knowledge-based Prior. As we discussed in Section 1, an event type tt may not have an exact/perfect match in the knowledge base \mathscr{F}. In such situations, we resort to finding the super-ordinate frame of tt, which is semantically closest to tt. As shown in Figures 1 and 2, where the event type tt in the support set ‘online-chat’ is matched against the knowledge prior FtF_{t} ‘Chatting’ in FrameNet, a super-ordinate frame. In order to enable the knowledge encoding to accurately reflect the characteristics of the corresponding event type, we add a learnable knowledge offset to ht\textbf{h}_{t}. We denote the knowledge offset between the event type tt and its knowledge encoding ht\textbf{h}_{t} by Δht\Delta\textbf{h}_{t}. Recall that the knowledge in ht\textbf{h}_{t} is encoded from the exactly-matched frame or the super-ordinate frame. Δht\Delta\textbf{h}_{t} is defined as follows:

Δht=𝝀t(mtht),\Delta\textbf{h}_{t}=\bm{\lambda}_{t}\odot(\textbf{m}_{t}-\textbf{h}_{t}),\vspace{-2mm} (5)

where \odot is the element-wise product, and mt\textbf{m}_{t} is the mean of the encodings 𝓔(x)\bm{\mathcal{E}}(x) of all the samples xx in the support set. 𝝀t[0,1]|ht|\bm{\lambda}_{t}\in[0,1]^{|\textbf{h}_{t}|} is the adaptive weight (gate), which is obtained from the sample encoding mt\textbf{m}_{t} and the knowledge encoding ht\textbf{h}_{t}:

𝝀t=σ(Wλ[mt;mtht;ht]+bλ),\bm{\lambda}_{t}=\sigma(\textbf{W}_{\lambda}\left[\textbf{m}_{t};\textbf{m}_{t}-\textbf{h}_{t};\textbf{h}_{t}\right]+\textbf{b}_{\lambda}),\vspace{-2mm} (6)

where σ\sigma is the nonlinear sigmoid function, and Wλ\textbf{W}_{\lambda} and bλ\textbf{b}_{\lambda} are trainable parameters.

Putting it altogether, the knowledge prior distribution has the following form,

p(VTS|)\displaystyle p(\textbf{V}_{T_{S}}|\mathscr{F}) =tTSp(vt|ht,Δht)\displaystyle=\prod_{t\in T_{S}}p(\textbf{v}_{t}|\textbf{h}_{t},\Delta\textbf{h}_{t}) (7)
=tTS𝒩(vt|ht+Δht,𝑰),\displaystyle=\prod_{t\in T_{S}}\mathcal{N}(\textbf{v}_{t}|\textbf{h}_{t}+\Delta\textbf{h}_{t},\bm{I}),

where 𝒩(vt|ht+Δht,𝑰)\mathcal{N}(\textbf{v}_{t}|\textbf{h}_{t}+\Delta\textbf{h}_{t},\bm{I}) is multivariate Gaussian with the mean ht+Δht\textbf{h}_{t}+\Delta\textbf{h}_{t} and covariance 𝑰\bm{I} (the identity matrix). So, each prototype vector has a prior distribution containing knowledge from FrameNet adaptively adjusted according to the support set.

Likelihood. With the given prototype vectors VTs\textbf{V}_{T_{s}} distributed according to p(VTS|XS,YS,)p(\textbf{V}_{T_{S}}|X_{S},Y_{S},\mathscr{F}), the likelihood for support samples is defined as,

p(YS|XS,VTS)\displaystyle p(Y_{S}|X_{S},\textbf{V}_{T_{S}}) =\displaystyle= sSp(ys|xs,VTS)\displaystyle\prod_{s\in S}p(y_{s}|x_{s},\textbf{V}_{T_{S}}) (8)
p(ys=t|xs,VTS)\displaystyle p(y_{s}=t|x_{s},\textbf{V}_{T_{S}}) :=\displaystyle:= exp(𝓔(xs)vt)tTSexp(𝓔(xs)vt).\displaystyle\frac{\exp(\bm{\mathcal{E}}(x_{s})\cdot\textbf{v}_{t})}{\sum_{t^{\prime}\in T_{S}}\exp(\bm{\mathcal{E}}(x_{s})\cdot\textbf{v}_{t^{\prime}})}.\vspace{-2mm}

The dot product of the sample encoding 𝓔(xq)\bm{\mathcal{E}}(x_{q}) and the event type prototype vector vt\textbf{v}_{t} estimates their similarity. We use softmaxsoftmax to normalize the result to the probability of xsx_{s} belonging to event type tt.

4.3 Optimization and Prediction

For prediction, the model computes and maximizes the log-probability logp(YQ|XQ,XS,YS,)\log p(Y_{Q}|X_{Q},X_{S},Y_{S},\mathscr{F}). However, according to Eqn (1), the log-probability relies on the integration over prototype vectors, which is difficult to compute. Hence, we estimate it with Monte Carlo sampling Qu et al. (2020),

p(YQ|XQ,XS,YS,)\displaystyle p(Y_{Q}|X_{Q},X_{S},Y_{S},\mathscr{F})
=𝔼p(VTS|XS,YS,)[p(YQ|XQ,VTS)]\displaystyle=\mathbb{E}_{p(\textbf{V}_{T_{S}}|X_{S},Y_{S},\mathscr{F})}\left[p(Y_{Q}|X_{Q},\textbf{V}_{T_{S}})\right]
1Nsi=1Nsp(YQ|XQ,VTS(s))\displaystyle\approx\frac{1}{N_{s}}\sum_{i=1}^{N_{s}}p(Y_{Q}|X_{Q},\textbf{V}_{T_{S}}^{\left(s\right)}) (9)

where NsN_{s} is the number of samples, and VTS(s)\textbf{V}_{T_{S}}^{\left(s\right)} is a sample drawn from the posterior distribution, i.e. VTS(s)p(VTS|XS,YS,)\textbf{V}_{T_{S}}^{\left(s\right)}\sim p(\textbf{V}_{T_{S}}|X_{S},Y_{S},\mathscr{F}). p(YQ|XQ,VTS(s))p(Y_{Q}|X_{Q},\textbf{V}_{T_{S}}^{\left(s\right)}) is the likelihood for query samples which has the same form as Eqn 8. To sample from the posterior, we use the stochastic gradient Langevin dynamics Welling and Teh (2011) with multiple stochastic updates. Formally, we initialize the sample V^TS\hat{\textbf{V}}_{T_{S}} and iteratively update the sample as,

V^TS\displaystyle\hat{\textbf{V}}_{T_{S}}\leftarrow V^TS+ϵ𝒛\displaystyle\hat{\textbf{V}}_{T_{S}}+\sqrt{\epsilon}\bm{z} (10)
+ϵ2V^TSlogp(YS|XS,V^TS)p(V^TS|),\displaystyle+\frac{\epsilon}{2}\nabla_{\hat{\textbf{V}}_{T_{S}}}\log\ p(Y_{S}|X_{S},\hat{\textbf{V}}_{T_{S}})p(\hat{\textbf{V}}_{T_{S}}|\mathscr{F}),\vspace{-2mm}

where 𝒛𝒩(𝟎,𝑰)\bm{z}\sim\mathcal{N}(\bm{0},\bm{I}), and ϵ\epsilon is a small real number representing the update step size. The gradient V^TSlogp(YS|XS,V^TS)p(V^TS|)\nabla_{\hat{\textbf{V}}_{T_{S}}}\log\ p(Y_{S}|X_{S},\hat{\textbf{V}}_{T_{S}})p(\hat{\textbf{V}}_{T_{S}}|\mathscr{F}) in Eqn 10 balances the effect of the knowledge and the support set on the prototype vector. Please see Appendix B for derivation details and intuitive explanations of its influence.

The Langevin dynamics requires a burn-in period. To speed up the convergence, we follow the previous method Qu et al. (2020) and initialize the sample as follows,

V^TS\displaystyle\hat{\textbf{V}}_{T_{S}} {v^t}tTS\displaystyle\leftarrow\{\hat{\textbf{v}}_{t}\}_{t\in T_{S}} (11)
v^t\displaystyle\hat{\textbf{v}}_{t} mt+ht+Δhtm,\displaystyle\leftarrow\textbf{m}_{t}+\textbf{h}_{t}+\Delta\textbf{h}_{t}-\textbf{m},

where m is the mean encoding of all the samples in the support set.

After obtaining prototype samples from the posterior, logp(YQ|XQ,XS,YS,)\log p(Y_{Q}|X_{Q},X_{S},Y_{S},\mathscr{F}) is end-to-end approximated according to Eqn (4.3). During the training stage, we optimize the log-likelihood of the query set and update the model parameters by gradient descent. In the prediction stage, the log-likelihood will determine the probability that a query sample belongs to each event type. The training process is shown in Algorithm 1.

Input: Event type set TT
Input: Event knowledge from FrameNet
1
2while not convergence do
3      
4      Sample a subset TST_{S} from TT to build a FSED task
5      Sample disjoint support and query sets for TST_{S}
6      Compute the sample encodings (Eq. 3)
7      Compute the {mt}tTS\{\textbf{m}_{t}\}_{t\in T_{S}} for each tTSt\in T_{S}
8      Compute knowledge encodings {ht}tTS\{\textbf{h}_{t}\}_{t\in T_{S}} (Eq. 2)
9      Compute knowledge offset {Δht}tTS\{\Delta\textbf{h}_{t}\}_{t\in T_{S}} (Eq. 5)
10      Initialize prototype vectors {VTSs}s=1Ns\{\textbf{V}_{T_{S}}^{s}\}_{s=1}^{N_{s}} (Eq.  11)
11      Update prototype vectors iteratively (Eq. B)
12      Compute and maximize log-likelihood (Eq. 4.3)
Algorithm 1 Training Process

5 Experiments

We conduct evaluation with the following goals: (1) to compare our adaptive knowledge-enhanced Bayesian meta-learning method with existing few-shot event detection methods and few-shot learning baseline methods; (2) to assess the effectiveness of introducing external knowledge in different NN-way-MM-shot settings; and (3) to provide empirical evidence that our adaptive knowledge offset can flexibly adjust the impact of the support set and prior knowledge on event prototypes, making the model more accurate and generalizable.

5.1 Experimental Settings

We evaluate our method on an aggregated few-shot event detection dataset FewEvent222https://github.com/231sm/Low_Resource_KBP Deng et al. (2020). FewEvent combines two currently widely-used event detection datasets, the ACE-2005 corpus333http://projects.ldc.upenn.edu/ace/ and the TAC-KBP-2017 Event Track Data444https://tac.nist.gov/2017/KBP/Event/index.html, and adds external event types in specific domains including music, film, sports and education Deng et al. (2020). As a result, FewEvent contains 70,852 samples for 19 event types that are further divided into 100 event subtypes.

In order to match the few-shot settings , we use 88 event types covering a total of 15,681 samples to construct experimental data. 68 event types are selected for training, 10 for validation, and the rest 10 for testing. Note that there are no overlapping types between the training, validation and testing sets. In order to obtain a convincing result, we conducted 5 random divisions of training and testing for all event types, and the experimental results are averaged as the final result.

The comparisons with our AKE-BML are performed in two aspects, the sample encoder and the few-shot learner. We combine different encoders and few-shot learners to obtain different baseline models. We consider four sample encoders including CNN Kim (2014), Bi-LSTM Huang et al. (2015), DMN Kumar et al. (2016) and our trigger-attention-based sample encoder TA. For few-shot learners, we consider Matching Networks (MN) Vinyals et al. (2016) and Prototypical Networks (PN) Snell et al. (2017). We also compare to the SOTA few-shot event detection method DMN-MPN  Deng et al. (2020), which uses a dynamic memory network (DMN) as the sample encoder and a memory-based prototypical network as the few-shot learner. In addition, in order to verify the effectiveness of our proposed method, we perform an ablation study on our model, which evaluate the model without external knowledge and without dynamic knowledge adaptation.

As a result, the following methods are compared in our experiments:

  • AKE-BML, our adaptive knowledge-enhanced Bayesian meta-learning method which uses TA encoder as the sample encoder.

  • KB-BML, a variant of AKE-BML without dynamic knowledge adaption.

  • TA-BML, a variant of AKE-BML using our TA encoder but without using external knowledge.

  • DMN-MPN, dynamic-memory-based prototypical network Deng et al. (2020).

  • Encoder+Learner, combinations of various sample encoders and event type learners (e.g. CNN+MN and TA+PN).

We use stochastic gradient descent Bottou (2012) as the optimizer in training with the learning rate 1×1051\times 10^{-5}. The sampling times NsN_{s} of Monte Carlo sampling and update step size ϵ\epsilon are set to 10 and 0.01 respectively. The update times of stochastic gradient Langevin dynamics MM is set to 5. We use dropout after the sample encoder and the knowledge encoder to avoid over-fitting; the dropout rate is set to 0.5. We evaluate the performance of event detection with F1F_{1} and AccuracyAccuracy scores.

5.2 Main Results

Model
5-Way-5-Shot
5-Way-10-Shot
5-Way-15-Shot
10-Way-5-Shot
10-Way-10-Shot
10-Way-15-Shot
F1F_{1}/AccuracyAccuracy F1F_{1}/AccuracyAccuracy F1F_{1}/AccuracyAccuracy F1F_{1}/AccuracyAccuracy F1F_{1}/AccuracyAccuracy F1F_{1}/AccuracyAccuracy
Bi-LSTM+MN§ 58.19/58.48 61.26/61.45 65.55/66.04 46.43/47.62 51.97/52.60 56.27/56.47
CNN+MN§ 59.30/60.04 64.81/65.15 68.35/68.58 44.85/45.80 50.14/50.67 54.13/54.49
DMN+MN§ 66.09/67.18 68.92/69.33 70.88/71.17 52.81/54.12 58.04/58.38 61.63/62.01
TA+MN 66.83/67.55 69.12/69.64 71.13/71.59 53.49/55.47 59.58/60.01 62.41/63.11
Bi-LSTM+PN§ 62.42/62.72 64.65/64.71 68.23/68.39 53.15/53.59 55.87/56.19 60.34/60.87
CNN+PN§ 63.69/64.89 69.64/69.74 70.42/70.52 51.12/51.51 53.80/54.01 57.89/58.28
DMN+PN§ 72.08/72.43 72.47/73.38 73.91/74.68 59.95/60.07 61.48/62.13 65.84/66.31
TA+PN 73.66/73.92 73.81/74.63 75.69/76.31 61.25/61.88 63.89/64.31 66.21/67.59
DMN-MPN§ 73.59/73.86 73.99/74.82 76.03/76.57 60.98/62.44 63.69/64.43 67.84/68.35
TA-BML 73.37/73.59 74.02/74.63 75.52/75.83 61.43/62.59 63.28/63.96 66.27/67.49
KB-BML 74.63/75.07 75.06/75.63 80.69/81.12 65.99/66.82 67.47/68.08 73.89/74.06
AKE-BML 88.99/89.36 90.10/91.48 91.40/92.34 84.55/84.94 86.03/87.73 87.13/87.45
Table 1: Accuracy and F1F_{1} scores of all compared methods. § denotes the results that are directly taken from the original paper Deng et al. (2020), due to the unavailability of the source code.

As shown in Table 1, we compare methods on F1F_{1} and AccuracyAccuracy scores. We observe the followings:

  • Our full model AKE-BML outperforms all other methods on both AccuracyAccuracy and F1F_{1} scores across all settings. Compared with the SOTA method DMN-MPN, AKE-BML achieves a substantial improvement of 15–23 absolute F1F_{1} points in all NN-way-MM-shot settings. It shows our adaptive knowledge-enhanced Bayesian meta-learning method can effectively utilize external knowledge and adjust it according to the support set, thus build better prototypes of event types. Please see Appendix C and  D for a detailed performance analysis over various NN-way and MM-shot settings.

  • With the sample encoders (Bi-LSTM, CNN, DMN and TA) fixed, it can be observed that prototypical networks (PN) consistently outperforms matching networks (MN). DMN-MPN performs better than PN-based methods, because the dynamic memory network can extract key information from the support set through multiple iterations. However, DMN-MPN only considers the information of a few samples in each support set, hence suffering from insufficient sample diversity similar to PN- and MN-based methods.

  • TA-BML performs similarly with DMN-MPN under the settings of NN-way-55-shot and NN-way-10-shot, but slightly worse under the NN-way-15-shot setting. One possible explanation is that when the number of samples in the support set is larger, MPN can generate higher-quality prototypes. In addition, the performance of TA-BML is not as good as KB-BML, which shows the importance of introducing external knowledge.

  • Compared with KB-BML, our full model AKE-BML can effectively solve the problem of deviation between knowledge and event types, and generate event prototypes with better generalization through knowledge. Compared with TA-BML, which does not incorporate external knowledge, AKE-BML achieves an even larger performance advantage, which further demonstrates the effectiveness of external knowledge.

5.3 Case Study

We present a case study on the dynamic knowledge adaptation between the support set and the corresponding event knowledge to demonstrate our model’s ability to learn robust event prototypes.

[Uncaptioned image]
Table 2: A case study of three event types. Words in bold indicate candidate trigger words.
[Uncaptioned image]
Table 3: λt\lambda_{t} values corresponding to different event types. The larger the λt\lambda_{t}, the greater the dependence of the prototype vector on the support set.

5.3.1 Predictions for Specific Cases

We select three event types as target categories to illustrate the contributions of each main component of our model. The event types are Music.Compose, Music.Sing and Film.Film_Productution. The sample contexts of Music.Compose and Music.Sing are similar, while Music.Compose and Film.Film_Productution share the same frame, which is Behind_the_scenes.

As shown in Table 2, only AKE-BML correctly predicts on all samples. TA-BML, the model without introducing knowledge, wrongly predicts the second sample of Music.Compose to be Music.Sing, due to their similar contexts. By introducing knowledge, both KB-BML and AKE-BML avoid this error, indicating that external knowledge can enrich event information based on the support set. For KB-BML, as Music.Compose and Film.Film_Productution share the same superordinate frame, the prototype of Music.Compose cannot distinguish between Music.Compose and Film.Film_Productution, so it wrongly classifies the third sample as Music.Compose. With our adaptive knowledge offset, AKE-BML can deal with sample similarity and knowledge deviation issues at the same time, thus it correctly classifies all samples.

Refer to caption
Figure 4: Visualization of event prototypes, prior knowledge, and event samples learned by AKE-BML in the 5-way-5-shot setting. The large, solid shapes denote event prototypes, the large shapes with circle outlines denote the prior knowledge, and the small shapes denote samples. Samples are marked by the color of their corresponding event types. The arrows indicate the adaption of prior knowledge to the prototype. Note that Music.Compose and Film.Film-Production share the same frame Behind_the_scenes.

5.3.2 Visualization of Prototypes

We use Latent Dirichlet Allocation (LDA) Blei et al. (2003) to reduce the dimensionality of the prototypes, sample encodings and prior knowledge encodings. Figure 4 visualizes five event type prototypes (large solid shapes), their aligned frames (large solid shapes with circle outlines) in FrameNet and some corresponding samples (small solid shapes). Each event type and its samples are coded with the same color.

In general, the samples and prototypes belonging to one event type are close in the space and different event types are far away from each other. Prior knowledge is distributed in different places in the space, which roughly determines the distribution of event prototypes. For example, the samples of Life.Pregnancy and Sports.Fair-Play are close to their respective event prototypes. Meanwhile, the distances between their prior knowledge is large, making their prototypes easily distinguishable.

It can also be seen that the event prototypes are closer to their samples than to the prior knowledge, which reflects the benefits of our proposed learnable knowledge offset. The visualization demonstrates the effectiveness of introducing external knowledge and our adaptive knowledge offset’s ability to balance the impact of the support set and prior knowledge on the event prototypes.

5.3.3 λt\lambda_{t} of Different Event Types

As shown in Formula (5), we use the learnable parameter λt\lambda_{t} to generate knowledge offsets. λt\lambda_{t} accounts for the deviation of the prior knowledge (i.e. a frame) from the event type it represents, and adaptively corrects this deviation using information of the support set. When the frame corresponding to the event type accurately expresses its semantics, the λt\lambda_{t} value should be small. When the knowledge is the super-ordinate frame of the event type (i.e., the frame cannot accurately describe the event semantics), the λt\lambda_{t} value should be large, so that the support set can be used to modify the prior knowledge to ensure that the prototype precisely represents the current event type.

Table 3 shows four different event types, their corresponding frames and λt\lambda_{t} values. The λt\lambda_{t} of Conflict.Attack is a small value 0.132, as the event type Conflict.Attack closely matches the frame Attack. The event type Contact.Letter-Communication matches the frame Communication. Communication does not contain the semantics of ”by writing letters”, but the core semantics is the same as Contact.Letter-Communication. Therefore, λt\lambda_{t} is small, at 0.228, which is still larger than the λt\lambda_{t} of Conflict.Attack. The event types Film.Film-Production and Music.Compose share the same super-ordinate frame Behind_the_scenes as prior knowledge, but the semantics of Behind_the_scenes is too abstract for Film.Film-Production and Music.Compose. Thus, it can be seen that the λt\lambda_{t} values corresponding to these two event types are relatively large: 0.386 for Film.Film-Production and 0.421 for Music.Compose.

The above cases demonstrate that our model is able to balance the influence of the support set and the knowledge on event prototypes through λt\lambda_{t}, and consequentially obtain highly accurate and generalizable prototypes.

6 Conclusion

In this paper, we proposed an Adaptive Knowledge-enhanced Bayesian Meta-Learning (AKE-BML) method for few-shot event detection. We alleviate the insufficient sample diversity problem in few-shot learning by leveraging the external knowledge base FrameNet to learn prototype representations for event types. We further tackle the uncertainty and incompleteness issues in knowledge coverage with a novel knowledge adaptation mechanism.

The comprehensive experimental results demonstrate that our proposed method substantially outperforms state-of-the-art methods, achieving a performance improvement of at least 15 absolute points of F1F_{1}. In the future, we plan to extend our proposed AKE-BML method to the few-shot event extraction task, which considers both event detection and argument extraction. We also plan to explore the zero-shot and incremental event extraction scenarios.

7 Acknowledgement

Research in this paper was partially supported by the Fundamental Research Funds for the Central Universities (2242021k10011).

References

Appendix

Appendix A FrameNet

An important problem in the few-shot event detection task is the insufficient diversity of support set samples. There are only a few labeled samples in the support set, which results in the model unable to construct high-quality prototype features of event types. To address this problem, we introduce the FrameNet Baker et al. (1998) as an external knowledge base of event types. FrameNet is a linguistic resource storing information about lexical and predicate-argument semantics. Each frame in FrameNet can be taken as a semantic frame of an event type Liu et al. (2016), which can be used as background knowledge for event types to assist event detection Liu et al. (2016); Fillmore et al. (2006). Figure 2 shows an example frame defining Attack. We can see the arguments involved in an Attack event and their roles. The linguistic units (LUs) of the frame Attack are the possible trigger words for the corresponding event. The frame is an important complementary source of knowledge to the support set. We match a frame in FrameNet to each event type, based on the event name, as its knowledge. In practice, FrameNet does not provide complete coverage of all event types, nor does every event type have an exact frame matched in FrameNet. For event types that cannot be exactly matched, we assign the frame corresponding to their super-ordinate event. For example, there is no corresponding frame for Contact.Online-Chat, so we assign it to the frame Chatting, which corresponds to the event type Contact.Chat.

Appendix B Gradient of posterior distribution

In order to show the change of the prototype vector after adding the knowledge shift, the gradient V^TSlogp(YS|XS,V^TS)p(V^TS|)\nabla_{\hat{\textbf{V}}_{T_{S}}}\log\ p(Y_{S}|X_{S},\hat{\textbf{V}}_{T_{S}})p(\hat{\textbf{V}}_{T_{S}}|\mathscr{F}) in iteration

V^TS\displaystyle\hat{\textbf{V}}_{T_{S}} V^TS+ϵ𝒛\displaystyle\leftarrow\hat{\textbf{V}}_{T_{S}}+\sqrt{\epsilon}\bm{z}
+ϵ2V^TSlogp(YS|XS,V^TS)p(V^TS|),\displaystyle+\frac{\epsilon}{2}\nabla_{\hat{\textbf{V}}_{T_{S}}}\log\ p(Y_{S}|X_{S},\hat{\textbf{V}}_{T_{S}})p(\hat{\textbf{V}}_{T_{S}}|\mathscr{F}),\vspace{-2mm}

is expanded. For ease of explanation, we only calculate the gradient of the prototype vector v^t\hat{\textbf{v}}_{t}. We denote the gradient of the original posterior distribution as gv^to\textbf{g}_{\hat{\textbf{v}}_{t}}^{o}, the gradient of the knowledge-shifted posterior distribution as gv^ts\textbf{g}_{\hat{\textbf{v}}_{t}}^{s}. We first calculate gv^to\textbf{g}_{\hat{\textbf{v}}_{t}}^{o}:

gv^to=\displaystyle\textbf{g}_{\hat{\textbf{v}}_{t}}^{o}= v^tlogp(YS|XS,V^TS)po(V^TS|)\displaystyle\nabla_{\hat{\textbf{v}}_{t}}\log\ p(Y_{S}|X_{S},\hat{\textbf{V}}_{T_{S}})p^{o}(\hat{\textbf{V}}_{T_{S}}|\mathscr{F}) (13)
=\displaystyle= v^tlogp(YS|XS,V^TS)+logpo(V^TS|)\displaystyle\nabla_{\hat{\textbf{v}}_{t}}\log\ p(Y_{S}|X_{S},\hat{\textbf{V}}_{T_{S}})+\log\ p^{o}(\hat{\textbf{V}}_{T_{S}}|\mathscr{F})
=\displaystyle= v^tlogp(YS|XS,V^TS)+\displaystyle\nabla_{\hat{\textbf{v}}_{t}}\log\ p(Y_{S}|X_{S},\hat{\textbf{V}}_{T_{S}})+
v^tlogpo(V^TS|)\displaystyle\nabla_{\hat{\textbf{v}}_{t}}\log\ p^{o}(\hat{\textbf{V}}_{T_{S}}|\mathscr{F})
=\displaystyle= v^tlogsS,ys=tp(ys|xs,v^t)+\displaystyle\nabla_{\hat{\textbf{v}}_{t}}\log\prod_{s\in S,y_{s}=t}p(y_{s}|x_{s},\hat{\textbf{v}}_{t})+
v^tlogpo(v^t|)\displaystyle\nabla_{\hat{\textbf{v}}_{t}}\log\ p^{o}(\hat{\textbf{v}}_{t}|\mathscr{F})
=\displaystyle= gv^to,l+gv^to,p,\displaystyle\textbf{g}_{\hat{\textbf{v}}_{t}}^{o,l}+\textbf{g}_{\hat{\textbf{v}}_{t}}^{o,p},

where gv^to,l=v^tlogsS,ys=tp(ys|xs,v^t)\textbf{g}_{\hat{\textbf{v}}_{t}}^{o,l}=\nabla_{\hat{\textbf{v}}_{t}}\log\prod_{s\in S,y_{s}=t}p(y_{s}|x_{s},\hat{\textbf{v}}_{t}) and gv^to,p=v^tlogpo(v^t|)\textbf{g}_{\hat{\textbf{v}}_{t}}^{o,p}=\nabla_{\hat{\textbf{v}}_{t}}\log\ p^{o}(\hat{\textbf{v}}_{t}|\mathscr{F}). The prior distribution is

po(v^t|)\displaystyle p^{o}(\hat{\textbf{v}}_{t}|\mathscr{F}) =𝒩(vt|ht,𝑰)\displaystyle=\mathcal{N}(\textbf{v}_{t}|\textbf{h}_{t},\bm{I}) (14)
=(2π)d2e12(v^tht)2,\displaystyle=\left(2\pi\right)^{-\frac{d}{2}}e^{-\frac{1}{2}\left(\hat{\textbf{v}}_{t}-\textbf{h}_{t}\right)^{2}},

the gradient of the logarithm of prior distribution to v^t\hat{\textbf{v}}_{t} is:

gv^to,p\displaystyle\textbf{g}_{\hat{\textbf{v}}_{t}}^{o,p} =(log(2π)d2)(htv^t)\displaystyle=\left(\log\left(2\pi\right)^{-\frac{d}{2}}\right)\left(\textbf{h}_{t}-\hat{\textbf{v}}_{t}\right) (15)
=C(htv^t),\displaystyle=C\left(\textbf{h}_{t}-\hat{\textbf{v}}_{t}\right),

where C=log(2π)d2C=\log\left(2\pi\right)^{-\frac{d}{2}} is a constant, dd is the dimension of prototype. The gradient of the log-likelihood on support set is

logsS,ys=tp(ys|xs,v^t)\displaystyle\log\ \prod_{s\in S,y_{s}=t}p(y_{s}|x_{s},\hat{\textbf{v}}_{t}) (16)
=logsS,ys=texp(𝓔(xs)v^t)tTSexp(𝓔(xs)v^t)\displaystyle=\log\ \prod_{s\in S,y_{s}=t}\frac{exp(\bm{\mathcal{E}}(x_{s})\cdot\hat{\textbf{v}}_{t})}{\sum_{t^{\prime}\in T_{S}}exp(\bm{\mathcal{E}}(x_{s})\cdot\hat{\textbf{v}}_{t^{\prime}})}
=sS,ys=tlogexp(𝓔(xs)v^t)tTSexp(𝓔(xs)v^t).\displaystyle=\sum_{s\in S,y_{s}=t}\log\ \frac{exp(\bm{\mathcal{E}}(x_{s})\cdot\hat{\textbf{v}}_{t})}{\sum_{t^{\prime}\in T_{S}}exp(\bm{\mathcal{E}}(x_{s})\cdot\hat{\textbf{v}}_{t^{\prime}})}.

The gradient of the log-likelihood to v^t\hat{\textbf{v}}_{t} is

gv^to,l=\displaystyle\textbf{g}_{\hat{\textbf{v}}_{t}}^{o,l}= v^tsS,ys=tlogexp(𝓔(xs)v^t)tTSexp((𝓔(xs)v^t)\displaystyle\nabla_{\hat{\textbf{v}}_{t}}\sum_{s\in S,y_{s}=t}\log\ \frac{exp\left(\bm{\mathcal{E}}(x_{s})\cdot\hat{\textbf{v}}_{t}\right)}{\sum_{t^{\prime}\in T_{S}}exp\left((\bm{\mathcal{E}}(x_{s})\cdot\hat{\textbf{v}}_{t^{\prime}}\right)} (17)
=\displaystyle= sS,ys=tv^tlogexp(𝓔(xs)v^t)tTSexp(𝓔(xs)v^t)\displaystyle\sum_{s\in S,y_{s}=t}\nabla_{\hat{\textbf{v}}_{t}}\log\ \frac{exp(\bm{\mathcal{E}}(x_{s})\cdot\hat{\textbf{v}}_{t})}{\sum_{t^{\prime}\in T_{S}}exp(\bm{\mathcal{E}}(x_{s})\cdot\hat{\textbf{v}}_{t^{\prime}})}
=\displaystyle= sS,ys=tv^t(𝓔(xs)v^t)\displaystyle\sum_{s\in S,y_{s}=t}\nabla_{\hat{\textbf{v}}_{t}}(\bm{\mathcal{E}}(x_{s})\cdot\hat{\textbf{v}}_{t})
v^tlogtTSexp(𝓔(xs)v^t)\displaystyle-\nabla_{\hat{\textbf{v}}_{t}}\log\ \sum_{t^{\prime}\in T_{S}}exp(\bm{\mathcal{E}}(x_{s})\cdot\hat{\textbf{v}}_{t^{\prime}})
=\displaystyle= sS,ys=t𝓔(xs)𝓔(xs)exp(𝓔(xs)v^t)tTSexp(𝓔(xs)v^t)\displaystyle\sum_{s\in S,y_{s}=t}\bm{\mathcal{E}}(x_{s})-\frac{\bm{\mathcal{E}}(x_{s})exp(\bm{\mathcal{E}}(x_{s})\cdot\hat{\textbf{v}}_{t})}{\sum_{t^{\prime}\in T_{S}}exp(\bm{\mathcal{E}}(x_{s})\cdot\hat{\textbf{v}}_{t^{\prime}})}
=\displaystyle= sS,ys=t(1ps(t))𝓔(xs),\displaystyle\sum_{s\in S,y_{s}=t}(1-p_{s}^{(t)})\bm{\mathcal{E}}(x_{s}),

where ps(t)=p(ys|xs,v^t)p_{s}^{(t)}=p(y_{s}|x_{s},\hat{\textbf{v}}_{t}) is the probability of correct classification of sample ss in support set. Then we get

gv^to=sS,ys=t(1ps(t))𝓔(xs)+C(htv^t).\displaystyle\textbf{g}_{\hat{\textbf{v}}_{t}}^{o}=\sum_{s\in S,y_{s}=t}(1-p_{s}^{(t)})\bm{\mathcal{E}}(x_{s})+C(\textbf{h}_{t}-\hat{\textbf{v}}_{t}). (18)

Then we calculate gv^ts\textbf{g}_{\hat{\textbf{v}}_{t}}^{s}, The only difference between calculating gv^ts\textbf{g}_{\hat{\textbf{v}}_{t}}^{s} and gv^to\textbf{g}_{\hat{\textbf{v}}_{t}}^{o} is gv^ts\textbf{g}_{\hat{\textbf{v}}_{t}}^{s} use the knowledge-shifted prior distribution

p(v^t|)\displaystyle p(\hat{\textbf{v}}_{t}|\mathscr{F}) =𝒩(vt|ht+Δht,𝑰)\displaystyle=\mathcal{N}(\textbf{v}_{t}|\textbf{h}_{t}+\Delta\textbf{h}_{t},\bm{I}) (19)
=(2π)d2e12(v^thtΔht)2.\displaystyle=\left(2\pi\right)^{-\frac{d}{2}}e^{-\frac{1}{2}\left(\hat{\textbf{v}}_{t}-\textbf{h}_{t}-\Delta\textbf{h}_{t}\right)^{2}}.

Same as the original posterior gradient, we have

gv^ts=\displaystyle\textbf{g}_{\hat{\textbf{v}}_{t}}^{s}= v^tlogsS,ys=tp(ys|xs,v^t)\displaystyle\nabla_{\hat{\textbf{v}}_{t}}\log\prod_{s\in S,y_{s}=t}p(y_{s}|x_{s},\hat{\textbf{v}}_{t}) (20)
+v^tlogp(v^t|)\displaystyle+\nabla_{\hat{\textbf{v}}_{t}}\log\ p(\hat{\textbf{v}}_{t}|\mathscr{F})
=\displaystyle= gv^ts,l+gv^ts,p,\displaystyle\textbf{g}_{\hat{\textbf{v}}_{t}}^{s,l}+\textbf{g}_{\hat{\textbf{v}}_{t}}^{s,p},

where gv^ts,l=gv^to,l=sS,ys=t(1ps(t))𝓔(xs)\textbf{g}_{\hat{\textbf{v}}_{t}}^{s,l}=\textbf{g}_{\hat{\textbf{v}}_{t}}^{o,l}=\sum_{s\in S,y_{s}=t}(1-p_{s}^{(t)})\bm{\mathcal{E}}(x_{s}). The gradient of the logarithm of knowledge-shifted prior distribution to v^t\hat{\textbf{v}}_{t} is:

gv^ts,p\displaystyle\textbf{g}_{\hat{\textbf{v}}_{t}}^{s,p} =(log(2π)d2)(ht+Δhtv^t)\displaystyle=\left(\log\left(2\pi\right)^{-\frac{d}{2}}\right)\left(\textbf{h}_{t}+\Delta\textbf{h}_{t}-\hat{\textbf{v}}_{t}\right) (21)
=C(ht+𝝀t(mtht)v^t)\displaystyle=C\left(\textbf{h}_{t}+\bm{\lambda}_{t}\odot(\textbf{m}_{t}-\textbf{h}_{t})-\hat{\textbf{v}}_{t}\right)
=C((𝟙𝝀t)htv^t)+C𝝀tmt,\displaystyle=C\left((\mathds{1}-\bm{\lambda}_{t})\odot\textbf{h}_{t}-\hat{\textbf{v}}_{t}\right)+C\bm{\lambda}_{t}\odot\textbf{m}_{t},

where 𝟙\mathds{1} is a |ht||\textbf{h}_{t}|-dimensional vector, and each element of 𝟙\mathds{1} is 1. then we get

gv^ts=\displaystyle\textbf{g}_{\hat{\textbf{v}}_{t}}^{s}= gv^ts,l+gv^ts,p\displaystyle\textbf{g}_{\hat{\textbf{v}}_{t}}^{s,l}+\textbf{g}_{\hat{\textbf{v}}_{t}}^{s,p} (22)
=\displaystyle= sS,ys=t(1ps(t))𝓔(xs)\displaystyle\sum_{s\in S,y_{s}=t}(1-p_{s}^{(t)})\bm{\mathcal{E}}(x_{s})
+C((𝟙𝝀t)htv^t)\displaystyle+C\left((\mathds{1}-\bm{\lambda}_{t})\odot\textbf{h}_{t}-\hat{\textbf{v}}_{t}\right)
+C𝝀tmt,\displaystyle+C\bm{\lambda}_{t}\odot\textbf{m}_{t},

bring mt=1MsS,ys=t𝓔(xs)\textbf{m}_{t}=\frac{1}{M}\sum_{s\in S,y_{s}=t}\bm{\mathcal{E}}(x_{s}) into the above formula, we get

gv^ts=\displaystyle\textbf{g}_{\hat{\textbf{v}}_{t}}^{s}= sS,ys=t[(1ps(t))𝓔(xs)+C𝝀tM𝓔(xs)]\displaystyle\sum_{s\in S,y_{s}=t}\left[(1-p_{s}^{(t)})\bm{\mathcal{E}}(x_{s})+\frac{C\bm{\lambda}_{t}}{M}\odot\bm{\mathcal{E}}(x_{s})\right] (23)
+C((𝟙𝝀t)htv^t).\displaystyle+C\left((\mathds{1}-\bm{\lambda}_{t})\odot\textbf{h}_{t}-\hat{\textbf{v}}_{t}\right).

Note that, when knowledge adaption is not used, the form of the prior knowledge distribution of the prototype is as follows,

po(VTS|)=tTSpo(vt|ht)=tTS𝒩(vt|ht,𝑰).p^{o}(\textbf{V}_{T_{S}}|\mathscr{F})=\prod_{t\in T_{S}}p^{o}(\textbf{v}_{t}|\textbf{h}_{t})=\prod_{t\in T_{S}}\mathcal{N}(\textbf{v}_{t}|\textbf{h}_{t},\bm{I}). (24)

To intuitively show the influence of the knowledge-adapted posterior distribution on the prototype vector, we expand the gradient V^TSlogp(YS|XS,V^TS)p(V^TS|)\nabla_{\hat{\textbf{V}}_{T_{S}}}\log\ p(Y_{S}|X_{S},\hat{\textbf{V}}_{T_{S}})p(\hat{\textbf{V}}_{T_{S}}|\mathscr{F}) in Eqn B. For ease of explanation, we only calculate the gradient of the prototype vector v^t\hat{\textbf{v}}_{t}. Denote the gradient of the original posterior distribution without knowledge adaption as gv^to\textbf{g}_{\hat{\textbf{v}}_{t}}^{o},

gv^to=sS,ys=t(1ps(t))𝓔(xs)+C(htv^t),\textbf{g}_{\hat{\textbf{v}}_{t}}^{o}=\sum_{s\in S,y_{s}=t}(1-p_{s}^{(t)})\bm{\mathcal{E}}(x_{s})+C(\textbf{h}_{t}-\hat{\textbf{v}}_{t}), (25)

and the gradient of the knowledge-adapted posterior distribution as gv^ts\textbf{g}_{\hat{\textbf{v}}_{t}}^{s} from Eqn 23.

Comparing Eqn 23 and Eqn 25, it can be seen that the posterior distribution without knowledge adaption cannot dynamically balance the influence of the knowledge and the support set on the prototype vector, whereas the knowledge-adapted posterior distribution can adjust their contributions to the prototype vector through 𝝀t\bm{\lambda}_{t}. The parameters in Eqn 6 will be updated by the log-likelihood on the query set. This allows the model to reasonably choose the weight of the knowledge and the support set, and obtain prototype vectors with better generalization.

Appendix C MM-shot Evaluation

In this section, we illustrate the effectiveness of adaptive knowledge-enhanced Bayesian meta-learning under different MM-shot settings, such as NN-way-55-shot, NN-way-1010-shot and NN-way-1515-shot. As shown in Table 1 in the main paper, as MM increases, the performance of all models improves, which shows that increasing the number of samples in the support set can provide more pertinent event type-related features. At the same time, it can be seen that from 15-shot to 5-shot, the previous methods suffer a significantly larger performance degradation than AKE-BML. This observation shows our model’s strong robustness against low sample diversity due to the incorporation of external knowledge.

The performance of KB-BML is close to that of DMN-MPN in the case of NN-way-55-shot and NN-way-1010-shot, and the performance is better in the case of NN-way-1515-shot. This can be attributed to two factors: (1) the introduction of knowledge can improve the generalization of event prototypes; and (2) increasing the number of samples can reduce the impact of the deviation between knowledge and event types. When the support set is sufficiently large, the samples in the support set can compensate for the deviation between knowledge and event types, and the knowledge can also improve the generalization of the prototype vector. However, when MM is small, the deviation between knowledge and event types will affect the quality of the prototype vectors.

AKE-BML can well balance the effects of samples and knowledge on the event type prototypes. It can be seen that when MM is small, the performance of AKE-BML does not decline as quickly as other models, which also proves the effectiveness of knowledge in dealing with the problem of insufficient diversity of the support set. At the same time, compared with KB-BML, our adaptive knowledge offset can effectively use the information in the support set to correct the knowledge deviation.

Appendix D NN-Way Evaluation

Refer to caption
Figure 5: NN-way (N=2,,10N=2,\ldots,10) evaluation and fixed shot numbers. (a) NN-way-5-shot. (b) NN-way-15-shot.

Figure 5 also illustrates model performance with respect to different way values (i.e. NN), while fixing the shot values. It can be seen from the figure that when NN increases, the performance of previous models decreases faster than AKE-BML, which shows that those models, only relying on the support set, cannot generate more recognizable event prototypes. The performance of KB-BML also declines significantly when NN increases. This is because many event types can only be partially aligned in FrameNet, to its super-ordinate frame, which causes the event prototypes to be indistinguishable to similar event types.

On the contrary, the performance of AKE-BML does not decrease significantly when NN increases, which shows that our adaptive knowledge-enhanced Bayesian meta-learning method can enhance the distinguishability of prototype vectors through the learnable knowledge offset. These results indicate that our adaptive knowledge-enhanced Bayesian meta-learning is more robust to the changes in the number of ways.