SHINE: Syntax-augmented Hierarchical Interactive Encoder for
Zero-shot Cross-lingual Information Extraction

SHINE: Syntax-augmented Hierarchical Interactive Encoder for Zero-shot Cross-lingual Information Extraction

SHINE: Syntax-augmented Hierarchical Interactive Encoder for
Zero-shot Cross-lingual Information Extraction

1 Introduction

2 Related Work

3 Methodology

4 Downstream Tasks

5 Experiments

6 Conclusion

Limitations

References

Appendix A Appendices

1 Introduction

2 Related Work

3 Methodology

4 Downstream Tasks

5 Experiments

6 Conclusion

Limitations

References

Appendix A Appendices

3.1 Problem Definition

3.2 Basic Encoder

3.3 Multi-level Interaction Network

3.4 Constituency Structure Modeling

4.1 Named Entity Recognition

4.2 Relation Extraction

4.3 Event Extraction

5.1 Datasets

5.2 Evaluation Metrics

5.3 Baselines and Implementation Details

5.4 Results and Comparison

5.5 Analysis

A.1 Datasets

A.2 Evaluation Metrics

A.3 Baseline Models

A.4 Implementation Details

A.5 Case Study

3.1 Problem Definition

3.2 Basic Encoder

3.3 Multi-level Interaction Network

3.4 Constituency Structure Modeling

4.1 Named Entity Recognition

4.2 Relation Extraction

4.3 Event Extraction

5.1 Datasets

5.2 Evaluation Metrics

5.3 Baselines and Implementation Details

5.4 Results and Comparison

5.5 Analysis

A.1 Datasets

A.2 Evaluation Metrics

A.3 Baseline Models

A.4 Implementation Details

A.5 Case Study

Span Representation Construction

Sub-span Importance Construction

Ablation Study

Case Study

CoNLL

WikiAnn

ACE 2005

Span Representation Construction

Sub-span Importance Construction

Ablation Study

Case Study

CoNLL

WikiAnn

ACE 2005

Abstract

Abstract

Jun-Yu Ma¹, Jia-Chen Gu¹, Zhen-Hua Ling¹, Quan Liu^2,3, Cong Liu^1,3, Guoping Hu^2,3
¹National Engineering Research Center of Speech and Language Information Processing,
University of Science and Technology of China, Hefei, China
²State Key Laboratory of Cognitive Intelligence ³iFLYTEK Research, Hefei, China
{mjy1999}@mail.ustc.edu.cn, {gujc,zhling}@ustc.edu.cn,
{quanliu,congliu2,gphu}@iflytek.com

Zero-shot cross-lingual information extraction (IE) aims at constructing an IE model for some low-resource target languages, given annotations exclusively in some rich-resource languages. Recent studies based on language-universal features have shown their effectiveness and are attracting increasing attention. However, prior work has neither explored the potential of establishing interactions between language-universal features and contextual representations nor incorporated features that can effectively model constituent span attributes and relationships between multiple spans. In this study, a syntax-augmented hierarchical interactive encoder (SHINE) is proposed to transfer cross-lingual IE knowledge. The proposed encoder is capable of interactively capturing complementary information between features and contextual information, to derive language-agnostic representations for various IE tasks. Concretely, a multi-level interaction network is designed to hierarchically interact the complementary information to strengthen domain adaptability. Besides, in addition to the well-studied syntax features of part-of-speech and dependency relation, a new syntax feature of constituency structure is introduced to model the constituent span information which is crucial for IE. Experiments across seven languages on three IE tasks and four benchmarks verify the effectiveness and generalization ability of the proposed method.

Information Extraction (IE) aims at extracting structured information from unstructured texts to classify and reconstruct massive amounts of content automatically. It covers a great variety of tasks, such as named entity recognition (NER) Li et al. (2020); Ma et al. (2023), relation extraction Abad et al. (2017); Bekoulis et al. (2018) and event extraction Zhou et al. (2016); Chen et al. (2017). Recently, thanks to the breakthroughs of deep learning, neural network models have achieved significant improvements in various IE tasks Luan et al. (2019); Lin et al. (2020); Chen et al. (2022). However, it is too expensive to annotate a large amount of data in low-resource languages for supervised IE training. Therefore, cross-lingual IE under low-resource Cabral et al. (2020); Yarmohammadi et al. (2021); Agirre (2022) settings has attracted considerable attention. In this paper, we study zero-shot cross-lingual IE, where annotated training data is available only in the source language, but not in the target language.

Refer to caption — Figure 1: An example of constituency tree, including: (1) an event trigger (*deployed*) and its argument (*17000 U.S. Army soldiers*), and (2) two entities corresponding to subject (*17000 U.S. Army soldiers*) and object (*Persian Gulf*) respectively.

Existing methods on reducing the need for annotated data in cross-lingual IE tasks can be generally categorized into shared representation space-based Tsai et al. (2016); Wu and Dredze (2019); M’hamdi et al. (2019), translation-based Mayhew et al. (2017); Yarmohammadi et al. (2021); Lou et al. (2022) and language-universal features-based methods Huang et al. (2018); Subburathinam et al. (2019); Ahmad et al. (2021). In this work, we study the last one. Subburathinam et al. (2019) have verified that language-universal symbolic and distributional representations are complementary for cross-lingual structure transfer. Language-universal features-based methods aggregate universal and complementary information to obtain schema consistency across languages for effective knowledge transfer Liu et al. (2019); Subburathinam et al. (2019); Ahmad et al. (2021). For example, Ahmad et al. (2021) utilized part-of-speech (POS), entity type and dependency features to construct a cross-lingual transfer framework.

Although language-universal features-based methods have achieved competitive performance on zero-shot cross-lingual IE, the features used in previous work are still incapable of capturing sufficient contextualized semantics. Existing methods have never explored the potential of establishing interactions between these features and contextual representations, which leads to a representation gap between them and information loss. Besides, previous studies focus on employing dependency structures for various IE tasks Subburathinam et al. (2019); Ahmad et al. (2021), describing only the dependency relationships between two words. Essentially, most IE tasks aim at identifying and categorizing phrase spans, thus the span-level information such as the constituent span attributes and the relationships between multiple spans are crucial. However, capturing this information is well beyond the scope of features studied extensively in prior work Ahmad et al. (2021). Here, we give an example to illustrate the importance of constituent span information as shown in Figure 1. For event extraction, given an event trigger deployed and two argument candidates U.S. Army and 17000 U.S. Army soldiers, the former candidate may be recognized as an argument without any guidance. However, the latter candidate is a constituent span which shares the same parent node labeled VP with deployed, then it can be recognized correctly as an argument under this signal. This property is common in many languages, so it is beneficial to be able to model this universal information when transferred across languages.

On account of the above issues, a syntax-augmented hierarchical interactive encoder (SHINE) is proposed in this paper to transfer cross-lingual IE knowledge. It is capable of interactively capturing complementary information between language-universal features and contextual information, so as to derive language-agnostic contextualized representations for various IE tasks. On the one hand, a multi-level interaction network is designed to encourage the distributions of features and contextual representations to approximate each other at three levels via an interactive loss mechanism. Specifically, the global-level interaction operates on the entire sentence, the local-level one operates on sub-spans in a sentence, and the task-level one operates on task-related mentions. In this way, the model can hierarchically interact the complementary information to strengthen its domain adaptability. Features of POS, dependency relation and entity type used in previous studies are adopted to verify the effectiveness of the proposed method.¹¹1Although entity type is not formally a syntax feature, it is still adopted in this work following Ahmad et al. (2021). Additionally, a new syntax feature of constituency structure is introduced to explicitly utilize the span-level information in the text. Considering the overlap between constituent spans, these spans are first converted into word-level embeddings. Then a frequency matrix is designed where each element represents the number of occurrences of a sub-span in all constituent spans. These not only enrich the attribute information of constituent spans, but also model the importance of each sub-span, so that task-related spans can be accurately captured for effective cross-language transfer.

To measure the effectiveness of the proposed method and to test its generalization ability, SHINE is evaluated on three IE tasks including NER, relation extraction and event extraction. Experiments across seven languages on four benchmarks are conducted for evaluation. Results show that SHINE achieves highly promising performance, verifying the importance of interactions between universal features and contextual representations, as well as the effectiveness of constituency structure for IE. To facilitate others to reproduce our results, we will publish all source code later.

In summary, our contributions in this paper are three-fold: (1) A multi-level interaction framework is proposed to interact the complementary structured information contained in language-universal features and contextual representations. (2) The constituency feature is first introduced to explicitly utilize the constituent span information for cross-lingual IE. (3) Experiments across seven languages on three tasks and four benchmarks verify the effectiveness and generalization ability of SHINE.

Many researchers have investigated shared representation space-based, translation-based and language-universal features-based methods for zero-shot cross-lingual IE tasks. Shared representation space-based models capture features of labeled source-language data using multilingual pre-trained models, and then they are applied to the target languages directly Tsai et al. (2016); Wu and Dredze (2019); M’hamdi et al. (2019). However, this type of methods can transfer only superficial information due to representation discrepancy between source and target languages. Besides, translation-based methods translate texts from the source language to the target languages, and then project annotations accordingly to create silver annotations Mayhew et al. (2017); Lou et al. (2022). But noise from translation and projection might degrade performance. Language-universal features-based methods are effective in cross-lingual IE by utilizing universal and complementary information to learn multi-lingual common space representations Liu et al. (2019); Subburathinam et al. (2019); Ahmad et al. (2021). Liu et al. (2019) utilized GCN Kipf and Welling (2017) to learn representations based on universal dependency parses to improve cross-lingual transfer for IE. Subburathinam et al. (2019) exploited other features such as POS and entity type for embedding. Ahmad et al. (2021) employed transformer Vaswani et al. (2017) to fuse structural information to learn the dependencies between words with different syntactic distances.

Compared with Subburathinam et al. (2019) and Ahmad et al. (2021) that are the most relevant to this work, the main differences are highlighted. These methods have never explored the potential of establishing interactions between language-universal features and contextual representations. Besides, the well-studied features focus on word attributes and the relationship between words. They cannot model span-level information such as the constituent span attributes and the relationships between multiple spans which are crucial for cross-lingual IE. To the best of our knowledge, this paper makes the first attempt to interactively capture complementary information between universal features and contextual information, and to introduce constituency structure to explicitly utilize span-level information.

In this section, we present the detailed framework of the proposed SHINE. For one thing, a multi-level interaction network is introduced. Specifically, three classes of adaptation methods are adopted to interact language-universal with contextual information via an interactive loss mechanism. In order to verify the effectiveness of the proposed framework, features such as POS, dependency relation and entity type used in previous studies are adopted. Furthermore, a new syntax feature of constituency structure is introduced to make up for the deficiencies of the existing work in explicitly modeling and utilizing the constituent span information.

Denote one sentence as $\boldsymbol{x}=\{x_{i}\}_{i=1}^{L}$ with its language-universal features $\boldsymbol{x^{l}}$ and annotations $\boldsymbol{y}$ , where $x_{i}$ denotes a word and L denotes the length of the sentence. An IE model generates predictions $\boldsymbol{\bar{y}}$ . The labeled training data $\mathcal{D}_{\text{train }}^{S}=\{(\boldsymbol{x},\boldsymbol{x^{l}},\boldsymbol{y})\}$ is available for the source language, while only labeled testing data $\mathcal{D}_{\text{test}}^{T}=\{(\boldsymbol{x},\boldsymbol{x^{l}},\boldsymbol{y})\}$ is available for the target language. Formally, zero-shot cross-lingual IE aims at achieving good performance on $\mathcal{D}_{\text{test}}^{T}$ by leveraging $\mathcal{D}_{\text{train}}^{S}$ .

The basic encoder in this paper consists of an mBERT Devlin et al. (2019) and an L-layer Transformer denoted as $\operatorname{Trans-L}$ . They are utilized to extract the contextual representations $\boldsymbol{h^{c}}$ and language-universal representations $\boldsymbol{h^{l}}$ respectively.

Following Ahmad et al. (2021), given a sentence $\boldsymbol{x}$ of length L from source language data $\mathcal{D}_{\text{train}}^{S}$ , three well-studied language-universal features are denoted as one-hot vectors $\boldsymbol{x^{p}}$ for POS, $\boldsymbol{x^{d}}$ for dependency relation, and $\boldsymbol{x^{e}}$ for entity type. Besides, the proposed constituent type vector $\boldsymbol{x^{c}}$ is not a one-hot vector and its construction is described in Section 3.4. These calculations are formulated as:

$\displaystyle\boldsymbol{x^{l}}$	$\displaystyle=[\boldsymbol{x^{p}};\boldsymbol{x^{d}};\boldsymbol{x^{e}};\boldsymbol{x^{c}}],$	(1)
$\displaystyle\boldsymbol{h^{c}}$	$\displaystyle=\operatorname{mBERT}(\boldsymbol{x}),$	(2)
$\displaystyle\boldsymbol{h^{l}}$	$\displaystyle=\operatorname{Trans-L}(\boldsymbol{x^{l}}),$	(3)

where $\boldsymbol{x^{l}}$ is the concatenation of four language-universal features. $\boldsymbol{h^{c}}=\{\boldsymbol{h}_{i}^{\boldsymbol{c}}\}_{i=1}^{L}$ and $\boldsymbol{h}^{\boldsymbol{l}}=\{\boldsymbol{h}_{i}^{\boldsymbol{l}}\}_{i=1}^{L}$ . $\boldsymbol{h}_{i}^{\boldsymbol{c}}$ and $\boldsymbol{h}_{i}^{\boldsymbol{l}}$ are representations of $x_{i}$ and $\boldsymbol{x}_{i}^{\boldsymbol{l}}$ respectively.

Each type of language-universal features emphasizes its respective property during cross-language transfer, thus versatile abilities are exhibited when various features are available. For example, POS can help augment the connection between words of the same part of speech across different languages, and dependency models the relationship between words to mitigate the word-order problem Liu et al. (2019). However, there is usually a semantic gap between these features and the contextual representations, leading to information loss since no explicit interactions are established between them when fusing these features.

In this section, a multi-level interaction framework is designed to hierarchically interact the contextual representations $\boldsymbol{h^{c}}$ and the language-universal representations $\boldsymbol{h^{l}}$ at the global-, local- and task-level respectively, via an interactive loss mechanism. As Figure 2 depicts, the global-level interaction operates over the entire sentence to enhance the interaction between $\boldsymbol{h^{c}}$ and $\boldsymbol{h^{l}}$ as a whole. Besides, the local-level divides a sentence into fixed-length sub-spans $\boldsymbol{X^{s}}=\{\boldsymbol{X}_{i}^{\boldsymbol{s}}\}_{i=1}^{T}$ where $\boldsymbol{X_{i}^{s}}=\{x_{j}\}_{j=i}^{i+P-1}$ , enhancing attention to span information. Here P is the length of each span and T is the number of all sub-spans. Lastly, the task-level interaction utilizes task-related mentions (such as entities, event triggers, etc.) to strengthen the model adaptation ability to different tasks. In this way, various syntax features and contextual information can directly interact with each other at different levels, so that the representation capability can be enhanced at both word-level and span-level. To measure the distribution discrepancy of two different random variables and effectively enable information sharing, a symmetrized Kullback–Leibler (KL) divergence Pérez-Cruz (2008) is employed following Jiang et al. (2020) and Wang et al. (2021) at each level of interaction. ${\rm KL}(P\|Q)=\sum\nolimits_{k}p_{k}\log\left(p_{k}/q_{k}\right)$ denotes the KL-divergence of two discrete distributions $P$ and $Q$ with the associated parameters of $p_{k}$ and $q_{k}$ , respectively. These calculations are formulated as:

\displaystyle\mathcal{L}_{g}={\rm KL}(\boldsymbol{h^{c}}||\boldsymbol{h^{l}})+{\rm KL}(\boldsymbol{h^{l}}||\boldsymbol{h^{c}}),

(4)

	$\displaystyle\mathcal{L}_{l}=\frac{1}{T}\sum_{i=1}^{T}[$	$\displaystyle{\rm KL}(\boldsymbol{H_{i}^{c}}\|\|\boldsymbol{H_{i}^{l}})+$		(5)
		$\displaystyle{\rm KL}(\boldsymbol{H_{i}^{l}}\|\|\boldsymbol{H_{i}^{c}})],$		(5)

\displaystyle\mathcal{L}_{t}={\rm KL}(\boldsymbol{H_{t}^{c}}||\boldsymbol{H_{t}^{l}})+{\rm KL}(\boldsymbol{H_{t}^{l}}||\boldsymbol{H_{t}^{c}}),

(6)

where $\boldsymbol{H_{i}^{c}}=\{\boldsymbol{h}_{j}^{\boldsymbol{c}}\}_{j=i}^{i+P-1}$ and $\boldsymbol{H_{i}^{l}}=\{\boldsymbol{h}_{j}^{\boldsymbol{l}}\}_{j=i}^{i+P-1}$ . $\boldsymbol{H_{t}^{c}}$ and $\boldsymbol{H_{t}^{l}}$ are the task-related mention representations from mBERT and Transformer respectively.

As aforementioned, previous studies focus on encoding dependency structures, which describe only the dependency relationships between two words. However, the attributes of a constituent span in text and the relationships between multiple spans are crucial to many IE tasks. To this end, we introduce the constituency structure to construct span representation and to model the importance of each sub-span, to explicitly utilize span-level information to accurately find task-related spans for effective cross-language transfer.

Since the constituent spans overlap each other, we first convert each span into a series of word-level one-hot vectors with BIO annotations (such as NP $\rightarrow$ B-NP, I-NP) Sang (2002). Then, these vectors are summed to derive $\boldsymbol{x^{c}}\in\mathbb{R}^{L\times C}$ mentioned in Section 3.2, where L is the length of the sentence and C is the number of the constituent types converted to BIO form. For example, given a sentence “they have received deployment orders”, we can derive the annotated constituent spans “(0,0,NP), (1,4,VP), (2,4,VP), (3,4,NP), (0,4,S)” with the Stanford CoreNLP toolkit Manning et al. (2014)²²2(1,4,VP) denotes that a span constructed with 1st to 4th words in a sentence is a VP, i.e., verb phrase.. Then the “deployment orders” is represented as “B-NP I-NP”. The details are shown in Table 1. The count-based representation can reflect the depth of words in the constituency tree to a certain extent and deeper words have larger counts. In this way, the relationship between each word and each constituent type can be constructed to enrich representation information.

Table 1: An example of the span representation construction for constituent spans in one sentence. Each element denotes the times each word occurs in all spans. For instance, (1,4,VP) and (2,4,VP) both contain “orders” in (orders, I-VP, 2), so the value is 2.

Intuitively, a task-related phrase is more likely to be a constituent span or a sub-span of a constituent span. Despite the constituent type information for each word is modeled, the importance of each sub-span in the text also matters. Those constituent spans or spans contained within constituent spans should be assigned higher importance to help find task-related phrases. In this work, the importance of each sub-span in the text is modeled based on existing constituent spans.

Words	he	bought	apple	juice	here
he	1	1	1	1	1
bought	1	1	2	2	1
apple	1	2	1	3	1
juice	1	2	3	1	1
here	1	1	1	1	1

Table 2: The frequency matrix shows the times of each sub-span in text is a sub-span of constituent spans. The annotated constituent spans are: (0,0,NP), (1,3,VP), (2,3,NP), (0,4,S). Considering that the attention between two words is bidirectional, we simply symmetrize the matrix. For the span (1,2), it is the sub-span of both (1,3) and (0,4), so the frequency is 2. Since a non-constituent word is usually not task-related (e.g. not an entity), we set it to 1.

In detail, a frequency matrix denoted as $\boldsymbol{F}\in\mathcal{R}^{L\times L}$ is constructed, where $F_{ij}$ represents the number of occurrences of span $(i,j)$ in all constituent spans. An example is illustrated in Table 2. A variant of Transformer $\operatorname{RTrans-N}(E,F)$ shown in Figure 2 is designed by modifying the self-attention mechanism as:

\displaystyle\operatorname{Attn-F}(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V})=G(\operatorname{softmax}\left(\frac{\boldsymbol{Q}\boldsymbol{K}^{\boldsymbol{T}}}{\sqrt{d_{k}}}\right)\boldsymbol{V}).

(7)

Here, $\operatorname{softmax}$ function produces an attention matrix $\boldsymbol{A}\in\mathcal{R}^{L\times L}$ where $A_{ij}$ denotes the attention weight that the $i$ -th token pays to the $j$ -th token in the sentence. $G$ is a function that integrates the frequency matrix to obtain modified attention weights. The $(i,j)$ -th element of the original attention matrix $A$ is modified as:

\displaystyle G(A)_{ij}=\frac{F_{ij}A_{ij}}{\sum_{j}{F_{ij}}{A_{ij}}}.

(8)

Thus, the importance of each sub-span of a sentence is modeled according to the frequency it occurs across all constituent spans, with larger frequency values indicating higher importance. In this way, the model assigns more attention to sub-spans with higher importance. Taking Table 2 as an example, the entity “apple juice” has the highest importance, so the model could find it accurately. The final representation of the sentence $\boldsymbol{h^{f}}$ is formulated as:

	$\displaystyle\boldsymbol{h^{w}}$	$\displaystyle=\operatorname{Linear}([\boldsymbol{h^{c}};\boldsymbol{h^{l}}]),$		(9)
	$\displaystyle\boldsymbol{h^{f}}$	$\displaystyle=\operatorname{RTrans-N}(\boldsymbol{h^{w},F}),$		(10)

where $\operatorname{Linear}$ is a linear transformation. $\boldsymbol{h^{w}}$ is the concatenation of contextual representations and language-universal representations.

Three downstream tasks are employed to evaluate the effectiveness of the proposed SHINE encoder as comprehensively as possible, including the tasks of NER, relation extraction and event extraction.

This task targets to locate and classify named entities in a text sequence. Denote one sentence as $\boldsymbol{x}=\{x_{i}\}_{i=1}^{L}$ with its labels $\boldsymbol{y}=\{y_{i}\}_{i=1}^{L}$ and representation $\boldsymbol{h_{x}^{f}}$ from Eq. (10), where $y_{i}$ denotes the label of its corresponding word $x_{i}$ and L denotes the length of the sentence. It is worth noting that this task aims at extracting entities, so the entity type $\boldsymbol{x^{e}}$ is removed from Eq. (1). Following Wu et al. (2020a), $\boldsymbol{h_{x}^{f}}$ is fed into a softmax classification layer to calculate the probability of each word denoted as $\boldsymbol{p}(x_{i})$ and cross-entropy loss is utilized to optimize the model:

\displaystyle\mathcal{L}_{e}=\frac{1}{L}\sum_{i=1}^{L}\mathcal{L}_{\text{CE }}\left(\boldsymbol{p}(x_{i}),y_{i}\right).

(11)

Additionally, previous studies on NER also focus on distillation methods Wu et al. (2020a); Chen et al. (2021); Li et al. (2022). In this work, the distillation method used by Wu et al. (2020a) is adopted in this task for a fair and comprehensive comparison. The source language model is denoted as the teacher model $\Theta_{tea}$ . During distillation, a student model $\Theta_{stu}$ with the same structure as $\Theta_{tea}$ is distilled based on the unlabeled target language data $\mathcal{D}_{\text{train}}^{T}$ , which is fed into $\Theta_{tea}$ to obtain its soft labels. Given a sentence $\boldsymbol{x^{\prime}}$ of length L from $\mathcal{D}_{\text{train}}^{T}$ , the objective for distillation Hinton et al. (2015) is to minimize the mean squared error (MSE) loss as:

\displaystyle\mathcal{L}_{\text{KD }}=\frac{1}{L}\sum_{i=1}^{L}{\text{MSE }}(\boldsymbol{p_{tea}}\left(x_{i}^{\prime}),\boldsymbol{p_{stu}}\left(x_{i}^{\prime}\right)\right).

(12)

This task aims at predicting the relationship label of a pair of subject and object entities in a sentence. Given two entity mentions $\boldsymbol{x_{m}}$ and $\boldsymbol{x_{n}}$ with representations $\boldsymbol{h_{m}^{f}}$ and $\boldsymbol{h_{n}^{f}}$ are derived respectively from Eq. (10). Following Ahmad et al. (2021), a sentence representation $\boldsymbol{h_{s}^{f}}$ is also obtained. Max-pooling is applied over these three vectors to derive the $\boldsymbol{\hat{h}_{m}^{f}},\boldsymbol{\hat{h}_{n}^{f}}$ , and $\boldsymbol{\hat{h}_{s}^{f}}$ . Then the concatenation of the three vectors is fed into a softmax classification layer to predict the label as follows:

\displaystyle\boldsymbol{p}_{mn}=\operatorname{softmax}\left(\boldsymbol{W}_{r}[\boldsymbol{\hat{h}_{m}^{f}};\boldsymbol{\hat{h}_{n}^{f}};\boldsymbol{\hat{h}_{s}^{f}}]+\boldsymbol{b}_{r}\right),

(13)

where $\boldsymbol{W}_{r}\in R^{3d_{\text{model }}\times r}$ and $\boldsymbol{b}_{r}\in R^{r}$ are trainable parameters, and $r$ is the total number of relation types. The objective of this task is to minimize the cross-entropy loss as:

\displaystyle\mathcal{L}_{r}

\displaystyle=\sum_{m=1}^{M}\sum_{n=1}^{R_{i}}\mathcal{L}_{\text{CE }}\left(\boldsymbol{p}_{mn},y_{mn}\right),

(14)

where $M$ is the number of entity mentions, $R_{i}$ is the number of entity candidates for $i$ -th entity mention and $y_{mn}$ denotes the ground truth relation type between $\boldsymbol{x_{m}}$ and $\boldsymbol{x_{n}}$ .

This task can be decomposed into two sub-tasks of Event Detection and Event Argument Role Labeling (EARL). Event detection aims at identifying event triggers and their types. EARL predicts the argument candidate of an event trigger and assigns a role label to each argument from a pre-defined set of labels. In this paper, we focus on EARL and assume event triggers of the input sentence are provided following Ahmad et al. (2021).

Given an event trigger $\boldsymbol{x_{t}}$ and an argument mention $\boldsymbol{x_{a}}$ with representation $\boldsymbol{h_{t}^{f}}$ and $\boldsymbol{h_{a}^{f}}$ respectively. The concatenation of the three vectors $\boldsymbol{\hat{h}_{t}^{f}},\boldsymbol{\hat{h}_{a}^{f}}$ , and $\boldsymbol{\hat{h}_{s}^{f}}$ is fed into a softmax classification layer to calculate the probability of the argument role label $\boldsymbol{p}_{ta}$ , following Eq. (13). The objective of this task $\mathcal{L}_{a}$ is to minimize the cross-entropy loss following Eq. (14). We change $M$ and $R_{i}$ with $N$ and $E_{i}$ respectively, where $N$ is the number of event triggers, $E_{i}$ is the number of argument candidates for $i$ -th event triggers. $y_{mn}$ is changed with $y_{ta}$ , denoting the ground truth argument role type between $\boldsymbol{x_{t}}$ and $\boldsymbol{x_{a}}$ .

Finally, the loss for SHINE is as follows:

\displaystyle\mathcal{L}_{f}=\mathcal{L}_{task}+\alpha(\mathcal{L}_{g}+\mathcal{L}_{l}+\mathcal{L}_{t}),

(15)

where $\alpha$ is a manually set hyperparameter and $\mathcal{L}_{task}$ is the task loss $\mathcal{L}_{e}$ , $\mathcal{L}_{r}$ or $\mathcal{L}_{a}$ .

Method	de	es	nl	Avg
BWET Xie et al. (2018)	57.76	72.37	71.25	67.13
BS Wu and Dredze (2019)	69.59	74.96	77.57	73.57
TSL Wu et al. (2020a)	73.16	76.75	80.44	76.78
Unitrans Wu et al. (2020b)	74.82	79.31	82.90	79.01
AdvPicker Chen et al. (2021)	75.01	79.00	82.90	78.97
w/o. KD	73.89	76.92	80.62	77.14
RIKD Liang et al. (2021)	75.48	77.84	82.46	78.59
w/o. KD	70.64	76.79	79.88	75.77
TOF Zhang et al. (2021)	76.57	80.35	82.79	79.90
MTMT Li et al. (2022)	76.80	81.82	83.41	80.67
w/o. KD	72.70	76.54	80.31	76.52
SHINE	74.12	77.91	81.81	77.94
SHINE w. Distill	$\textbf{77.32}^{\dagger}$	80.86	$\textbf{84.00}^{\dagger}$	80.73
SHINE w/o. interaction	73.68	77.42	81.01	77.37
SHINE w/o. frequency	73.75	77.52	81.23	77.50
SHINE w/o. constituency	73.39	76.78	80.55	76.91
SHINE w/o. all	73.13	76.39	80.38	76.63

Table 3: Evaluation results (%) of entity-level F1-score on the test set of the CoNLL datasets Sang (2002); Sang and Meulder (2003). Except MTMT w/o. KD was implemented by us, the other results were cited from the published literature. For a fair comparison, scores for the RIKD (mBERT) version were listed, and scores for some distillation-based methods without distillation were also listed. Numbers marked with

\dagger

denoted that the improvement over the best performing baseline was statistically significant (t-test with

p

-value <

0.05

Method	ar	hi	zh	Avg
BS Wu and Dredze (2019)	42.30	67.60	52.90	54.27
TSL Wu et al. (2020a)	43.12	69.54	48.12	53.59
RIKD Liang et al. (2021)	45.96	70.28	50.40	55.55
MTMT Li et al. (2022)	52.77	70.76	52.26	58.60
SHINE	51.61	68.67	53.47	57.91
SHINE w. Distill	$\textbf{55.28}^{\dagger}$	71.10	$\textbf{57.65}^{\dagger}$	$\textbf{61.34}^{\dagger}$
SHINE w/o. interaction	49.74	68.00	52.83	56.86
SHINE w/o. frequency	50.73	68.28	51.67	56.89
SHINE w/o. constituency	49.81	68.85	50.31	56.32
SHINE w/o. all	49.22	68.42	49.73	55.79

Table 4: Evaluation results (%) of entity-level F1-score on the test set of the WikiAnn dataset Pan et al. (2017). Results except ours were cited from the published literature. For a fair comparison, scores of RIKD (mBERT) was listed. Numbers marked with

\dagger

denoted that the improvement over the best performing baseline was statistically significant (t-test with

p

-value <

0.05

We adopted CoNLL-2002 Sang (2002), CoNLL-2003 Sang and Meulder (2003) and WikiAnn Pan et al. (2017) datasets for NER, Automatic Content Extraction (ACE) 2005 dataset Walker et al. (2006) for relation extraction and EARL. Readers can refer to Appendix A.1 for the details of these datasets.

For data preprocessing, UDPipe³³3https://lindat.mff.cuni.cz/services/udpipe/ Straka and Straková (2017) was applied for POS tagging and dependency parsing for CoNLL and WikiAnn datasets. We followed Subburathinam et al. (2019) for the ACE 2005 datatset. Finally, for all datasets, the Stanford CoreNLP toolkit⁴⁴4https://stanfordnlp.github.io/CoreNLP/ Manning et al. (2014) was utilized for constituency parsing.

For NER, following previous work Wu et al. (2020a), English was employed as the source language and the other languages were employed as the target languages. Unlabeled target language data in the training set and its language-universal features were utilized for distillation in NER. As for relation extraction and EARL, all models were individually trained on one source language, and directly evaluated on the other two target languages following Ahmad et al. (2021).

For NER, an entity mention was correct if its offsets and type matched a reference entity. Following Sang (2002), entity-level F1-score was used as the evaluation metric. The relation-level F1 score was employed for relation extraction Lin et al. (2020); Ahmad et al. (2021). A relation mention was correct if its predicted type and the offsets of the two associated entity mentions were correct. The argument-level F1 score was considered for EARL Subburathinam et al. (2019); Ahmad et al. (2021). An event argument role label was correct if its event type, offsets, and argument role type matched any of the reference argument mentions. Readers can refer to Appendix A.2 for the details of the metric calculations.

To compare SHINE on NER, these proposed approaches were chosen as baselines including: (1) distillation-based methods: TSL Wu et al. (2020a), Unitrans Wu et al. (2020b), AdvPicker Chen et al. (2021), RIKD Liang et al. (2021), and MTMT Li et al. (2022), and (2) non-distillation-based methods: BWET Xie et al. (2018), BS Wu and Dredze (2019) and TOF Zhang et al. (2021).

As for relation extraction and EARL tasks, this method was mainly compared with the following: CL-GCN Subburathinam et al. (2019), CL-RNN Ni and Florian (2019), OneIE Lin et al. (2020), GATE Ahmad et al. (2021) and CLEAE Lou et al. (2022). Readers can refer to Appendix A.3 and Appendix A.4 for the implementation details of the baseline models and the proposed SHINE respectively.

Method	Relation Extraction							Event Argument Role Labeling
Method	$\begin{gathered}\text{En}\\ \Downarrow\\ \text{Zh}\\ \end{gathered}$	$\begin{gathered}\text{ En }\\ \Downarrow\\ \text{ Ar }\\ \end{gathered}$	$\begin{gathered}\text{ Zh }\\ \Downarrow\\ \text{ En }\\ \end{gathered}$	$\begin{gathered}\text{ Zh }\\ \Downarrow\\ \text{ Ar }\\ \end{gathered}$	$\begin{gathered}\text{ Ar }\\ \Downarrow\\ \text{ En }\\ \end{gathered}$	$\begin{gathered}\text{ Ar }\\ \Downarrow\\ \text{ Zh }\\ \end{gathered}$	Avg	$\begin{gathered}\text{En}\\ \Downarrow\\ \text{Zh}\\ \end{gathered}$	$\begin{gathered}\text{ En }\\ \Downarrow\\ \text{ Ar }\\ \end{gathered}$	$\begin{gathered}\text{ Zh }\\ \Downarrow\\ \text{ En }\\ \end{gathered}$	$\begin{gathered}\text{ Zh }\\ \Downarrow\\ \text{ Ar }\\ \end{gathered}$	$\begin{gathered}\text{ Ar }\\ \Downarrow\\ \text{ En }\\ \end{gathered}$	$\begin{gathered}\text{ Ar }\\ \Downarrow\\ \text{ Zh }\\ \end{gathered}$	Avg
CL-GCN Subburathinam et al. (2019)	41.16	42.74	46.60	39.90	39.12	36.45	40.99	38.89	39.70	36.58	36.99	35.57	37.06	37.46
CL-RNN Ni and Florian (2019)	47.01	45.60	48.18	39.80	40.31	42.26	43.86	44.97	39.05	40.69	40.16	35.34	37.22	39.57
OneIE Lin et al. (2020)	55.23	46.50	53.71	41.80	43.59	42.51	47.22	51.80	43.53	46.17	41.39	39.04	41.93	43.98
GATE Ahmad et al. (2021)	53.52	50.77	52.25	45.36	41.67	44.14	47.95	48.61	50.18	45.99	45.04	42.52	38.39	45.12
CLEAE Lou et al. (2022)	54.63	46.91	55.87	44.52	42.23	46.03	48.36	53.96	51.12	47.83	45.91	45.14	43.36	47.89
SHINE	$\textbf{59.39}^{\dagger}$	47.62	$\textbf{56.41}^{\dagger}$	$\textbf{46.66}^{\dagger}$	$\textbf{44.56}^{\dagger}$	$\textbf{47.22}^{\dagger}$	$\textbf{50.31}^{\dagger}$	$\textbf{57.92}^{\dagger}$	51.60	$\textbf{51.35}^{\dagger}$	$\textbf{48.11}^{\dagger}$	41.86	$\textbf{45.49}^{\dagger}$	$\textbf{49.38}^{\dagger}$
SHINE w/o. interaction	58.26	47.03	56.30	45.15	42.89	46.07	49.28	56.86	51.01	51.30	46.94	41.10	42.04	48.21
SHINE w/o. frequency	58.87	46.96	56.23	45.31	43.36	45.68	49.40	56.99	50.90	50.92	47.31	40.79	43.67	48.43
SHINE w/o. constituency	56.79	46.52	55.56	44.64	42.01	45.58	48.51	56.60	49.77	50.65	46.42	39.37	40.81	47.27
SHINE w/o. all	56.05	45.98	55.03	42.83	40.76	44.63	47.54	56.12	48.18	50.54	44.67	38.29	37.18	45.83

Table 5: Evaluation results (%) of F1-score on the test set of the ACE 2005 dataset Walker et al. (2006) for relation extraction and EARL. Results except ours were obtained by implementing the source code of the baseline models provided by the authors. Languages on top and bottom of

\Downarrow

denoted the source and target languages respectively. Numbers marked with

\dagger

denoted that the improvement over the best performing baseline was statistically significant (t-test with

p

-value <

0.05

Table 3, 4 and 5 reported the zero-shot cross-lingual IE results of different methods on 3 tasks and 4 benchmarks, containing 7 target languages. For the NER task, the results show that the proposed SHINE w. Distill method outperformed MTMT (previous SOTA) on average by absolute margins of 0.06% and 2.74% in terms of CoNLL and WikiAnn datasets respectively. Besides, the proposed SHINE also significantly outperformed baseline distillation method TSL on average by absolute margins of 1.16% and 4.32% in terms of CoNLL and WikiAnn datasets respectively, demonstrating the effectiveness of these language-universal features. For a fair comparison, we compared SHINE against the version of Advpicker w/o. KD, RIKD w/o. KD, MTMT w/o. KD, and the results showed that SHINE significantly outperformed them. Our results also demonstrated the compatibility and scalability between SHINE and distillation.

As for relation extraction and EARL, SHINE outperformed CLEAE (previous SOTA) in most all transfer directions with an average improvement of 1.95% and 1.49% in relation extraction and EARL tasks respectively. It was worth noting that GATE used mBERT as the contextual representation extractor without fine-tuning. This might lead to the cross-lingual information in mBERT not being fully exploited, subsequently degrading the performance of the model. Our results clearly demonstrate that SHINE is highly effective and generalizes well across languages and tasks.

To validate the contributions of different components in SHINE, the following variants and baselines were conducted to perform the ablation study: (1) SHINE w/o. interaction, which removed the hierarchical interaction framework. Besides, the constituency feature was still used during training. (2) SHINE w/o. frequency, which removed the frequency matrix. Besides, the constituent span embeddings constructed in Section 3.4 and the interaction framework were still used during training. (3) SHINE w/o. constituency, which removed constituent span embeddings and frequency matrix. Besides, the interaction framework was still utilized. (4) SHINE w/o. all, which removed all the components mentioned above. It was the base structure of the proposed SHINE.

Results of the ablation experiments were shown in the bottom four lines of Table 3, 4 and 5 respectively. Additional in-depth analyses reveal: (1) removal of the interaction network (SHINE vs SHINE w/o. interaction) caused a significant performance drop, demonstrating the importance of establishing interaction between language-universal features and contextual information, and (2) removing constituency features caused significant performance drops (SHINE vs SHINE w/o. frequency and SHINE w/o. constituency). Both constituent type embeddings and sub-span importance information were useful.

The ablation study validated the effectiveness of all components. Moreover, the subtle integration of these modules achieved highly promising performance. Not only hierarchical interaction should be established to capture the complementary information of features and contextual information, but also constituency structure should be modeled for effective cross-lingual transfer.

To further illustrate the effectiveness of SHINE and to explore how language-universal features play a role in cross-language transfer, the embedding distribution of three models was shown in Appendix A.5. Distribution discrepancy within SHINE was significantly smaller than the base model. It shows that with the guidance of language-universal features, the proposed encoder could capture complementary information to alleviate discrepancy between languages to derive language-agnostic contextualized representations.

In this paper, we propose a syntax-augmented hierarchical interactive encoder for zero-shot cross-lingual IE. A multi-level interaction framework is designed to interact the complementary structured information contained in language-universal features and contextual representations. Besides, the constituency structure is first introduced to explicitly model and utilize the constituent span information. Experiments show that the proposed method achieves highly promising performance across seven languages on three tasks and four benchmarks. In the future, we will extend this method to more languages and tasks. Besides, we will explore other sources of language-universal features as well as the relationships between features to augment representation capability.

Although the proposed method has shown great performance for zero-shot cross-lingual IE, we should realize that the proposed method still can be further improved. For example, relationships between different language-universal features should be considered because there is an implicit mutual influence between them, explicitly modeling it can make these features organically form a whole. Then the model can use this information more efficiently and it will be a part of our future work. In addition, some languages cannot be supported by existing language tools. Although toolkits for some similar languages can be utilized for annotating, denoising these annotations is worth studying.

Abad et al. (2017) Azad Abad, Moin Nabi, and Alessandro Moschitti. 2017. Self-crowdsourcing training for relation extraction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 2: Short Papers, pages 518–523. Association for Computational Linguistics.
Agirre (2022) Eneko Agirre. 2022. Few-shot information extraction is here: Pre-train, prompt and entail. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, page 2. ACM.
Ahmad et al. (2021) Wasi Uddin Ahmad, Nanyun Peng, and Kai-Wei Chang. 2021. GATE: graph attention transformer encoder for cross-lingual relation and event extraction. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 12462–12470. AAAI Press.
Bekoulis et al. (2018) Giannis Bekoulis, Johannes Deleu, Thomas Demeester, and Chris Develder. 2018. Adversarial training for multi-context joint entity and relation extraction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2830–2836. Association for Computational Linguistics.
Cabral et al. (2020) Bruno Souza Cabral, Rafael Glauber, Marlo Souza, and Daniela Barreiro Claro. 2020. Crossoie: Cross-lingual classifier for open information extraction. In Computational Processing of the Portuguese Language - 14th International Conference, PROPOR 2020, Evora, Portugal, March 2-4, 2020, Proceedings, volume 12037 of Lecture Notes in Computer Science, pages 368–378. Springer.
Chen et al. (2022) Beiduo Chen, Jun-Yu Ma, Jiajun Qi, Wu Guo, Zhen-Hua Ling, and Quan Liu. 2022. USTC-NELSLIP at semeval-2022 task 11: Gazetteer-adapted integration network for multilingual complex named entity recognition. In Proceedings of the 16th International Workshop on Semantic Evaluation, SemEval@NAACL 2022, Seattle, Washington, United States, July 14-15, 2022, pages 1613–1622. Association for Computational Linguistics.
Chen et al. (2021) Weile Chen, Huiqiang Jiang, Qianhui Wu, Börje Karlsson, and Yi Guan. 2021. Advpicker: Effectively leveraging unlabeled data via adversarial discriminator for cross-lingual NER. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 743–753. Association for Computational Linguistics.
Chen et al. (2017) Yubo Chen, Shulin Liu, Xiang Zhang, Kang Liu, and Jun Zhao. 2017. Automatically labeled data generation for large scale event extraction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 409–419. Association for Computational Linguistics.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
Hinton et al. (2015) Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. CoRR, abs/1503.02531.
Huang et al. (2018) Lifu Huang, Heng Ji, Kyunghyun Cho, Ido Dagan, Sebastian Riedel, and Clare R. Voss. 2018. Zero-shot transfer learning for event extraction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 2160–2170. Association for Computational Linguistics.
Jiang et al. (2020) Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Tuo Zhao. 2020. SMART: robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 2177–2190. Association for Computational Linguistics.
Keung et al. (2019) Phillip Keung, Yichao Lu, and Vikas Bhardwaj. 2019. Adversarial learning with contextual embeddings for zero-resource cross-lingual classification and NER. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 1355–1360. Association for Computational Linguistics.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
Li et al. (2020) Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li. 2020. A unified MRC framework for named entity recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 5849–5859. Association for Computational Linguistics.
Li et al. (2022) Zhuoran Li, Chunming Hu, Xiaohui Guo, Junfan Chen, Wenyi Qin, and Richong Zhang. 2022. An unsupervised multiple-task and multiple-teacher model for cross-lingual named entity recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 170–179. Association for Computational Linguistics.
Liang et al. (2021) Shining Liang, Ming Gong, Jian Pei, Linjun Shou, Wanli Zuo, Xianglin Zuo, and Daxin Jiang. 2021. Reinforced iterative knowledge distillation for cross-lingual named entity recognition. In KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, pages 3231–3239. ACM.
Lin et al. (2020) Ying Lin, Heng Ji, Fei Huang, and Lingfei Wu. 2020. A joint neural model for information extraction with global features. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7999–8009. Association for Computational Linguistics.
Liu et al. (2019) Jian Liu, Yubo Chen, Kang Liu, and Jun Zhao. 2019. Neural cross-lingual event detection with minimal parallel resources. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 738–748. Association for Computational Linguistics.
Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
Lou et al. (2022) Chenwei Lou, Jun Gao, Changlong Yu, Wei Wang, Huan Zhao, Weiwei Tu, and Ruifeng Xu. 2022. Translation-based implicit annotation projection for zero-shot cross-lingual event argument extraction. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, pages 2076–2081. ACM.
Luan et al. (2019) Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari Ostendorf, and Hannaneh Hajishirzi. 2019. A general framework for information extraction using dynamic span graphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 3036–3046. Association for Computational Linguistics.
Ma et al. (2023) Jun-Yu Ma, Jia-Chen Gu, Jiajun Qi, Zhen-Hua Ling, Quan Liu, and Xiaoyi Zhao. 2023. USTC-NELSLIP at semeval-2023 task 2: Statistical construction and dual adaptation of gazetteer for multilingual complex NER. In Proceedings of the 17th International Workshop on Semantic Evaluation, SemEval@ACL 2023, Toronto, Canada, July 13-14, 2023. Association for Computational Linguistics.
Manning et al. (2014) Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The stanford corenlp natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, System Demonstrations, pages 55–60. The Association for Computer Linguistics.
Mayhew et al. (2017) Stephen Mayhew, Chen-Tse Tsai, and Dan Roth. 2017. Cheap translation for cross-lingual named entity recognition. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 2536–2545. Association for Computational Linguistics.
M’hamdi et al. (2019) Meryem M’hamdi, Marjorie Freedman, and Jonathan May. 2019. Contextualized cross-lingual event trigger extraction with minimal resources. In Proceedings of the 23rd Conference on Computational Natural Language Learning, CoNLL 2019, Hong Kong, China, November 3-4, 2019, pages 656–665. Association for Computational Linguistics.
Ni and Florian (2019) Jian Ni and Radu Florian. 2019. Neural cross-lingual relation extraction based on bilingual word embedding mapping. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 399–409. Association for Computational Linguistics.
Pan et al. (2017) Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. 2017. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1946–1958. Association for Computational Linguistics.
Pérez-Cruz (2008) Fernando Pérez-Cruz. 2008. Kullback-leibler divergence estimation of continuous distributions. In 2008 IEEE International Symposium on Information Theory, ISIT 2008, Toronto, ON, Canada, July 6-11, 2008, pages 1666–1670. IEEE.
Sang (2002) Erik F. Tjong Kim Sang. 2002. Introduction to the conll-2002 shared task: Language-independent named entity recognition. In Proceedings of the 6th Conference on Natural Language Learning, CoNLL 2002, Held in cooperation with COLING 2002, Taipei, Taiwan, 2002. ACL.
Sang and Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003, Edmonton, Canada, May 31 - June 1, 2003, pages 142–147. ACL.
Straka and Straková (2017) Milan Straka and Jana Straková. 2017. Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with udpipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Vancouver, Canada, August 3-4, 2017, pages 88–99. Association for Computational Linguistics.
Subburathinam et al. (2019) Ananya Subburathinam, Di Lu, Heng Ji, Jonathan May, Shih-Fu Chang, Avirup Sil, and Clare R. Voss. 2019. Cross-lingual structure transfer for relation and event extraction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 313–325. Association for Computational Linguistics.
Tsai et al. (2016) Chen-Tse Tsai, Stephen Mayhew, and Dan Roth. 2016. Cross-lingual named entity recognition via wikification. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016, pages 219–228. ACL.
Van der Maaten and Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(11).
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
Walker et al. (2006) Christopher Walker, Stephanie Strassel, Julie Medero, and Kazuaki Maeda. 2006. Ace 2005 multilingual training corpus. In Linguistic Data Consortium, Philadelphia 57.
Wang et al. (2021) Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang, Zhongqiang Huang, Fei Huang, and Kewei Tu. 2021. Improving named entity recognition by external context retrieving and cooperative learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 1800–1812. Association for Computational Linguistics.
Wu et al. (2020a) Qianhui Wu, Zijia Lin, Börje Karlsson, Jianguang Lou, and Biqing Huang. 2020a. Single-/multi-source cross-lingual NER via teacher-student learning on unlabeled data in target language. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 6505–6514. Association for Computational Linguistics.
Wu et al. (2020b) Qianhui Wu, Zijia Lin, Börje F. Karlsson, Biqing Huang, and Jianguang Lou. 2020b. Unitrans : Unifying model transfer and data transfer for cross-lingual named entity recognition with unlabeled data. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, pages 3926–3932. ijcai.org.
Wu and Dredze (2019) Shijie Wu and Mark Dredze. 2019. Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 833–844. Association for Computational Linguistics.
Xie et al. (2018) Jiateng Xie, Zhilin Yang, Graham Neubig, Noah A. Smith, and Jaime G. Carbonell. 2018. Neural cross-lingual named entity recognition with minimal resources. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 369–379. Association for Computational Linguistics.
Yarmohammadi et al. (2021) Mahsa Yarmohammadi, Shijie Wu, Marc Marone, Haoran Xu, Seth Ebner, Guanghui Qin, Yunmo Chen, Jialiang Guo, Craig Harman, Kenton Murray, Aaron Steven White, Mark Dredze, and Benjamin Van Durme. 2021. Everything is all it takes: A multipronged strategy for zero-shot cross-lingual information extraction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 1950–1967. Association for Computational Linguistics.
Zhang et al. (2021) Ying Zhang, Fandong Meng, Yufeng Chen, Jinan Xu, and Jie Zhou. 2021. Target-oriented fine-tuning for zero-resource named entity recognition. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 1603–1615. Association for Computational Linguistics.
Zhou et al. (2016) Deyu Zhou, Tianmeng Gao, and Yulan He. 2016. Jointly event extraction and visualization on twitter via probabilistic modelling. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.

We adopted CoNLL⁵⁵5http://www.cnts.ua.ac.be/conll2003 Sang (2002); Sang and Meulder (2003) and WikiAnn⁶⁶6http://nlp.cs.rpi.edu/wikiann Pan et al. (2017) datasets for NER, Automatic Content Extraction (ACE) 2005 dataset⁷⁷7https://catalog.ldc.upenn.edu/LDC2006T06 Walker et al. (2006) for relation extraction and EARL.

included two datasets: (1) CoNLL-2002 (Spanish, Dutch); (2) CoNLL-2003 (English, German); They were annotated with 4 entity types: LOC, MISC, ORG, and PER.

included English, Arabic, Hindi, and Chinese. It was annotated with 3 entity types: LOC, ORG, and PER. CoNLL and WikiAnn datasets were annotated with the BIO entity labelling scheme and were divided into the training, development and testing sets. Table 6 shows the statistics of these datasets.

included three languages (English, Chinese and Arabic). It defined an ontology that included 7 entity types, 18 relation subtypes, and 33 event subtypes. We added a class label None to denote that two entity mentions or a pair of an event mention and an argument candidate under consideration did not have a relationship belonging to the target ontology. Since the official did not divide the training, development and testing sets, We adopted the same dataset-splitting strategy following Subburathinam et al. (2019). Besides, we re-implemented these baseline models based on the code provided by authors using default settings, which were described in Appendix A.3 . Table 7 shows the statistics of the dataset.

Language	Type	Train	Dev	Test
CoNLL dataset Sang (2002); Sang and Meulder (2003)
English-en	Sentence	14,987	3,466	3,684
(CoNLL-2003)	Entity	23,499	5,942	5,648
German-de	Sentence	12,705	3,068	3,160
(CoNLL-2003)	Entity	11,851	4,833	3,673
Spanish-es	Sentence	8,323	1,915	1,517
(CoNLL-2002)	Entity	18,798	4,351	3,558
Dutch-nl	Sentence	15,806	2,895	5,195
(CoNLL-2002)	Entity	13,344	2,616	3,941
WikiAnn dataset Pan et al. (2017)
English-en	Sentence	20,000	10,000	10,000
English-en	Entity	27,931	14,146	13,958
Arabic-ar	Sentence	20,000	10,000	10,000
Arabic-ar	Entity	22,500	11,266	11,259
Hindi-hi	Sentence	5,000	1,000	1,000
Hindi-hi	Entity	6,124	1,226	1,228
Chinese-zh	Sentence	20,000	10,000	10,000
Chinese-zh	Entity	25,031	12,493	12,532

Table 6: The statistics of the CoNLL Sang (2002); Sang and Meulder (2003) and WikiAnn Pan et al. (2017) datasets.

Language	Type	Train	Dev	Test
English-en	Sentence	14,671	873	711
	Event	4,317	492	422
	Argument	7,814	933	892
	Relation	5,247	550	509
Chinese-zh	Sentence	5,847	715	733
	Event	2,610	325	336
	Argument	6,281	727	793
	Relation	5,392	807	664
Arabic-ar	Sentence	2,367	203	210
	Event	921	141	127
	Argument	2,523	328	367
	Relation	2,389	266	268

Table 7: The statistics of the ACE 2005 dataset Walker et al. (2006).

For NER, following Sang (2002), entity-level F1-score was used as the evaluation metric. Denote A as the number of all entities classified by the model, B as the number of all correct entities classified by the model, and E as the number of all correct entities, the precision (P), recall (R), and entity-level F1-score (E-F1) of the model were:

\displaystyle\operatorname{P}=\frac{B}{A},\ \ \operatorname{R}=\frac{B}{E},\ \ \operatorname{E-F1}=\frac{2\times P\times R}{P+R}.

(16)

Following previous works Lin et al. (2020); Ahmad et al. (2021), the relation-level F1 score was considered for relation extraction. A relation mention was correct if its predicted type and the offsets of the two associated entity mentions are correct. Denote A as the number of all relation mentions classified by the model, B as the number of all correct relation mentions classified by the model, and E as the number of all correct relation mentions, the precision (P), recall (R), and relation-level F1-score (R-F1) of the model were:

\displaystyle\operatorname{P}=\frac{B}{A},\ \ \operatorname{R}=\frac{B}{E},\ \ \operatorname{R-F1}=\frac{2\times P\times R}{P+R}.

(17)

An event argument role label was correct if its event type, offsets, and argument role type matched any of the reference argument mentions Ahmad et al. (2021). Denote A as the number of all arguments classified by the model, B as the number of all correct arguments classified by the model, and E as the number of all correct arguments, the argument-level F1 score (A-F1) was defined as similar to relation-level F1 score:

\displaystyle\operatorname{P}=\frac{B}{A},\ \ \operatorname{R}=\frac{B}{E},\ \ \operatorname{A-F1}=\frac{2\times P\times R}{P+R}.

(18)

We described the implementation details for all the models for NER as follows:

TSL Wu et al. (2020a) proposed a teacher-student learning model, via using source-language models as teachers to train a student model on unlabeled data in the target language for cross-lingual NER. The number of parameters of this model was about 110M.

Unitrans Wu et al. (2020b) unified both model transfer and data transfer based on their complementarity via enhanced knowledge distillation on unlabeled target-language data. The number of parameters of this model was about 331M.

AdvPicker Chen et al. (2021) proposed a novel approach to combine the feature-based method and pseudo labeling via language adversarial learning for cross-lingual NER. The number of parameters of this model was about 178M.

RIKD Liang et al. (2021) proposed a reinforced knowledge distillation framework. The number of parameters of this model was about 111M.

MTMT Li et al. (2022) proposed an unsupervised multiple-task and multiple-teacher model for cross-lingual NER. The number of parameters of this model was about 220M.

In addition, BWET Xie et al. (2018), BS Wu and Dredze (2019) and TOF Zhang et al. (2021) were non-distillation-based methods. The number of parameters of these model was about 12M, 111M and 114M respectively.

Furthermore, baselines for relation extraction and EARL are shown below:

CL-GCN Subburathinam et al. (2019) used GCN Kipf and Welling (2017) to embed the language-universal feature and context to learn structured space representation. We refer to the released code from Ahmad et al. (2021)⁸⁸8https://github.com/wasiahmad/GATE to re-implement it.

CL-RNN Ni and Florian (2019) utilized BiLSTM to embed the language-universal feature and contextual representations to learn structured space representation. We refer to the released code from Ahmad et al. (2021) to re-implement this method.

OneIE Lin et al. (2020) was a joint model trained with multitasking to capture cross-subtask and cross-instance inter-dependencies. We used their provided code⁹⁹9http://blender.cs.illinois.edu/software/oneie/ to train the model with default settings. The number of parameters of this model was about 115M.

GATE Ahmad et al. (2021) proposed a Transformer layer with syntactic distance to encode the language-universal features. The number of parameters of this model was about 118M. We utilized the official release code to re-implement this method.

CLEAE Lou et al. (2022) proposed a translation-based method with an implicit annotation. We extend this method to relation extraction as it is similar to EARL. The number of parameters of this model was about 112M. The official release code¹⁰¹⁰10https://github.com/HLT-HITSZ/CLEAE was utilized to re-implement it.

All code was implemented in the PyTorch framework¹¹¹¹11https://pytorch.org/. All of the context encoders mentioned in this paper employed pre-trained cased mBERT Devlin et al. (2019) in HuggingFace’s Transformers¹²¹²12https://huggingface.co/transformers where the number of transformer blocks was 12, the hidden layer size was 768, and the number of self-attention heads was 12. Following Keung et al. (2019), some hyperparameters were tuned on each target language. All language-universal feature encoders mentioned in this paper employed 2-layer Transformer Vaswani et al. (2017), the hidden layer size was 768, and the number of self-attention heads was 8. For the variant transformer mentioned in Eq. (10), the layer was set to 1, the hidden size was 768, and the number of self-attention heads was 8. The $\alpha$ in Eq. (15) was set to 10 and the span length P utilized in Eq. (5) was set to 4.

For NER, following Wu et al. (2020a), each batch contained 32 examples, with a maximum encoding length of 128. The dropout rate was set to 0.1, and AdamW Loshchilov and Hutter (2019) with WarmupLinearSchedule in the Transformers Library was used as optimizer. The learning rate was set to 5e-5 for the teacher models and 2e-5 for the student models. Following Wu and Dredze (2019), the parameters of the embedding layer and the bottom three layers of the mBERT used in the teacher model and the student model were frozen. All models were trained for 10 epochs and chosen the best checkpoint with the target dev set. For relation extraction and EARL, the other parameters were set following Lin et al. (2020). We optimized our model with Adam Kingma and Ba (2015) for 80 epochs with a learning rate of 5e-5 and a dropout rate of 0.1. Furthermore, each experiment was conducted 5 times and reported the mean F1-score.

The number of parameters of a model was about 130M. The whole training of SHINE was implemented with one GeForce RTX 3090, which consumed about 3 hours for NER and 8 hours for relation extraction and EARL.

Figure 3 shows the representation distribution for the two languages across the three models.

deployment