This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SegMix: A Simple Structure-Aware Data Augmentation Method

Yuxin Pei Carnegie Mellon University Pushkar Bhuse2 [email protected] Zhengzhong Liu Eric Xing
Abstract

Interpolation-based Data Augmentation (DA) methods (Mixup) linearly interpolate the inputs and labels of two or more training examples. Mixup has more recently been adapted to the field of Natural Language Processing (NLP), mainly for sequence labeling tasks. However, such a simple adoption yields mixed or unstable improvements over the baseline models. We argue that the direct-adoption methods do not account for structures in NLP tasks. To this end, we propose SegMix, a collection of interpolation-based DA algorithms that can adapt to task-specific structures. SegMix poses fewer constraints on data structures, is robust to various hyperparameter settings, applies to more task settings, and adds little computational overhead. In the algorithm’s core, we apply interpolation methods on task-specific meaningful segments, in contrast to applying them on sequences as in prior work. We find SegMix to be a flexible framework that combines rule-based DA methods with interpolation-based methods, creating interesting mixtures of DA techniques. We show that SegMix consistently improves performance over strong baseline models in Named Entity Recognition (NER) and Relation Extraction (RE) tasks, especially under data-scarce settings. Furthermore, this method is easy to implement and adds negligible training overhead.

SegMix: A Simple Structure-Aware Data Augmentation Method



Refer to caption
Figure 1: Example of SegMix v.s. Whole-sequence Mixup for NER. Each colored block is an entity.
Refer to caption
Figure 2: Four variations of SegMix (MMix, TMix, SMix, and RMix). The left is the original training sequence. The colored blocks are the segments to be mixed. The segments on the right are randomly sampled from the predefined Segment Pool. Mention Pool, Token Pool, and Relation Pair Pool are constructed from the training data, while the Synonym-token Pool is constructed with the WordNet (Miller, 1995a) and returns a synonym of the chosen token. The segment embeddings and one-hot encodings of labels are mixed with ratio α\alpha.

1 Introduction

Initially proposed as Mixup for computer vision tasks, interpolation-based Data Augmentation (DA) (Zhang et al., 2018) linearly interpolates the inputs and labels of two or more training examples. Inspired by Mixup, several attempts have been made to apply interpolation-based DA to NLP, mainly in sequence labeling tasks (Guo et al., 2020). However, the proposed embedding-mix solution does not extend well to tasks with structured labels. For example, mixing two sentences with different structures usually generates a non-sensical output. As demonstrated in Fig. 1, when working with entity spans, Whole-sequence Mixup111Guo et al. 2020 is referred to as Whole-sequence Mixup to avoid confusion with SeqMix of Zhang et al. 2020. produces non-sensical entity labels like a mixture of nonentity and entity ([O/B-PER]) and consecutive beginning labels ([O/B-PER], [B-LOC/I-PER]). Such noisy augmented data tend to mislead the model, especially in data-scarce settings. As shown in Chen et al. (2020a), without additional constraints on the augmented data, applying Whole-Sequence Mixup results in performance worse than baseline.

Instead of using extra heuristic constraints to filter out low-quality augmented data, it may be more efficient and effective to bring structure awareness into the mixing process from the beginning. To this end, we propose Segment Mix (SegMix), a DA method that performs linear interpolations on meaningful, task-specific segments. Virtuous training examples are created by replacing the original segments with the interpolation of pairs of segment embeddings. As in Fig. 1, the embedding of a location entity (“New York City”) is mixed with the embedding of a person entity (“Marcello Cuttitta”). We exploit the benefit of linear interpolation while keeping the target structure more sensible.

Furthermore, SegMix imposes few restrictions on the original tasks, mixing pairs, or generated examples. On the one hand, this potentially allows one to explore a much larger data space. For example, it allows mixing training samples with various sentence lengths and structures. On the other, it means that SegMix can be applied to other NLP tasks in addition to sequence labeling.

This paper tests SegMix against Named Entity Recognition (NER) and Relation Extraction (RE), two typical Information Extraction tasks with text segments. We show that SegMix improves upon the baselines under data-scarce settings, and demonstrate its robustness under different hyperparameter settings, which is not the case for simple sequence-based Mixup methods. SegMix is easy to implement222We will release the experiment code base. and adds little computational overhead to training and inference.

2 Related Work

Many NLP tasks involve dealing with data with structures, while a popular area is structured prediction. These tasks often involve extracting a predefined target structure from the input data (Lafferty et al., 2001; Collins, 2002; Ma and Hovy, 2016). NER aims to locate and classify the named entities mentioned in unstructured text. There have been several attempts to apply algorithms similar to Mixup to sequence labeling tasks such as NER (Chen et al., 2020a; Zhang et al., 2020). These tasks have linear structures that allow for simple sequence-level mixing methods. RE aims to detect the semantic relationship between a pair of nominals. Unlike NER, RE models typically do not use a linear encoding scheme such as BIO, making sequence-level mixing non-trivial. To the best of our knowledge, interpolation-based DA methods have not been applied to such tasks.

Rule-based DA

Rule-based DA specifies rules for inserting, deleting, or replacing parts of text (van Dyk and Meng, 2001). Easy Data Augmentation (EDA) (Wei and Zou, 2019) proposed a set of token-level random perturbation operations (insertion, deletion, and swap). SwitchOut (Wang et al., 2018) randomly replaces tokens in the sentence with random words. WordDrop (Sennrich et al., 2016) drops tokens randomly. Existing work also brings structure awareness into DA. Substructure Substitution (SUB) (Shi et al., 2021) generates new examples by replacing substructures (e.g., subtrees or subsequences) with ones with the same label. SUB applies to POS tagging, parsing, and token classification. A similar idea is proposed for NER (Dai and Adel, 2020). Mention Replacement (MR) and Label-wise Token Replacement (LwTR) substitute entity mention and token with those with the same label. Synonym Replacement (SR) replaces token with a synonym retrieved from WordNet (Miller, 1995b). Xu et al. 2016 reverses dependency sub-paths and their corresponding relationships in relation classification. Şahin and Steedman 2018 crops and rotates the dependency trees for POS tagging.  Su et al. 2021 presents a contrastive pre-training method to create more generalized representations for RE tasks. It introduces a DA technique where text contained in the shortest dependency path is kept constant and other tokens are replaced. Generally, these methods explore the vicinity area around the data point and assume that they share the same label.

Interpolation-based DA

Originally proposed for image classification tasks, Mixup Zhang et al. (2018) performs convex combinations between a pair of data points and their labels. Mixup improves the performance of image classification tasks by regularizing the neural network to favor simple linear behavior between training examples Zhang et al. (2018). Several adaptations of Mixup have been made in NLP tasks. TMix (Chen et al., 2020b) performs an interpolation of text in a hidden space in text classification tasks. Snippext (Miao et al., 2020) mixes BERT encodings and passes them through a classification layer for sentiment analysis tasks. AdvAug (Cheng et al., 2020) mixes adversarial examples as an adversarial augmentation method for Neural Machine Translation.

However, direct application of Whole-Sequence Mixup yields limited improvement in tasks involving structured data. As empirically shown in LADA (Chen et al., 2020a) on NER, the direct mixing of two sentences changes both the local token representation and the context embeddings required to identify the entity mention (Chen et al., 2020a). This is also demonstrated in Fig. 1, the generated data can sometimes be too noisy to help with model training. In fact, LADA has to add additional constraints by mixing the sequences only with its k-nearest neighbors to reduce the noise (Chen et al., 2020a). Similarly, SeqMix (Zhang et al., 2020) scans both sequences with a fixed-length sliding window and mixes the subsequence within the windows. However, this approach does not eliminate the problem of generating low-quality data — extra constraints are still used to ensure the quality of generated data. These constraints limit the explorable data space close to the training data. What is more, they complicate the algorithms and add non-negligible computational overheads.

3 Method

We propose SegMix and implements 4 variants, namely MentionMix (MMix), TokenMix (TMix), SynonymMix (SMix), and RelationMix (RMix). As shown in Fig. 2, after defining the task-dependent segment, we create a new training sample by replacing a segment of the original sample with a mixed embedding of the segment itself and another randomly drawn segment. These mixed embeddings are then fed into the encoder. Algorithm 1 presents the SegMix generation process.

Algorithm 1 SegMix generation algorithm
1:Input: 𝒟,𝒫k,r\mathcal{D},\mathcal{P}^{k},r
2:𝒟A{},𝒟S\mathcal{D}_{A}\leftarrow\{\},\mathcal{D}_{S}\leftarrow sample(𝒟,len(𝒟)r\mathcal{D},\emph{len}(\mathcal{D})\cdot r)
3:for (Xi,Yi)(X_{i},Y_{i}) in 𝒟S\mathcal{D}_{S} do
4:     Ei,OiEmb(Xi),OHE(Yi)E_{i},O_{i}\leftarrow\textbf{Emb}(X_{i}),\textbf{OHE}(Y_{i})
5:     λBeta(α,α)\lambda\leftarrow Beta(\alpha,\alpha)
6:     Sa,laS_{a},l_{a}\leftarrow kk segment tuples in Xi,YiX_{i},Y_{i}
7:     Sb,lbS_{b},l_{b}\leftarrow kk segment tuples in 𝒫\mathcal{P}
8:     Xi,YiXi.X_{i}^{\prime},Y_{i}^{\prime}\leftarrow X_{i}.copy(), Yi.Y_{i}.copy()
9:     for saj,sbjs_{a}^{j},s_{b}^{j} in Sa,SbS_{a},S_{b} do
10:         ea,eb=Emb(sa),Emb(sb)e_{a},e_{b}=\textbf{Emb}(s_{a}),\textbf{Emb}(s_{b})
11:         start,endstart,end\leftarrow index range of sajs_{a}^{j} in XiX_{i}
12:         e~aj,e~bj\tilde{e}_{a}^{j},\tilde{e}_{b}^{j}\leftarrowpad_to_longer(eaj,ebje_{a}^{j},e_{b}^{j})
13:         Ei[start:end]e~ajλ+e~bj(1λ)E_{i}[start:end]\leftarrow\tilde{e}_{a}^{j}\cdot\lambda+\tilde{e}_{b}^{j}\cdot(1-\lambda)
14:     end for
15:     for laj,lbjl_{a}^{j},l_{b}^{j} in la,lbl_{a},l_{b} do
16:         oa,ob=OHE(la),OHE(lb)o_{a},o_{b}=\textbf{OHE}(l_{a}),\textbf{OHE}(l_{b})
17:         start,endstart,end\leftarrow index range of lajl_{a}^{j} in YiY_{i}
18:         o~aj,o~bj\tilde{o}_{a}^{j},\tilde{o}_{b}^{j}\leftarrowpad_to_longer(oaj,objo_{a}^{j},o_{b}^{j})
19:         Oi[start:end]o~ajλ+o~bj(1λ)O_{i}[start:end]\leftarrow\tilde{o}_{a}^{j}\cdot\lambda+\tilde{o}_{b}^{j}\cdot(1-\lambda)
20:     end for
21:     𝒟A\mathcal{D}_{A}.add((Ei,Oi)(E_{i},O_{i}))
22:end for
23:Output: 𝒟A\mathcal{D}_{A}

Formally, consider a training dataset 𝒟={(Xi,Yi)|iN}\mathcal{D}=\{(X_{i},Y_{i})|i\in N\} of size NN, where each input XiX_{i} is a sequence of tokens Xi=(Xi1,Xi2,,)X_{i}=(X_{i}^{1},X_{i}^{2},\dotsc,) and a task-dependent structured output YiY_{i}, a structured prediction algorithm generally encodes the output YiY_{i} using a task-dependent scheme. For example, NER labels are often encoded with the BIO scheme while RE labels are associated with a pair of nominal phrases. SegMix adapts to different encoding schemes by designing task-dependent segments.

A segment s(u,v)s(u,v) is a continuous sequence of tokens (Xiu,Xiu+1,,Xiv)(X_{i}^{u},X_{i}^{u+1},\dotsc,X_{i}^{v}) in sample XiX_{i}, a segment tuple S=[si(ui,vi),]S=[s_{i}(u_{i},v_{i}),...] is a kk-ary tuple of segments contained in the sequence. We choose a segment tuple relevant to the task and associate it with an appropriate label list L=[li,]L=[l_{i},...]. For example, in RE, there are segment tuple of length 2, which contains the pair of nominals in a relation.

A Segment Pool of size MM:𝒫k={(Si,Li)|iM}\mathcal{P}^{k}=\{(S_{i},L_{i})|i\in M\} is generated by collecting segment tuples SiS_{i} from the training data or an external resource (e.g. WordNet). Here, kk is a constant for a specific task. For example, in RE, there are binary segment tuple containing a pair of nominals.

With the training data set 𝒟\mathcal{D}, the Segment Pool 𝒫k\mathcal{P}^{k}, and the mix rate rr, SegMix (𝒟,𝒫k,r)(\mathcal{D},\mathcal{P}^{k},r) returns an augmented data set 𝒟A\mathcal{D}_{A} of size rNr\cdot N. A set 𝒟𝒮\mathcal{D_{S}} of size rNr\cdot N is first drawn from the training data 𝒟\mathcal{D} as candidates for augmentation. For each data point (Xi,Yi)(X_{i},Y_{i}) drawn from 𝒟𝒮\mathcal{D_{S}}, we randomly pick a segment tuple SaS_{a} and the corresponding label list LaL_{a} from the sequence XiX_{i}. The mix for candidate XiX_{i}, (Sb,Lb)(S_{b},L_{b}), is then drawn from the Segment Pool.

Let 𝐄𝐦𝐛\mathbf{Emb} be an embedding function on VD\mathbb{R}^{V}\mapsto\mathbb{R}^{D}, where V is the size of the vocabulary and D is the embedding dimension. Let 𝐎𝐇𝐄\mathbf{OHE} be a function that returns the one-hot encoding of a label.

For all sa,sb=Sa[i],Sb[i],1ilen(Sa){s_{a},s_{b}=S_{a}[i],S_{b}[i],1\leq i\leq\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor{\@listingGroup{ltx_lst_identifier}{len}}}}}(S_{a}), and la,lb=La[j],Lb[j],1jlen(La){l_{a},l_{b}=L_{a}[j],L_{b}[j],1\leq j\leq\leavevmode\lstinline{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor{\@listingGroup{ltx_lst_identifier}{len}}}}}(L_{a}). Define ea,eb=𝐄𝐦𝐛(sa),𝐄𝐦𝐛(sb),oa,ob=𝐎𝐇𝐄(la),𝐎𝐇𝐄(lb)e_{a},e_{b}=\mathbf{Emb}(s_{a}),\mathbf{Emb}(s_{b}),o_{a},o_{b}=\mathbf{OHE}(l_{a}),\mathbf{OHE}(l_{b}).

The embeddings and one-hot encodings are then padded according to sequence length (line 12, 18). Let e~a,e~b,o~a,o~b\tilde{e}_{a},\tilde{e}_{b},\tilde{o}_{a},\tilde{o}_{b} be the padded version of the embeddings and one-hot encodings. Finally, in line 13, 19, we perform a linear interpolation between e~a,e~b\tilde{e}_{a},\tilde{e}_{b} and o~a,o~a\tilde{o}_{a},\tilde{o}_{a} with a mix rate λ\lambda chosen randomly from a Beta distribution (see specifications in 4.1):

eae~aλ+e~b(1λ)\displaystyle e_{a}^{\prime}\leftarrow\tilde{e}_{a}\cdot\lambda+\tilde{e}_{b}\cdot(1-\lambda) (1)
oao~aλ+o~b(1λ)\displaystyle o_{a}^{\prime}\leftarrow\tilde{o}_{a}\cdot\lambda+\tilde{o}_{b}\cdot(1-\lambda)

In Eq.1, \cdot is a scalar multiplication and +,+,- are vector element-wise operations. When λ=1\lambda=1, the augmented data falls back to the original one. When λ=0\lambda=0, the segments are completely replaced by those drawn from the pool, equivalent to replacement-based DA techniques.

Finally, the augmented data point is generated by copying the original data and replacing the chosen segment and labels with the mixed version. We present 3 variations of SegMix for NER and 1 for RE with different types of Segment Pool 𝒫k\mathcal{P}^{k}.

MentionMix

Inspired by MR, MMix performs linear interpolations on a mention level (a contiguous segment of tokens with the same entity label). A Mention Pool 𝒫1\mathcal{P}^{1} is constructed by scanning the training data set and extracting all mention segments and their corresponding labels. Thus, each segment tuple is composed of a single mention and a list of entity labels encoded with the BIO scheme. This method can also be viewed as a generalization of (SUB) (Shi et al., 2021) which performs a soft-mix of substructures of varying lengths.

TokenMix

Inspired by LwTR, TMix performs linear interpolations at the token level. We use tokens with entity labels in the BIO scheme of training data sets as a token pool 𝒫1\mathcal{P}^{1}. Each segment tuple is composed of a single token and its label.

SynonymMix

Inspired by SR, the Synonym Pool 𝒫1\mathcal{P}^{1} returns a synonym of the token in the original sequence based on WordNet (Miller, 1995b). We assume the two synonyms share the same label, thus interpolation only happens within input.

RelationMix

Since each relation is composed of two possibly nonadjacent nominals in a sentence, we construct a pool 𝒫2\mathcal{P}^{2} with groups of two nominals and a relation label333The direction of the relation is implied by the labels. For example, the label list contains both producer-product (e1,e2) and producer-product (e2,e1). During the mixing phase, the two nominals and their corresponding relation labels are mixed with a pair of nominals from 𝒫2\mathcal{P}^{2}.

4 Experiments

Datasets

Language Task # Instances
CoNLL-03 English NER 14987
Kin Kinyarwanda NER 626
Sin Sinhala NER 753
SemEval English RE 8000
DDI English RE 22233
Chemport English RE 18035
Table 1: Dataset Statistics

We conduct SegMix experiments mainly on 33 datasets for NER and 33 for RE on a variety of domains and languages. An NER task is to recognize mentions from text belonging to predefined semantic types, such as person, location, and organization. An RE task requires one to classify the relation type between two prelabeled nominals in a sentence. Some basic dataset statistics are included in Table. 1444Since no down-sampling settings are included in LORELEI-Kin and Sin, we report the results as a single value..

  1. (1)

    CoNLL-03 (Sang and Meulder, 2003), an English corpus for NER containing entity labels such as person, location, organization, etc.555We also conduct experiments on GermEval, a German NER dataset. The results and trends are similar to those in CoNLL-03, and are presented in the Appendix. A.1

  2. (2)

    LORELEI (Strassel and Tracey, 2016) which contains NER annotations for text in languages Kinyarwanda (Kin) and Sinhala (Sin).

  3. (3)

    SemEval-2010 Task 8 (Hendrickx et al., 2010), an English corpus for RE task, containing 99 relation types that include cause-effect, product-producer, instrument-agency, etc.

  4. (4)

    DDI (Herrero-Zazo et al., 2013), a biomedical dataset manually annotated with drug-drug interactions, containing 44 relationship types.

  5. (5)

    ChemProt (Krallinger et al., 2017), a biomedical dataset annotated with chemical-protein interactions, containing 44 interaction types.

CoNLL-03 Kin Sin
Data Size 200 400 800 626 753
BERT 76.0376.03 ±\pm 0.570.57 81.2081.20 ±\pm 0.290.29 84.3484.34 ±\pm 0.330.33 82.2982.29 75.0275.02
BERT + LADA 70.4670.46 ±\pm 0.840.84 81.9881.98 ±\pm 0.160.16 84.5384.53 ±\pm 0.090.09 76.0276.02 60.4360.43
BERT + SeqMix 77.1077.10 ±\pm 1.041.04 81.5581.55 ±\pm 0.660.66 84.8984.89 ±\pm 0.270.27 83.1383.13 78.9378.93
BERT + Whole-seq Mix 75.1175.11 ±\pm 0.620.62 81.9481.94 ±\pm 0.140.14 84.6184.61 ±\pm 0.180.18 82.3582.35 79.1779.17
BERT + MR 77.8677.86 ±\pm 0.360.36 81.4981.49 ±\pm 0.170.17 84.2184.21 ±\pm 0.290.29 83.4683.46 78.6278.62
BERT + LwTR 76.6976.69 ±\pm 0.490.49 81.1381.13 ±\pm 0.360.36 84.5684.56 ±\pm 0.370.37 82.4282.42 78.1778.17
BERT + SR 77.3577.35 ±\pm 0.290.29 81.3381.33 ±\pm 0.320.32 85.1085.10 ±\pm 0.110.11 82.5182.51 78.3878.38
BERT + MMix 78.5178.51 ±\pm 0.340.34 82.98\mathbf{82.98} ±\pm 0.610.61 85.3785.37 ±\pm 0.590.59 83.3783.37 79.5079.50
BERT + TMix 78.75\mathbf{78.75} ±\pm 0.490.49 82.2882.28 ±\pm 0.300.30 85.5185.51 ±\pm 0.210.21 83.85\mathbf{83.85} 78.6378.63
BERT + SMix 77.9577.95 ±\pm 0.380.38 82.5182.51 ±\pm 0.360.36 85.3385.33 ±\pm 0.190.19 83.3183.31 79.3879.38
BERT + MMix + SMix 78.4578.45 ±\pm 0.260.26 82.3982.39 ±\pm 0.210.21 85.6685.66 ±\pm 0.250.25 82.8182.81 79.8379.83
BERT + MMix + TMix 78.4678.46 ±\pm 0.260.26 82.3982.39 ±\pm 0.240.24 85.82\mathbf{85.82} ±\pm 0.210.21 82.7582.75 80.31\mathbf{80.31}
BERT + MMix + SMix + TMix 78.2178.21 ±\pm 0.280.28 82.3682.36 ±\pm 0.340.34 85.2685.26 ±\pm 0.270.27 82.8382.83 78.0578.05
RoBERTa † 74.0874.08 ±\pm 0.270.27 78.8978.89 ±\pm 0.590.59 82.2882.28 ±\pm 0.230.23 - -
RoBERTa +MMix 75.31\mathbf{75.31} ±\pm 0.520.52 80.09\mathbf{80.09} ±\pm 0.490.49 83.3783.37 ±\pm 0.540.54 - -
RoBERTa + TMix 74.5574.55 ±\pm 0.370.37 79.4479.44 ±\pm 0.350.35 83.2283.22 ±\pm 0.800.80 - -
RoBERTa + SMix 75.1875.18 ±\pm 0.420.42 79.8079.80 ±\pm 0.450.45 83.49\mathbf{83.49} ±\pm 0.390.39 - -
Table 2: F1 scores for NER in data-scarce settings (downsampled CoNLL-03 and LORELEI (Kin and Sin)) using SegMix compared with interpolation- and replacement-based DA methods. We use 55 different random seeds for down-sampled datasets and report their averaged performance and standard deviation as μ±σ\mu\pm\sigma. For LORELEI, we report the 1010-fold cross-validation result. Although there is no one best performing variant of SegMix for all settings, we observe that for all variants, SegMix had the best performance compared to the baseline in all settings and other DA techniques in most settings. †denotes our methods.

Data Sampling

For true low-resource languages Kinyarwanda and Sinhala (data sizes of LORELEI-Sin and LORELEI-Kin are less than 5%5\% of the CoNLL-03 English dataset), we use all available data. To create difference scarce settings for CoNLL-03, we subsample a range of sizes (200,400,800,1600,3200,6400,12800200,400,800,1600,3200,6400,12800) of the original training data as the training set. The augmentation algorithm can only access the downsampled training set. We use 55 different random seeds to subsample the training set of each size and report both mean and standard deviation as (μ±σ)(\mu\pm\sigma). The validation and test dataset are unchanged. For LORELEI, we deleted all data samples that only have character "–". Therefore, there are some discrepancies between our reported data number and the original paper. For RE, we subsample (100,200,400,800,1600,6400100,200,400,800,1600,6400) from the original training data as the training set. We do not continue experiments for larger sizes since the improvement from DA diminished.

Settings

For each data split, we conduct experiments on 1212 settings for NER —- 22 interpolation-based DA (Inter+Intra LADA666We used implementation available at https://github.com/GT-SALT/LADA., Whole-sequence Mixup777Implemented by setting segments as whole sequences.), 33 replacement based DA (MR, SR, LwTR)888Implemented as SegMix where mix rate is 1., and 66 variations of SegMix (MMix, TMix, SMix, and their combinations MMix + SMix, MMix + TMix, MMix + TMix + SMix) with a fixed 0.20.2 augmentation rate. We use the BIO tagging scheme Màrquez et al. (2005) to assign labels to each token in NER tasks. In RE tasks, we compare RMix with Relation Replacement. Gold standard nominal pairs are used.

All the methods are evaluated with F1 scores. For Kin and Sin, we report the average F1 scores over 10 folds with cross-validation, which is consistent with Rijhwani et al. 2020.

4.1 Implementation Details

For our experiments, we adopt the pretrained BERT and RoBERTa models999The model choices are included in Appendix A.2. as the encoder, and a linear layer to make prediction, with soft cross-entropy loss. The pretrained BERT model is adopted for each language whereas due to computation expenses, we adopted the pretrained RoBERTa model for experiments on only the CoNLL-03 dataset. For pseudo-data-scarce settings (CoNLL-03, DDI, Chemprot, and SemEval), we train all the models for 100 epochs with early stopping and take the checkpoint with the maximum validation score on the development dataset as the final model. For Kin and Sin, under each data split, we train the model for 100 epochs and report the F1 score. The initial weight decay is 0.1 and α\alpha is 8 for both models. Additionally, learning rates for all settings are set to 5e55e-5 for the BERT model and 1e41e-4 for the RoBERTa model.

4.2 Results and Analysis

NER

The results for the three NER datasets under data-scarce settings with BERT and RoBERTa are shown in Table 2. Fig. 3 includes the results for CoNLL-03 under all data settings with BERT. Under all settings, SegMix or a combination of SegMix achieves the best result compared with other interpolation- and replacement-based methods. For BERT, the best performing SegMix improves the baseline by 2.72.7 F1 in CoNLL-03 with the 200200 sample setting, 1.51.5 F1 for Kin, and 55 F1 for Sin. As for RoBERTa, SegMix and its variants perform better compared to the baseline RoBERTa model in all simulated data-scarce scenario with CoNLL-03. For example, the best performing SegMix variant with RoBERTa improves the baseline by 1.2 F1 on CoNLL-03 under the 200200-sample setting. SegMix proves to be effective under both down-sampled settings and true low-resource settings. These results are consistent with our hypothesis that the “soft” mix of data points in structure-aware segments yields better results than “hard” replacement or mixing on sequences. In comparison, LADA has an unstable performance under data-scarce settings. It produces worse results than the baseline under the CoNLL-03 with 200200 samples, and in both low-resource languages Kin and Sin, while SegMix shows consistent improvements.

One notable trend is that most DA methods provides a larger improvement on Sin in compared to Kin. Notice that even with the same model architecture, the baseline performance of Sin is considerably lower compared to the performance of Kin and English of similar data sizes. This could be due to the fact that multilingual BERT transfers better between languages that share more101010While both the Kinyarwanda-BERT and Sinhala-BERT are transferred from M-BERT, the number of common grammatical ordering WALS features Dryer and Haspelmath (2013) is 3 between Kinyarwanda and English and 1 for Sinhala. These features are 81A, 85A, 86A, 87A, 88A and 89A. word order features Pires et al. (2019). Given the lower baseline, many DA methods provide larger improvements in Sin compared to Kin, and our SegMix variants score around 80 F1 scores. This shows that DA methods are generally very valuable for low resource and understudied languages.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Average F1 score on CoNLL-03, DDI, ChemProt, and SemEval-2010 under different down-sampled data settings. The y axis represents the average F1 score, and the x axis represents number and percentage of instances used as the training set. For each dataset, we calculate the average F1 score on increasing data sub-samples until the performance of our SegMix variant either plateaus or equals that of the baseline. SegMix works best in settings with less than approximately 1000 training instances.

RE

For RE, we compare RMix with the baseline and Relation Replacement (replacing nominal pairs). The results are presented in Fig.3. We find that simple replacement sometimes worsens the baseline performance, while RMix consistently improves the baseline. We analyze its performance on increasing percentages of training data to simulate pseudo-data-scarce settings, as well as settings with ample training data. We observe a consistent improvement performance of RMix over replacement based methods, and at least comparable performance with the baselines. SegMix performs well in data scarce settings, more specifically, on scenarios with less than approximately 1000 training examples. For example, in case of the DDI dataset, SegMix performs at least 2 F1 scores better compared to the baseline in these scenarios.

Robustness with respect to augmentation rate

From previous results on sequence-level Mixup (Zhang et al., 2020; Chen et al., 2020a), we observe that the performance of the model tends to drop below the baseline as the augmentation rate increases above a certain value. Furthermore, the optimal augmentation rate varies under different initial data settings: a good augmentation rate for the 200200-sample might not be good for the 800-sample. With BERT, for example, a 0.20.2 augmentation rate improves upon baseline under the 200200-sample setting, but produces worse results than the baseline under the 800800-sample setting. This leads to an extra burden in hyperparameter tuning. Through experiments on varying augmentation rates under 33 different data-scarcity settings, we show that MMix consistently improves the baseline performance under all settings, making it more applicable in practical contexts. As presented in Fig. 4, MMix consistently improves upon the baseline for all experimented augmentation rate. Furthermore, the best performance is consistently achieved at 0.10.1. TMix and SMix also show a similar trend, the specific scores are presented in Appendix. A.1.

Refer to caption
Refer to caption
Figure 4: Average F1 score with variant augmentation rates of MMix and SeqMix on CoNLL-03 with 200, 400, and 800 down-sampled data. The colored line represents the baseline performance. MMix constantly outperforms the baseline performance.

Computation Time

SegMix is easy to implement and adds little computational overhead. We compare the time required to generate the mixing data and training using LADA, MMix, and SeqMix in Table. 3. Without extra constraints on the augmentation process, MMix (and its other variants) takes <11 second on average to generate the augmented dataset. While SeqMix takes >22 minutes due to the filtering process. Both SeqMix and SegMix pass mixed embeddings into the encoder directly; thus, no extra computation is required for each epoch. However, we observe that SegMix converges faster than SeqMix, thus requiring less training time on average. Since LADA mixes hidden representations during training, no augmented dataset is explicitly generated. This leads to almost twice the training time of SegMix.

mixing time (s) training time (s)
SeqMix 138.90138.90 ±\pm 15.4615.46 1094.991094.99 ±\pm 108.28108.28
MMix † 0.810.81 ±\pm 0.220.22 609.61609.61 ±\pm 66.3966.39
LADA 1120.781120.78 ±\pm 103.13103.13
Table 3: Comparison of the mixing time (time taken to generate the augmented data) and the training time (time taken to train the model to converge) of LADA, SeqMix and MMix on CoNLL-03 with 200200 downsampled data. We experimented with 55 different random seeds and reported the average time and standard deviation.

4.3 Discussion

We argue that SegMix keeps the syntactic and output structure of training data intact. We choose some sample sequences in CoNLL-03 and visualize them in Fig. 5 by mapping the mixed embeddings to the nearest word in the vocabulary.

Original: Swedish [MISC] options and derivatives exchange OM Gruppen AB [ORG] said on Thursday it would open an electronic bourse for forest industry products in London [LOC] in the first half of 1997. MMix: Swedish [MISC] options and derivatives exchange Javier Gomez de [PER/ORG] said on Thursday it would open an electronic bourse for forest industry products in London [LOC] in the first half of 1997. Whole-Sequence Mix: Sweden [MISC/ORG] option [O/ORG] but [unused33] transfer . . [unused10] [O/ORG] saying to Friday them might closed his electronics . with woods companies Products of Paris [O/LOC] of a second three in 1995.

Figure 5: Mixed sentence samples recovered by mapping embeddings to the nearest token (l2 distance). [A/B] represents the linear interpolation of the one-hot encodings of the two labels A and B.

MMix preserves the syntactic and entity structures while achieving linear interpolation between each mention. Due to the high proportion of non-entity phrases in the dataset, SeqMix tends to mix entity mentions with nonentity segments (label [O]). The resulting sentences often contain nonmeaningful entities (e.g., option and . . [unused10]), but are perceived as entities (with a non-[O] label). The nonentity phrases in the sentence would also be mixed, producing semantically incorrect context phrases like second three in 1995.

Unlike other interpolation-based DA methods, SegMix imposes few constraints on the mixing candidate and mixed examples. All training data pairs can potentially be used as mixing candidates and no filtering process is required after the augmented sample is generated. This not only potentially expands the explorable space of our augmentation algorithm but also saves computational time.

When analyzing the improvement for each entity class for CoNLL-03, there is an overall improvement in the accuracy for each class, especially for PER and ORG111111Confusion Matrix included in Appendix. A.1. Before SegMix, the model tends to mistakenly predict [LOC] for [ORG] (27%19%27\%\rightarrow 19\%), and [O] for [PER] (19%8%19\%\rightarrow 8\%). This may be due to the fact that MMix introduces more variations of meaningful entities into the training process, preventing the model from only predicting labels with the one of majority occurrence.

We also analyze cases that are improved in different tasks, the specifics can be found in Appendix.A.3. In one example, the baseline model correctly detects a entity span "British University", but falsely classifies it as [MISC] whereas SegMix correctly distinguishes it as an [ORG]. In another example, the baseline model fails to detect the entity span ("Minor Counties" instead of "Minor Counties XI") and the correct entity while SegMix gives the same wrong span, but correct entity class. We hypothesize that SegMix mainly helps the model distinguish between ambiguous types instead of span detection. To validate this claim, we convert all mentions to [B] and [I] during the inference phase and find that there is little difference between the models (both around 98%98\%) in terms of span accuracy — confirming our hypothesis. Similarly for RE, we conduct evaluation in two settings: evaluating only relation type and only relation direction. The accuracy scores for the two metrics both increase around 2%2\%. Thus, RMix helps to identify both the correct type and direction of relations. Specific cases and examples can be found in Appendix A.3.

Limitations

In this paper, we analyze the efficacy of SegMix on tasks with clear task related segments (NER and RE). SegMix works best in such settings but we do not validate it on tasks like syntactic parsing. Secondly, we only test the performance of SegMix on a few transformer based models (BERT and RoBERTa), it is not applicable to new paradigms such as question answering and generation based information extraction techniques He et al. (2015); Josifoski et al. (2022). Lastly, although SegMix works best on small datasets (\approx1000 examples), we recognize that it has a diminishing improvement with the increase of data size. Thus, we recommend using SegMix in data-scarce situations.

5 Conclusion

This paper proposes SegMix, a simple DA technique that adapts to task-specific data structures, which extends the application range of Mixup in NLP tasks. We demonstrate its robustness by evaluating model performance under both true low-resource and downsampled settings on multiple NER and RE datasets. SegMix consistently improves the model performance and is more consistent than other mixing methods. By combining rule-based and interpolation-based DA with a computationally inexpensive and straightforward method, SegMix opens up several interesting directions for further exploration.

Ethics Statement

Our research does not present any new datasets but rather presents a new general methods that can be used to improve performance of existing NLP applications, and is intended to be used under data-scarce situation. As a result, we anticipate no direct harm involved with the intended usage. However, we realize that it depends on the kind of NLP models/applications the users to apply to.

Our research does not involve attributing any forms of characteristics to any individual. As a matter of fact, we strive to boost performance for NLP applications on low-resource languages. Our proposed method is easy to implement and adds negligible overhead to computation time compared to similar methods. Due to the fact that we conducted experiments over extensive hyperparameter and data settings, we used around 5000 GPU/hours on Tesla T4 GPUs.

References

Appendix A Appendix

A.1 Additional results

We conduct experiments on GermEval datasets. The results are included in Table. 4. We report the results of the experiment on the varying augmentation rate in MMix, SMix, and TMix in Table 6.

GermEval
5%5\% 10%10\% 30%30\%
BERT 70.2870.28 75.6475.64 79.6379.63
BERT + MR 74.5174.51 75.9875.98 80.8380.83
BERT + SR 73.7773.77 73.2673.26 75.5275.52
BERT + LR 73.2673.26 79.4979.49 79.2079.20
BERT + MMix † 76.0676.06 80.3280.32 83.4883.48
BERT + SMix † 75.0775.07 78.6478.64 80.8980.89
BERT + TMix † 74.4874.48 77.0777.07 80.9980.99
Table 4: F1 scores on down-sampled GermEval compared with replacement-based augmentation methods. †denotes our methods.

To better understand the improvement made by SegMix, we compare the confusion matrix of the baseline model and MMix for each class for 5%5\% of CoNLL-03 data in Fig. 6.

Language Model Link Reference
English BERT Devlin et al. 2018
English RoBERTa Liu et al. 2019
Kinyarwanda Kin Adelani et al. 2021
Sinhala Sin Wang et al. 2020
Table 5: Pre-trained Models
Aug Rate 200 400 800 Average
Baseline 0 76.02±0.5676.02\pm 0.56 81.20±0.2981.20\pm 0.29 84.34±0.3384.34\pm 0.33 -
MMix 0.1 78.76±0.49\mathbf{78.76}\pm 0.49 82.28±0.31\mathbf{82.28}\pm 0.31 85.51±0.21\mathbf{85.51}\pm 0.21 +(1.66±0.55+(1.66\pm 0.55)
0.2 77.71±0.2977.71\pm 0.29 82.10±0.0982.10\pm 0.09 84.77±0.2384.77\pm 0.23 +(1.01±0.47)+(1.01\pm 0.47)
0.3 77.88±0.2077.88\pm 0.20 82.10±0.1982.10\pm 0.19 84.72±0.2884.72\pm 0.28 +(1.05±0.47)+(1.05\pm 0.47)
0.4 77.13±0.2377.13\pm 0.23 81.89±0.1381.89\pm 0.13 84.59±0.2484.59\pm 0.24 +(0.68±0.46)+(0.68\pm 0.46)
0.5 77.38±0.3277.38\pm 0.32 81.32±0.0781.32\pm 0.07 84.66±0.0784.66\pm 0.07 +(0.60±0.47)+(0.60\pm 0.47)
Average 78.16±0.4478.16\pm 0.44 82.32±0.2682.32\pm 0.26 85.12±0.1785.12\pm 0.17 +(1.00±0.48)+(1.00\pm 0.48)
TMix 0.1 78.70±0.47\mathbf{78.70}\pm 0.47 82.98±0.27\mathbf{82.98}\pm 0.27 85.37±0.26\mathbf{85.37}\pm 0.26 +(1.83±0.54+(1.83\pm 0.54)
0.2 78.51±0.3478.51\pm 0.34 82.35±0.1282.35\pm 0.12 85.26±0.2385.26\pm 0.23 +(1.52±0.48)+(1.52\pm 0.48)
0.3 78.24±0.3978.24\pm 0.39 82.21±0.1582.21\pm 0.15 85.07±0.1285.07\pm 0.12 +(1.32±0.48)+(1.32\pm 0.48)
0.4 77.56±0.4977.56\pm 0.49 82.11±0.3382.11\pm 0.33 85.22±0.0685.22\pm 0.06 +(1.11±0.54)+(1.11\pm 0.54)
0.5 77.78±0.6077.78\pm 0.60 81.97±0.1781.97\pm 0.17 84.68±0.2584.68\pm 0.25 +(0.96±0.57)+(0.96\pm 0.57)
Average 78.16±0.4478.16\pm 0.44 82.32±0.2682.32\pm 0.26 85.12±0.1785.12\pm 0.17 +(1.35±0.51)+(1.35\pm 0.51)
SMix 0.1 77.95±0.39\mathbf{77.95}\pm 0.39 82.52±0.36\mathbf{82.52}\pm 0.36 85.33±0.19\mathbf{85.33}\pm 0.19 +(1.4±0.52+(1.4\pm 0.52)
0.2 77.75±0.4677.75\pm 0.46 82.42±0.3582.42\pm 0.35 85.05±0.1885.05\pm 0.18 +(1.22±0.54)+(1.22\pm 0.54)
0.3 77.24±0.4477.24\pm 0.44 82.11±0.0782.11\pm 0.07 84.90±0.1684.90\pm 0.16 +(0.89±0.49)+(0.89\pm 0.49)
0.4 77.23±0.5977.23\pm 0.59 81.75±0.2981.75\pm 0.29 84.76±0.1584.76\pm 0.15 +(0.73±0.57)+(0.73\pm 0.57)
0.5 77.78±0.4977.78\pm 0.49 81.42±0.3581.42\pm 0.35 84.98±0.2184.98\pm 0.21 +(0.54±0.55)+(0.54\pm 0.55)
Average 77.39±0.5077.39\pm 0.50 82.04±0.2982.04\pm 0.29 85.01±0.1785.01\pm 0.17 +(0.96±0.54)+(0.96\pm 0.54)
Table 6: f1 scores of MMix, TMix, SMix on CoNLL-03 with variant augmentation rates (#of augmented data#of training data\frac{\#\text{of augmented data}}{\#\text{of training data}}) under different initial data sizes. SegMix consistently improves over the baseline, demonstrating its stability and robustness over varying augmentation rates. The last row is the averaged improvement score for each augmentation rate over different initial data sizes. The last column is the average score for each initial data size over different augmentation rates.

A.2 Variants of BERT Models

As mentioned in Sec. 4.1, we adopted language-specific BERT models as the pre-trained models for all tasks. There are 12 layers (transformer blocks), 12 attention heads, and 110 million parameters (Devlin et al., 2018). The model links are included in Table. 5. For Kinyarwanda, bert-base-multilingual-cased-finetuned-kinyarwanda is obtained by fine-tuning Multilingual BERT (MBERT) on the Kinyarwanda dataset JW300, KIRNEWS, and BBC Gahuza  (Adelani et al., 2021). EMBERT-Sin is obtained by EXTEND (Wang et al., 2020) MBERT in Sinhala. Specifically, EMBERT-Sin first incorporates the target language Sinhala by expanding the vocabulary, and then continues pre-training on LORELEI using a batch size of 3232, a learning rate of 2e52e-5, and trained for 500K500K iterations.

Refer to caption
Figure 6: Confusion Matrix on CoNLL-03 with and without SegMix with 200200 training data.
Pred. 1 Baseline English [MISC] county sides and another against British Universities [MISC]
MMix English [MISC] county sides and another against British Universities [ORG]
Pred. 2 Baseline May 22 First one-day international at Headingley [ORG]
MMix May 22 First one-day international at Headingley [LOC]
Pred. 3 Baseline July 9 v Minor Counties [MISC] XI
MMix July 9 v Minor Counties [ORG] XI
Table 7: Examples of cases predicted by the baseline model and MMix from validation dataset. The colored segments represent an entity mention, the blue segment represents a correctly classified mention, and the red represents a misclassified mention.
Ex. 4 the complete [statue]e1e_{1} topped by an imposing [head]e2e_{2}was originally nearly five metres high
Other Baseline: Component-Whole(e2,e1) RMix : Other
Ex. 5 the [slide]e1e_{1}which was triggered by an avalanche - control [crew] e2e_{2} damaged one home and blocked the road for most of the day
Cause-Effect(e2,e1) Baseline: Product-Producer(e1,e2) RMix : Cause-Effect(e1,e2)
Table 8: Examples of correctly classified cases after RMix. The bold segment tuple represents a nominal pair, and the blue label represents a misclassified relation. The true label is presented in the first column.

A.3 Case Analysis

We list some improved cases in Table. 7, Ex. 1 and 2 are cases of correction between for ORG, while Ex. 3 is a case where the entity label is correct, but the mention range remains incomplete (both predicts Minor Counties as a mention instead of Minor Counties XI). In Table. 8, we list some improved cases for RMix on RE. Both Ex.4 and 5 are cases of correction for relation type. In Ex.5, RMix helps the model classify the correct relation but not in the correct order.