This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SHAPE: Shifted Absolute Position Embedding for Transformers

Shun Kiyono♠,♡    Sosuke Kobayashi♡,♢    Jun Suzuki♡,♠    Kentaro Inui♡,♠
RIKEN    Tohoku University    Preferred Networks, Inc.   
[email protected],   [email protected],
{jun.suzuki, inui}@tohoku.ac.jp
Abstract

Position representation is crucial for building position-aware representations in Transformers. Existing position representations suffer from a lack of generalization to test data with unseen lengths or high computational cost. We investigate shifted absolute position embedding (SHAPE) to address both issues. The basic idea of SHAPE is to achieve shift invariance, which is a key property of recent successful position representations, by randomly shifting absolute positions during training. We demonstrate that SHAPE is empirically comparable to its counterpart while being simpler and faster111The code is available at https://github.com/butsugiri/shape..

1 Introduction

Position representation plays a critical role in self-attention-based encoder-decoder models (Transformers) (Vaswani et al., 2017), enabling the self-attention to recognize the order of input sequences. Position representations have two categories (Dufter et al., 2021): absolute position embedding (APE) (Gehring et al., 2017; Vaswani et al., 2017) and relative position embedding (RPE) (Shaw et al., 2018). With APE, each position is represented by a unique embedding, which is added to inputs. RPE represents the position based on the relative distance between two tokens in the self-attention mechanism.

RPE outperforms APE on sequence-to-sequence tasks Narang et al. (2021); Neishi and Yoshinaga (2019) due to extrapolation, i.e., the ability to generalize to sequences that are longer than those observed during training (Newman et al., 2020). Wang et al. (2021) reported that one of the key properties contributing to RPE’s superior performance is shift invariance222Shift invariance is also known as translation invariance., the property of a function to not change its output even if its input is shifted. However, unlike APE, RPE’s formulation strongly depends on the self-attention mechanism. This motivated us to explore a way to incorporate the benefit of shift invariance in APE.

Refer to caption
Figure 1: Overview of position representations. (a) APE and (c) SHAPE consider absolute positions in the input layer, whereas (b) RPE considers the relative position of a given token pair in the self-attention mechanism.

A promising approach to achieving shift invariance while using absolute positions is to randomly shift positions during training. A similar idea can be seen in several contexts, e.g., computer vision (Goodfellow et al., 2016) and question-answering in NLP (Geva et al., 2020). APE is no exception; a random shift should force Transformer to capture the relative positional information from absolute positions. However, the effectiveness of a random shift for incorporating shift invariance in APE is yet to be demonstrated. Thus, we formulate APE with a random shift as a variant of position representation, namely, Shifted Absolute Position Embedding (SHAPE; Figure 1c), and conduct a thorough investigation. In our experiments, we first confirm that Transformer with SHAPE learns to be shift-invariant. We then demonstrate that SHAPE achieves a performance comparable to RPE in machine translation. Finally, we reveal that Transformer equipped with shift invariance shows not only better extrapolation ability but also better interpolation ability, i.e., it can better predict rare words at positions observed during the training.

2 Position Representations

Figure 1 gives an overview of the position representations compared in this paper. We denote a source sequence 𝑿\bm{X} as a sequence of II tokens, namely, 𝑿=(x1,,xI)\bm{X}=(x_{1},\dots,x_{I}). Similarly, let 𝒀\bm{Y} represent a target sequence of JJ tokens 𝒀=(y1,,yJ)\bm{Y}=(y_{1},\dots,y_{J}).

2.1 Absolute Position Embedding (APE)

APE provides each position with a unique embedding (Figure 1a). Transformer with APE computes the input representation as the sum of the word embedding and the position embedding for each token xi𝑿x_{i}\in\bm{X} and yj𝒀y_{j}\in\bm{Y}.

Sinusoidal positional encoding (Vaswani et al., 2017) is a deterministic function of the position and the de facto standard APE for Transformer333Learned position embedding (Gehring et al., 2017) is yet another variant of APE; however, we exclusively focus on sinusoidal positional encoding as its performance is comparable (Vaswani et al., 2017).. Specifically, for the ii-th token, the mm-th element of position embedding PE(i,m)\mathrm{PE}(i,m) is defined as

PE(i,m)={sin(i100002mD)mis evencos(i100002mD)mis odd,\displaystyle\mathrm{PE}(i,m)\!=\!\begin{cases}\sin\left(\frac{i}{10000^{\frac{2m}{D}}}\right)&m\ \textrm{is even}\\ \cos\left(\frac{i}{10000^{\frac{2m}{D}}}\right)&m\ \textrm{is odd}\end{cases}, (1)

where DD denotes the model dimension.

2.2 Relative Position Embedding (RPE)

RPE (Shaw et al., 2018) incorporates position information by considering the relative distance between two tokens in the self-attention mechanism (Figure 1b). For example, Shaw et al. (2018) represent the relative distance between the ii-th and jj-th tokens with relative position embeddings 𝒂ijKey,𝒂ijValueD\bm{a}_{i-j}^{\mathrm{Key}},\bm{a}_{i-j}^{\mathrm{Value}}\in\mathbb{R}^{D}{}. These embeddings are then added to key and value representations, respectively.

RPE outperforms APE on out-of-distribution data in terms of sequence length owing to its innate shift invariance (Rosendahl et al., 2019; Neishi and Yoshinaga, 2019; Narang et al., 2021; Wang et al., 2021). However, the self-attention mechanism of RPE involves more computation than that of APE444Narang et al. (2021) reported that Transformer with RPE is up to 25% slower than that with APE.. In addition, more importantly, RPE requires the modification of the architecture, while APE does not. Specifically, RPE strongly depends on the self-attention mechanism; thus, it is not necessarily compatible with studies that attempt to replace the self-attention with a more lightweight alternative (Kitaev et al., 2020; Choromanski et al., 2021; Tay et al., 2020).

RPE, which was originally proposed by Shaw et al. (2018), has many variants in the literature (Dai et al., 2019; Raffel et al., 2020; Huang et al., 2020; Wang et al., 2021; Wu et al., 2021). They aim to improve the empirical performance or the computational speed compared with the original RPE. However, the original RPE is still a strong method in terms of the performance. Narang et al. (2021) conducted a thorough comparison on multiple sequence-to-sequence tasks and reported that the performance of the original RPE is comparable to or sometimes better than its variants. Thus, we exclusively use the original RPE in our experiments.

2.3 Shifted Absolute Position Embedding (SHAPE)

Given the drawbacks of RPE, we investigate SHAPE (Figure 1c) as a way to equip Transformer with shift invariance without any architecture modification or computational overhead on APE. During training, SHAPE shifts every position index of APE by a random offset. This prevents the model from using absolute positions to learn the task and instead encourages the use of relative positions, which we expect to eventually lead to the learning of shift invariance.

Let kk represent an offset drawn from a discrete uniform distribution 𝒰{0,K}\mathcal{U}\{0,K\} for each sequence and for every iteration during training, where KK\in\mathbb{N} is the maximum shift. SHAPE only replaces PE(i,m)\mathrm{PE}(i,m) of APE in Equation 1 with

PE(i+k,m).\displaystyle\mathrm{PE}(i+k,m). (2)

We independently sample kk for the source and target sequence. SHAPE can thus be incorporated into any model using APE with virtually no computational overhead since only the input is modified. Note that SHAPE is equivalent to the original APE if we set K=0K=0; in fact, we set K=0K=0 during inference. Thus, SHAPE can be seen as a natural extension to incorporate shift invariance in APE.

SHAPE can be interpreted in multiple viewpoints. For example, SHAPE can be seen as a regularizer that prevents Transformer from overfitting to the absolute position; such overfitting is undesirable not only for extrapolation (Neishi and Yoshinaga, 2019) but also for APE with length constraints (Takase and Okazaki, 2019; Oka et al., 2020, 2021). In addition, SHAPE can be seen as a data augmentation method because the randomly sampled kk shifts each instance into different subspaces during training.

3 Experiments

Using machine translation benchmark data, we first confirmed that Transformer trained with SHAPE learns shift invariance (Section 3.2). Then, we compared SHAPE with APE and RPE to investigate its effectiveness (Section 3.3).

3.1 Experimental Configuration

Dataset    We used the WMT 2016 English-German dataset for training and followed Ott et al. (2018) for tokenization and subword segmentation (Sennrich et al., 2016). We used newstest2010-2013 and newstest2014-2016 as the validation and test sets, respectively.

Our experiments consist of the following three distinct dataset settings:

(i) Vanilla:    Identical to previous studies (Vaswani et al., 2017; Ott et al., 2018).

(ii) Extrapolate:    Shift-invariant models are typically evaluated in terms of extrapolation ability (Wang et al., 2021; Newman et al., 2020). We replicated the settings of Neishi and Yoshinaga (2019); the training set excludes pairs whose source or target sequence exceeds 50 subwords, while the validation and test sets are identical to Vanilla.

(iii) Interpolate:    We also evaluate the models from the viewpoint of interpolation, which we define as the ability to generate tokens whose lengths are seen during training. Specifically, we evaluate interpolation using long sequences since, first, the generation of long sequences is an important research topic in NLP (Zaheer et al., 2020; Maruf et al., 2021) and second, in datasets with long sequences, the position distribution of each token becomes increasingly sparse. In other words, tokens in the validation and test sets become unlikely to be observed in the training set at corresponding positions; we expect that shift invariance is crucial for addressing such position sparsity.

In this study, we artificially generate a long sequence by simply concatenating independent sentences in parallel corpus. Specifically, given ten neighboring sentences of Vanilla, i.e., 𝑿1,,𝑿10\bm{X}_{1},\dots,\bm{X}_{10} and 𝒀1,,𝒀10\bm{Y}_{1},\dots,\bm{Y}_{10}, we concatenate each sentence with a unique token sep\langle\textit{sep}\rangle. We also apply the same operation to the validation and test sets.

Evaluation    We evaluate the performance with sacreBLEU (Post, 2018). Throughout the experiment, we apply the moses detokenizer to the system output and then compute the detokenized BLEU555Details of datasets and evaluation are in Appendix A..

Models    We adopt transformer-base (Vaswani et al., 2017) with APE, SHAPE, or RPE, respectively. Our implementations are based on OpenNMT-py (Klein et al., 2017). Unless otherwise stated, we use a fixed value (K=500K=500) for the maximum shift of SHAPE to demonstrate that SHAPE is robust against the choice of KK. We set the relative distance limit in RPE to 16 following Shaw et al. (2018) and Neishi and Yoshinaga (2019)666See Appendix B for a list of hyperparameters..

3.2 Experiment 1: Shift Invariance

We confirmed that SHAPE learns shift invariance by comparing APE and SHAPE trained on Interpolate.

Quantitative Evaluation: BLEU on Training Data    We first evaluated if the model is robust to the order of sentences in each sequence. We used the sub-sampled training data (10k pairs) of Interpolate to eliminate the effect of unseen sentences; in this way, we can isolate the effect of sentence order. Given a sequence in the original order (Original), 𝑿1,,𝑿10\bm{X}_{1},\dots,\bm{X}_{10}, we generated a swapped sequence (Swapped) by moving the first sentence to the end, i.e., 𝑿2,,𝑿10,𝑿1\bm{X}_{2},\dots,\bm{X}_{10},\bm{X}_{1}. The model then generates two sequences 𝒀1,,𝒀10\bm{Y}_{1}^{\prime},\dots,\bm{Y}_{10}^{\prime} and 𝒀2,,𝒀10,𝒀1\bm{Y}_{2}^{\prime},\dots,\bm{Y}_{10}^{\prime},\bm{Y}_{1}^{\prime}. Finally, we evaluated the BLEU score of 𝒀1\bm{Y}_{1}^{\prime}. The result is shown in Table 1. Here, SHAPE has a much smaller performance drop than APE when evaluated on different sentence ordering. This result indicates the shift invariance property of SHAPE.

Original Swapped Performance Drop
APE 28.81 20.74 8.07
SHAPE 28.51 27.06 1.45
Table 1: BLEU score on the sub-sampled training data of Interpolate (10,000 pairs). In Original and Swapped, the order of input sequence is 𝑿1,,𝑿10\bm{X}_{1},\dots,\bm{X}_{10} and 𝑿2,,𝑿10,𝑿1\bm{X}_{2},\dots,\bm{X}_{10},\bm{X}_{1}, respectively.
Refer to caption
Figure 2: Cosine similarities of the encoder hidden states with different offsets k{0,100,250,500}k\in\{0,100,250,500\}. Only the representation of SHAPE is invariant with kk.

Qualitative Evaluation: Similarities of Representations    We also qualitatively confirmed the shift invariance as shown in Figure 2. The figure illustrates how the offset kk changes the encoder representations of trained models APE and SHAPE. Given the two models and an input sequence 𝑿\bm{X}, we computed the encoder hidden states of the given input sequence for each k{0,100,250,500}k\in\{0,100,250,500\}. For each position ii, we computed the cosine similarity (sim\mathrm{sim}) of the hidden states from two offsets, i.e., 𝒉ik1,𝒉ik2D\bm{h}_{i}^{k_{1}},\bm{h}_{i}^{k_{2}}\in\mathbb{R}^{D}{}, and computed its average across the positions as

1Ii=1Isim(𝒉ik1,𝒉ik2).\displaystyle\frac{1}{I}\sum_{i=1}^{I}\mathrm{sim}(\bm{h}_{i}^{k_{1}},\bm{h}_{i}^{k_{2}}). (3)

As shown in Figure 2, SHAPE builds a shift-invariant representation; regardless of the offset kk, the cosine similarity is almost always 1.0. Such invariance is nontrivial because the similarity of APE does not show similar characteristics777Additional figures are available in Appendix C..

Dataset Model Valid Test Speed
Vanilla APE 23.61 30.46 x1.00
RPE 23.67 30.54 x0.91
SHAPE 23.63 30.49 x1.01
Extrapolate APE 22.18 29.22 x1.00
RPE 22.97 29.86 x0.91
SHAPE 22.96 29.80 x0.99
Interpolate APE 31.40 38.23 x1.00
RPE - - -
SHAPE 32.50 39.09 x0.99
Table 2: BLEU scores on newstest2010-2016. Valid is the average of newstest2010-2013. Test is the average of newstest2014-2016. The scores for individual newstests are available in Appendix D. {\dagger}: the values are averages of five distinct trials with five different random seeds. *: not available as the implementation was very slow. Speed is the relative speed to APE (larger is faster).

3.3 Experiment 2: Performance Comparison

We compared the overall performance of position representations on the validation and test sets as shown in Table 2. Figure 3 shows the BLEU improvement of RPE and SHAPE from APE with respect to the source sequence length888The same graph with absolute BLEU is in Appendix D..

On Vanilla, the three models show comparable results. APE being comparable to RPE is inconsistent with the result reported by Shaw et al. (2018); we assume that this is due to a difference in implementation. In fact, Narang et al. (2021) have recently reported that improvements in Transformer often do not transfer across implementations.

Refer to caption
Figure 3: BLEU score improvement from APE on validation and test sets with respect to the source sequence length. The gray color means no training data.

On Extrapolate, RPE (29.86) outperforms APE (29.22) by approximately 0.6 BLEU points on the test set; this is consistent with the result reported by Neishi and Yoshinaga (2019). Moreover, SHAPE achieves comparable test performance to RPE (29.80). According to Figure 3a, both RPE and SHAPE have improved extrapolation ability, i.e., better BLEU scores on sequences longer than those observed during training. In addition, Figure 3a shows the performance of SHAPE with the maximum shift K=40K=40 that was chosen on the basis of the BLEU score for the validation set. This model outperforms RPE, achieving BLEU scores of 23.12 and 29.86 on the validation and test sets, respectively. These results indicate that SHAPE can be a better alternative to RPE.

On Interpolate, we were unable to train RPE because its training was prohibitively slow999A single gradient step of RPE took about 5 seconds, which was 20 times longer than that of APE and SHAPE. We assume that the RPE implementation available in OpenNMT-py has difficulty in dealing with long sequences.. Similarly to Extrapolate, SHAPE (39.09) outperforms APE (38.23) on the test set. Figure 3b shows that SHAPE consistently outperformed APE for every sequence length. From this result, we find that the shift invariance also improves the interpolation ability of Transformer.

Refer to caption
(a) Extrapolate dataset
Refer to caption
(b) Interpolate dataset
Figure 4: Tokenwise analysis on gold references: the value in each cell represents the ratio that SHAPE assigns a higher score to a gold token than APE.

4 Analysis

This section provides a deeper analysis of how the model with translation invariance improves the performance. We hereinafter exclusively focus on APE and SHAPE because SHAPE achieves comparable performance to RPE, and we were unable to train RPE on the Interpolate dataset as explained in footnote 9.

As discussed in Section 3.3, Figure 3 demonstrated that SHAPE outperformed APE in terms of BLEU score. However, BLEU evaluates two concepts simultaneously, that is, the token precision via n-gram matching and the output length via the brevity penalty (Papineni et al., 2002). Thus, the actual source of improvement remains unclear. We hereby exclusively analyzed the precision of token prediction. Specifically, we computed tokenwise scores assigned for gold references, and we then compared them across the models; given a sequence pair (𝑿,𝒀)(\bm{X},\bm{Y}) and a trained model, we computed a score (i.e., log probability) sjs_{j} for each token yjy_{j} in a teacher-forcing manner. Here, a higher score to gold token means better model performance. We used the validation set for comparison.

Figure 4 shows the ratio that SHAPE assigns a higher score to a gold token than APE, compared across for each position of the decoder.

Better extrapolation means better token precision    Figure 4(a) shows that SHAPE outperforms APE, especially in the right part of the heat map. This area corresponds to sequences longer than those observed during training. This result indicates that better extrapolation in terms of BLEU score means better token precision.

Interpolation is particularly effective for rare tokens    As shown in Figure 4(b), SHAPE consistently outperforms APE and the performance gap is especially significant in the low-frequency region (bottom part). This indicates that SHAPE predicts rare words better than APE. One plausible explanation for this observation is that SHAPE carries out data augmentation in the sense that in each epoch, the same sequence pair is assigned a different position depending on the offset kk. Rare words typically have sparse position distributions in training data and thus benefit from the extra position assignment during training.

5 Conclusion

We investigated SHAPE, a simple variant of APE with shift invariance. We demonstrated that SHAPE is empirically comparable to RPE yet imposes almost no computational overhead on APE. Our analysis revealed that SHAPE is effective at extrapolation to unseen lengths and interpolating rare words. SHAPE can be incorporated into the existing codebase with a few lines of code and no risk of a performance drop from APE; thus, we expect SHAPE to be used as a drop-in replacement for APE and RPE.

Acknowledgements

We thank the anonymous reviewers for their insightful comments. We thank Sho Takase for valuable discussions. We thank Ana Brassard, Benjamin Heinzerling, Reina Akama, and Yuta Matsumoto for their valuable feedback. The work of Jun Suzuki was supported by JST Moonshot R&D Grant Number JPMJMS2011 (fundamental research) and JSPS KAKENHI Grant Number 19H04162 (empirical evaluation).

References

Appendix A Summary of Datasets

We summarized the statistics, preprocessing, and evaluation metrics of datasets used in our experiment in Table 3. The length statistics are in Figure 5.

Dataset Name Training Data # of Sent. Pairs in Training Data Validation Test Evaluation Metric
Vanilla WMT 2016 English-German 4.5M newstest2010-2013 newsetst2014-2016 detokenized BLEU via sacreBLEU
Extrapolate WMT 2016 English-German. We removed sequence pairs if the length of the source or target sentence exceeds 50 subwords. 3.9M newstest2010-2013 newsetst2014-2016 detokenized BLEU via sacreBLEU
Interpolate WMT 2016 English-German. Given neighboring ten sentence of Vanilla, i.e., 𝑿1,,𝑿10\bm{X}_{1},\dots,\bm{X}_{10} and 𝒀1,,𝒀10\bm{Y}_{1},\dots,\bm{Y}_{10}, we concatenate each sentence with a special token sep\langle\textit{sep}\rangle. 450K newstest2010-2013. We concatenated sentences as in training data. newstest2014-2016. We concatenated sentences as in training data. detokenized BLEU via sacreBLEU
Table 3: Summary of statistics, preprocessing, and evaluation metric of datasets used in our experiment.
Refer to caption
(a) Vanilla dataset
Refer to caption
(b) Extrapolate dataset
Refer to caption
(c) Interpolate dataset
Figure 5: Distribution of source sequence length of each dataset.

Appendix B Hyperparameters

We present the list of hyperparameters used in our experiments in Table 4. Hyperparameters for training Transformer follow the recipe available in the official documentation page of OpenNMT-py101010https://opennmt.net/OpenNMT-py/FAQ.html#how-do-i-use-the-transformer-model.

Configurations Selected Value
Encoder-Decoder Architecture transformer-base (Vaswani et al., 2017)
Optimizer Adam (β1=0.9,β2=0.98,ϵ=1×108\beta_{1}=0.9,\beta_{2}=0.98,\epsilon=1\times 10^{-8})
Learning Rate Schedule “Noam” scheduler described in (Vaswani et al., 2017)
Warmup Steps 8,000
Learning Rate Scaling Factor 2
Dropout 0.1
Gradient Clipping None
Beam Search Width 4
Label Smoothing ϵls=0.1\epsilon_{ls}=0.1 (Szegedy et al., 2016)
Mini-batch Size 112k tokens
Number of Gradient Steps 200,000
Averaging Save checkpoint for every 5,000 steps and take an average of last 10 checkpoints
Maximum Offset KK (for SHAPE) We set K=500K=500 for the most of the experiments. We manually tuned KK on validation BLEU for Extrapolate from following range: {10, 20, 30, 40, 100, 500}, and report the score of K=40K=40 in addition to K=500K=500. We used a single random seed for the tuning.
Relative Distance Limit (for RPE) 16 following (Neishi and Yoshinaga, 2019)
GPU Hardware Used DGX-1 and DGX-2
Table 4: List of hyperparameters. {\dagger}: this corresponds to “learning rate” variable defined in OpenNMT-py framework.

Appendix C Similarities of Representations

In Section 3.2, we presented Figure 2 to qualitatively demonstrate that the representation of SHAPE is shift-invariant. We present ten additional figures that we created from ten additional instances in Figure 6. The characteristic of the figures are consistent with that observed in Figure 2; the representation of SHAPE is shift-invariant, whereas the representation of APE is not.

Refer to caption
(a) Sequence ID: #1
Refer to caption
(b) Sequence ID: #2
Refer to caption
(c) Sequence ID: #3
Refer to caption
(d) Sequence ID: #4
Refer to caption
(e) Sequence ID: #5
Refer to caption
(f) Sequence ID: #6
Refer to caption
(g) Sequence ID: #7
Refer to caption
(h) Sequence ID: #8
Refer to caption
(i) Sequence ID: #9
Refer to caption
(j) Sequence ID: #10
Figure 6: Cosine similarities of encoder hidden states with different offsets k{0,100,250,500}k\in\{0,100,250,500\}. Only the representation of SHAPE is invariant with kk.

Appendix D Detailed BLEU Scores

We report the BLEU score on each of newstest2010-2016 in Table 5111111SacreBLEU hash of Vanilla and Extrapolate is: BLEU+case.mixed+lang
.en-de+numrefs.1+smooth.exp+
test.wmt{10,11,12,13,
14/full,15,16}+tok.13a+version.1.5.0
.
121212SacreBLEU hash of Interpolate is BLEU+case.mixed+numrefs.1+smooth.exp+
tok.13a+version.1.5.0
.
. In addition, we report the performance of APE, RPE, and SHAPE with respect to the source sequence lengths in Figure 7.

Model 2010 2011 2012 2013 2014 2015 2016 Average Speed
Dataset: Vanilla
APE 24.22 21.98 22.20 26.06 26.95 29.98 34.46 26.55 x1.00
RPE 24.29 22.05 22.22 26.13 27.00 30.00 34.61 26.61 x0.91
SHAPE 24.18 22.01 22.23 26.08 26.89 30.12 34.48 26.57 x1.01
Dataset: Extrapolate
APE 22.69 20.36 20.72 24.94 26.24 28.79 32.62 25.19 x1.00
RPE 23.46 21.19 21.69 25.54 26.80 29.43 33.34 25.92 x0.91
SHAPE 23.60 21.24 21.53 25.45 26.54 29.22 33.63 25.89 x0.99
Dataset: Interpolate
APE 31.41 29.71 29.79 34.69 35.36 38.00 41.32 34.33 x1.00
RPE - - - - - - - - -
SHAPE 32.71 30.77 30.96 35.54 35.72 39.18 42.37 35.32 x0.99
Table 5: BLEU scores on newstest2010-2016. Average column shows the macro average of all newstests. {\dagger}: the values are averages of five distinct trials with five different random seeds. *: not available as the implementation was very slow. Speed is the relative speed to APE (larger is faster).
Refer to caption
Figure 7: BLEU score on validation and test sets with respect to the source sequence length. The gray color means no training data.

Appendix E Learning Curve of Each Model

We present the learning curve of each model (APE, RPE, SHAPE) trained on different datasets (Vanilla, Extrapolate, Interpolate). Figures 8 and 9 present the validation perplexity against the number of gradient steps and wall clock, respectively. From these figures, we made the following observations:

First, according to Figure 8, the speed of convergence is similar across the models in terms of the number of gradient steps. In other words, in our experiment (Section 3), we never compare the models whose degree of convergence is different.

Second, Figure 9 demonstrates that RPE requires more time to complete the training than APE and SHAPE do. As explained in Section 2.2, RPE causes the computational overhead because it needs to compute attention for relative position embeddings. The amount of time required to complete the training is presented in Table 6.

Refer to caption
(a) Vanilla dataset
Refer to caption
(b) Extrapolate dataset
Refer to caption
(c) Interpolate dataset
Figure 8: Learning curves for each position representation and dataset. We compare the speed of convergence in terms of number of gradient steps.
Refer to caption
(a) Vanilla dataset
Refer to caption
(b) Extrapolate dataset
Refer to caption
(c) Interpolate dataset
Figure 9: Learning curves for each position representation and dataset. We compare the speed of convergence in terms of wall clock.
Model Dataset Hardware Training Time (sec) Number of Parameters
APE Vanilla DGX-1 97,073 61M
RPE Vanilla DGX-1 107,089 61M
SHAPE Vanilla DGX-1 96,439 61M
APE Extrapolate DGX-1 101,469 61M
RPE Extrapolate DGX-1 111,246 61M
SHAPE Extrapolate DGX-1 102,535 61M
APE Interpolate DGX-2 69,148 61M
SHAPE Interpolate DGX-2 69,529 61M
Table 6: Training time required to complete 200,000 gradient steps. RPE requires more time than APE and SHAPE do. Figure 9 illustrates the corresponding learning curve.

Appendix F Sanity Check of the Baseline Performance

Building a strong baseline is essential for trustable results (Denkowski and Neubig, 2017). To confirm that our baseline model (i.e., Transformer with APE) trained using OpenNMT-py (Klein et al., 2017) is strong enough, we compared its performance with that trained on Fairseq (Ott et al., 2019). Fairseq is another state-of-the-art framework used by winning teams of WMT shared task (Ng et al., 2019). For training on Fairseq, we used the official recipe available in the documentation131313https://github.com/pytorch/fairseq/tree/master/examples/scaling_nmt. The result is presented in Table 7. Here, the results are the average of five distinct trials with different random seeds. From the table, we can confirm that both models can achieve comparable results.

Model Implementation 2010 2011 2012 2013 2014 2015 2016 Average
APE Fairseq 24.24 22.10 22.40 26.38 27.11 29.58 34.34 26.59
APE OpenNMT-py 24.22 21.98 22.20 26.06 26.95 29.98 34.46 26.55
Table 7: BLEU score on newstest2010-2016. We report average result of five distinct trials with different random seeds.