This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Transformer Conformal Prediction for Time Series

Junghwan Lee    Chen Xu    Yao Xie
Abstract

We present a conformal prediction method for time series using the Transformer architecture to capture long-memory and long-range dependencies. Specifically, we use the Transformer decoder as a conditional quantile estimator to predict the quantiles of prediction residuals, which are used to estimate the prediction interval. We hypothesize that the Transformer decoder benefits the estimation of the prediction interval by learning temporal dependencies across past prediction residuals. Our comprehensive experiments using simulated and real data empirically demonstrate the superiority of the proposed method compared to the existing state-of-the-art conformal prediction methods.

Conformal Prediction, Uncertainty Quantification, Machine Learning, Deep Learning, Transformer

1 Introduction

Uncertainty quantification has become crucial in many scientific domains where black-box machine learning models are often used (angelopoulos2021gentle). Conformal prediction has emerged as a popular and modern technique for uncertainty quantification by providing valid predictive inference for those black-box models (shafer2008tutorial; barber2023conformal).

Time series prediction aims to forecast future values based on a sequence of observations that are sequentially ordered in time (box2015time). With recent advances in machine learning, numerous models have been proposed and adopted for various time series prediction tasks. The increased use of black-box machine learning models necessitates uncertainty quantification, particularly in high-stakes time series prediction tasks such as medical event prediction, stock prediction, and weather forecasting.

While conformal prediction can provide valid predictive inference for uncertainty quantification, applying conformal prediction to time series is challenging since time series data often violate the exchangeability assumption. Additionally, real-world time series data typically exhibit significant stochastic variations and strong temporal correlations. Many efforts have been made to develop valid and effective conformal prediction methods for time series (xu2023conformal). Sequential Predictive Conformal Inference (SPCI), a recently proposed conformal prediction framework for time series, has shown state-of-the-art performance by using Quantile Random Forest (meinshausen2006quantile) as a conditional quantile estimator to predict the quantiles of future prediction residuals, which are used to estimate prediction interval (xu2023sequential).

In this study, we employed Transformer decoder (vaswani2017attention; radford2018improving) as a conditional quantile estimator in the SPCI framework. Specifically, the Transformer decoder takes a sequence of past residuals and features as input to predict the quantiles of future residuals. Given that Transformer decoder-only architecture has already shown impressive performance in many sequential modeling tasks, we hypothesize that utilizing it in the SPCI framework benefits the estimation of prediction interval by learning temporal dependencies between the residuals. We empirically demonstrate the superiority of the proposed method through experiments with simulated and real data, comparing it to state-of-the-art conformal prediction methods.

Refer to caption
Figure 1: Visual description of our proposed method. Prediction residuals (ϵ^t\hat{\epsilon}_{t}) and corresponding features (XtX_{t}) are used as input to a stack of Transformer decoder layers to predict the quantiles of future residual(s), which are used to estimate prediction intervals.
Refer to caption
(a) SPCI-T
Refer to caption
(b) SPCI
Refer to caption
(c) EnbPI
Refer to caption
(d) NexCP
Figure 2: Prediction intervals estimated by SPCI-T and baselines on non-stationary simulated dataset.

2 Problem setup

Consider a sequence of observations {(Xt,Yt):t=1,2,}\{(X_{t},Y_{t}):t=1,2,\ldots\}, where XtdX_{t}\in\mathbb{R}^{d} denotes dd-dimensional feature and YtY_{t}\in\mathbb{R} denotes continuous scalar outcome. Assume that the first TT samples {(Xt,Yt)}t=1T\{(X_{t},Y_{t})\}_{t=1}^{T} be the data for training and validation. We also assume that we have a point prediction method f^\hat{f} that provides a point prediction Y^t\hat{Y}_{t} for YY as Y^t=f^(Xt)\hat{Y}_{t}=\hat{f}(X_{t}).

Our goal is to sequentially construct a prediction interval C^t1(Xt)\hat{C}_{t-1}(X_{t}), starting from time T+1T+1, that desirably contains the true outcome YtY_{t} with the probability at least 1α1-\alpha. C^t1(Xt)\hat{C}_{t-1}(X_{t}) is constructed using the past observations and predictions up to t1t-1 time. The significance level α\alpha is pre-defined and C^t1(Xt)\hat{C}_{t-1}(X_{t}) is a set includes YtY_{t} with probability at least 1α1-\alpha.

We use prediction residual (i.e., prediction error) as a non-conformity score, which is defined as:

ϵ^=YtY^t.\hat{\epsilon}=Y_{t}-\hat{Y}_{t}. (1)

There are two types of coverage guarantees that should be satisfied with prediction intervals: marginal coverage and conditional coverage. Marginal coverage is defined as follows:

(YtC^t1(Xt))1α,t.\mathbb{P}\left({Y_{t}}\in\hat{C}_{t-1}(X_{t})\right)\geq 1-\alpha,\forall t. (2)

Conditional coverage is a stronger guarantee that given each XtX_{t}, the true observation YtY_{t} in included in C^t1(Xt)\hat{C}_{t-1}(X_{t}) with at least 1α1-\alpha probability, which can be defined as follows:

(YtC^t1(Xt)|Xt)1α,t.\mathbb{P}\left({Y_{t}}\in\hat{C}_{t-1}(X_{t})|X_{t}\right)\geq 1-\alpha,\forall t. (3)

If C^t1(Xt)\hat{C}_{t-1}(X_{t}) satisfies eq (2) or eq (3), it is called marginally valid or conditionally valid, respectively. The desired aim is to construct C^t1(Xt)\hat{C}_{t-1}(X_{t}) that satisfies both marginally and conditionally valid.

While infinite-width prediction intervals always satisfy both coverage guarantees, such intervals are pointless since they do not carry any information to quantify uncertainty. Therefore, we aim to minimize the width of prediction intervals while satisfying coverage.

3 Method

3.1 Sequential Predictive Conformal Inference

Sequential Predictive Conformal Inference (SPCI) is a conformal prediction framework for time series proposed by (xu2023sequential). SPCI adopted Quantile Random Forest (meinshausen2006quantile) as a conditional quantile estimator to estimate ϵ^t\hat{\epsilon}_{t} sequentially, leveraging the dependencies of the past residuals. Specifically, SPCI estimates prediction intervals as follows:

[f^(Xt)+Q^t(β^),f^(Xt)+Q^t(1α+β^)],\left[\hat{f}(X_{t})+\hat{Q}_{t}(\hat{\beta}),\hat{f}(X_{t})+\hat{Q}_{t}(1-\alpha+\hat{\beta})\right], (4)

where Q^\hat{Q} denotes a conditional quantile estimator, Q^t(p)\hat{Q}_{t}(p) is the estimation for the true pp-th quantile of ϵ^t\hat{\epsilon}_{t}. β^\hat{\beta} denotes the value that minimizes the prediction interval as:

β^=argminβ[0,α](Q^t(1α+β^)Q^t(β)).\hat{\beta}=\arg\min_{\beta\in[0,\alpha]}\left(\hat{Q}_{t}(1-\alpha+\hat{\beta})-\hat{Q}_{t}(\beta)\right).
Table 1: Empirical coverage and interval width of SPCI-T and baselines on the simulated datasets. The target coverage is set to 0.9, and the past window is set to 100. We report the average value with standard deviation calculated based on three independent trials with different random seeds.

Non-stationary Heteroskedasticity
Coverage Width Coverage Width
SPCI-T 0.91±.0080.91_{\pm.008} 52.38±.584\textbf{52.38}_{\pm.584} 0.88±.0020.88_{\pm.002} 9.56±.082\textbf{9.56}_{\pm.082}
SPCI 0.96±.0040.96_{\pm.004} 76.35±.52776.35_{\pm.527} 0.89±.006{0.89}_{\pm.006} 10.03±.001{10.03}_{\pm.001}
EnbPI 0.86±.002{0.86}_{\pm.002} 134.7±.743134.7_{\pm.743} 0.90±.006{0.90}_{\pm.006} 10.72±.008{10.72}_{\pm.008}
NexCP 0.91±.002{0.91}_{\pm.002} 156.4±2.73{156.4}_{\pm 2.73} 0.92±.004{0.92}_{\pm.004} 11.73±.198{11.73}_{\pm.198}
Table 2: Empirical coverage and interval width of multi-step prediction using SPCI-T on the simulated datasets. The target coverage is set to 0.9, and the past window is set to 100.

Non-stationary Heteroskedasticity
Coverage Width Coverage Width
s=2s=2 0.88±.0150.88_{\pm.015} 51.54±.31451.54_{\pm.314} 0.84±.0030.84_{\pm.003} 9.67±.1349.67_{\pm.134}
s=3s=3 0.86±.0080.86_{\pm.008} 51.46±.18351.46_{\pm.183} 0.80±.0030.80_{\pm.003} 9.72±.1519.72_{\pm.151}
s=4s=4 0.84±.0100.84_{\pm.010} 51.49±.25651.49_{\pm.256} 0.80±.0100.80_{\pm.010} 9.76±.1409.76_{\pm.140}
Table 3: Empirical coverage and interval width of the proposed method and baselines. The target coverage is 0.9, and the past window is set to 50 for all experiments. We report the average value with standard deviation calculated based on three independent trials with different random seeds.

Wind Electricity Solar
Coverage Width Coverage Width Coverage Width
SPCI-T 0.93±.0060.93_{\pm.006} 2.08±.072\textbf{2.08}_{\pm.072} 0.92±.0090.92_{\pm.009} 0.18±.005\textbf{0.18}_{\pm.005} 0.93±.0050.93_{\pm.005} 50.70±3.84\textbf{50.70}_{\pm 3.84}
SPCI 0.96±.0160.96_{\pm.016} 2.41±.0162.41_{\pm.016} 0.92±.0040.92_{\pm.004} 0.22±.0010.22_{\pm.001} 0.91±.0060.91_{\pm.006} 88.76±.24588.76_{\pm.245}
EnbPI 0.48±.0060.48_{\pm.006} 4.10±.0094.10_{\pm.009} 0.79±.0020.79_{\pm.002} 0.22±.0010.22_{\pm.001} 0.88±.0020.88_{\pm.002} 86.91±.36386.91_{\pm.363}
NexCP 0.92±.0160.92_{\pm.016} 6.27±.1456.27_{\pm.145} 0.89±.0010.89_{\pm.001} 0.46±.0010.46_{\pm.001} 0.86±.0020.86_{\pm.002} 114.98±.201114.98_{\pm.201}
Table 4: Empirical coverage and interval width of the proposed method and baselines. The target coverage is 0.9, and the past window is set to 100 for all experiments. Note that the performance of NexCP is identical to the performance in Table 3 since it does not use the past window.

Wind Electricity Solar
Coverage Width Coverage Width Coverage Width
SPCI-T 0.91±.0110.91_{\pm.011} 1.96±.094\textbf{1.96}_{\pm.094} 0.92±.0130.92_{\pm.013} 0.17±.014\textbf{0.17}_{\pm.014} 0.90±.0060.90_{\pm.006} 45.35±1.67\textbf{45.35}_{\pm 1.67}
SPCI 0.94±.0060.94_{\pm.006} 2.39±.0302.39_{\pm.030} 0.93±.0050.93_{\pm.005} 0.22±.0020.22_{\pm.002} 0.93±.0020.93_{\pm.002} 88.21±.43488.21_{\pm.434}
EnbPI 0.74±.0380.74_{\pm.038} 4.65±.0264.65_{\pm.026} 0.85±.0010.85_{\pm.001} 0.26±.0010.26_{\pm.001} 0.88±.0020.88_{\pm.002} 86.64±.13886.64_{\pm.138}
NexCP 0.92±.0160.92_{\pm.016} 6.27±.1456.27_{\pm.145} 0.89±.0010.89_{\pm.001} 0.46±.0010.46_{\pm.001} 0.86±.0020.86_{\pm.002} 114.98±.201114.98_{\pm.201}

3.2 Conformal Prediction for Time Series with Transformer

In this study, we employ Transformer (vaswani2017attention) as a conditional quantile estimator Q^\hat{Q} to estimate the true quantiles of ϵ^t\hat{\epsilon}_{t} within the SPCI framework. Specifically, we use decoder-only architecture (radford2018improving) since it can generalize to variable lengths of sequences without strictly partitioning the sequences for encoding and decoding. Throughout the paper, we refer to the proposed method as Sequential Predictive Conformal Inference with Transformer (SPCI-T).

The past ww residuals and features are used as input for the model to predict the quantiles of the future unobserved residuals. Specifically, {Zt}t(w+1)t1\{Z_{t}\}_{t-(w+1)}^{t-1} is used to predict Q^t(p)\hat{Q}_{t}(p), where Zt:=[Xt,ϵ^t]Z_{t}:=[X_{t},\hat{\epsilon}_{t}]. A fully connected layer without activation converts ZtZ_{t} to the input representation of the model dimension. Note that ZtZ_{t} can include other features that are useful for prediction besides XtX_{t}. Figure 1 visually describes the model architecture.

A fully connected layer without activation transforms output representation into the prediction of the quantile of ϵt^\hat{\epsilon_{t}}. Training is done by sequentially minimizing the quantile loss as follows:

(ϵ^,ϵ^,p)={p(ϵ^ϵ^)if ϵ^ϵ^0,(1p)(ϵ^ϵ^)if ϵ^ϵ^0,\mathcal{L}(\hat{\epsilon},\hat{\epsilon}^{\prime},p)=\begin{cases}p(\hat{\epsilon}-\hat{\epsilon}^{\prime})&\text{if }\hat{\epsilon}-\hat{\epsilon}^{\prime}\geq 0,\\ (1-p)(\hat{\epsilon}^{\prime}-\hat{\epsilon})&\text{if }\hat{\epsilon}^{\prime}-\hat{\epsilon}\geq 0,\end{cases} (5)

where pp is the target quantile and ϵ^\hat{\epsilon}^{\prime} is the predicted value of ϵ^\hat{\epsilon} corresponding to the target quantile.

We hypothesize that using the Transformer decoder as a conditional quantile estimator can offer the following advantages:

  • Transformer decoder can effectively learn temporal dependencies, including long-term dependencies, across residuals.

  • We can incorporate additional features, such as XtX_{t}, for conditional quantile estimation, allowing the Transformer to learn potential dependencies between these additional features and the residuals.

  • Transformer decoder can perform multi-step predictions using known features through generative inference without needing explicit residuals as input for prediction.

4 Experiments

We evaluate SPCI-T and baselines on simulated and real data. The code for all experiments is available at anonymousurl. Hyperparameters and implementation details are available in Appendix A.

4.1 Setup

We first obtain point predictions for all YtY_{t} and corresponding residuals ϵ^\hat{\epsilon} by using leave-one-out (LOO) point predictors in all experiments. We use the ensemble of 25 random forests as the LOO point predictor.

We use state-of-the-art conformal prediction methods as baselines, which include SPCI (xu2023sequential), EnbPI (xu2021conformal), and NexCP (barber2023conformal). We evaluate SPCI-T and the baselines regarding interval coverage and width. We also evaluate the multi-step prediction of SPCI-T. In a multi-step prediction setup, we aim to estimate the prediction intervals at ss-step ahead, assuming that we only have known features (XtX_{t}) for multi-step prediction.

Table 5: Empirical coverage and interval width of multi-step prediction using SPCI-T on real datasets. The target coverage is 0.9, and the past window is set to 100.

Electricity Solar
Coverage Width Coverage Width
s=2s=2 0.87±.0040.87_{\pm.004} 0.18±.0070.18_{\pm.007} 0.88±.0050.88_{\pm.005} 42.70±1.2042.70_{\pm 1.20}
s=3s=3 0.82±.0040.82_{\pm.004} 0.18±.0070.18_{\pm.007} 0.86±.0100.86_{\pm.010} 44.56±1.8244.56_{\pm 1.82}
s=4s=4 0.78±.0070.78_{\pm.007} 0.18±.0070.18_{\pm.007} 0.84±.0200.84_{\pm.020} 46.83±2.0846.83_{\pm 2.08}

4.2 Simulation

Dataset

We generate two simulated time series datasets. The data generating process follows Yt=f(Xt)+ϵtY_{t}=f(X_{t})+\epsilon_{t}. The first dataset is a non-stationary time series, which contains periodicity and autoregressive ϵt\epsilon_{t}. The second dataset is time series with heteroscedastic errors where the variance of ϵt\epsilon_{t} depends on XtX_{t}. Details on simulated data are provided in Appendix A.

Results

Table 1 shows empirical coverage and interval width of SPCI-T and baselines on the two simulated datasets. We observe that SPCI-T achieves the narrowest interval width without losing coverage. Figure 2 displays the prediction interval estimated by SPCI-T and baselines on the non-stationary dataset, confirming that SPCI-T obtains significantly narrower interval width compared to the baselines. Table 2 presents multi-step prediction results of SPCI-T. While SPCI-T maintains its prediction interval width, it loses coverage for multi-step prediction with larger ss.

4.3 Real data Experiments

Dataset

We use three time-series datasets from the real world: solar, wind, and electricity. The solar dataset (zhang2021solar) contains solar radiation information in Atalanta downtown measured in terms of diffuse horizontal irradiance, provided by the United States National Solar Radiation Database. The wind dataset consists of wind speed records measured by the Midcontinent Independent System Operator every 15 minutes over a week period in September 2020 (zhu2021multi). The electricity dataset contains electricity usage and pricing in the states of New South Wales and Victoria in Australia, observed between 1996 and 1999 (harries1999splice). All three datasets were widely adopted benchmark datasets in conformal prediction literature. Details of the three datasets are provided in Appendix A.

Results

Table 3 and Table 4 shows empirical coverage and interval width of SPCI-T and baselines on three real datasets with w=50w=50 and w=100w=100, respectively. We observe that SPCI-T consistently outperforms all other baselines. We also observe that the interval width of SPCI-T was narrow with a longer window, which shows that SPCI-T utilizes the long history of residuals by learning dependencies. Table 5 shows multi-step prediction results of SPCI-T on electricity and solar datasets. We exclude the results on the wind dataset since the coverage deteriorated significantly with s>2s>2. Similarly, in simulation, we observed that SPCI-T lost the coverage with the increasing ss.

We add additional time features to each XtX_{t} in the solar dataset to see how these additional features can influence the performance of SPCI-T. Table 6 presents empirical coverage and interval width of SPCI-T and baselines on a solar dataset with additional features. SPCI-T again consistently outperforms baselines and shows significantly improved performance compared to the experiment on the solar dataset without the additional features. These results empirically demonstrate the advantage of SPCI-T in utilizing features with the residuals for conditional quantile estimation.

Table 6: Empirical coverage and interval width of SPCI-T and baselines on a solar dataset with additional features. The target coverage is 0.9, and the past window is set to 50 or 100. Note that NexCP showed the identical performance regardless of ww since NexCP does not use past windows.

Solar with additional features
w=50w=50 w=100w=100
Coverage Width Coverage Width
SPCI-T 0.93±.0130.93_{\pm.013} 33.51±1.99\textbf{33.51}_{\pm 1.99} 0.91±.0060.91_{\pm.006} 28.14±1.60\textbf{28.14}_{\pm 1.60}
SPCI 0.92±.0020.92_{\pm.002} 89.63±.65189.63_{\pm.651} 0.92±.0050.92_{\pm.005} 87.28±.78287.28_{\pm.782}
EnbPI 0.88±.0050.88_{\pm.005} 86.94±.18186.94_{\pm.181} 0.89±.0000.89_{\pm.000} 85.73±.45985.73_{\pm.459}
NexCP 0.90±.0020.90_{\pm.002} 100.92±6.64100.92_{\pm 6.64} 0.90±.0020.90_{\pm.002} 100.92±6.64100.92_{\pm 6.64}

5 Conclusion

In this study, we propose SPCI-T, which incorporates a Transformer decoder into the recently developed SPCI framework. SPCI-T uses a Transformer decoder to learn temporal dependencies between the residuals and features to predict the conditional quantile of future residuals, which are then used to estimate the prediction interval. Our simulated and real data experiments empirically demonstrate the superiority of SPCI-T. Future directions include tailoring the model architecture for the time series conformal prediction task and conducting more comprehensive evaluations with state-of-the-art methods.

Accessibility

Software and Data

Datasets used in simulation and real data experiments are available at anonymousurl.

Acknowledgements

Anonymous acknowledgments.

Appendix A Additional Details of Experiments

A.1 Details of Simulated Data

Non-stationary time series

We set T=2000T=2000, which means we have {(Xt,Yt)}t=12000\{(X_{t},Y_{t})\}_{t=1}^{2000}. We first generate each dimension of Xt10X_{t}\in\mathbb{R}^{10} from Unif(0,e0.01mod(t,100))\text{Unif}(0,e^{0.01\text{mod}(t,100)}) for all tt. Then, f(Xt)f(X_{t}) is obtained as follows:

f(Xt)=g(Xt)h(Xt),f(X_{t})=g(X_{t})h(X_{t}),

where

g(t)=log(t)sin(2πt/100),t=mod(t,100),g(t)=\log(t^{\prime})\sin(2\pi t^{\prime}/100),\quad t^{\prime}=\text{mod}(t,100),

and

h(Xt)=(|βXt|+(βXt)2+|βXt|3)1/4.h(X_{t})=(|\beta^{\top}X_{t}|+(\beta^{\top}X_{t})^{2}+|\beta^{\top}X_{t}|^{3})^{1/4}.

We sample ϵt\epsilon_{t} from AR(1) as ϵt=ρϵt1+et\epsilon_{t}=\rho\epsilon_{t-1}+e_{t}, where ρ=0.6\rho=0.6 and ete_{t} is i.i.d. generated from normal distribution with zero mean and unit variance. β10\beta\in\mathbb{R}^{10} is a sparse vector having only 20% non-zero elements where non-zero elements are generated from Unif(0,1)\text{Unif}(0,1).

Time series with heteroskedastic error

We set T=2000T=2000, then generate Xt10X_{t}\in\mathbb{R}^{10} similarly to non-stationary time series for all tt. Then, f(Xt)f(X_{t}) is obtained as follows:

f(Xt)=(|βXt|+(βXt)2+|βXt|3)1/4,f(X_{t})=(|\beta^{\top}X_{t}|+(\beta^{\top}X_{t})^{2}+|\beta^{\top}X_{t}|^{3})^{1/4},

where β\beta is generated in a similar way as in the non-stationary time series. We also sample ϵt\epsilon_{t} from AR(1) similar to non-stationary time series, but with the variance of ϵt\epsilon_{t} dependent on XtX_{t} as follows:

Var(ϵt)=σ(Xt)2,\text{Var}(\epsilon_{t})=\sigma(X_{t})^{2},
σ(Xt)=𝟏Xt.\sigma(X_{t})=\mathbf{1}^{\top}X_{t}.

A.2 Details of Real Data

Solar data

The solar dataset contains solar radiation information recorded in every 30 minutes in 2018 in Atalanta downtown (zhang2021solar) provided by the United States National Solar Radiation Database (sengupta2018national). We used seven covariates: Direct Normal Irradiance (DNI), dew point, surface albedo, wind speed, relative humidity, temperature, and pressure. The outcome variable is Diffuse Horizontal Irradiance (DHI), which reflects radiation levels. For additional time features, we converted the 24-hour period into 24 hourly one-hot encoded features, adding 24 additional features.

Wind data

The wind dataset contains wind speed data (measured in m/s) from wind farms operated by the Midcontinent Independent System Operator (MISO) in the United States (zhu2021multi). This 10-dimensional dataset recorded wind speed every 15 minutes over a one-week period in September 2020.

Electricity data

The electricity dataset recorded electricity usage and pricing in the states of New South Wales and Victoria in Australia, every 30 minutes over a 2.5 year period between 1996 and 1999 (harries1999splice). We used four covariates: nswprice and vicprice, the price of electricity in each of the two states; and nswdemand and vicdemand, the usage demand in each of the two states. The outcome variable is transfer, which is the quantity of electricity transferred between the two states.

A.3 Hyperparameters and Implementation Details

Table 7 shows hyperparameters chosen for SPCI-T. We conducted a grid search using training and validation set to find the optimal hyperparameters. The model dimension, number of heads, and number of layers were chosen when performance plateaued. Since SPCI-T requires validation set to select the best model during training, we split the datasets into training, validation, and test set with 8:1:1 ratio for SPCI-T. For all other baselines, the datasets were split into training and test set with a 9:1 ratio. For a fair comparison in terms of data usage, we additionally trained SPCI-T on the validation set after it was initially trained on the training set, which was conducted for 10% of the number of epochs used for the initial training.

Table 7: Hyperparameters chosen for SPCI-T in experiments using simulated and real data.

Simulation Real data
Non-stationary Heteroskedastic Wind Electricity Solar Solar w/ add. features
batch size 4 4 4 4 4 4
learning rate 0.0001 0.0001 0.0001 0.0001 0.0005 0.0005
model dimension 16 16 32 16 16 32
number of heads 4 4 4 4 4 4
number of layers 4 4 4 4 4 4
dropout 0.2 0.2 0.1 0.2 0.2 0.2
additional training N N Y Y Y Y