DEED: Dynamic Early Exit on Decoder
for Accelerating Encoder-Decoder Transformer Models
Abstract
Encoder-decoder transformer models have achieved great success on various vision-language (VL) tasks, but they suffer from high inference latency. Typically, the decoder takes up most of the latency because of the auto-regressive decoding. To accelerate the inference, we propose an approach of performing Dynamic Early Exit on Decoder (DEED). We build a multi-exit encoder-decoder transformer model which is trained with deep supervision so that each of its decoder layers is capable of generating plausible predictions. In addition, we leverage simple yet practical techniques, including shared generation head and adaptation modules, to keep accuracy when exiting at shallow decoder layers. Based on the multi-exit model, we perform step-level dynamic early exit during inference, where the model may decide to use fewer decoder layers based on its confidence of the current layer at each individual decoding step. Considering different number of decoder layers may be used at different decoding steps, we compute deeper-layer decoder features of previous decoding steps just-in-time, which ensures the features from different decoding steps are semantically aligned. We evaluate our approach with two state-of-the-art encoder-decoder transformer models on various VL tasks. We show our approach can reduce overall inference latency by 30%-60% with comparable or even higher accuracy compared to baselines.
1 Introduction

Vision-Language (VL) tasks, e.g., Visual Question Answering (VQA) (Goyal et al. 2017; Biten et al. 2019; Mathew, Karatzas, and Jawahar 2021; Mishra et al. 2019; Singh et al. 2019) and referring expression comprehension (Mao et al. 2016; Yu et al. 2016), have drawn increasing attention in recent years. These tasks involve reasoning about image and texts at the same time. Among many successful models (Appalaraju et al. 2021; Chen et al. 2020; Wang et al. 2022; Zhang et al. 2021; Biten et al. 2022; Lu et al. 2022; Chen et al. 2022; Zhou et al. 2020a) tackling these tasks, encoder-decoder transformer models (Wang et al. 2022; Biten et al. 2022; Chen et al. 2022) usually show the best accuracy thanks to the strong generative ability of the decoder.
Nevertheless, encoder-decoder models rely on the auto-regressive decoding to bring its ability into full play at inference. With auto-regressive decoding, each output token is generated conditioned on previous tokens. Therefore, it has to generate tokens one after another, and repeat the feed-forward in each layer as many times. This mechanism leads to high inference latency in the decoder, and makes the decoder take up most of the total inference latency, as shown in Figure 1. Interestingly, even using only one decoder layer, an encoder-decoder model can still get decent prediction accuracy (see Figure 1), which means samples got correct by the one decoder layer do not need the excessive computations in the deeper decoder layers.
Inspired by these facts, we propose an approach to dynamically allocate adequate amount of computation at a particular decoding step in order to speed-up inference without sacrificing accuracy. Specifically, we build Dynamic Early Exit on Decoder (DEED), a multi-exit model with an early exit strategy to let the model decide whether or not to exit at a specific decoder layer at each decoding step dynamically. Following existing work (Xin et al. 2020; Liu et al. 2021, 2020; Zhang et al. 2022; Geng et al. 2021; Xin et al. 2021; Zhou et al. 2020b), we employ confidence-based dynamic early exit where the decoder may decide to exit when it is confident about its prediction. Unlike encoder acceleration, the dynamic early exit for auto-regressive decoder is more challenging. The challenge is two-fold:
-
•
Multi-exit model, i.e., a model that can exit / make prediction at each layer. To get accurate predictions out of dynamic early exit, we must build and train a strong multi-exit encoder-decoder model, where each of the decoder layers have strong generative ability.
-
•
Semantic misalignment at inference. Tokens can be generated at different decoder layers at different decoding steps. But the auto-regressive decoding requires -layer features from all the previous steps if the current step is inferring at layer . They won’t be available if the previous decoding steps exit at shallower layers. This semantic misalignment between different layers imposes difficulties when applying naive early exit strategy, leading to degraded accuracy.
Previous approaches address the first challenge by using different prediction heads after each transformer layer (Schwartz et al. 2020; Xin et al. 2020; Geng et al. 2021; Xin et al. 2021). In contrast, we build our multi-exit model by sharing the generation head among different decoder layers and training with deep supervision. The generation head generates the output sequence prediction, e.g., the answer text for visual question answering (Biten et al. 2022; Chen et al. 2022; Alayrac et al. 2022; Lu et al. 2022) or box coordinates for referring expression comprehension (Wang et al. 2022; Lu et al. 2022). In addition, we insert an adaptation module between decoder layers and the generation head. This design helps to strengthen the generative ability of shallow decoder layers by sharing the common generation knowledge among different decoder layers. Moreover, to maintain the generative ability when exiting at the final layer, we proposes a loss function that emphasizes the learning of the final layer. These simple yet effective techniques help to improve the accuracy of shallow decoder layers without sacrificing the accuracy at the final decoder layer.
To address the second challenge, we propose a novel algorithm that dynamically computes required deeper-layer features for previous decoding steps just-in-time. This algorithm effectively resolves the semantic misalignment among different layers at different generation steps. In contrast, the existing work, Depth-Adaptive Transformer (DAT) (Elbayad et al. 2019), which uses the shallow-layer features as the deeper-layer features directly for later decoding steps, failed to mitigate such semantic misalignment and thus substantially undermine the generative ability of the model.
Our contributions are summarized as follows:
-
•
We propose DEED, a multi-exit model with step-level dynamic early exit on decoder to speed-up inference without sacrificing accuracy for encoder-decoder transformer models.
-
•
We apply our approach to two state-of-the-art encoder-decoder transformer models and evaluate on various VL datasets. Our approach is able to reduce 30%-60% overall inference latency with comparable or even higher accuracy compared to baseline models and other dynamic early exit approaches.
-
•
Our approach provides a trade-off between accuracy and latency by using a variable confidence threshold.
2 Related Work
Encoder-Decoder Models for Vision-Language Tasks Encoder-decoder transformer models have pushed the edge for Vision-Language (VL) tasks recently (Alayrac et al. 2022; Wang et al. 2022; Biten et al. 2022; Lu et al. 2022; Chen et al. 2022) because of strong representation ability of encoder and generative ability of decoder. For example, Flamingo (Alayrac et al. 2022) uses a vision encoder to encode input images and a text decoder to generate text predictions for various VL tasks. LaTr (Biten et al. 2022) utilizes the sequence generation ability in decoder and layout in multi-modality learning and achieves state-of-the-art accuracy on text-based VQA tasks. OFA (Wang et al. 2022) proposes a unified sequence-to-sequence learning framework to incorporate various VL tasks into the encoder-decoder scheme. Our work focuses on accelerating the decoder inference for this type of encoder-decoder transformer models.
Dynamic Early Exit Using Dynamic Early Exit (DEE) is a popular strategy to reduce the inference latency of transformer models (Xin et al. 2020; Liao et al. 2021; Liu et al. 2021, 2020; Zhang et al. 2022; Geng et al. 2021; Xin et al. 2021; Zhou et al. 2020b; Li et al. 2021). For example, DeeBERT (Xin et al. 2020) and RomeBERT (Geng et al. 2021) applies DEE to BERT (Kenton and Toutanova 2019) based on classification confidence scores from different encoder layers. BERxiT (Xin et al. 2021) learns a policy for dynamic early exit. TOKEE (Li et al. 2021) introduces a token-level early exit approach for sequence labelling. However, these encoder-focused approaches cannot be applied to transformer decoders directly, due to the challenges imposed by the auto-regressive mechanism in the decoder models.
DAT (Elbayad et al. 2019) is one approach tackling decoder early exit. It introduces a halt-and-copy approach, which halts the computation at a layer if the prediction is confident, and copies the feature from shallow decoder layers to deeper layers in later decoding steps when needed. CALM (Schuster et al. 2022) follows the same halt-and-copy approach for decoder early exit. However, this approach suffers strong semantic misalignment because the semantic information from different decoder layers are not compatible. Thus the later decoding step at deeper layers cannot obtain meaningful features from previous steps, leading to significant accuracy drops. In contrast, our approach dynamically computes the deeper-layer features from earlier steps just-in-time to resolve the semantic misalignment and to achieve higher accuracy than DAT.
Multi-exit Models The most straightforward way of building multi-exit models is adding deep supervision to each layer (Lee et al. 2015; Teerapittayanon, McDanel, and Kung 2016; Schwartz et al. 2020). Nonetheless, it often degrades the accuracy of the final prediction layer. To preserve the final layer accuracy, DeeBERT (Xin et al. 2020) proposes a two-stage training strategy, in which the final prediction layer and the backbone are trained firstly, and other prediction layers are trained secondly with the rest of the parts frozen. However, this two-stage training strategy leads to reduced accuracy of shallow layers. RomeBERT (Geng et al. 2021) designs an approach to increase the accuracy of shallow layers using self-distillation and gradient regularization. BERxiT (Xin et al. 2021) uses an alternating training scheme to improve the accuracy of shallow layers. It alternates between two training objectives: the loss of the final layer only and the loss of all layers. Unlike previous work, we build the multi-exit model by sharing the prediction head among all layers and inserting adaptation modules to align the feature spaces. Our approach shows the best trade-off between final layer accuracy and shallow layer accuracy.
Other Directions for Latency Reduction Apart from dynamic early exit, there are attempts in other directions to reduce latency for transformers. For example, knowledge distillation (Hinton et al. 2015; Jiao et al. 2020; Lin et al. 2022; Sanh et al. 2019) is applied to reduce the model size and latency by distilling information from a large teacher model to a small student model. Model pruning (Gordon, Duh, and Andrews 2020; Michel, Levy, and Neubig 2019) reduces model size by removing redundant parameters. Non-autoregressive generation (Gu et al. 2018; Qian et al. 2021) avoids the time-consuming step-by-step generation by decoding the predictions in parallel. These directions are orthogonal to dynamic early exit, hence they are not our focus.
3 Approach
We propose DEED, a dynamic early exit on decoder approach to accelerate encoder-decoder transformer models for VL tasks. Specifically, we leverage confidence-based step-level dynamic early exit to decide which decoder layer to exit based on how confident we are at each decoding step. At training, we train our multi-exit model with deep supervision (Lee et al. 2015), where the output features of each decoder layer are input to a shared generation head and supervised using the ground truth. At inference, we apply dynamic early exit on the auto-regressive decoder. At each decoding step, the model decides how many decoder layers to use based on its confidence about the output token – hence different number of layers may be used at different decoding steps. In the follow sections, we first introduce the auto-regressive decoding process and the challenge of semantic misalignment in dynamic early exit on decoder in Section 3.1. Then we describe our multi-exit model architecture and our training strategy in Section 3.2. Finally we show how we resolve the semantic misalignment problem with just-in-time computation of decoder features in Section 3.3.
3.1 Background
Auto-Regressive Decoding
At inference, decoder typically generates the prediction in an auto-regressive decoding way, i.e., decoder generates tokens step-by-step and the generated token in each step is conditioned on the previously generated tokens. Theoretically, all the previous tokens are supposed to be input to the decoder to generate the current token, which would cause redundant computation for the previous tokens as their features have been computed at previous decoding steps. In common practice, to reduce redundant computation, the key-value features in the multi-head self-attention layers are all saved and provided for later steps. This practice decreases computation complexity and reduces inference latency, by avoid re-computing key-value features of earlier decoder steps at later steps.
Semantic Misalignment In step-level dynamic early exit, each decoding step can use a different number of decoder layers. As a result, the past key-value features may not always be available for every layer. This misalignment makes it difficult to implement the step-level dynamic early exit, as it cannot retrieve the cached key-value features from previous steps when the current step uses deeper layers. One option is to copy shallower-layer key-value features to deeper-layers (Elbayad et al. 2019). However, the deeper-layer features encode higher-level semantics compared to shallower-layer features. A mixture of them across decoding steps will cause semantic misalignment and undermine the generative ability of the model. An easy workaround is to constrain the model to always exit at the same decoder layer, but this would upset our observation that some tokens are harder to generate than others. In experiments, we will show that this constrained approach is not desirable in terms of accuracy and latency. While we have to stick to step-level dynamic early exit and solve semantic misalignment, pre-computing the deeper-layer key-value features is not efficient because we do not know how many layers the following steps will use. To address this issue, we do step-level dynamic early exit with just-in-time computation, see Section 3.3.

3.2 Multi-exit Model
To perform step-level dynamic early exit, it is crucial to have a multi-exit model to ensure each decoder layer is capable of generating plausible predictions. So we introduce our multi-exit model here before moving on to how we do step-level dynamic early exit.
Model Architecture In our multi-exit encoder-decoder transformer model, we have a generation head that maps decoder features into tokens. In contrast to existing work (Geng et al. 2021; Xin et al. 2020, 2021), we share the generation head across different decoder layers to share the common generation knowledge among different decoder layers, which strengthens the generative ability of shallow decoder layers. In addition, we insert separate adaptation modules between the shallow decoder layers and the generation head to adapt the features from shallow decoder layers to the semantic space of features from the final decoder layer (see Figure 2). Specifically, the adaptation module is composed of a linear layer followed by layer normalization.
Model Training To train the multi-exit model, the most straightforward way is to add deep supervision (Lee et al. 2015) after outputs of each decoder layer as follows
(1) |
where correspond to the total number of decoder layers, the loss for the -th decoder layer, and the average loss across all decoder layers, respectively. However, this approach does not optimize the model for the final decoder layer solely. As a result, the model suffers from degraded accuracy of the final decoder layer, which will cap the accuracy of our approach. To address this issue, we emphasize the loss of the final layer so as to maintain high accuracy for the final decoder layer. To this end, we add the final decoder layer loss to the training objective as follows
(2) |
3.3 Step-Level Dynamic Early Exit with Just-in-Time Computation
We perform step-level dynamic early exit at inference on top of the multi-exit model. To avoid semantic misalignment and improve efficiency, we design an algorithm to compute the past key-value features just-in-time. For step and decoder layer , we denote the decoder layer as , the past key-value features as , the decoded token output as , the corresponding confidence score as , and the confidence score threshold as . We use colon separated numbers to denote intervals, e.g., denotes the decoding steps from to (inclusive). Apart from , we also save any output hidden states of at step if it is computed.

DocVQA | OCR-VQA | |||||
ANLS | Dec. Latency | Tot. Latency | Accuracy | Dec. Latency | Tot. Latency | |
Original-b | 81.5 | 104.3 | 124.6 | 68.4 | 109.7 | 125.6 |
DAT-b (Elbayad et al. 2019) | 71.4 | 90.9 | 111.6 | 60.3 | 105.3 | 121.4 |
SLEX-b | 81.4 | 90.1 | 111.4 | 68.3 | 109.0 | 124.7 |
FTEX-b | 81.2 | 47.1 | 67.4 | 67.1 | 53.8 | 70.3 |
DEED-b | 81.9+0.4 | 46.1-55.8% | 66.5-48.6% | 68.1-0.3 | 52.4-52.2% | 68.5-45.5% |
Original-L | 83.5 | 181.5 | 216.3 | 70.1 | 202.5 | 229.6 |
DAT-L (Elbayad et al. 2019) | 74.3 | 134.0 | 169.9 | 63.0 | 166.0 | 194.2 |
SLEX-L | 83.7 | 154.3 | 190.6 | 69.6 | 111.5 | 139.8 |
FTEX-L | 83.1 | 58.6 | 91.5 | 68.6 | 79.6 | 108.1 |
DEED-L | 83.8+0.3 | 49.2-72.9% | 82.8-61.7% | 69.7-0.4 | 79.2-60.9% | 107.5-53.2% |
As shown in Algorithm 1, for each decoding step, we go through the decoder one layer per iteration. At decoding step , first we prepare saved past key-value features and hidden states (for the decoding steps where the key-value features are absent), where () corresponds to the sequence length of saved past key-value features for , see line 2 in Algorithm 1. Next we feed and into to compute key-value features , hidden states , decoded token output , and the corresponding confidence score , see line 3 in Algorithm 1. We save these newly computed and for future use, see line 4 in Algorithm 1. Taking the decoding process in Figure 3 as an example, at decoding step 3 when the model is about to enter layer 2, the past key-value features are available but are absent, so along with the saved hidden states will be fed into decoder layer 2. We repeat the same process for every decoder layer until the predicted confidence score is larger than a threshold , where is computed by the classification score after softmax, see line 5-6 in Algorithm 1. Note that although the deeper-layer features are computed for the previous decoding steps, the previous token outputs will not be updated with those features, because each token is supposed to be dependent on the past and any change in the previous tokens will break the dependency.
One may notice that our approach assumes the available at decoding step . This is assured by our per-layer traversal - the hidden states are always computed and saved at the previous decoder layer.
4 Experiments
ST-VQA | Text-VQA | |||||
ANLS | Dec. Latency | Tot. Latency | Accuracy | Dec. Latency | Tot. Latency | |
Original-b | 69.7 | 71.9 | 88.6 | 61.1 | 71.7 | 89.0 |
DAT-b (Elbayad et al. 2019) | 62.8 | 57.3 | 74.4 | 53.3 | 64.4 | 83.3 |
SLEX-b | 69.8 | 60.4 | 77.3 | 59.6 | 51.9 | 68.9 |
FTEX-b | 69.5 | 41.7 | 59.0 | 60.0 | 45.5 | 62.4 |
DEED-b | 69.9+0.2 | 33.5-53.4% | 50.1-43.5% | 61.0-0.1 | 43.5-39.3% | 61.4-31.0% |
Original-L | 70.3 | 136.5 | 164.2 | 63.1 | 136.5 | 165.5 |
DAT-L (Elbayad et al. 2019) | 62.8 | 85.5 | 112.4 | 54.6 | 111.1 | 140.5 |
SLEX-L | 70.2 | 85.0 | 114.5 | 61.3 | 93.5 | 122.9 |
FTEX-L | 70.4 | 65.9 | 96.4 | 61.8 | 79.3 | 108.9 |
DEED-L | 71.5+1.2 | 50.0-63.4% | 78.5-52.2% | 63.6+0.5 | 72.1-47.2% | 102.7-37.9% |
We evaluate DEED on two state-of-the-art encoder-decoder models: LaTr++ (Biten et al. 2022) and OFA (Wang et al. 2022) with various vision-language tasks. We do auto-regressive prediction for all the tasks.
4.1 DEED on LaTr++
LaTr (Biten et al. 2022) is the state-of-the-art approach for text-based visual question answering (text-VQA). LaTr uses multi-modal encoder-decoder transformer models with OCR text, layout, and visual features as inputs. We improve LaTr by using a better vision backbone and adding better unsupervised pre-training tasks, see Section A.2 for more details. We refer to the improved LaTr as LaTr++ here. Following LaTr, we focus on the text-VQA task.
VQA | RefCOCO | ||||||||
test-dev | test-std | Dec. Lat. | Tot. Lat. | val | testA | testB | Dec. Lat. | Tot. Lat. | |
OFA | 79.3 | 79.4 | 753.5 | 811.3 | 90.6 | 92.5 | 85.9 | 132.8 | 187.1 |
DEED | 79.0 | 79.1 | 480.7 | 538.5 | 90.2 | 92.4 | 85.1 | 79.5 | 133.8 |
RefCOCO+ | RefCOCOg | ||||||||
val | testA | testB | Dec. Lat. | Tot. Lat. | val-u | test-u | Dec. Lat. | Tot. Lat. | |
OFA | 85.7 | 89.9 | 78.6 | 142.7 | 197.9 | 87.2 | 87.6 | 132.5 | 187.6 |
DEED | 85.3 | 89.6 | 77.9 | 61.6 | 117.9 | 87.0 | 87.4 | 83.0 | 138.1 |
Settings
We evaluate on four text-VQA datasets: DocVQA (Mathew, Karatzas, and Jawahar 2021), OCR-VQA (Mishra et al. 2019), ST-VQA (Biten et al. 2019), and TextVQA (Singh et al. 2019), using accuracy and latency as the metric. For accuracy, we follow the standard protocol to report the metrics on each dataset, i.e., Average Normalized Levenshtein Similarity (ANLS) (Biten et al. 2019; Mathew, Karatzas, and Jawahar 2021) for DocVQA and ST-VQA, and accuracy of exact text match between groundtruth and prediction for OCR-VQA and TextVQA. For latency, we report both the total inference latency and the decoder-only latency, as our approach only affects the decoder inference. The latency is measured w.r.t. wall-clock time on the same machine which has 1 Nvidia A100 GPU with 40GB memory. All approaches are implemented in Pytorch (Paszke et al. 2017) with Huggingface (Wolf et al. 2019). To measure the most accurate per sample latency, we use batch size 1 in inference to avoid unnecessary padding. See Section A.3 for more details.
Baselines We compare DEED to the original model, the SOTA approach DAT (Elbayad et al. 2019), and two strong baselines SLEX and FTEX we proposed and built:
-
•
Original: the vanilla LaTr++ model, on which no early-exit or deep-supervision is applied.
-
•
DAT (Elbayad et al. 2019): DAT is the state-of-the-art decoder speed-up algorithm. At the step when the model exits at a deeper layer, it simply copies the features of the shallow layer from previous steps to all deeper layers.
-
•
Sequence-level early exit (SLEX): the decoder always exits at layer at each decoding step. is chosen by the accumulated confidence score of the entire sequence. More precisely, for each decoder layer, SLEX needs to infer all decoding steps to get the accumulated confidence score, which makes this baseline unpractical.
-
•
First-token early exit (FTEX): the decoder always exit at layer at each decoding step. Unlike SLEX, is chosen based on the confidence score of the first token, which makes FTEX more practical than SLEX because FTEX only needs to infer the first decoding step to make the decision.
In our experiments, for fair comparisons, SLEX, FTEX, and DAT use the same multi-exit model trained for DEED , which improves the accuracy of the shallow layers.
Implementation Details We evaluate our approach DEED on both base (-b) and large (-L) variations of LaTr++. The base version has 12 encoder and 12 decoder layers, and the large version has 24 encoder and 24 decoder layers. We follow LaTr (Biten et al. 2022) to do pre-training first and fine-tuning later. We add deep supervision loss in Eq. (2) in both pre-training and fine-tuning. The confidence score threshold is selected using cross-validation, specifically, 0.99 on DocVQA and 0.95 for other three datasets. See Section A.4 for more details.
Results
Table 1 and Table 2 show the comparisons of accuracy and latency among DEED and baseline approaches.



Our approach shows excellent performance compared to the original model. It consistently reduces the inference latency for both base and large variations, while maintaining the evaluation accuracy on all benchmark datasets. The latency reduction on decoder is between 40% and 73% across all model and dataset combinations. Specifically, on DocVQA, DEED reduces the decoder latency on the larger variation from 181.5ms to 49.2ms, achieving a large 72.9% reduction, while its ANLS is 0.3 higher than the original LaTr++. Comparing to other baseline approaches, DEED always outperforms them with clear margins. DAT (Elbayad et al. 2019) reduces the decoder latency slightly, but it suffers from major accuracy degradation, due to the semantic misalignment introduced by the copy mechanism. SLEX can maintain high accuracy as it makes the decision based on the entire sequence, but its latency improvement is minor compared to DEED . FTEX can reach significant inference acceleration as it can decide to use a shallow layer after the first decoding step. However, it often sacrifices more accuracy because the layer with the maximum first token confidence might not have the best generation for the entire sequence. In contrast, our approach makes exit decisions at each layer and each step, and recomputes the deeper features when necessary, which helps it achieve the best accuracy and latency comparing to all other approaches.
Notice that there is usually more the latency reduction on the large model, because the large model has more decoder layers and early exit still happens at very shallow layers instead of going deeper. In addition, our approach can improve the accuracy of the vanilla LaTr++ in most cases, because our multi-exit model with deep supervision pre-training significantly improves the accuracy for shallow layers (see Section 4.3), and DEED often chooses the layer with the best generative ability based on the confidence scores. In fact, shallow layers can have better predictions than deeper layers on certain examples. If we can choose which layer to make the prediction according to groundtruth, the base model can obtain ANLS 85.0 on DocVQA.
4.2 DEED on OFA
OFA (Wang et al. 2022) is a sequence-to-sequence framework that unifies multiple modalities. It can address multiple VL tasks with a single paradigm. The encoder and decoder are chosen based on the task.
Experimental Setup Following (Wang et al. 2022), we evaluate DEED with OFA on various multi-modal downstream tasks, specifically, VQAv2 (Goyal et al. 2017) for VQA, and RefCOCO/RefCOCO+/RefCOCOg (Yu et al. 2016; Mao et al. 2016) for referring expression comprehension. We also report the accuracy, the total inference latency, and the decoder-only latency and compare DEED to the baseline model, as described in Section 4.1. For each dataset, the latency is averaged on samples from all splits. Here we only compare to the original OFA model. We use the original pre-trained OFA model and the same fine-tuning procedure as in (Wang et al. 2022) to reproduce the OFA results and train DEED. We do not pre-train the model with deep supervision due to its overwhelming computational costs. We use the large size OFA model and the threshold is chosen via cross-validation, i.e., 0.96 for VQA and 0.1 for RefCOCO/RefCOCO+/RefCOCOg.
Results The accuracy and latency of the original OFA and DEED are shown in Table 3. The results of OFA are reproduced using the official code, which are very close to the reported numbers. Again DEED consistently reduces the decoder inference latency with marginal accuracy drops. Specifically, it achieves an average 36.2% and 44% decoder latency reduction on the VQA task and the referring expression comprehension task respectively. In addition, even without deep-supervision pre-training, DEED obtains comparable accuracy compared to the original OFA. The accuracy of DEED should be boosted if we do deep-supervision pre-training for OFA as well. These results demonstrate that our approach can be generalized to different encoder-decoder transformer models and various VL tasks.

4.3 Ablation Study
We study the contribution of each component in DEED. All ablation studies are conducted on DocVQA with LaTr++ base variation.
Model Architecture We inspect the effect of shared generation head (SH) and the adaptation module (AM) for multi-exit model. We compare the models of using SH only, using AM only, and using both (SH + AM), to the baseline model trained with unshared generation heads without the adaptation module. We use in Eq. 1 for training and we do not do deep-supervision pre-training. Figure 4(a) shows results of different approaches. By using both the shared generation head and the adaptation module, the model achieves consistently better accuracy than the baseline, except for the 9-th layer. Notice that the improvement on the first layer is the greatest (1%), which hugely contributes to the overall latency reduction as more examples can exit at layer 1 without sacrificing the accuracy. However, without the adaptation module, the shared generation head has inferior performance due to the mis-alignment between the generation head and the intermediate features for generation.
Training Objective In Figure 4(b), we visualize the ANLS of models trained with the vallina deep supervision in Eq. 1, alternating training (AT) (Xin et al. 2021), and the additional final layer loss (FL) in Eq. 2. We can see both AT and FL can improve the accuracy of the deep ( 8) layers, which helps DEED achieve the same or even better accuracy compared to the original model. FL gives better accuracy for most layers than AT, which confirms the effectiveness of our proposed training objective.

Pre-training In our experiments, we found that pre-training the model with deep supervision can significantly improve the accuracy of the shallow layers, as shown in Figure 4(c). The magenta curve is the model pre-trained with the deep supervision while the brown curve is the one without. Pre-training with deep supervision increases the accuracy of the first layer by 3%. It also consistently increases the ANLS between layer 2 and layer 8. We argue that deep supervision during the pre-training stage helps the model learn strong generative ability in the shallow layers.
Threshold DEED can realize different trade-offs between the accuracy and latency by tuning the confidence score threshold , to fit in different use cases without retraining the model. In Figure 5, we visualize the decoder latency and ANLS of DEED w.r.t. different thresholds ([0.5, 0.99]). We compare DEED to the original LaTr++ trained with 2, 4, 8, and 12 decoder layers. When using the threshold 0.99, our approach reaches the highest ANLS score of 81.9, which exceeds the vanilla 12-layer LaTr++, while achieving 2.26X decoder speed-up. At the other end of the spectrum, DEED can reduce the decoder latency to 25.9ms (4.03X speed-up vs. 12-layer LaTr++) with ANLS score of 81.2. In contrast, to reduce the latency to 26ms, the original model can only use 2 decoder layers, resulting in a significant 3.2 ANLS drop compared to DEED.
The Distribution of Tokens Exiting at Each Layer We visualize the distribution of tokens exit at each layer on four text-VQA datasets in Figure 6. The majority of the tokens exit at the first layer, then the last layer. Only a small amount of tokens exit at middle layers. This shows that the model exits at shallow layers for easy predictions, which aligns with our observation that most samples do not need all decoder layers during inference. In addition, from Figure 4 in the main paper, the second most samples are hard samples that can be correctly predicted by the final decoder layer or even the final decoder layer fails, so the second peak of the histogram appears at the final decoder layer (i.e., decoder layer 12).
5 Conclusions
We propose DEED, a multi-exit model with step-level dynamic early exit on decoder for encoder-decoder transformer model acceleration. DEED leverages confidence-based step-level dynamic early exit to reduce the computation at each decoding step. To improve the accuracy when exiting at shallow layers, we build a multi-exit model leveraging multiple techniques including deep supervision, shared generation head, adaptation modules, and emphasizing the learning of the final decoder layer. We apply our approach to two state-of-the-art encoder-decoder transformer models. Results on various vision-language tasks and datasets show that our approach significantly reduces the inference latency with comparable or even higher accuracy compared to baselines.
References
- Alayrac et al. (2022) Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. 2022. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198.
- Appalaraju et al. (2021) Appalaraju, S.; Jasani, B.; Kota, B. U.; Xie, Y.; and Manmatha, R. 2021. Docformer: End-to-end transformer for document understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 993–1003.
- Appalaraju et al. (2023) Appalaraju, S.; Tang, P.; Dong, Q.; Sankaran, N.; Zhou, Y.; and Manmatha, R. 2023. DocFormerv2: Local Features for Document Understanding. arXiv preprint arXiv:2306.01733.
- Biten et al. (2022) Biten, A. F.; Litman, R.; Xie, Y.; Appalaraju, S.; and Manmatha, R. 2022. Latr: Layout-aware transformer for scene-text vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16548–16558.
- Biten et al. (2019) Biten, A. F.; Tito, R.; Mafla, A.; Gomez, L.; Rusinol, M.; Valveny, E.; Jawahar, C.; and Karatzas, D. 2019. Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, 4291–4301.
- Borisyuk, Gordo, and Sivakumar (2018) Borisyuk, F.; Gordo, A.; and Sivakumar, V. 2018. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 71–79.
- Chen et al. (2022) Chen, X.; Wang, X.; Changpinyo, S.; Piergiovanni, A.; Padlewski, P.; Salz, D.; Goodman, S.; Grycner, A.; Mustafa, B.; Beyer, L.; et al. 2022. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794.
- Chen et al. (2020) Chen, Y.-C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; and Liu, J. 2020. Uniter: Universal image-text representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, 104–120. Springer.
- Dosovitskiy et al. (2020) Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
- Elbayad et al. (2019) Elbayad, M.; Gu, J.; Grave, E.; and Auli, M. 2019. Depth-adaptive transformer. arXiv preprint arXiv:1910.10073.
- Geng et al. (2021) Geng, S.; Gao, P.; Fu, Z.; and Zhang, Y. 2021. Romebert: Robust training of multi-exit bert. arXiv preprint arXiv:2101.09755.
- Gordon, Duh, and Andrews (2020) Gordon, M. A.; Duh, K.; and Andrews, N. 2020. Compressing bert: Studying the effects of weight pruning on transfer learning. arXiv preprint arXiv:2002.08307.
- Goyal et al. (2017) Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; and Parikh, D. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR).
- Gu et al. (2018) Gu, J.; Bradbury, J.; Xiong, C.; Li, V. O.; and Socher, R. 2018. Non-Autoregressive Neural Machine Translation. In International Conference on Learning Representations.
- Hinton et al. (2015) Hinton, G.; Vinyals, O.; Dean, J.; et al. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7).
- Jiao et al. (2020) Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; and Liu, Q. 2020. TinyBERT: Distilling BERT for Natural Language Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, 4163–4174.
- Kenton and Toutanova (2019) Kenton, J. D. M.-W. C.; and Toutanova, L. K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT, 4171–4186.
- Lee et al. (2015) Lee, C.-Y.; Xie, S.; Gallagher, P.; Zhang, Z.; and Tu, Z. 2015. Deeply-supervised nets. In Artificial intelligence and statistics, 562–570. PMLR.
- Li et al. (2021) Li, X.; Shao, Y.; Sun, T.; Yan, H.; Qiu, X.; and Huang, X.-J. 2021. Accelerating BERT Inference for Sequence Labeling via Early-Exit. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 189–199.
- Liao et al. (2021) Liao, K.; Zhang, Y.; Ren, X.; Su, Q.; Sun, X.; and He, B. 2021. A global past-future early exit method for accelerating inference of pre-trained language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013–2023.
- Lin et al. (2022) Lin, S.; Xie, H.; Wang, B.; Yu, K.; Chang, X.; Liang, X.; and Wang, G. 2022. Knowledge Distillation via the Target-aware Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10915–10924.
- Liu et al. (2020) Liu, W.; Zhou, P.; Wang, Z.; Zhao, Z.; Deng, H.; and Ju, Q. 2020. FastBERT: a Self-distilling BERT with Adaptive Inference Time. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 6035–6044.
- Liu et al. (2021) Liu, Y.; Meng, F.; Zhou, J.; Chen, Y.; and Xu, J. 2021. Faster Depth-Adaptive Transformers. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15): 13424–13432.
- Lu et al. (2022) Lu, J.; Clark, C.; Zellers, R.; Mottaghi, R.; and Kembhavi, A. 2022. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916.
- Mao et al. (2016) Mao, J.; Huang, J.; Toshev, A.; Camburu, O.; Yuille, A. L.; and Murphy, K. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 11–20.
- Mathew, Karatzas, and Jawahar (2021) Mathew, M.; Karatzas, D.; and Jawahar, C. 2021. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2200–2209.
- Michel, Levy, and Neubig (2019) Michel, P.; Levy, O.; and Neubig, G. 2019. Are sixteen heads really better than one? Advances in neural information processing systems, 32.
- Mishra et al. (2019) Mishra, A.; Shekhar, S.; Singh, A. K.; and Chakraborty, A. 2019. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), 947–952. IEEE.
- Paszke et al. (2017) Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in PyTorch. In NIPS-W.
- Powalski et al. (2021) Powalski, R.; Borchmann, Ł.; Jurkiewicz, D.; Dwojak, T.; Pietruszka, M.; and Pałka, G. 2021. Going full-tilt boogie on document understanding with text-image-layout transformer. In International Conference on Document Analysis and Recognition, 732–747.
- Qian et al. (2021) Qian, L.; Zhou, H.; Bao, Y.; Wang, M.; Qiu, L.; Zhang, W.; Yu, Y.; and Li, L. 2021. Glancing Transformer for Non-Autoregressive Neural Machine Translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1993–2003.
- Raffel et al. (2020) Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P. J.; et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140): 1–67.
- Sanh et al. (2019) Sanh, V.; Debut, L.; Chaumond, J.; and Wolf, T. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Schuster et al. (2022) Schuster, T.; Fisch, A.; Gupta, J.; Dehghani, M.; Bahri, D.; Tran, V.; Tay, Y.; and Metzler, D. 2022. Confident adaptive language modeling. Advances in Neural Information Processing Systems, 35: 17456–17472.
- Schwartz et al. (2020) Schwartz, R.; Stanovsky, G.; Swayamdipta, S.; Dodge, J.; and Smith, N. A. 2020. The Right Tool for the Job: Matching Model and Instance Complexities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 6640–6651.
- Singh et al. (2019) Singh, A.; Natarajan, V.; Shah, M.; Jiang, Y.; Chen, X.; Batra, D.; Parikh, D.; and Rohrbach, M. 2019. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8317–8326.
- Teerapittayanon, McDanel, and Kung (2016) Teerapittayanon, S.; McDanel, B.; and Kung, H.-T. 2016. Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23rd international conference on pattern recognition (ICPR), 2464–2469.
- Wang et al. (2022) Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z.; Ma, J.; Zhou, C.; Zhou, J.; and Yang, H. 2022. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, 23318–23340. PMLR.
- Wolf et al. (2019) Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- Xin et al. (2020) Xin, J.; Tang, R.; Lee, J.; Yu, Y.; and Lin, J. 2020. DeeBERT: Dynamic early exiting for accelerating BERT inference. arXiv preprint arXiv:2004.12993.
- Xin et al. (2021) Xin, J.; Tang, R.; Yu, Y.; and Lin, J. 2021. BERxiT: Early exiting for BERT with better fine-tuning and extension to regression. In Proceedings of the 16th conference of the European chapter of the association for computational linguistics: Main Volume, 91–104.
- Yu et al. (2016) Yu, L.; Poirson, P.; Yang, S.; Berg, A. C.; and Berg, T. L. 2016. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, 69–85. Springer.
- Zhang et al. (2021) Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; and Gao, J. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5579–5588.
- Zhang et al. (2022) Zhang, Z.; Zhu, W.; Zhang, J.; Wang, P.; Jin, R.; and Chung, T.-S. 2022. PCEE-BERT: Accelerating BERT Inference via Patient and Confident Early Exiting. In Findings of the Association for Computational Linguistics: NAACL 2022, 327–338. Seattle, United States: Association for Computational Linguistics.
- Zhou et al. (2020a) Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.; and Gao, J. 2020a. Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI conference on artificial intelligence, volume 34, 13041–13049.
- Zhou et al. (2020b) Zhou, W.; Xu, C.; Ge, T.; McAuley, J.; Xu, K.; and Wei, F. 2020b. Bert loses patience: Fast and robust inference with early exit. Advances in Neural Information Processing Systems, 33: 18330–18341.
This is the appendix for the main DEED paper. Here we discuss the architecture and results of LaTr++, and the results of our reproduced OFA vs. the original OFA results.
Appendix A LaTr++
LaTr (Biten et al. 2022) obtains the state-of-the-art results on the text-based visual question answering (text-VQA) task. LaTr uses multi-modal encoder-decoder transformer models which takes OCR text, layout, and visual features as inputs. We improve LaTr by replacing the ViT-based vision backbone (Dosovitskiy et al. 2020) with simple multi-layer perceptrons and adding more unsupervised pre-training tasks, following DocFormerv2 (Appalaraju et al. 2023). We refer to the improved LaTr as LaTr++. See Figure 7 for the architecture of LaTr++ and more details below.
A.1 Architecture
In LaTr++, given an input image, we resize the image to size 500x384 and split the image into 196 patches with patch size 32x32. Instead of using ViT (Dosovitskiy et al. 2020), we simply use a linear projection layer to generate 196 visual token embeddings for each patch. We further use one more linear layer with the intention of compressing the extracted 196 visual tokens to only 128 visual tokens. These visual tokens are then concatenated with word embeddings, from here the architecture is identical to LaTr (Biten et al. 2022) and T5 (Raffel et al. 2020). Arguably our LaTr++ architecture is much more simpler than LaTr (Biten et al. 2022) as we do not have a pre-trained ViT as a dependency, hence our model has less number of parameters for equal model size compared to LaTr (Biten et al. 2022).

ST-VQA (ANLS) | TextVQA (Accuracy) | OCR-VQA (Accuracy) | |
LaTr | 68.3 | 59.5 | 67.5 |
LaTr++ | 69.7 | 61.1 | 68.4 |
LaTr | 70.2 | 61.1 | - |
LaTr++ | 70.3 | 63.1 | 70.1 |
VQA | RefCOCO | RefCOCO+ | RefCOCOg | |||||||
---|---|---|---|---|---|---|---|---|---|---|
test-dev | test-std | val | testA | testB | val | testA | testB | val-u | test-u | |
OFA (original) | 79.4 | 79.5 | 90.1 | 92.9 | 85.3 | 85.8 | 89.9 | 79.2 | 85.9 | 86.6 |
OFA (reproduced) | 79.3 | 79.4 | 90.6 | 92.5 | 85.9 | 85.7 | 89.9 | 78.6 | 87.2 | 87.6 |
A.2 Pre-training
We use the IDL dataset111https://www.industrydocuments.ucsf.edu/ described in the main paper to pre-train the LaTr++ models. We use the standard T5 denoising pre-training task (Raffel et al. 2020) as in the original LaTr paper (Biten et al. 2022). In addition, to make the LaTr++ a more competitive baseline we add two more unsupervised pre-training tasks at the encoder: a) Line prediction task - in order to teach the model the relative position semantic information between text tokens, we randomly pick two text tokens and ask the model to predict how many lines are between them. There are only three labels: 0, 1 and 2. Any text token pairs that have more than 2 lines between them are assigned to 2 because distant text tokens are not related and the model does not need the precise number of lines between them. b) Token-to-grid task - To utilize global information the task involves creating a virtual 3x3 grid and asking the network to predict which grid each text token falls in. Losses of all three tasks, i.e., standard denoising, line prediction, and token-to-grid, are added to form the final pre-training loss for LaTr++.
A.3 Settings
We evaluate on four text-VQA datasets: DocVQA (Mathew, Karatzas, and Jawahar 2021), OCR-VQA (Mishra et al. 2019), ST-VQA (Biten et al. 2019), and TextVQA (Singh et al. 2019). DocVQA is a VQA dataset dedicated to document text understanding, and OCR-VQA focuses on question-answering on book covers. ST-VQA and TextVQA contain natural images of everyday scenes with textual information and require the understanding of the text in the image to answer the question. Following (Biten et al. 2022), we use Amazon Textract222https://aws.amazon.com/textract/ for DocVQA, Amazon Text-in-Image333https://docs.aws.amazon.com/rekognition/latest/dg/text-detecting-text-procedure.html for ST-VQA and TextVQA, and Rosetta (Borisyuk, Gordo, and Sivakumar 2018) for OCR-VQA, to extract text information from images.
A.4 Implementation Details
We pre-train our models on the Industrial Document Library (IDL) dataset444https://www.industrydocuments.ucsf.edu/, using the tasks described in Section A.2. We add deep-supervision loss of the T5 denoising task on all decoder layers for DEED, because we found it considerably improves the generative ability of shallow layers, as discussed in our main paper. We pre-train the base version with deep supervision on 5M IDL data for 30 epochs. For the large variation, to reduce the computational costs while achieving competitive performance, we firstly pre-train the model on 64M IDL data for 1.5 epochs without deep supervision, and then pre-train it on 64M IDL data with deep supervision using batch size 18 for 60k steps. The models are then fine-tuned on each dataset following the same settings as in (Powalski et al. 2021; Biten et al. 2022). We follow the convention of fine-tuning on the combination of ST-VQA and TextVQA training sets when evaluating on these two datasets (Biten et al. 2022). The confidence score threshold is selected using cross-validation, specifically, 0.99 on DocVQA and 0.95 for other three datasets.
A.5 Results
Here we compare LaTr++ to LaTr on three text-VQA datasets: ST-VQA (Biten et al. 2019), TextVQA (Singh et al. 2019), and OCR-VQA (Mishra et al. 2019). For ST-VQA and TextVQA, we train LaTr++ on the combination of ST-VQA and TextVQA training sets, following LaTr (Biten et al. 2022). For OCR-VQA, we train LaTr++ on the OCR-VQA training set only. All results are reported on the validation sets of these three datasets. As we can see in Table 4, LaTr++ obtains better results than state-of-the-art approach LaTr on the text-VQA task.
A.6 Analyses on the Number of Parameters
# Parameters | |||
---|---|---|---|
Baseline | Ours | Previous | |
LaTr++ (-b) | 232M | 239M | 503M |
LaTr++ (-L) | 750M | 774M | 1507M |
Compared to the previous multi-exit models, one advantage of our multi-exit model is fewer number of parameters. For LaTr++, the base version (-b) has 232M parameters with hidden size () 768, 12 encoder layers, and 12 decoder layers. The large version (-L) has 750M parameters with 1024, 24 encoder layers, and 24 decoder layers. Each adaptation module consists of a linear layer ( parameters) followed by layer normalization ( parameters). Therefore, each adaption module only has 0.6M parameters (-b) and 1.05M parameters (-L). The adaption modules from all layers only increase 3% of the full model parameters. In contrast, a generation head has 24.7M parameters (-b) and 32.9M parameters (-L) due to the output dimension (32128). Using unshared generation heads increases the total number of parameters by 100%. So our adaptation module has much fewer parameters than unshared generation heads in previous approaches. See Table 6 for more details.
Appendix B OFA Results
We use the original pre-trained OFA model and the same fine-tuning procedure as in (Wang et al. 2022) to reproduce the OFA results and train DEED, using the official code555https://github.com/OFA-Sys/OFA. The only exception is RefCOCOg - we fine-tune our model on top of the RefCOCO fine-tuned model, because RefCOCOg has fewer training samples than RefCOCO and RefCOCO+, which makes the accuracy on RefCOCOg inferior if we fine-tune our model on RefCOCOg directly. Here we compare our reproduced OFA results and the original OFA results on VQAv2 (Goyal et al. 2017) for VQA and RefCOCO/RefCOCO+/RefCOCOg (Yu et al. 2016; Mao et al. 2016) for referring expression comprehension. As we can see, our reproduced results are close to the reported numbers.