Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning ††thanks: This work has received funding from the European Union’s Horizon 2020 programme dAIEDGE (G.A. No 101120726).
Abstract
We introduce a method called the Expansion mechanism that processes the input unconstrained by the number of elements in the sequence. By doing so, the model can learn more effectively compared to traditional attention-based approaches. To support this claim, we design a novel architecture ExpansionNet v2 that achieved strong results on the MS COCO 2014 Image Captioning challenge and the State of the Art in its respective category, with a score of 143.7 CIDErD in the offline test split, 140.8 CIDErD in the online evaluation server and 72.9 AllCIDEr on the nocaps validation set. Additionally, we introduce an End to End training algorithm up to 2.8 times faster than established alternatives.
Index Terms:
Captioning, COCO, Sequence, ExpansionI Introduction
Image Captioning consists of the problem of describing images without human intervention. It is a challenging multi-modal task that requires both language comprehension and visual understanding. Early approaches relied on statistical and graph-based methods [1, 2], but since the advent of Neural Networks most Image Captioning systems adopted an encoder and decoder structure [3, 4, 5]. The first component is responsible for extracting visual features from the image, whereas the latter serves the purpose of generating the description. Early works [4, 3, 5] relied on Convolutional Neural Network (CNN) backbones [6] combined with Recurrent Neural Networks (RNNs) [7, 8] to further refine the visual inputs and for text generation. In contrast, modern Image Captioning systems adopt Attention-based [9, 10] architectures for the sequence modelling part and, in recent works [11, 12, 13], also during the image feature extraction. Currently, fully attentive models are the standard de facto architecture in many NLP and Vision research fields and their ubiquity led to many refinements and improvements of the formulation across multiple fields [14, 12, 15, 16, 17, 18, 11, 19, 20]. However, one of the purposes of the development Attention mechanism [9, 21] was to spread the input sequence content along the whole collection of encoder’s hidden vectors instead of one single state, overcoming a significant performance bottleneck in RNNs. To do so, as the name suggests, the Attention mechanism enhances the values of a few elements and inhibits the others by

means of the Softmax function. Recently, many studies [22, 23, 24, 25, 26, 27] deepened the understanding of attention approach and suggested that there is little difference between the first and alternative solutions such as Gaussian distributions [25], MLPs [27], Fourier Transform [26] and suggested that the effectiveness of these methods depends mainly on their capability to form high-quality compositions out of the input. Motivated by these observations, our work investigates the possibility that the fixed number of elements provided by the input (the sequence length) represents a performance bottleneck for stateless architectures and limits their potential to form higher-quality compositions, in the particular field of Image Captioning. To this end, we propose the Expansion mechanism, a method that distributes and processes the sequence content using an increased or arbitrary number of elements and retrieves the original length back in the complementary backward operation. We then introduce ExpansionNet v2 (depicted in Fig. 1), which to our knowledge is the first model that learns to exploit arbitrary sequence lengths in Image Captioning and achieves very competitive results without relying on the Attention’s characteristic function.
The overall contributions of this work are the following: (i) we introduce a new method called Expansion Mechanism that distributes the input content over an arbitrary or increased number of elements during the forward step, and retrieves the original length back in the complementary backward operation. To support both bidirectional and auto-regressive processing, we introduce two methods, called Static Expansion and Dynamic Expansion. The efficiency aspect is addressed in their design and as a result, the computational impact is negligible for small configurations; (ii) with the aforementioned methods, we design a novel architecture called ExpansionNet v2 that achieves strong results on the MS-COCO 2014 outperforming similar models trained on the same dataset; (iii) given the positive results of our architecture, we find out that traditional architectures in Image Captioning are indeed penalized by the fixed number of elements provided by the input; (iv) in contrast to the general trend, our achieves strong results despite the removal of the Attention in most components. Finally, we also propose a fast End-to-End training strategy that lowers significantly the training cost of our model compared to popular approaches.
II Related Works
Image Captioning models benefited greatly from Deep Learning methods. From hand-crafted sentences combined with object detection [28, 29], modern systems consist of a neural encoder that extracts meaningful visual representations from the image and a decoder responsible for the description generation. In the early formulations, the decoder consisted of RNNs [8, 7], whereas the encoder consisted of a convolutional backbone [3, 4] that represented the entire image with a single feature vector. It was later replaced by an object detector [5] that extracted a collection of salient regions of the image. This enabled the adoption of sequence modelling architectures in both encoding and decoding [30, 5, 4, 3] on top of the backbones. Most modern Image Captioning systems are currently based on the Transformer architecture [10] and many works focused on improving its formulation or structure [31, 17, 32, 14, 19, 33, 34]. For example, the work of [17] introduced geometrical awareness in the Self-Attention formulation. [31] modified the attentive layer with a gate that served the purpose of mitigating the contribution of irrelevant queries. [14] exploited the bilinear pooling to enable a higher order of interactions across the input elements. Other works such as [18, 35, 12] focused on structural changes and exploiting the visual input more effectively. Overall, all these methods follow the main components of the formulas introduced in [9, 21, 10]. Our Expansion mechanism is based on the adoption of embedding vectors. The effectiveness of integrating additional learnable parameters in the sequence was observed first in [33] in Machine Translation. Later in Image Captioning, the concept was also deployed by [36] and [37]. In contrast to these works, our method is the only one that distributes the input into an arbitrary number of hidden vectors.
Another trend consists of pre-training the model with a huge amount of training data and fine-tuning over the Image Captioning task [38, 16, 39, 40]. In particular, OFA [38] and GIT [16] currently represent the State-of-the-Art Image Captioning systems and outperform non-generative models by a significant margin. However, their model size poses an obstacle to the deployment in memory-limited devices and the training data are tens and hundreds of times bigger than the popular MS-COCO 2014 [41]. For this reason, these works are considered orthogonal to ours which can be instead integrated to potentially achieve better performances. In general, we only consider works that are trained exclusively on MS-COCO 2014, for this reason, the works of [12, 15, 16, 39, 40] are omitted during evaluation since our model does not leverage additional data.

III Method
III-A Static and Dynamic Expansion
The Expansion mechanism is broken down into several steps. First, it distributes the sequence content into an arbitrary or increased number of elements (Section III-A1) using a “Forward Expansion”, which is described in Section III-A2 and allows the network to process the sequence unconstrained by the fixed input length. Then, it retrieves the original length using the complementary operation “Backward Expansion”, described in Section III-A3. Depending on the operations, we define two implementations of the idea: Static Expansion and Dynamic Expansion. The latter is designed to support both the auto-regressive and bidirectional processing, in contrast to the first, which only supports the bidirectional case.
III-A1 Expansion coefficient
In both Static and Dynamic Expansion, the expansion coefficient defines a collection of learnable parameters . However, in the Static Expansion, defines exactly the size of the expanded sequences regardless of the input length . In particular, the expansion queries and biases equal to and respectively. In contrast, in the Dynamic Expansion, the expanded sequence is of size , and the expansion queries and biases are calculated with the BroadSum operator, defined in the two cases as:
(1) | ||||
where denotes a linear projection of the input and is defined as:
whereas is defined by the column-wise concatenation of identity matrices of size :
. An example of the input and output of the BroadSum operation is depicted in the bottom left of Figure 2, where the bias vectors are omitted for simplicity.
III-A2 Forward Expansion
The forward expansion generates the expanded sequences and involves three linear projections of the input, denoted as . First of all, the “Length Transformation Matrix”, denoted as , is computed as the dot-product similarity between and the expansion queries :
(2) |
The result is fed into the following operations:
(3) |
where , is the row-wise normalization function defined as:
(4) |
the coefficient ensures the feasibility of the operation. Then, the expanded sequences are calculated as follows:
(5) |
III-A3 Backward expansion
In the backward step, the original sequence length is retrieved by transposing the length transformation matrix in Equation 2 and applying the same operations of Equation 3:
(6) |
This time, the matrices are multiplied with the expanded sequences of Equation 5:
(7) |
Finally, the final results and are combined by means of a sigmoid gate:
(8) |
where is a linear projection of the input.

The backward operation completes the operations performed in the Static and Dynamic Expansion. It can be noted that all operations in the forward (3)(5) and backward expansion (7) (6) are duplicated in two operations streams for and , differing mainly in the sign of the computation of the Length Transformation Matrix in (2). The decision was made to mitigate the remote possibility of the matrix being populated only by zeros. This does not affect the results compared to the single path but slightly increases the computational cost.
In the case of Dynamic Expansion, masking is applied when calculating the results in (5) and (7) to preserve the auto-regressive property. The operation principle of Static and Dynamic Expansions are illustrated in Fig. 2, which, for simplicity, depicts only a single operation stream and omits biases and the output sigmoid gates.
III-A4 Block Static Expansion
To increase the effectiveness of the Static Expansion, we perform the Forward and Backward operations on a collection of target lengths instead of one. We call the operation Block Static Expansion. From a formulation perspective, all operations are repeated over a group of expansion coefficients and can be implemented in a way such that both forward and backward steps are performed over all targets at the same time. All expansion group queries and biases can be combined into a single one:
(9) | ||||
and the computational efficiency of the previous formulation can be preserved. During the backward stage, the length transformation matrix is scaled by the inverse number of elements in the group .
III-B Architecture
Our model consists of the standard encoder-decoder structure implemented on top of the Swin-Transformer, which details are provided in [13]. The image is first fed into the backbone:
(10) |
and generates the initial set of processed visual features . The result is fed into the encoder, which is made of Static Expansion FeedForward blocks. Here skip connection and pre-layer normalization [42] are adopted, and the following formulas describe each encoder layer for :
(11) | ||||
Similarly, given a generic input sequence (at training stage so we can omit the time axis), the decoder is made of Dynamic Expansion Cross-Attention FeedForward blocks, where skip connection and normalization is applied on each component. Each decoder layer is described by the following equations:
(12) | ||||
All layers are summed through a linear projection and the final output is fed to the classification layer. Fig. 3 depicts the main structure.
Encoder | Decoder | B1 | B2 | B3 | B4 | M | R | C | S |
Baseline | Baseline | 75.3 | 59.2 | 45.4 | 34.6 | 28.4 | 57.0 | 115.8 | 21.6 |
Stc. Exp. G= | Baseline | 76.4 | 60.6 | 46.6 | 35.5 | 28.6 | 57.2 | 117.8 | 21.9 |
Stc. Exp. G= | Baseline | 75.9 | 59.9 | 46.1 | 35.2 | 28.9 | 57.1 | 117.9 | 22.3 |
Stc. Exp. G= | Baseline | 76.3 | 60.4 | 46.4 | 35.5 | 28.8 | 57.1 | 117.7 | 22.0 |
Baseline | Dyn. Exp.=4 | 77.2 | 61.4 | 47.4 | 36.2 | 28.9 | 57.7 | 119.7 | 22.3 |
Baseline | Dyn. Exp.=8 | 76.9 | 61.5 | 47.9 | 37.1 | 29.1 | 57.8 | 120.8 | 22.3 |
Baseline | Dyn. Exp.=16 | 76.7 | 61.4 | 47.8 | 36.8 | 29.0 | 57.7 | 121.2 | 22.2 |
Stc. Exp. G= | Dyn. Exp.=16 | 77.4 | 61.9 | 48.2 | 37.3 | 29.2 | 58.0 | 122.2 | 22.3 |
Stc. Exp. G= | Dyn. Exp.=16 | 77.8 | 62.3 | 48.3 | 37.2 | 29.3 | 58.3 | 122.8 | 22.5 |
Stc. Exp. G= | Dyn. Exp.=16 | 77.4 | 62.0 | 48.2 | 37.2 | 29.2 | 58.0 | 122.5 | 22.2 |
Stc. Exp. G= | Dyn. Exp.=16 | 77.3 | 61.7 | 47.9 | 37.0 | 29.3 | 58.0 | 122.7 | 22.4 |
Stc. Exp. G= | Dyn. Exp.=16 | 77.6 | 62.0 | 48.2 | 37.2 | 29.4 | 58.1 | 123.5 | 22.5 |
III-C Training objectives
The model is first pre-trained using the Cross-Entropy loss :
(13) |
where is the probability assigned by the model parameters to the target given the image and the previous words . Additionally, the CIDEr-D score is optimized using the SCST [43] which minimizes the negative expected reward , which gradient can be approximated as follows:
(14) |
is the baseline computed according to [44] and is the CIDEr-D reward assigned to the sampled sequence .
Although we optimize the model on two loss functions, for each one of them, the training stage is efficiently split into two additional steps to allow a broader number of computational resources to reproduce this work.
IV Results
IV-A Experimental Setup
IV-A1 Dataset
The training dataset consists of the popular MS-COCO benchmark [41] split according to [45], resulting in 113287 image-description pairs for the training, 5000 in the validation set, and in the 5000 test set. Each reference caption is pre-processed by a simple pipeline consisting of lowering casing, removing punctuation, and filtering out words that do not occur at least 5 times (vocabulary of size 10000). Additionally, the final model is evaluated over the Novel Object Captioning at Scale (nocaps) dataset validation set [46], which consists of three classes of images called in-domain, near-domain, and out-domain, according to the familiarity of the classes with respect to those contained in the training set. This dataset is subject to the same pre-processing of MS-COCO and serves the purpose of further challenging the model in unfavourable conditions.
IV-A2 Model details
Two models are implemented for the experimental setup. The baseline, which is the Base Transformer and our main model, referred to as “ExpansionNet v2”, is implemented with the following configurations =, =, ==. In the latter, the Dynamic expansion coefficient is set to 16, and the Static expansion coefficients consist of = (more details in Section IV-B). Each one relies on top of the same backbone, the Swin-Transformer in the Large configuration [13] pre-trained on ImageNet [47]. All images are subject to a minimal pre-processing: first, they are resized into a tensor, then RGB values are converted into a range and further normalized using = and =. The source code of the experiments is available111Code available at: https://github.com/jchenghu/ExpansionNet_v2.
IV-A3 Training algorithm
It can be observed that the Swin-Transformer backbone is the most computationally expensive part of the system. For this reason, inspired by [48], to enable the End to End training step to a broader number of computational architectures, our training is divided into four steps in particular, each phase (in both the cross-entropy training and the reinforcement stage) consists of initial training in which the backbone’s weights are frozen and a fine-tuning step during which gradients flow throughout the whole system:
Step A) Cross-Entropy – Freezed backbone. The model is trained using batch size 48, an initial learning rate of 2e-4, a warmup of 10000, and is annealed by 0.8 every 2 epochs for 8 epochs;
Step B) Cross Entropy – End to End. The whole system is trained for 2 additional epochs, using batch size 48 and an initial learning rate of 3e-5 annealed by 0.55 every epoch;
Step C) CIDEr-D optimization – Freezed backbone. Reinforcement phase adopts a batch size of 48, an initial learning rate of 1e-4, no warmup, annealed by 0.8 every epoch for 9 epochs;
Step D) CIDEr-D optimization – End to End. The whole system is fine-tuned for a few more iterations up to an additional epoch using a batch size of 20 and fixed learning rate 2e-6. This step is optional since it only slightly contributes to the final performances and can be skipped if no improvements are observed. All CIDEr-D optimization steps are implemented according to the Standard configuration222SacreEOS signature [49]:
STANDARD_wInit+Cider-D[n4,s6.0]+average[nspi5]+1.0.0.
Cross-Entropy | CIDEr-D optimization | |||||||||||
Model | B1 | B4 | M | R | C | S | B1 | B4 | M | R | C | S |
Up-Down [5] | 77.2 | 36.2 | 27.0 | 56.4 | 113.5 | 20.3 | 79.8 | 36.3 | 27.7 | 56.9 | 120.1 | 21.4 |
GCN-LSTM [50] | 77.3 | 36.8 | 27.9 | 57.0 | 116.3 | 20.9 | 80.5 | 38.2 | 28.5 | 58.3 | 127.6 | 22.0 |
SGAE [51] | - | - | - | - | - | - | 80.8 | 38.4 | 28.4 | 58.6 | 127.8 | 22.1 |
AoANet [31] | 77.4 | 37.2 | 28.4 | 57.5 | 119.8 | 21.3 | 80.2 | 38.9 | 29.2 | 58.8 | 129.8 | 22.4 |
X-Transformer [14] | 77.3 | 37.0 | 28.7 | 57.5 | 120.0 | 21.8 | 80.9 | 39.7 | 29.5 | 59.1 | 132.8 | 23.4 |
GET [35] | - | - | - | - | - | - | 81.5 | 39.5 | 29.3 | 58.9 | 131.6 | 22.8 |
DLCT [18] | - | - | - | - | - | - | 81.4 | 39.8 | 29.5 | 59.1 | 133.8 | 23.0 |
RSTNet [52] | - | - | - | - | - | - | 81.8 | 40.1 | 29.8 | 59.5 | 135.6 | 23.3 |
PureT [11] | - | - | - | - | - | - | 82.1 | 40.9 | 30.2 | 60.1 | 138.2 | 24.2 |
ExpansionNet v2 | 78.1 | 38.1 | 30.1 | 58.9 | 128.2 | 23.5 | 82.8 | 41.5 | 30.3 | 60.5 | 140.4 | 24.5 |
Cross-Entropy | CIDEr-D optimization | |||||||||||
Model | B1 | B4 | M | R | C | S | B1 | B4 | M | R | C | S |
GCN-LSTM [50] | 77.4 | 37.1 | 28.1 | 57.2 | 117.1 | 21.1 | 80.9 | 38.3 | 28.6 | 58.5 | 128.7 | 22.1 |
SGAE [51] | - | - | - | - | - | - | 81.0 | 39.0 | 28.4 | 58.9 | 129.1 | 22.2 |
AoANet [31] | 78.7 | 38.1 | 28.5 | 58.2 | 122.7 | 21.7 | 81.6 | 40.2 | 29.3 | 59.4 | 132.0 | 22.8 |
X-Transformer [14] | 77.8 | 37.7 | 29.0 | 58.0 | 122.1 | 21.9 | 81.7 | 40.7 | 29.9 | 59.7 | 135.3 | 23.8 |
GET [35] | - | - | - | - | - | - | 82.1 | 40.6 | 29.8 | 59.6 | 135.1 | 23.8 |
DLCT [18] | - | - | - | - | - | - | 82.2 | 40.8 | 29.9 | 59.8 | 137.5 | 23.3 |
PureT [11] | - | - | - | - | - | - | 83.4 | 42.1 | 30.4 | 60.8 | 141.0 | 24.3 |
ExpansionNet v2 | 78.5 | 38.5 | 29.9 | 58.8 | 128.7 | 23.6 | 83.5 | 42.7 | 30.6 | 61.1 | 143.7 | 24.7 |
Despite its apparent complexity, it is much more computationally friendly than the standard method consisting of a small batch size of 10 for 30 epochs for both optimization steps. As a matter of fact, only a much smaller number of training epochs are dedicated to fine-tuning the whole system. Thus, the time required for the calculation of the backbone’s gradient is often avoided and the time required for forward operations can be drastically reduced as well. In particular, in our implementation, during steps 1 and 3 the backbone’s forward pass is performed only once for each image in the data set. Therefore, its cost is replaced by a memory read and copy. All steps are trained using the RAdam optimizer [53] ().
IV-B Ablation Study
To study the effectiveness of our method we replace the encoder and decoder in the baseline with our methods and evaluate several settings of expansion coefficients. It can be observed from Table I that the impact of the static expansion layer in the single group configuration is limited. In fact, it only slightly improves the baseline, regardless of the choice of . Conversely, the dynamic expansion layer showcases a more significant improvement obtaining the best result for . When the two expansion methods are combined, the model outperforms the baseline across all metrics with a margin of at least 6.0 CIDEr-D, 2.0 BLEU, 0.5 SPICE, 1.0 ROUGE, and 0.8 METEOR. Analyzing several configurations of length groups in the static expansion, it appears that introducing more expansion vectors does not necessarily lead to better performances, since for 128, 128, 128, 128, 128, 384, 384, 384, 384, 384 and 512, 512, 512, 512, 512 the model yield similar results. However, the model seems to benefit from a diverse selection of coefficients such as in the case of 32, 64, 128, 256, 512 which will be adopted in the remaining experiments. Ultimately, all instances outperform the baseline across all metrics.
B1 | B2 | B3 | B4 | METEOR | ROUGE-L | CIDEr-D | ||||||||
Model | c5 | c40 | c5 | c40 | c5 | c40 | c5 | c40 | c5 | c40 | c5 | c40 | c5 | c40 |
Up-Down [5] | 80.2 | 95.2 | 64.1 | 88.8 | 49.1 | 79.4 | 36.9 | 68.5 | 27.6 | 36.7 | 57.1 | 72.4 | 117.9 | 120.5 |
GCN-LSTM [50] | - | - | 65.5 | 89.3 | 50.8 | 80.3 | 38.7 | 69.7 | 28.5 | 37.6 | 58.5 | 73.4 | 125.3 | 126.5 |
SGAE [51] | 81.0 | 95.3 | 65.6 | 89.5 | 50.7 | 80.4 | 38.5 | 69.7 | 28.2 | 37.2 | 58.6 | 73.6 | 123.8 | 126.5 |
AoANet [31] | 81.0 | 95.0 | 65.8 | 89.6 | 51.4 | 81.3 | 39.4 | 71.2 | 29.1 | 38.5 | 58.9 | 74.5 | 126.9 | 129.6 |
X-Transformer [14] | 81.9 | 95.7 | 66.9 | 90.5 | 52.4 | 82.5 | 40.3 | 72.4 | 29.6 | 39.2 | 59.5 | 75.0 | 131.1 | 133.5 |
RSTNet [52] | 82.1 | 96.4 | 67.0 | 91.3 | 52.2 | 83.0 | 40.0 | 73.1 | 29.6 | 39.1 | 59.5 | 74.6 | 131.9 | 134.0 |
GET [35] | 81.6 | 96.1 | 66.5 | 90.9 | 51.9 | 82.8 | 39.7 | 72.9 | 29.4 | 38.8 | 59.1 | 74.4 | 130.3 | 132.5 |
DLCT [18] | 82.4 | 96.6 | 67.4 | 91.7 | 52.8 | 83.8 | 40.6 | 74.0 | 29.8 | 39.6 | 59.8 | 75.3 | 133.3 | 135.4 |
PureT [11] | 82.8 | 96.5 | 68.1 | 91.8 | 53.6 | 83.9 | 41.4 | 74.1 | 30.1 | 39.9 | 60.4 | 75.9 | 136.0 | 138.3 |
OFA [15] | 84.5 | 98.1 | 70.1 | 94.4 | 55.9 | 87.8 | 43.6 | 78.7 | 32.1 | 42.7 | 62.5 | 79.0 | 147.2 | 149.6 |
GIT [16] | 84.3 | 98.1 | 70.0 | 94.4 | 55.7 | 87.6 | 43.2 | 78.3 | 31.9 | 42.1 | 62.0 | 78.4 | 146.4 | 149.8 |
ExpansionNet v2 | 83.3 | 96.9 | 8.8 | 92.6 | 54.4 | 85.0 | 42.1 | 75.3 | 30.4 | 40.1 | 60.8 | 76.4 | 138.5 | 140.8 |
IV-C Performance Comparison
IV-C1 COCO Offline Evaluation
Table II and Table III report the score comparison between ExpansionNet v2 and the best-performing models in recent years. Up-Down [5] introduced the idea of extracting a collection of features from the images using an object detector like Faster-RCNN [6] in contrast to the classification backbone [4]. The idea was adopted in most of the following architectures as well, for instance, in the case of GCN-LSTM [50] and SGAE [51], which additionally implemented a convolutional graph network on top of it to exploit the information provided by a scene graph. AoANet [31] adopted the Transformer and improved the attentive components with two gates serving the purpose of simulating an additional level of attention over the inputs and augmented the language modelling part with an LSTM. On the other hand, X-Transformer [14] adopted a fully attentive architecture and further refined the attentive blocks by means of bilinear pooling techniques. The most recent and performing architectures focused on increasing the more effective ways to feed visual information into the sequence modelling network. For instance, RSTNet [52] showcased the effectiveness of grid features over regions, GET [35] processed the images using a global representation in conjunction with the local ones, DLCT [18] instead exploited the advantages of both regions and grid visual features. Finally, PureT [11] implemented the first end-to-end Transformer architecture applying the Window / Shifted-Window MHA [13] in both the encoder and decoder. ExpansionNet v2 outperforms PureT by a margin of 0.7 BLEU1, 0.6 BLEU4, 0.1 METEOR, 0.4 ROUGE, 2.2 CIDEr-D and 0.3 SPICE in the single model case and by 0.1 BLEU1, 0.6 BLEU4, 0.3 ROUGE, 2.7 CIDEr-D and 0.4 SPICE in the ensemble configuration.
IV-C2 COCO Online Evaluation
We evaluate ExpansionNet v2 using the ensemble configuration and adopting the standard Beam Search (beam size 5) over the official testing set of 40775 images, submitting the predictions to the online testing server. Results are reported in Table IV. c5 and c40 represent the scores of 5 and 40 reference captions (unknown to the user), respectively.
Our model achieves State-of-the-Art performance (as of 2 July 2022) among non-generative models trained on MS-COCO 2014, outperforming the previous one [11] by a margin of 1.2 BLEU4 (c40), 0.2 METEOR (c40), 0.5 ROUGE-L (c40) and 2.5 CIDEr-D in both c5 and c40 instances. However, it is ultimately outperformed, by a significant margin, by generative models [15, 16] which we consider orthogonal to our work since they focus more on training method and data quality rather than architecture design.
IV-C3 Nocaps Evaluation
We evaluate ExpansionNet v2 over the nocaps validation set. In particular, we adopt a single model trained exclusively on Cross-Entropy Loss, using no additional pre-training data sets. The predictions are generated by the standard Beam Search algorithm (beam size 3) in contrast to the CBS [54]. A limited comparison is reported in Table V, which showcases that our model achieves very competitive results among the architectures trained in similar configurations, with an overall lead of 17.6 CIDEr and 1.4 SPICE over the Up-Down model [5]. It is still ultimately outperformed by recent V+L pre-training-based works
IV-D Training and Inference Cost
The efficiency aspect was addressed in the design of the Expansion mechanism. For instance, it can be observed from Table VI that doubling the expansion coefficients does not lead to double FLOPS, which would be the case of actually doubling the input sequence length. In particular, for small parameters, our model is comparable to the Transformer in terms of computational cost. In contrast, ExpansionNet v2 is 1.63 slower than the baseline because of an abundant selection of expansion coefficients.
In Table VII, we compare our training time with the ones presented by other works. In particular, time entries are estimated assuming all model computational costs are the same as the ExpansionNet v2, which is a generous approximation compared to generative models whose sizes are tens of times larger. Despite such premise and the fact that we also perform end-to-end training, it can be seen that our model can be trained up to 2.8 faster than other non-generative models and up to 46.8 faster in the case of generative ones. Recalling the results in Table IV, performance-wise, our model achieves 93.9% performances of the State-of-the-Art model GIT [16] but uses 7080 less data and is 129 smaller.
Encoder | Decoder | FLOPS |
Baseline | Baseline | |
Stc. Exp. G= | Baseline | |
Stc. Exp. G= | Baseline | |
Stc. Exp. G= | Baseline | |
Baseline | Dyn. Exp.=4 | |
Baseline | Dyn. Exp.=8 | |
Baseline | Dyn. Exp.=16 | |
Stc. Exp. G= | Dyn. Exp.=16 | |
Stc. Exp. G= | Dyn. Exp.=16 | |
Stc. Exp. G= | Dyn. Exp.=16 | |
Stc. Exp. G= | Dyn. Exp.=16 | |
ExpansionNet v2 | ExpansionNet v2 |
Source | Params. () | Datasets total num. images () | Training Description | Train. time () |
Obj. Transf. [17],
AoANet [31] PureT [11] |
33M (0.86)
87M (2.28) 34M (0.89) |
MS-COCO 2014 113K (1.00) | Cross-Entropy: 30 epochs and batch size 10. Reinforcement: 30 epochs and batch size 10. | 7 days (2.80) |
X-Transformer [14] | 141M (3.71) | MS-COCO 2014 113K (1.00) | Cross-Entropy: 70 epochs and batch size 40. Reinforcement: 35 epochs and batch size 32. | 5 days (2.00) |
GIT [16] | 4.9B (128.94) | MS-COCO, CC3M, CC12M, VG, SBU, ALT200M + 0.6B 0.8B (7079.64). | 2 epochs and we assume batch size 48 in the estimation. | 117 days (46.80) |
OFA [15] | 871M (22.92) |
MS-COCO, CC3M, CC12M, VG,
SBU 15M (132.74). |
40 epochs and assume batch size 48 in the estimation. | 44 days (17.60) |
ExpansionNet v2 | 38M (1.00) | MS-COCO 2014 113K (1.00) | See Section IV-A. | 2.5 days (1.00) |
Image | Captions |
![]() |
Baseline: A man holding a tennis ball on a tennis court. ExpansionNet v2: A man jumping in the air to hit a tennis ball. Gt: {A tennis player jumps and swats at the ball.; A tennis player hitting a tennis ball on a court.; Professional tennis player immediately after returning a shot.} |
![]() |
Baseline: A little girl brushing her hair with a table. ExpansionNet v2: A little girl brushing her hair with a pink brush. Gt: {A young girl tries to comb her own hair.; A young child brushing her hair with a big pink brush.; A young girl is trying to brush her hair with a pink brush.} |
IV-E Qualitative Analysis
Table VIII provides some examples of captions. Regardless of the image complexity, ExpansionNet v2 is not only able to correctly describe the subjects depicted in the scenes but also showcases a good level of semantic understanding by describing the goals and interactions. Unfortunately, our model seems to struggle with out-of-domain objects as showcased in Table IX where, due to objects and terms unknown to the model, predictions are either imprecise (2nd image) or incorrect (1st image). Nonetheless, it appears to provide a roughly correct description of the image. We showcase an example of attention visualization in Fig. 4, where the scattered focus correctly outlines the main subjects despite the absence of an object detector.
V Conclusion
In this work, we addressed the question of whether the fixed number of elements of the inputs represented a performance bottleneck in modern image-captioning systems. To this end, we presented the idea of an Expansion mechanism and provided two concrete implementations called Static Expansion and Dynamic Expansion, that process the input using sequences that feature a different length compared to the one provided in the input. Upon these layers, we designed a new architecture called ExpansionNet v2 and trained it on the MS-COCO 2014 dataset using a fast End to End training approach. Extensive experiments conducted on the testing set showcase that our method achieved better performances when compared to the baseline. This answer positively the initial research question of whether the input length can represent a bottleneck to the sequence processing. Additionally, ExpansionNet v2 achieved strong performances on both offline (143.7 CIDEr-D) and online (140.8 CIDEr-D) test splits and is outperformed mainly by V+L pre-training models, which we consider orthogonal to our work due to the differences in model size and additional training data. In conclusion, we introduced the Expansion layers and ExpansionNet v2 and found the answer to our research question in the case of the Image Captioning field. Future works will further develop the methods and ideas presented in this work, motivated by the fact that they can be easily integrated into other solution approaches (such as V+L pre-training) and other research fields.
Image | Captions |
![]() |
ExpansionNet v2: A close-up of a fish in a body of water.
3 gts: { A seahorse in an aquarium full of water with some plants growing in the background.; A blue seahorse is swimming near sea plants on back.; A very small seahorse is in the water along with other pieces. } |
![]() |
ExpansionNet v2: Three pictures of a blender with red liquid in it.
3 gts: { A picture of three blenders with a strawberry looking beverage inside.; A white mixer in the process of making a smoothie.; The steps of making a smoothie in a blender are shown. } |

References
- [1] M. Mitchell et al., “Midge: Generating image descriptions from computer vision detections,” in Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012, pp. 747–756.
- [2] G. Kulkarni et al., “Babytalk: Understanding and generating simple image descriptions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2891–2903, 2013.
- [3] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3156–3164.
- [4] K. Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning, 2015, pp. 2048–2057.
- [5] P. Anderson et al., “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6077–6086.
- [6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” arXiv preprint arXiv:1506.01497, 2015.
- [7] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- [8] K. Cho et al., “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
- [9] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
- [10] A. Vaswani et al., “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
- [11] Y. Wang, J. Xu, and Y. Sun, “End-to-end transformer based model for image captioning,” arXiv preprint arXiv:2203.15350, 2022.
- [12] V.-Q. Nguyen, M. Suganuma, and T. Okatani, “Grit: Faster and better image captioning transformer using dual visual features,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI. Springer, 2022, pp. 167–184.
- [13] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
- [14] Y. Pan, T. Yao, Y. Li, and T. Mei, “X-linear attention networks for image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 971–10 980.
- [15] P. Wang et al., “Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” in International Conference on Machine Learning. PMLR, 2022, pp. 23 318–23 340.
- [16] J. Wang et al., “Git: A generative image-to-text transformer for vision and language,” arXiv preprint arXiv:2205.14100, 2022.
- [17] S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” arXiv preprint arXiv:1906.05963, 2019.
- [18] Y. Luo et al., “Dual-level collaborative transformer for image captioning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 2286–2293.
- [19] Q. Guo, X. Qiu, P. Liu, Y. Shao, X. Xue, and Z. Zhang, “Star-transformer,” arXiv preprint arXiv:1902.09113, 2019.
- [20] J. Hao, X. Wang, B. Yang, L. Wang, J. Zhang, and Z. Tu, “Modeling recurrence for transformer,” arXiv preprint arXiv:1904.03092, 2019.
- [21] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015.
- [22] A. Raganato, Y. Scherrer, and J. Tiedemann, “Fixed encoder self-attention patterns in transformer-based machine translation,” arXiv preprint arXiv:2002.10260, 2020.
- [23] Y. Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao, and C. Zheng, “Synthesizer: Rethinking self-attention for transformer models,” in International conference on machine learning. PMLR, 2021, pp. 10 183–10 192.
- [24] H. Ramsauer et al., “Hopfield networks is all you need,” arXiv preprint arXiv:2008.02217, 2020.
- [25] W. You, S. Sun, and M. Iyyer, “Hard-coded gaussian attention for neural machine translation,” arXiv preprint arXiv:2005.00742, 2020.
- [26] J. Lee-Thorp, J. Ainslie, I. Eckstein, and S. Ontanon, “Fnet: Mixing tokens with fourier transforms,” arXiv preprint arXiv:2105.03824, 2021.
- [27] I. O. Tolstikhin et al., “Mlp-mixer: An all-mlp architecture for vision,” Advances in neural information processing systems, vol. 34, pp. 24 261–24 272, 2021.
- [28] R. Socher and L. Fei-Fei, “Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2010, pp. 966–973.
- [29] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu, “I2t: Image parsing to text description,” Proceedings of the IEEE, vol. 98, no. 8, pp. 1485–1508, 2010.
- [30] L. Wang, Z. Bai, Y. Zhang, and H. Lu, “Show, recall, and tell: Image captioning with recall mechanism,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 176–12 183.
- [31] L. Huang, W. Wang, J. Chen, and X.-Y. Wei, “Attention on attention for image captioning,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 4634–4643.
- [32] J.-H. Kim, J. Jun, and B.-T. Zhang, “Bilinear attention networks,” arXiv preprint arXiv:1805.07932, 2018.
- [33] S. Sukhbaatar, E. Grave, G. Lample, H. Jegou, and A. Joulin, “Augmenting self-attention with persistent memory,” arXiv preprint arXiv:1907.01470, 2019.
- [34] A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
- [35] J. Ji et al., “Improving image captioning by leveraging intra-and inter-layer global representation in transformer network,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 2, 2021, pp. 1655–1663.
- [36] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 578–10 587.
- [37] P. Zeng, H. Zhang, J. Song, and L. Gao, “S2 transformer for image captioning,” in Proceedings of the International Joint Conferences on Artificial Intelligence, vol. 5, 2022.
- [38] P. Wang et al., “Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” arXiv preprint arXiv:2202.03052, 2022.
- [39] X. Hu et al., “Scaling up vision-language pre-training for image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 980–17 989.
- [40] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” arXiv preprint arXiv:2201.12086, 2022.
- [41] T.-Y. Lin et al., “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
- [42] R. Xiong et al., “On layer normalization in the transformer architecture,” in International Conference on Machine Learning. PMLR, 2020, pp. 10 524–10 533.
- [43] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7008–7024.
- [44] R. Luo, “A better variant of self-critical sequence training,” arXiv preprint arXiv:2003.09971, 2020.
- [45] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3128–3137.
- [46] H. Agrawal et al., “Nocaps: Novel object captioning at scale,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8948–8957.
- [47] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
- [48] J. C. Hu, R. Cavicchioli, and A. Capotondi, “Exploring the sequence length bottleneck in the transformer for image captioning,” 2022.
- [49] J. Hu, R. Cavicchioli, and A. Capotondi, “A request for clarity over the end of sequence token in the self-critical sequence training”,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 14233 LNCS, p. 39 – 50, 2023.
- [50] T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 684–699.
- [51] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 685–10 694.
- [52] X. Zhang et al., “Rstnet: Captioning with adaptive attention on visual and non-visual words,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 465–15 474.
- [53] L. Liu et al., “On the variance of the adaptive learning rate and beyond,” arXiv preprint arXiv:1908.03265, 2019.
- [54] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Guided open vocabulary image captioning with constrained beam search,” arXiv preprint arXiv:1612.00576, 2016.
- [55] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3558–3568.
- [56] P. Zhang et al., “Vinvl: Revisiting visual representations in vision-language models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5579–5588.