This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning thanks: This work has received funding from the European Union’s Horizon 2020 programme dAIEDGE (G.A. No 101120726).

Jia Cheng Hu 0009-0008-1611-966X FIM Dept.
Univ. of Modena and Reggio Emilia
Modena, Italy
[email protected]
   Roberto Cavicchioli orcid=0000-0003-0166-0898 DCE Dept.
Univ. of Modena and Reggio Emilia
Reggio Emilia, Italy
[email protected]
   Alessandro Capotondi 0000-0001-8705-0761 FIM Dept.
Univ. of Modena and Reggio Emilia
Modena, Italy
[email protected]
Abstract

We introduce a method called the Expansion mechanism that processes the input unconstrained by the number of elements in the sequence. By doing so, the model can learn more effectively compared to traditional attention-based approaches. To support this claim, we design a novel architecture ExpansionNet v2 that achieved strong results on the MS COCO 2014 Image Captioning challenge and the State of the Art in its respective category, with a score of 143.7 CIDErD in the offline test split, 140.8 CIDErD in the online evaluation server and 72.9 AllCIDEr on the nocaps validation set. Additionally, we introduce an End to End training algorithm up to 2.8 times faster than established alternatives.

Index Terms:
Captioning, COCO, Sequence, Expansion

I Introduction

Image Captioning consists of the problem of describing images without human intervention. It is a challenging multi-modal task that requires both language comprehension and visual understanding. Early approaches relied on statistical and graph-based methods [1, 2], but since the advent of Neural Networks most Image Captioning systems adopted an encoder and decoder structure [3, 4, 5]. The first component is responsible for extracting visual features from the image, whereas the latter serves the purpose of generating the description. Early works [4, 3, 5] relied on Convolutional Neural Network (CNN) backbones [6] combined with Recurrent Neural Networks (RNNs) [7, 8] to further refine the visual inputs and for text generation. In contrast, modern Image Captioning systems adopt Attention-based [9, 10] architectures for the sequence modelling part and, in recent works [11, 12, 13], also during the image feature extraction. Currently, fully attentive models are the standard de facto architecture in many NLP and Vision research fields and their ubiquity led to many refinements and improvements of the formulation across multiple fields [14, 12, 15, 16, 17, 18, 11, 19, 20]. However, one of the purposes of the development Attention mechanism [9, 21] was to spread the input sequence content along the whole collection of encoder’s hidden vectors instead of one single state, overcoming a significant performance bottleneck in RNNs. To do so, as the name suggests, the Attention mechanism enhances the values of a few elements and inhibits the others by

Refer to caption
Figure 1: The expansion mechanism distributes the input data into another one featuring a different sequence length during the forward phase and performs the reverse operation in the backward pass. In this way, the network is enabled to process the sequence unconstrained by the number of elements.

means of the Softmax function. Recently, many studies [22, 23, 24, 25, 26, 27] deepened the understanding of attention approach and suggested that there is little difference between the first and alternative solutions such as Gaussian distributions [25], MLPs [27], Fourier Transform [26] and suggested that the effectiveness of these methods depends mainly on their capability to form high-quality compositions out of the input. Motivated by these observations, our work investigates the possibility that the fixed number of elements provided by the input (the sequence length) represents a performance bottleneck for stateless architectures and limits their potential to form higher-quality compositions, in the particular field of Image Captioning. To this end, we propose the Expansion mechanism, a method that distributes and processes the sequence content using an increased or arbitrary number of elements and retrieves the original length back in the complementary backward operation. We then introduce ExpansionNet v2 (depicted in Fig. 1), which to our knowledge is the first model that learns to exploit arbitrary sequence lengths in Image Captioning and achieves very competitive results without relying on the Attention’s characteristic function.

The overall contributions of this work are the following: (i) we introduce a new method called Expansion Mechanism that distributes the input content over an arbitrary or increased number of elements during the forward step, and retrieves the original length back in the complementary backward operation. To support both bidirectional and auto-regressive processing, we introduce two methods, called Static Expansion and Dynamic Expansion. The efficiency aspect is addressed in their design and as a result, the computational impact is negligible for small configurations; (ii) with the aforementioned methods, we design a novel architecture called ExpansionNet v2 that achieves strong results on the MS-COCO 2014 outperforming similar models trained on the same dataset; (iii) given the positive results of our architecture, we find out that traditional architectures in Image Captioning are indeed penalized by the fixed number of elements provided by the input; (iv) in contrast to the general trend, our achieves strong results despite the removal of the Attention in most components. Finally, we also propose a fast End-to-End training strategy that lowers significantly the training cost of our model compared to popular approaches.

II Related Works

Image Captioning models benefited greatly from Deep Learning methods. From hand-crafted sentences combined with object detection [28, 29], modern systems consist of a neural encoder that extracts meaningful visual representations from the image and a decoder responsible for the description generation. In the early formulations, the decoder consisted of RNNs [8, 7], whereas the encoder consisted of a convolutional backbone [3, 4] that represented the entire image with a single feature vector. It was later replaced by an object detector [5] that extracted a collection of salient regions of the image. This enabled the adoption of sequence modelling architectures in both encoding and decoding [30, 5, 4, 3] on top of the backbones. Most modern Image Captioning systems are currently based on the Transformer architecture [10] and many works focused on improving its formulation or structure [31, 17, 32, 14, 19, 33, 34]. For example, the work of [17] introduced geometrical awareness in the Self-Attention formulation. [31] modified the attentive layer with a gate that served the purpose of mitigating the contribution of irrelevant queries. [14] exploited the bilinear pooling to enable a higher order of interactions across the input elements. Other works such as [18, 35, 12] focused on structural changes and exploiting the visual input more effectively. Overall, all these methods follow the main components of the formulas introduced in [9, 21, 10]. Our Expansion mechanism is based on the adoption of embedding vectors. The effectiveness of integrating additional learnable parameters in the sequence was observed first in [33] in Machine Translation. Later in Image Captioning, the concept was also deployed by [36] and [37]. In contrast to these works, our method is the only one that distributes the input into an arbitrary number of hidden vectors.

Another trend consists of pre-training the model with a huge amount of training data and fine-tuning over the Image Captioning task [38, 16, 39, 40]. In particular, OFA [38] and GIT [16] currently represent the State-of-the-Art Image Captioning systems and outperform non-generative models by a significant margin. However, their model size poses an obstacle to the deployment in memory-limited devices and the training data are tens and hundreds of times bigger than the popular MS-COCO 2014 [41]. For this reason, these works are considered orthogonal to ours which can be instead integrated to potentially achieve better performances. In general, we only consider works that are trained exclusively on MS-COCO 2014, for this reason, the works of [12, 15, 16, 39, 40] are omitted during evaluation since our model does not leverage additional data.

Refer to caption
Figure 2: Static Expansion and Auto-regressive Dynamic Expansion scheme and example. Assuming an input length of L=3L=3. In the Static Expansion setting, an expansion coefficient of NE=5N_{E}=5 leads to an expanded sequence of length 55. In contrast, in the Dynamic Expansion, an expansion coefficient of NE=3N_{E}=3 generates an expanded sequence of LNE=9L\cdot N_{E}=9. For the sake of simplicity, the double operation stream, the expansion biases and the gated result combination are omitted in the illustration. The difference between the Auto-regressive Dynamic Expansion and the bidirectional one lies in the Masked Matrix Multiplication.

III Method

III-A Static and Dynamic Expansion

The Expansion mechanism is broken down into several steps. First, it distributes the sequence content into an arbitrary or increased number of elements (Section III-A1) using a “Forward Expansion”, which is described in Section III-A2 and allows the network to process the sequence unconstrained by the fixed input length. Then, it retrieves the original length using the complementary operation “Backward Expansion”, described in Section III-A3. Depending on the operations, we define two implementations of the idea: Static Expansion and Dynamic Expansion. The latter is designed to support both the auto-regressive and bidirectional processing, in contrast to the first, which only supports the bidirectional case.

III-A1 Expansion coefficient

In both Static and Dynamic Expansion, the expansion coefficient NEN_{E} defines a collection of learnable parameters EQ,EBNE×dmE_{Q},E_{B}\in\mathbb{R}^{N_{E}\times d_{m}}. However, in the Static Expansion, NEN_{E} defines exactly the size of the expanded sequences regardless of the input length LL. In particular, the expansion queries QEQ_{E} and biases BEB_{E} equal to EQE_{Q} and EBE_{B} respectively. In contrast, in the Dynamic Expansion, the expanded sequence is of size NELN_{E}\cdot L, and the expansion queries QEQ^{E} and biases BEB_{E} are calculated with the BroadSum operator, defined in the two cases as:

QE\displaystyle Q_{E} =(CE)+(EQ𝕀E)\displaystyle=(C^{\top}\mathbb{H}_{E})^{\top}+(E_{Q}^{\top}\mathbb{I}_{E})^{\top} (1)
BE\displaystyle B_{E} =(CE)+(EB𝕀E)\displaystyle=(C^{\top}\mathbb{H}_{E})^{\top}+(E_{B}^{\top}\mathbb{I}_{E})^{\top}

where CL×dmC\in\mathbb{R}^{L\times d_{m}} denotes a linear projection of the input and EL×(LNE)\mathbb{H}_{E}\in\mathbb{R}^{L\times(L\cdot N_{E})} is defined as:

E=[𝟙𝟎𝟎𝟎𝟙𝟎𝟎𝟎𝟙], 1,𝟎1×NE\mathbb{H}_{E}=\begin{bmatrix}\mathds{1}&\mathbf{0}&\dots&\mathbf{0}\\ \mathbf{0}&\mathds{1}&\dots&\mathbf{0}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{0}&\mathbf{0}&\dots&\mathds{1}\\ \end{bmatrix},\ \ \mathds{1},\mathbf{0}\in\mathbb{R}^{1\times N_{E}}

whereas 𝕀ENE×(LNE)\mathbb{I}_{E}\in\mathbb{R}^{N_{E}\times(L\cdot N_{E})} is defined by the column-wise concatenation of LL identity matrices of size NE×NEN_{E}\times N_{E}:

𝕀E=[ILILIL],ILNE×NE\mathbb{I}_{E}=\begin{bmatrix}\textnormal{I}_{L}&\textnormal{I}_{L}&\dots&\textnormal{I}_{L}\\ \end{bmatrix},\ \ \textnormal{I}_{L}\in\mathbb{R}^{N_{E}\times N_{E}}

. An example of the input and output of the BroadSum operation is depicted in the bottom left of Figure 2, where the bias vectors are omitted for simplicity.

III-A2 Forward Expansion

The forward expansion generates the expanded sequences and involves three linear projections of the input, denoted as K,V1,V2L×dmK,V_{1},V_{2}\in\mathbb{R}^{L\times d_{m}}. First of all, the “Length Transformation Matrix”, denoted as MM, is computed as the dot-product similarity between KK and the expansion queries QEQ_{E}:

M=QEKdm.M=\frac{Q_{E}K^{\top}}{\sqrt{d_{m}}}. (2)

The result is fed into the following operations:

Rifw\displaystyle R^{fw}_{i} =Ψ(ReLU((1)iM),ϵ)i{1,2}\displaystyle=\Psi(ReLU((-1)^{i}M),\epsilon)\ \ \ \ i\in\{1,2\} (3)

where Ψ:(X,ϵ)Y\Psi:(X,\epsilon)\rightarrow Y, X,YN1×N2,ϵ+/{0}X,Y\in\mathbb{R}^{N_{1}\times N_{2}},\epsilon\in\mathbb{R}^{+}/\{0\} is the row-wise normalization function defined as:

Ψ(X,ϵ)ij=xijz=1N2xiz+ϵ\Psi(X,\epsilon)_{ij}=\frac{x_{ij}}{\sum_{z=1}^{N_{2}}x_{iz}+\epsilon} (4)

the coefficient ϵ\epsilon ensures the feasibility of the operation. Then, the expanded sequences are calculated as follows:

Fifw\displaystyle F^{fw}_{i} =RifwVi+BEi{1,2}\displaystyle=R^{fw}_{i}V_{i}+B_{E}\ \ \ \ i\in\{1,2\} (5)

III-A3 Backward expansion

In the backward step, the original sequence length is retrieved by transposing the length transformation matrix in Equation 2 and applying the same operations of Equation 3:

Ribw\displaystyle R^{bw}_{i} =Ψ(ReLU((1)iM),ϵ)i{1,2}\displaystyle=\Psi(ReLU((-1)^{i}M^{\top}),\epsilon)\ \ \ \ i\in\{1,2\} (6)

This time, the matrices RibwR^{bw}_{i} are multiplied with the expanded sequences of Equation 5:

Bibw\displaystyle B^{bw}_{i} =RibwFifwi{1,2}\displaystyle=R^{bw}_{i}F^{fw}_{i}\ \ \ \ i\in\{1,2\} (7)

Finally, the final results B1bwB^{bw}_{1} and B2bwB^{bw}_{2} are combined by means of a sigmoid gate:

out=σ(S)B1bw+(1σ(S))B2bw.out=\sigma(S)\odot B^{bw}_{1}+(1-\sigma(S))\odot B^{bw}_{2}. (8)

where SLS\in\mathbb{R}^{L} is a linear projection of the input.

Refer to caption
Figure 3: ExpansionNet v2 architecture.

The backward operation completes the operations performed in the Static and Dynamic Expansion. It can be noted that all operations in the forward (3)(5) and backward expansion (7) (6) are duplicated in two operations streams for i=1i=1 and i=2i=2, differing mainly in the sign of the computation of the Length Transformation Matrix in (2). The decision was made to mitigate the remote possibility of the matrix being populated only by zeros. This does not affect the results compared to the single path but slightly increases the computational cost.

In the case of Dynamic Expansion, masking is applied when calculating the results in (5) and (7) to preserve the auto-regressive property. The operation principle of Static and Dynamic Expansions are illustrated in Fig. 2, which, for simplicity, depicts only a single operation stream and omits biases and the output sigmoid gates.

III-A4 Block Static Expansion

To increase the effectiveness of the Static Expansion, we perform the Forward and Backward operations on a collection of target lengths instead of one. We call the operation Block Static Expansion. From a formulation perspective, all operations are repeated over a group of expansion coefficients G={NE1,NE2,,NENG}G=\{N^{1}_{E},N^{2}_{E},\ldots,N^{N_{G}}_{E}\} and can be implemented in a way such that both forward and backward steps are performed over all targets at the same time. All expansion group queries and biases can be combined into a single one:

EQG\displaystyle E^{G}_{Q} ={(EQ1),(EQ2),,(EQNG)}\displaystyle=\{(E^{1}_{Q})^{\top},(E^{2}_{Q})^{\top},\ldots,(E^{N_{G}}_{Q})^{\top}\}^{\top} (9)
EBG\displaystyle E^{G}_{B} ={(EB1),(EB2),,(EBNG)}\displaystyle=\{(E^{1}_{B})^{\top},(E^{2}_{B})^{\top},\ldots,(E^{N_{G}}_{B})^{\top}\}^{\top}

and the computational efficiency of the previous formulation can be preserved. During the backward stage, the length transformation matrix is scaled by the inverse number of elements in the group GG.

III-B Architecture

Our model consists of the standard encoder-decoder structure implemented on top of the Swin-Transformer, which details are provided in [13]. The image AA is first fed into the backbone:

X0\displaystyle X_{0} =Swin-Transf(A)\displaystyle=Swin\textnormal{-}Transf(A) (10)

and generates the initial set of processed visual features X0={x10,x20,,xN0},xi0dmX_{0}=\{x^{0}_{1},x^{0}_{2},\ldots,x^{0}_{N}\},\ x^{0}_{i}\in\mathbb{R}^{d_{m}}. The result is fed into the encoder, which is made of NencN_{enc} Static Expansion \rightarrow FeedForward blocks. Here skip connection and pre-layer normalization [42] are adopted, and the following formulas describe each encoder layer for n{1,,Nenc}n\in\{1,\ldots,N_{enc}\}:

En\displaystyle E_{n} =Xn1+StaticExpn(NormnSE(Xn1))\displaystyle=X_{n-1}+StaticExp_{n}(Norm^{SE}_{n}(X_{n-1})) (11)
Xn\displaystyle X_{n} =En+FFn(NormnFF(Bn))\displaystyle=E_{n}+FF_{n}(Norm^{FF}_{n}(B_{n}))

Similarly, given a generic input sequence Y0={y10,y20,,yM0},yi0dmY_{0}=\{y^{0}_{1},y^{0}_{2},\ldots,y^{0}_{M}\},\ y^{0}_{i}\in\mathbb{R}^{d_{m}} (at training stage so we can omit the time axis), the decoder is made of NdecN_{dec} Dynamic Expansion \rightarrow Cross-Attention \rightarrow FeedForward blocks, where skip connection and normalization is applied on each component. Each decoder layer is described by the following equations:

Bn\displaystyle B_{n} =Yn1+DynamicExpn(NormnDE(Yn1))\displaystyle=Y_{n-1}+DynamicExp_{n}(Norm^{DE}_{n}(Y_{n-1})) (12)
Wn\displaystyle W_{n} =Bn+Attentionn(NormnCA(Bn),XNenc)\displaystyle=B_{n}+Attention_{n}(Norm^{CA}_{n}(B_{n}),X_{N_{enc}})
Yn\displaystyle Y_{n} =Wn+FFn(NormnFF(Wn))\displaystyle=W_{n}+FF_{n}(Norm^{FF}_{n}(W_{n}))

All layers are summed through a linear projection and the final output is fed to the classification layer. Fig. 3 depicts the main structure.

TABLE I: Ablation study in the first stage of Cross-Entropy training using beam size 3 over the Karpathy validation split. B=BLEU. M=METEOR. R=ROUGE. C=CIDEr-D. S=SPICE.
Encoder Decoder B1 B2 B3 B4 M R C S
Baseline Baseline 75.3 59.2 45.4 34.6 28.4 57.0 115.8 21.6
Stc. Exp. G={16}\{16\} Baseline 76.4 60.6 46.6 35.5 28.6 57.2 117.8 21.9
Stc. Exp. G={32}\{32\} Baseline 75.9 59.9 46.1 35.2 28.9 57.1 117.9 22.3
Stc. Exp. G={64}\{64\} Baseline 76.3 60.4 46.4 35.5 28.8 57.1 117.7 22.0
Baseline Dyn. Exp.NEN_{E}=4 77.2 61.4 47.4 36.2 28.9 57.7 119.7 22.3
Baseline Dyn. Exp.NEN_{E}=8 76.9 61.5 47.9 37.1 29.1 57.8 120.8 22.3
Baseline Dyn. Exp.NEN_{E}=16 76.7 61.4 47.8 36.8 29.0 57.7 121.2 22.2
Stc. Exp. G={64}\{64\} Dyn. Exp.NEN_{E}=16 77.4 61.9 48.2 37.3 29.2 58.0 122.2 22.3
Stc. Exp. G={128,128,128,128,128}\{128,128,128,128,128\} Dyn. Exp.NEN_{E}=16 77.8 62.3 48.3 37.2 29.3 58.3 122.8 22.5
Stc. Exp. G={256,256,256,256,256}\{256,256,256,256,256\} Dyn. Exp.NEN_{E}=16 77.4 62.0 48.2 37.2 29.2 58.0 122.5 22.2
Stc. Exp. G={512,512,512,512,512}\{512,512,512,512,512\} Dyn. Exp.NEN_{E}=16 77.3 61.7 47.9 37.0 29.3 58.0 122.7 22.4
Stc. Exp. G={32,64,128,256,512}\{32,64,128,256,512\} Dyn. Exp.NEN_{E}=16 77.6 62.0 48.2 37.2 29.4 58.1 123.5 22.5

III-C Training objectives

The model is first pre-trained using the Cross-Entropy loss LXEL_{XE}:

LXE(θ)=tTlog(pθ(yt|y1:t1,I))L_{XE}(\theta)=-\sum_{t}^{T}log(p_{\theta}(y^{*}_{t}|y^{*}_{1:t-1},I)) (13)

where pθ(yt|y1:t1,I)p_{\theta}(y^{*}_{t}|y^{*}_{1:t-1},I) is the probability assigned by the model parameters θ\theta to the target yty^{*}_{t} given the image II and the previous words y1:t1y^{*}_{1:t-1}. Additionally, the CIDEr-D score is optimized using the SCST [43] which minimizes the negative expected reward LR(θ)=𝔼y1:Tpθ[r(y1:T)]L_{R}(\theta)=-\mathbb{E}_{y_{1:T~{}p_{\theta}}}[r(y_{1:T})], which gradient can be approximated as follows:

θLR(θ)(r(y1:Ts)b)θlogpθ(y1:Ts)\nabla_{\theta}L_{R}(\theta)\approx-(r(y_{1:T}^{s})-b)\nabla_{\theta}log\ p_{\theta}(y_{1:T}^{s}) (14)

bb is the baseline computed according to [44] and r(y1:Ts)r(y_{1:T}^{s}) is the CIDEr-D reward assigned to the sampled sequence y1:Tsy_{1:T}^{s}.

Although we optimize the model on two loss functions, for each one of them, the training stage is efficiently split into two additional steps to allow a broader number of computational resources to reproduce this work.

IV Results

IV-A Experimental Setup

IV-A1 Dataset

The training dataset consists of the popular MS-COCO benchmark [41] split according to [45], resulting in 113287 image-description pairs for the training, 5000 in the validation set, and in the 5000 test set. Each reference caption is pre-processed by a simple pipeline consisting of lowering casing, removing punctuation, and filtering out words that do not occur at least 5 times (vocabulary of size 10000). Additionally, the final model is evaluated over the Novel Object Captioning at Scale (nocaps) dataset validation set [46], which consists of three classes of images called in-domain, near-domain, and out-domain, according to the familiarity of the classes with respect to those contained in the training set. This dataset is subject to the same pre-processing of MS-COCO and serves the purpose of further challenging the model in unfavourable conditions.

IV-A2 Model details

Two models are implemented for the experimental setup. The baseline, which is the Base Transformer and our main model, referred to as “ExpansionNet v2”, is implemented with the following configurations dmd_{m}=512512, dffd_{ff}=20482048, NencN_{enc}=NdecN_{dec}=33. In the latter, the Dynamic expansion coefficient is set to 16, and the Static expansion coefficients consist of GG={32,64,128,256,512}\{32,64,128,256,512\} (more details in Section IV-B). Each one relies on top of the same backbone, the Swin-Transformer in the Large configuration [13] pre-trained on ImageNet [47]. All images are subject to a minimal pre-processing: first, they are resized into a 3×384×3843{\times}384{\times}384 tensor, then RGB values are converted into a [0,1][0,1] range and further normalized using meanmean=(0.485,0.456,0.406)(0.485,0.456,0.406) and stdstd=(0.229,0.224,0.225)(0.229,0.224,0.225). The source code of the experiments is available111Code available at: https://github.com/jchenghu/ExpansionNet_v2.

IV-A3 Training algorithm

It can be observed that the Swin-Transformer backbone is the most computationally expensive part of the system. For this reason, inspired by [48], to enable the End to End training step to a broader number of computational architectures, our training is divided into four steps in particular, each phase (in both the cross-entropy training and the reinforcement stage) consists of initial training in which the backbone’s weights are frozen and a fine-tuning step during which gradients flow throughout the whole system:

Step A) Cross-Entropy – Freezed backbone. The model is trained using batch size 48, an initial learning rate of 2e-4, a warmup of 10000, and is annealed by 0.8 every 2 epochs for 8 epochs;

Step B) Cross Entropy – End to End. The whole system is trained for 2 additional epochs, using batch size 48 and an initial learning rate of 3e-5 annealed by 0.55 every epoch;

Step C) CIDEr-D optimization – Freezed backbone. Reinforcement phase adopts a batch size of 48, an initial learning rate of 1e-4, no warmup, annealed by 0.8 every epoch for 9 epochs;

Step D) CIDEr-D optimization – End to End. The whole system is fine-tuned for a few more iterations up to an additional epoch using a batch size of 20 and fixed learning rate 2e-6. This step is optional since it only slightly contributes to the final performances and can be skipped if no improvements are observed. All CIDEr-D optimization steps are implemented according to the Standard configuration222SacreEOS signature [49]:
STANDARD_wInit+Cider-D[n4,s6.0]+average[nspi5]+1.0.0
.

TABLE II: Offline comparison of State-of-the-Art single models over the Karpathy test split. B=BLEU. M=METEOR. R=ROUGE. C=CIDEr-D. S=SPICE.
Cross-Entropy CIDEr-D optimization
Model B1 B4 M R C S B1 B4 M R C S
Up-Down [5] 77.2 36.2 27.0 56.4 113.5 20.3 79.8 36.3 27.7 56.9 120.1 21.4
GCN-LSTM [50] 77.3 36.8 27.9 57.0 116.3 20.9 80.5 38.2 28.5 58.3 127.6 22.0
SGAE [51] - - - - - - 80.8 38.4 28.4 58.6 127.8 22.1
AoANet [31] 77.4 37.2 28.4 57.5 119.8 21.3 80.2 38.9 29.2 58.8 129.8 22.4
X-Transformer [14] 77.3 37.0 28.7 57.5 120.0 21.8 80.9 39.7 29.5 59.1 132.8 23.4
GET [35] - - - - - - 81.5 39.5 29.3 58.9 131.6 22.8
DLCT [18] - - - - - - 81.4 39.8 29.5 59.1 133.8 23.0
RSTNet [52] - - - - - - 81.8 40.1 29.8 59.5 135.6 23.3
PureT [11] - - - - - - 82.1 40.9 30.2 60.1 138.2 24.2
ExpansionNet v2 78.1 38.1 30.1 58.9 128.2 23.5 82.8 41.5 30.3 60.5 140.4 24.5
TABLE III: Offline comparison of State-of-the-Art ensemble models over the Karpathy test split. B=BLEU. M=Meteor. R=Rouge. C=CIDEr-D. S=SPICE.
Cross-Entropy CIDEr-D optimization
Model B1 B4 M R C S B1 B4 M R C S
GCN-LSTM [50] 77.4 37.1 28.1 57.2 117.1 21.1 80.9 38.3 28.6 58.5 128.7 22.1
SGAE [51] - - - - - - 81.0 39.0 28.4 58.9 129.1 22.2
AoANet [31] 78.7 38.1 28.5 58.2 122.7 21.7 81.6 40.2 29.3 59.4 132.0 22.8
X-Transformer [14] 77.8 37.7 29.0 58.0 122.1 21.9 81.7 40.7 29.9 59.7 135.3 23.8
GET [35] - - - - - - 82.1 40.6 29.8 59.6 135.1 23.8
DLCT [18] - - - - - - 82.2 40.8 29.9 59.8 137.5 23.3
PureT [11] - - - - - - 83.4 42.1 30.4 60.8 141.0 24.3
ExpansionNet v2 78.5 38.5 29.9 58.8 128.7 23.6 83.5 42.7 30.6 61.1 143.7 24.7

Despite its apparent complexity, it is much more computationally friendly than the standard method consisting of a small batch size of 10 for 30 epochs for both optimization steps. As a matter of fact, only a much smaller number of training epochs are dedicated to fine-tuning the whole system. Thus, the time required for the calculation of the backbone’s gradient is often avoided and the time required for forward operations can be drastically reduced as well. In particular, in our implementation, during steps 1 and 3 the backbone’s forward pass is performed only once for each image in the data set. Therefore, its cost is replaced by a memory read and copy. All steps are trained using the RAdam optimizer [53] (β1=0.9,β2=0.98\beta_{1}=0.9,\ \beta_{2}=0.98).

IV-B Ablation Study

To study the effectiveness of our method we replace the encoder and decoder in the baseline with our methods and evaluate several settings of expansion coefficients. It can be observed from Table I that the impact of the static expansion layer in the single group configuration is limited. In fact, it only slightly improves the baseline, regardless of the choice of NEN_{E}. Conversely, the dynamic expansion layer showcases a more significant improvement obtaining the best result for NE=16N_{E}=16. When the two expansion methods are combined, the model outperforms the baseline across all metrics with a margin of at least 6.0 CIDEr-D, 2.0 BLEU, 0.5 SPICE, 1.0 ROUGE, and 0.8 METEOR. Analyzing several configurations of length groups in the static expansion, it appears that introducing more expansion vectors does not necessarily lead to better performances, since for G={G=\{128, 128, 128, 128, 128}\}, G={G=\{384, 384, 384, 384, 384}\} and G={G=\{512, 512, 512, 512, 512}\} the model yield similar results. However, the model seems to benefit from a diverse selection of coefficients such as in the case of G={G=\{32, 64, 128, 256, 512}\} which will be adopted in the remaining experiments. Ultimately, all instances outperform the baseline across all metrics.

TABLE IV: Online server results on the MS-COCO 2014 test set which ground truth is unknown. B=BLEU. M=METEOR. R=ROUGE. C=CIDEr-D. S=SPICE.
B1 B2 B3 B4 METEOR ROUGE-L CIDEr-D
Model c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40 c5 c40
Up-Down [5] 80.2 95.2 64.1 88.8 49.1 79.4 36.9 68.5 27.6 36.7 57.1 72.4 117.9 120.5
GCN-LSTM [50] - - 65.5 89.3 50.8 80.3 38.7 69.7 28.5 37.6 58.5 73.4 125.3 126.5
SGAE [51] 81.0 95.3 65.6 89.5 50.7 80.4 38.5 69.7 28.2 37.2 58.6 73.6 123.8 126.5
AoANet [31] 81.0 95.0 65.8 89.6 51.4 81.3 39.4 71.2 29.1 38.5 58.9 74.5 126.9 129.6
X-Transformer [14] 81.9 95.7 66.9 90.5 52.4 82.5 40.3 72.4 29.6 39.2 59.5 75.0 131.1 133.5
RSTNet [52] 82.1 96.4 67.0 91.3 52.2 83.0 40.0 73.1 29.6 39.1 59.5 74.6 131.9 134.0
GET [35] 81.6 96.1 66.5 90.9 51.9 82.8 39.7 72.9 29.4 38.8 59.1 74.4 130.3 132.5
DLCT [18] 82.4 96.6 67.4 91.7 52.8 83.8 40.6 74.0 29.8 39.6 59.8 75.3 133.3 135.4
PureT [11] 82.8 96.5 68.1 91.8 53.6 83.9 41.4 74.1 30.1 39.9 60.4 75.9 136.0 138.3
OFA [15] 84.5 98.1 70.1 94.4 55.9 87.8 43.6 78.7 32.1 42.7 62.5 79.0 147.2 149.6
GIT [16] 84.3 98.1 70.0 94.4 55.7 87.6 43.2 78.3 31.9 42.1 62.0 78.4 146.4 149.8
ExpansionNet v2 83.3 96.9 8.8 92.6 54.4 85.0 42.1 75.3 30.4 40.1 60.8 76.4 138.5 140.8

IV-C Performance Comparison

IV-C1 COCO Offline Evaluation

Table II and Table III report the score comparison between ExpansionNet v2 and the best-performing models in recent years. Up-Down [5] introduced the idea of extracting a collection of features from the images using an object detector like Faster-RCNN [6] in contrast to the classification backbone [4]. The idea was adopted in most of the following architectures as well, for instance, in the case of GCN-LSTM [50] and SGAE [51], which additionally implemented a convolutional graph network on top of it to exploit the information provided by a scene graph. AoANet [31] adopted the Transformer and improved the attentive components with two gates serving the purpose of simulating an additional level of attention over the inputs and augmented the language modelling part with an LSTM. On the other hand, X-Transformer [14] adopted a fully attentive architecture and further refined the attentive blocks by means of bilinear pooling techniques. The most recent and performing architectures focused on increasing the more effective ways to feed visual information into the sequence modelling network. For instance, RSTNet [52] showcased the effectiveness of grid features over regions, GET [35] processed the images using a global representation in conjunction with the local ones, DLCT [18] instead exploited the advantages of both regions and grid visual features. Finally, PureT [11] implemented the first end-to-end Transformer architecture applying the Window / Shifted-Window MHA [13] in both the encoder and decoder. ExpansionNet v2 outperforms PureT by a margin of 0.7 BLEU1, 0.6 BLEU4, 0.1 METEOR, 0.4 ROUGE, 2.2 CIDEr-D and 0.3 SPICE in the single model case and by 0.1 BLEU1, 0.6 BLEU4, 0.3 ROUGE, 2.7 CIDEr-D and 0.4 SPICE in the ensemble configuration.

IV-C2 COCO Online Evaluation

We evaluate ExpansionNet v2 using the ensemble configuration and adopting the standard Beam Search (beam size 5) over the official testing set of 40775 images, submitting the predictions to the online testing server. Results are reported in Table IV. c5 and c40 represent the scores of 5 and 40 reference captions (unknown to the user), respectively.

Our model achieves State-of-the-Art performance (as of 2 July 2022) among non-generative models trained on MS-COCO 2014, outperforming the previous one [11] by a margin of 1.2 BLEU4 (c40), 0.2 METEOR (c40), 0.5 ROUGE-L (c40) and 2.5 CIDEr-D in both c5 and c40 instances. However, it is ultimately outperformed, by a significant margin, by generative models [15, 16] which we consider orthogonal to our work since they focus more on training method and data quality rather than architecture design.

IV-C3 Nocaps Evaluation

We evaluate ExpansionNet v2 over the nocaps validation set. In particular, we adopt a single model trained exclusively on Cross-Entropy Loss, using no additional pre-training data sets. The predictions are generated by the standard Beam Search algorithm (beam size 3) in contrast to the CBS [54]. A limited comparison is reported in Table V, which showcases that our model achieves very competitive results among the architectures trained in similar configurations, with an overall lead of 17.6 CIDEr and 1.4 SPICE over the Up-Down model [5]. It is still ultimately outperformed by recent V+L pre-training-based works

TABLE V: Performances on nocaps validation set. C and S denote the CIDEr-D and SPICE scores respectively.
Domain Metric Enc-Dec[55] Up-Down[5] Ours
In C 72.8 78.1 83.8
S 11.1 11.6 12.6
Near C 57.1 57.7 79.2
S 10.2 10.3 12.4
Out C 34.1 31.3 54.0
S 8.3 8.3 9.3
All C 54.7 55.3 72.9
S 10.0 10.1 11.4

such as [39, 56, 40].

IV-D Training and Inference Cost

The efficiency aspect was addressed in the design of the Expansion mechanism. For instance, it can be observed from Table VI that doubling the expansion coefficients does not lead to double FLOPS, which would be the case of actually doubling the input sequence length. In particular, for small parameters, our model is comparable to the Transformer in terms of computational cost. In contrast, ExpansionNet v2 is 1.63×\times slower than the baseline because of an abundant selection of expansion coefficients.

In Table VII, we compare our training time with the ones presented by other works. In particular, time entries are estimated assuming all model computational costs are the same as the ExpansionNet v2, which is a generous approximation compared to generative models whose sizes are tens of times larger. Despite such premise and the fact that we also perform end-to-end training, it can be seen that our model can be trained up to 2.8×\times faster than other non-generative models and up to 46.8×\times faster in the case of generative ones. Recalling the results in Table IV, performance-wise, our model achieves 93.9% performances of the State-of-the-Art model GIT [16] but uses 7080×\times less data and is 129×\times smaller.

TABLE VI: Inference cost comparison of ablation models on the MS-COCO 2014 validation set (5000 images).
Encoder Decoder FLOPS
Baseline Baseline 9.28×10129.28\times 10^{12}
Stc. Exp. G={16}\{16\} Baseline 9.62×10129.62\times 10^{12}
Stc. Exp. G={32}\{32\} Baseline 9.70×10129.70\times 10^{12}
Stc. Exp. G={64}\{64\} Baseline 9.88×10129.88\times 10^{12}
Baseline Dyn. Exp.NEN_{E}=4 9.40×10129.40\times 10^{12}
Baseline Dyn. Exp.NEN_{E}=8 9.43×10129.43\times 10^{12}
Baseline Dyn. Exp.NEN_{E}=16 9.48×10129.48\times 10^{12}
Stc. Exp. G={64}\{64\} Dyn. Exp.NEN_{E}=16 10.08×101210.08\times 10^{12}
Stc. Exp. G={128}\{128\}×5\times 5 Dyn. Exp.NEN_{E}=16 13.26×101213.26\times 10^{12}
Stc. Exp. G={256}\{256\}×5\times 5 Dyn. Exp.NEN_{E}=16 16.80×101216.80\times 10^{12}
Stc. Exp. G={512}\{512\}×5\times 5 Dyn. Exp.NEN_{E}=16 23.88×101223.88\times 10^{12}
ExpansionNet v2 ExpansionNet v2 15.21×101215.21\times 10^{12}
TABLE VII: Training time comparison of State-of-the-Art works against our solution. “Time” represents the estimated time required to train models on a single NVIDIA A100 using the described strategy. γ\gamma, θ\theta and σ\sigma denote the normalized quantity of the number of parameters, the number of training images and the training cost compared to our proposal. The ”\star” symbol denotes generative modes, typically pre-trained on multiple tasks and images from various sources. We simplify the matter using the cost of Cross-Entropy training on MS-COCO 2014, and the downstream task learning cost is ignored since it is negligible compared to the pre-training phase.
Source Params. (γ\gamma) Datasets \rightarrow total num. images (θ\theta) Training Description Train. time (σ\sigma)
Obj. Transf. [17],
AoANet [31]
PureT [11]
33M (0.86)
87M (2.28)
34M (0.89)
MS-COCO 2014 \rightarrow 113K (1.00) Cross-Entropy: \sim 30 epochs and batch size 10. Reinforcement: 30 epochs and batch size 10. 7 days (2.80)
X-Transformer [14] 141M (3.71) MS-COCO 2014 \rightarrow 113K (1.00) Cross-Entropy: 70 epochs and batch size 40. Reinforcement: 35 epochs and batch size 32. 5 days (2.00)
GIT [16] \star 4.9B (128.94) MS-COCO, CC3M, CC12M, VG, SBU, ALT200M + 0.6B \rightarrow 0.8B (7079.64). 2 epochs and we assume batch size 48 in the estimation. 117 days (46.80)
OFA [15] \star 871M (22.92) MS-COCO, CC3M, CC12M, VG,
SBU \rightarrow 15M (132.74).
40 epochs and assume batch size 48 in the estimation. 44 days (17.60)
ExpansionNet v2 38M (1.00) MS-COCO 2014 \rightarrow 113K (1.00) See Section IV-A. 2.5 days (1.00)
TABLE VIII: Examples of captions.
Image Captions
[Uncaptioned image]
Baseline: A man holding a tennis ball on a tennis court.
ExpansionNet v2: A man jumping in the air to hit a tennis ball.
Gt: {A tennis player jumps and swats at the ball.; A tennis player hitting a tennis ball on a court.; Professional tennis player immediately after returning a shot.}
[Uncaptioned image]
Baseline: A little girl brushing her hair with a table.
ExpansionNet v2: A little girl brushing her hair with a pink brush.
Gt: {A young girl tries to comb her own hair.; A young child brushing her hair with a big pink brush.; A young girl is trying to brush her hair with a pink brush.}

IV-E Qualitative Analysis

Table VIII provides some examples of captions. Regardless of the image complexity, ExpansionNet v2 is not only able to correctly describe the subjects depicted in the scenes but also showcases a good level of semantic understanding by describing the goals and interactions. Unfortunately, our model seems to struggle with out-of-domain objects as showcased in Table IX where, due to objects and terms unknown to the model, predictions are either imprecise (2nd image) or incorrect (1st image). Nonetheless, it appears to provide a roughly correct description of the image. We showcase an example of attention visualization in Fig. 4, where the scattered focus correctly outlines the main subjects despite the absence of an object detector.

V Conclusion

In this work, we addressed the question of whether the fixed number of elements of the inputs represented a performance bottleneck in modern image-captioning systems. To this end, we presented the idea of an Expansion mechanism and provided two concrete implementations called Static Expansion and Dynamic Expansion, that process the input using sequences that feature a different length compared to the one provided in the input. Upon these layers, we designed a new architecture called ExpansionNet v2 and trained it on the MS-COCO 2014 dataset using a fast End to End training approach. Extensive experiments conducted on the testing set showcase that our method achieved better performances when compared to the baseline. This answer positively the initial research question of whether the input length can represent a bottleneck to the sequence processing. Additionally, ExpansionNet v2 achieved strong performances on both offline (143.7 CIDEr-D) and online (140.8 CIDEr-D) test splits and is outperformed mainly by V+L pre-training models, which we consider orthogonal to our work due to the differences in model size and additional training data. In conclusion, we introduced the Expansion layers and ExpansionNet v2 and found the answer to our research question in the case of the Image Captioning field. Future works will further develop the methods and ideas presented in this work, motivated by the fact that they can be easily integrated into other solution approaches (such as V+L pre-training) and other research fields.

TABLE IX: Examples on nocaps out-of-domain images.
Image Captions
[Uncaptioned image] ExpansionNet v2: A close-up of a fish in a body of water.
3 gts: { A seahorse in an aquarium full of water with some plants growing in the background.; A blue seahorse is swimming near sea plants on back.; A very small seahorse is in the water along with other pieces. }
[Uncaptioned image] ExpansionNet v2: Three pictures of a blender with red liquid in it.
3 gts: { A picture of three blenders with a strawberry looking beverage inside.; A white mixer in the process of making a smoothie.; The steps of making a smoothie in a blender are shown. }
Refer to caption
Figure 4: Attention visualization of a single decoder head in ExpansionNet v2.

References

  • [1] M. Mitchell et al., “Midge: Generating image descriptions from computer vision detections,” in Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012, pp. 747–756.
  • [2] G. Kulkarni et al., “Babytalk: Understanding and generating simple image descriptions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2891–2903, 2013.
  • [3] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3156–3164.
  • [4] K. Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning, 2015, pp. 2048–2057.
  • [5] P. Anderson et al., “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6077–6086.
  • [6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” arXiv preprint arXiv:1506.01497, 2015.
  • [7] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [8] K. Cho et al., “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
  • [9] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
  • [10] A. Vaswani et al., “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  • [11] Y. Wang, J. Xu, and Y. Sun, “End-to-end transformer based model for image captioning,” arXiv preprint arXiv:2203.15350, 2022.
  • [12] V.-Q. Nguyen, M. Suganuma, and T. Okatani, “Grit: Faster and better image captioning transformer using dual visual features,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI.   Springer, 2022, pp. 167–184.
  • [13] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
  • [14] Y. Pan, T. Yao, Y. Li, and T. Mei, “X-linear attention networks for image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 971–10 980.
  • [15] P. Wang et al., “Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” in International Conference on Machine Learning.   PMLR, 2022, pp. 23 318–23 340.
  • [16] J. Wang et al., “Git: A generative image-to-text transformer for vision and language,” arXiv preprint arXiv:2205.14100, 2022.
  • [17] S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” arXiv preprint arXiv:1906.05963, 2019.
  • [18] Y. Luo et al., “Dual-level collaborative transformer for image captioning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 3, 2021, pp. 2286–2293.
  • [19] Q. Guo, X. Qiu, P. Liu, Y. Shao, X. Xue, and Z. Zhang, “Star-transformer,” arXiv preprint arXiv:1902.09113, 2019.
  • [20] J. Hao, X. Wang, B. Yang, L. Wang, J. Zhang, and Z. Tu, “Modeling recurrence for transformer,” arXiv preprint arXiv:1904.03092, 2019.
  • [21] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015.
  • [22] A. Raganato, Y. Scherrer, and J. Tiedemann, “Fixed encoder self-attention patterns in transformer-based machine translation,” arXiv preprint arXiv:2002.10260, 2020.
  • [23] Y. Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao, and C. Zheng, “Synthesizer: Rethinking self-attention for transformer models,” in International conference on machine learning.   PMLR, 2021, pp. 10 183–10 192.
  • [24] H. Ramsauer et al., “Hopfield networks is all you need,” arXiv preprint arXiv:2008.02217, 2020.
  • [25] W. You, S. Sun, and M. Iyyer, “Hard-coded gaussian attention for neural machine translation,” arXiv preprint arXiv:2005.00742, 2020.
  • [26] J. Lee-Thorp, J. Ainslie, I. Eckstein, and S. Ontanon, “Fnet: Mixing tokens with fourier transforms,” arXiv preprint arXiv:2105.03824, 2021.
  • [27] I. O. Tolstikhin et al., “Mlp-mixer: An all-mlp architecture for vision,” Advances in neural information processing systems, vol. 34, pp. 24 261–24 272, 2021.
  • [28] R. Socher and L. Fei-Fei, “Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.   IEEE, 2010, pp. 966–973.
  • [29] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu, “I2t: Image parsing to text description,” Proceedings of the IEEE, vol. 98, no. 8, pp. 1485–1508, 2010.
  • [30] L. Wang, Z. Bai, Y. Zhang, and H. Lu, “Show, recall, and tell: Image captioning with recall mechanism,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 176–12 183.
  • [31] L. Huang, W. Wang, J. Chen, and X.-Y. Wei, “Attention on attention for image captioning,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 4634–4643.
  • [32] J.-H. Kim, J. Jun, and B.-T. Zhang, “Bilinear attention networks,” arXiv preprint arXiv:1805.07932, 2018.
  • [33] S. Sukhbaatar, E. Grave, G. Lample, H. Jegou, and A. Joulin, “Augmenting self-attention with persistent memory,” arXiv preprint arXiv:1907.01470, 2019.
  • [34] A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
  • [35] J. Ji et al., “Improving image captioning by leveraging intra-and inter-layer global representation in transformer network,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 2, 2021, pp. 1655–1663.
  • [36] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 578–10 587.
  • [37] P. Zeng, H. Zhang, J. Song, and L. Gao, “S2 transformer for image captioning,” in Proceedings of the International Joint Conferences on Artificial Intelligence, vol. 5, 2022.
  • [38] P. Wang et al., “Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” arXiv preprint arXiv:2202.03052, 2022.
  • [39] X. Hu et al., “Scaling up vision-language pre-training for image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 980–17 989.
  • [40] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” arXiv preprint arXiv:2201.12086, 2022.
  • [41] T.-Y. Lin et al., “Microsoft coco: Common objects in context,” in European conference on computer vision.   Springer, 2014, pp. 740–755.
  • [42] R. Xiong et al., “On layer normalization in the transformer architecture,” in International Conference on Machine Learning.   PMLR, 2020, pp. 10 524–10 533.
  • [43] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7008–7024.
  • [44] R. Luo, “A better variant of self-critical sequence training,” arXiv preprint arXiv:2003.09971, 2020.
  • [45] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3128–3137.
  • [46] H. Agrawal et al., “Nocaps: Novel object captioning at scale,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8948–8957.
  • [47] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  • [48] J. C. Hu, R. Cavicchioli, and A. Capotondi, “Exploring the sequence length bottleneck in the transformer for image captioning,” 2022.
  • [49] J. Hu, R. Cavicchioli, and A. Capotondi, “A request for  clarity over  the  end of  sequence token in  the  self-critical sequence training”,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 14233 LNCS, p. 39 – 50, 2023.
  • [50] T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 684–699.
  • [51] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 685–10 694.
  • [52] X. Zhang et al., “Rstnet: Captioning with adaptive attention on visual and non-visual words,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 465–15 474.
  • [53] L. Liu et al., “On the variance of the adaptive learning rate and beyond,” arXiv preprint arXiv:1908.03265, 2019.
  • [54] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Guided open vocabulary image captioning with constrained beam search,” arXiv preprint arXiv:1612.00576, 2016.
  • [55] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3558–3568.
  • [56] P. Zhang et al., “Vinvl: Revisiting visual representations in vision-language models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5579–5588.