Off-Policy Self-Critical Training for Transformer in Visual Paragraph Generation

Shiyang Yan, Yang Hua, Neil M. Robertson
[email protected]

Abstract

Recently, several approaches have been proposed to solve language generation problems. Transformer is currently state-of-the-art seq-to-seq model in language generation. Reinforcement Learning (RL) is useful in solving exposure bias and the optimisation on non-differentiable metrics in seq-to-seq language learning. However, Transformer is hard to combine with RL as the costly computing resource is required for sampling. We tackle this problem by proposing an off-policy RL learning algorithm where a behaviour policy represented by GRUs performs the sampling. We reduce the high variance of importance sampling (IS) by applying the truncated relative importance sampling (TRIS) technique and Kullback-Leibler (KL)-control concept. TRIS is a simple yet effective technique, and there is a theoretical proof that KL-control helps to reduce the variance of IS. We formulate this off-policy RL based on self-critical sequence training. Specifically, we use a Transformer-based captioning model as the target policy and use an image-guided language auto-encoder as the behaviour policy to explore the environment. The proposed algorithm achieves state-of-the-art performance on the visual paragraph generation and improved results on image captioning.

1 Introduction

Transformer (self-attention) is a kind of seq-to-seq models, which shows breakthrough successes in natural language processing (NLP), such as machine translation and image captioning [1] [2]. Seq-to-seq models are usually trained using either Maximum Likelihood Estimation (MLE) or Reinforcement Learning (RL) [3] [4]. Especially, RL for seq-to-seq models can tackle two problems in language generation: (1). The exposure bias, referring to the train-test discrepancy in seq-to-seq models. The training uses the ground-truths while the testing generates a new token based on the previously generated ones. (2). The gradient estimation towards optimisation for non-differentiable evaluation metrics such as BLEU or CIDEr [4]. Indeed, RL has brought significant performance gain in image captioning and language generation [4] [3]. However, there is less literature on Transformer performing RL [5]. On-policy RL is known to be sample inefficient, and this is especially serious for Transformer in visual paragraph generation where the generated paragraph usually contains about 200 words or more [6]. Expensive computing resource is required for the gradient graph of the decoder, which is established in each time step for on-policy training, making the training even in-feasible.

Off-policy RL, on the contrary, is to use another independent behaviour policy to explore the environment and transfer the experience to the target policy. Off-policy is sample efficient [7] and also can largely reduce the computing resources required. In RL, the concept of off-policy is usually rooted in value-based RL [8] [9]. However, the RL in NLP is usually policy-based RL learning methods, e.g., REINFORCE-like [10] algorithms [3] [4] and actor-critic [11] as the action space (vocabulary) is large. The value-based RL is not advantageous in dealing with large action space [12].

Also, off-policy RL is sometimes inaccurate as there exists a discrepancy between the target and the behaviour policy. A true off-policy where the target and behaviour policy is non-correlated is extremely hard [13]. The well-known off-policy RL learning algorithms such as DQN [8] and DDPG [14], are only capable of learning with data correlated to their current policy [13]. A common way of approximation in off-policy is using Importance Sampling (IS) estimators [15] [16], which tries to correct the mismatch in the distributions under the behaviour policy and target policy. IS, however, has high variance when the two policy distributions are very different. The ratio of the two probabilities sampled becomes either small or large (sometimes infinite), which leads to huge variance. This phenomenon is noticeable when the episode of RL is long, like when dealing with long sentence generation in visual paragraph generation.

Hence, we propose an off-policy self-critical sequence training based on its on-policy version [4], a REINFORCE-like Policy Gradient algorithm and apply it for visual paragraph generation. We employ the smooth version of IS, i.e. truncated relative importance sampling (TRIS) [17] to reduce the variance of the conventional IS. TRIS is proved to be effective in reducing the variance of IS as it introduces a relative distribution ratio, which is bounded. Also, there is evidence that the Kullback-Leibler (KL) divergence between the target and behaviour policy influence the variance of IS [18]. KL-control studies an RL problem in which the agent tries to maximise the task-related reward while minimising deviation from a prior policy (behaviour policy). Consequently, when training the target policy with off-policy RL, we penalise its divergence from the behaviour policy with KL-control [19]. We add a term of the KL divergence between the target policy and the behaviour policy in the value function of our RL and incorporate it into the self-critical sequence training.

To be specific, we train Meshed Transformer [2] optimised under the proposed off-policy self-critical sequence training for visual paragraph generation. We design a GRU-based image-guided language auto-encoder as the behaviour policy and treat our Transformer as the target policy. The target policy will learn self-critical rewards while minimising the divergence from the behaviour policy, reducing the variance of the TRIS. To summarise, our contributions are threefold: (1) We propose a novel off-policy self-critical sequence training framework, making the RL learning feasible for Transformer. (2) We reduce the variance of the IS ratio, which is in off-policy RL approximation, by applying TRIS and the concept and techniques of KL-control. (3) We achieve state-of-the-art results on visual paragraph generation and improved results on image captioning. Empirical evidence also shows that the IS variance can be significantly reduced.

2 Related Works

2.1 Off-Policy RL Learning

RL with replay buffer [20] can be considered as a standard tool for off-policy learning [21]. In these schemes, the behaviour policy in the replay buffer is somehow related to the target policy [8], which is not a ‘true’ off-policy RL learning [13]. For example, Isele et al. [22] see that the performance of an agent is most reliable when the distribution of data in the replay buffer matched the test distribution.

Many approaches [23] [15] use IS to re-weight the probability distribution when the target policy is different from behaviour policy. However, IS is with high variance, preventing the model from achieving stable performance. Hanna et al. [24] use function approximation to estimate the behaviour policy to reduce the variance. Liu et al. [25] models the stationary state visiting distribution of the behaviour policy for infinite-horizon off-policy RL tasks. Humayoo et al. [17] apply a simple technique, TRIS, to solve this problem.

KL-control is a branch of stochastic optimal control, where the KL divergence from other distributions is applied in regularisation [26] [27]. An example in the on-policy policy gradient is Trust Region Policy Optimisation (TRPO) [28], where a KL penalty term is incorporated in the value function of the Policy Gradient algorithm. KL-control has also been used to improve transfer learning between MLE training on data and training with RL [29].

2.2 Image Paragraph Generation

Regions-Hierarchical [6] introduces the first large-scale paragraph captioning dataset, which utilises the images from Visual Genome dataset and adds new annotations. The dataset shows more pronouns, verbs and more diversities than the single sentence captioning dataset, which is more challenging.

Approaches [6] [30] [31] propose different types of hierarchical model structures to generate the visual paragraphs, with an effective coupling mechanism between sentences within one paragraph. These hierarchical architectures model each sentence and couple the sentences into one paragraph, often with more superior performance than the flat models [6]. Advanced methods like VAE [31], GAN [30] are applied to boost the performance further. However, we see less literature on Transformer-like models under RL for visual paragraph generation as the sampling in on-line RL is computing-expensive.

3 Methods

In this section, we first formulate the RL setting of Transformer in visual paragraph generation, then introduce our off-policy self-critical framework for optimisation.

3.1 Formulation of Visual Paragraph Generation in On-Policy Self-Critical Training

We consider the visual paragraph generation process as a finite Markov Decision Process (MDP). Transformer can be viewed as an agent, which interacts with the environment (words and image features). In the MDP $\{S,A,P,R,\gamma\}$ , $S=\{s_{0},...,s_{T}\}$ is the state space, $A=\{a_{0},...,a_{T}\}$ is an action space. $P(s_{t+1}|s_{t},a_{t})$ is the state transition probability, $R(s_{t},a_{t})$ is the reward function and $\gamma\in(0,1]$ is the discount factor. The agent selects an action, from a conditional probability distribution, which is called the policy $\pi_{\theta}(a|s)$ , parametrised by $\theta$ . In visual paragraph generation, the state space composed of image features ( $I_{F}$ ) and actions generated so far, described as $s_{t}=\{I_{F},a_{0},a_{1},a_{2},...,a_{t-1}\}$ . Value functions are the expectation of accumulative discounted future reward, measuring how good each state is. There are two kinds of value functions: the state value function $V^{\pi}(s_{t})$ and the state-action value function $Q^{\pi}(s_{t},a_{t})$ , which are defined as follows:

\begin{split}&V^{\pi}(s_{t})=\mathbb{E}_{a_{t},s_{t+1},...\sim\pi}\Big{[}\sum_{l=0}^{T}\gamma^{l}r_{t+l}|S=s_{t}\Big{]}\\ &Q^{\pi}(s_{t},a_{t})=\mathbb{E}_{s_{t+1},a_{t+1},...\sim\pi}\Big{[}\sum_{l=0}^{T}\gamma^{l}r_{t+l}|S_{t}=s_{t},A_{t}=a_{t}\Big{]}.\\ \end{split}

(1)

The agent tries to maximise the accumulative reward and update the parameters, the loss function is expressed as follows:

L(\theta)=V^{\pi}(s_{0})=\mathbb{E}_{\pi}\Big{[}\sum_{l=1}^{T}\gamma^{t-1}r_{t}\Big{]}.

(2)

For Policy Gradient methods [32], which are widely applied in sequence generation problems, the optimisation can be formulated as:

\nabla_{\theta}L(\theta)=\mathbb{E}_{\pi}\Big{[}Q^{\pi}(s_{t},a_{t})\nabla_{\theta}log\pi_{\theta}(a_{t}|s_{t})\Big{]}.

(3)

The Policy Gradient is unbiased, but with high variance. A common way to address this issue is using an arbitrary baseline $b(s_{t})$ , which is described as follows:

\nabla_{\theta}L(\theta)=\mathbb{E}_{\pi}\Big{[}(Q^{\pi}(s_{t},a_{t})-b(s_{t}))\nabla_{\theta}log\pi_{\theta}(a_{t}|s_{t})\Big{]}.

(4)

The baseline is an arbitrary function, which should be independent from the action $a_{t}$ . The Q function appears in the above equations in self-critical sequence learning is set as the expectation of the accumulated rewards. As there is no intermediate reward in language generation task, the self-critical training uses a single sample from Monte Carlo sampling to approximate the Q function, which, in reality, is the CIDEr score of the sampled sentence $\text{CIDEr}^{s}$ . The self-critical uses the baseline CIDEr score CIDEr^$\wedge$ from greedy sampling to reduce the variance in Policy Gradient,

\nabla_{\theta}L(\theta)=\mathbb{E}_{\pi}\Big{[}(\text{CIDEr}^{s}-\text{CIDEr}\textsuperscript{$\wedge$})\nabla_{\theta}log\pi_{\theta}(a_{t}|s_{t})\Big{]}.

(5)

3.2 The Proposed Off-Policy Self-Critical Training

One reason for the instability of off-policy learning is the discrepancy between distributions of the target and behaviour policies as we wish to gather data from the distributions of target policy but sample data from the distribution of the behaviour policy.

Importance Sampling (IS).

IS [33] [34] is a classical approach in handling the discrepancy between the target and behaviour policies. If the behaviour policy is $\pi^{b}$ , if $\tau=\{a_{1},...,a_{t},...,a_{T}\}$ , then IS in off-policy self-critical learning can be written as:

\begin{split}&\nabla_{\theta}L(\theta)=\mathbb{E}_{\pi^{b}}\Big{[}(\text{CIDEr}^{s}-\text{CIDEr}\textsuperscript{$\wedge$})\frac{\pi(\tau)}{\pi^{b}(\tau)}\nabla_{\theta}log\pi_{\theta}(a_{t}|s_{t})\Big{]}\\ &=\mathbb{E}_{\pi^{b}}\Big{[}(\text{CIDEr}^{s}-\text{CIDEr}\textsuperscript{$\wedge$})(\prod_{t=0}^{T}{\mu_{t}})\nabla_{\theta}log\pi_{\theta}(a_{t}|s_{t})\Big{]},\\ \end{split}

(6)

where $\mu_{t}=\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi^{b}(a_{t}|s_{t})}$ is the importance ratio. The IS is with high variance, especially when the discrepancy between the distributions of the target policy and the behaviour policy is large, as the ratio of the probability becomes unstable.

Truncated Relative Importance Sampling (TRIS).

Relative Importance Sampling (RIS) [35] [36] [17] can be applied to smooth the IS so as to reduce the variance, which is described as follows:

\begin{split}&\mu_{t}^{r}=\frac{\pi_{\theta}(a_{t}|s_{t})}{\lambda{\pi_{\theta}(a_{t}|s_{t})}+(1-\lambda){\pi^{b}(a_{t}|s_{t})}}\\ &\nabla_{\theta}L(\theta)=\mathbb{E}_{\pi^{b}}\Big{[}(\text{CIDEr}^{s}-\text{CIDEr}\textsuperscript{$\wedge$})(\prod_{t=0}^{T}{\mu_{t}^{r}})\nabla_{\theta}log\pi_{\theta}(a_{t}|s_{t})\Big{]},\\ \end{split}

(7)

where the RIS is bounded as it is no greater than $\frac{1}{\lambda}$ , as proved in [17]. Accordingly, RIS has bounded variance and is also with low bias [35]. The probability ratio $\prod_{t=0}^{T}{\mu_{t}^{r}}$ does not involve a product of a sequence of unbounded value.

The Truncated Relative Importance Sampling (TRIS), expressed as $\mu^{tr}_{t}=min(c,\mu_{t}^{r})$ , can stabilise the training as it truncates the min value of the ratio to $c$ , which introduces a lower bound to RIS.

Penalty in Policy Gradient.

A method to further reduce the variance of TRIS and stabilize the training is encouraging the learnt policy to be close to the behaviour policy [37]. We can penalise the KL divergence in the value function. It is also related to the KL-control problem in which a KL value penalty is introduced in the value function. Formally, if $\tau=\{a_{1},a_{2},...,a_{t-1}\}$ , we define the penalised loss objective as:

\nabla_{\theta}L(\theta)=\mathbb{E}_{\pi^{b}}\Big{\{}(\text{CIDEr}^{s}-\text{CIDEr}\textsuperscript{$\wedge$})(\prod_{t=0}^{T}{\mu^{tr}_{t}})\nabla_{\theta}log\pi_{\theta}(a_{t}|s_{t})\Big{\}}-\beta D[\pi(\tau)||b(\tau)],

(8)

where $\mu_{t}$ is the RIS ratio, and $D$ is a divergence function between distributions over actions such as KL divergence. This formulation can penalise the behaviour of the target policy being divergent from the behaviour policy. As $D_{kl}[q(x)||p(x)]=\sum_{x}q(x)(\log q(x)-\log p(x))$ . Then the loss objective is equivalent to the following expression at the action level:

\nabla_{\theta}L(\theta)=\mathbb{E}_{\pi^{b}}\Big{\{}[(\text{CIDEr}^{s}-\text{CIDEr}\textsuperscript{$\wedge$})+\beta(\log b(a_{t}|s_{t})-\log\pi(a_{t}|s_{t}))](\prod_{t=0}^{T}\mu^{tr}_{t})\nabla_{\theta}log\pi_{\theta}(a_{t}|s_{t})\Big{\}},

(9)

where the term $b(a_{t}|s_{t})$ rewards the model for choosing the action that have a high probability of the prior (behaviour) policy. $-\log\pi(a|s)$ is the entropy regularisation [38], which is very important in RL for efficient exploring. $\beta$ is the coefficient weight to control the contribution of the penalty term.

The Rationale of Combining TRIS with KL-control.

A fundamental issue of IS is the choice of the importance function, which, in our case, is the behaviour policy $\pi^{b}$ . $\pi(E=e)=\frac{1}{M}\sum_{i=1}^{M}\frac{\pi(h_{i},e)}{\pi^{b}(h_{i})}$ where $h_{i}$ is the instantiation of variables $H$ in the $i^{th}$ samples, $e$ is the observed variable. The optimal importance function is when $\pi^{b}=\pi(H|e)$ , which is proportional to $\pi(H,e)$ and lead to zero variance of the IS. In practice, the optimal is not easy to sample. Hence, many researchers are seeking methods to reduce the variance of IS.

While the optimal is hard to find, the KL-divergence between the two distributions can significantly affect the variance of IS, which is proved in [18]: Let $\pi^{b_{1}}$ and $\pi^{b_{2}}$ be two importance functions, and the $D(\pi^{b_{1}}||\pi)-D(\pi^{b_{2}}||\pi)=d>\ln c>0$ where $D$ is the KL-divergence, then $\frac{E_{b_{1}}[Var(\pi/\pi^{b})]}{E_{b_{2}}[Var(\pi/\pi^{b})]}\geq\frac{e^{2d}}{c^{2}}$ . $Var$ indicates the variance. Accordingly, even a small change in KL-divergence could exponentially alter the variance of IS and RIS. Consequently, we can penalise the target policy when it is divergent from the behaviour policy, to further reduce the variance of TRIS.

3.3 The Model Structure

The agent we utilise is the meshed Transformer [2], which shows state-of-the-art performance in image captioning. We use a GRU-based image-guided language auto-encoder as the behaviour policy, as shown in Figure 1. The input paragraph (ground-truth paragraph) is encoded via a GRU-based language encoder to a hidden vector $h_{e}$ , with a size of $batch\times M$ . Then we feed the region image features extracted from a pre-trained Faster R-CNN model, denoted as $F=\{f_{1},f_{2},...f_{K}\}$ , each item with a size of $batch\times N$ to a visual attention module [39] in every time step of the language decoder. Hence the input to the language decoder (a GRU model) at each time step $t$ is expressed as:

\begin{split}&e_{ti}=concat(F,h_{e},h_{t})*W_{\alpha}\\ &\alpha_{ti}=\frac{exp(e_{ti})}{\sum_{k=1}^{K}exp(e_{tk})}\\ &I_{t}=\sum_{i=1}^{K}(\alpha_{ti}*f_{i})\\ &h_{t}=\left\{\begin{split}&h_{e}\ \ \text{if}\ t=0\\ &\text{GRU}(I_{t},h_{t})\ \ \text{if}\ t>0,\end{split}\right.\end{split}

(10)

where $W_{\alpha}\in{R}^{L\times 1}$ and $L=2*M+N$ . The hidden vector of the language decoder is initialised with $h_{e}$ . The language auto-encoder can be considered as image-guided. $h_{t}$ is then decoded to paragraph. We use the language auto-encoder as the behaviour policy to explore in the environment. To approximate the off-policy learning, TRIS and a KL-divergence penalty are utilised in training.

Refer to caption — Figure 1: The off-policy self-critical for visual paragraph generation: The image is first input to a Faster R-CNN model [40] to extracting $n$ region features, each is with a dimension of 4096. The features are forwarded to Transformer to perform training, after Fully-connected (FC) transforming. Meanwhile, the input paragraph is encoded via a GRU encoder to a hidden vector. The hidden vector, along with the visual features, are subsequently input to a GRU decoder to perform Multinomial Sampling. The sampled words are then forwarded to Transformer to obtain the action probabilities. The self-critical reward obtained from the GRU decoder is formulated with a KL penalty term, which is to reduce the variance of TRIS used for re-weighting the probabilities. Best Viewed in Colour.

3.4 Training Algorithm

We first train an image-guided language auto-encoder using the image-paragraph pairs provided with the dataset, which is then used as a behaviour policy. Transformer [2] is pre-trained using the standard MLE learning scheme on the dataset. Then we treat Transformer model as the target policy, and start the off-policy Policy Gradient training described previously. When training the model under RL, the total loss objective is a combination of MLE loss and RL loss, expressed as:

\begin{split}&Loss_{MLE}(\theta)=-\sum_{t=0}^{T}log(\pi_{\theta}(a_{t}|a_{0:t-1},I_{F}))\\ &Loss_{total}(\theta)=(1-\alpha)*Loss_{MLE}(\theta)-\alpha*L(\theta),\\ \end{split}

(11)

where the MLE loss is to minimise the negative log probabilities of the generated word token given previous generated work tokens.

4 Experiments

We conduct the experiments on two use cases of the off-policy self-critical for image-based language generation: visual paragraph generation and image captioning. The merits of our algorithm are mainly in tackling long sequence generation for Transformer models in, e.g., visual paragraph generation. Image captioning is to generate a caption for a given image, which can be combined with on-policy self-critical [2]. Nevertheless, we apply the proposed method on image captioning as well.

4.1 Visual Paragraph Generation

Implementation Details.

We experiment on the Stanford Visual Paragraph dataset [6]. In this dataset, each image contains one paragraph. The training, validation and testing sets contain 14,575, 2487 and 2489 images, respectively. We evaluate the BLEU, METEOR, ROUGE-L and CIDEr scores for the generated paragraphs. For MLE baseline, we train the model for 40 epochs. For our off-policy self-critical algorithm, we further train the model for 8 epochs using a combination of off-policy RL and MLE. We use early stopping on CIDEr score to choose the best model for every one epoch. The learning rate is set as 4e-4 for MLE training, and 4e-5 for our off-policy self-critical training. We use Adam optimiser [41] with stochastic back-propagation. The batch size is set as 20. Our experiments are conducted using Pytorch 1.2.0 and with a server equipped with an NVIDIA 2080-Ti GPU.

Ablation Studies.

Firstly, we set two kinds of behaviour policies, the visual attention-based captioning model [39] and our image-guided language auto-encoder, which are shown in Table 1. The attention model yields poorer performance than our auto-encoder as we include the language information in our auto-encoder.

The impact of such different behaviour policy on the target policy is not that obvious, as revealed in Table 2. This phenomenon shows that: (1). the behaviour policy is only applied in the exploration of RL, which, in theory, does not affect the target policy. (2). Behaviour policy that selects better action, can have a better impact on the target policy as the reward tends to be more positive.

RIS can reduce the variance of IS via a simple technique of linear transformation. As the reduced variance leads to more stable training, the performance can be raised, as shown in Table 3. TRIS can further boost the performance as it additionally introduces a lower bound of the RIS ratio. This lower bound guarantees that the $\prod_{t=0}^{T}{\mu^{tr}_{t}}$ is bigger than zero mostly, leading to more effective training.

The KL-control technique described previously can penalise the target policy when it is divergent from the behaviour policy, thus can reduce the variance of IS. The results are shown in Table 3 and Table 4. The TRIS with KL-control can increase the final performance of the target policy.

We study the value of $c$ in TRIS, as presented in Table 6. A suitable $c$ is critical in maintaining the performance as it directly affects the TRIS ratio. $c=0.96$ yields the best results.

The coefficient $\alpha$ also has an impact on the performance, $\alpha=0.5$ can make a right balance between supervised learning and off-policy RL learning, as shown in Table 5.

We plot the IS ratio curves of training versus the iteration. We run the RL training for 2000 iterations, with a batch size of 20, which can be seen in Figure 2. The IS ratio leads to a very high value (more than 3000) in around 1600 and 2000 iterations, which is not bounded. RIS with a relative ratio of 0.5 can significantly reduce the variance, making the value of IS ratio below 0.07, which shows critical contrast with the IS ratio. The KL-control can further reduce the variance of the RIS ratio, limiting the RIS ratio below 0.05. The TRIS introduces a lower bound of 0.95 to the ratio, leading to stable training.

Table 1: The performance of behaviour policies.

Methods	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
behaviour Policy [39]	22.4	10.2	4.3	1.7	9.5	24.4	7.0
Our behaviour Policy	46.3	30.8	20.9	14.5	18.0	41.5	66.7

Table 2: The impact of behaviour policies on the performance of the target policy.

Methods	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
off-policy with behaviour Policy [39]	37.8	22.1	13.0	7.6	14.9	29.2	14.1
off-policy with our behaviour Policy	41.9	24.8	14.8	8.9	16.6	29.8	19.0

Table 3: The impact of TRIS on the performance of the target policy.

Methods	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
IS, $\alpha=0.5$	41.9	24.8	14.8	8.9	16.6	29.8	19.0
IS + KL, $\beta=0.05,\alpha=0.5$	42.1	24.2	14.1	9.2	16.5	28.2	16.9
RIS + KL, $\beta=0.05,\alpha=0.5$	43.1	25.5	15.2	9.0	16.9	29.5	20.0
TRIS + KL, $\beta=0.05,\alpha=0.5$	42.7	25.7	15.5	9.4	16.9	30.2	20.9

Table 4: The impact of coefficient weight

\beta

of KL-control on the performance on the target policy.

Methods	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
RIS+KL, $\beta=0.2,\alpha=1.0$	42.0	24.4	14.4	8.6	16.5	29.0	19.9
RIS+KL, $\beta=0.1,\alpha=1.0$	42.9	25.4	15.1	8.9	16.7	29.7	20.2
RIS+KL, $\beta=0.05,\alpha=1.0$	42.5	25.4	15.3	9.2	16.7	30.1	19.5

Table 5: The impact of the coefficient of the off-policy policy gradient on the performance.

Methods	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
RIS+KL, $\alpha=0.2$	42.3	24.8	14.6	8.5	16.5	29.1	18.8
RIS+KL, $\alpha=0.5$	43.1	25.5	15.2	9.0	16.9	29.5	20.0
RIS+KL, $\alpha=0.8$	42.3	25.1	15.0	8.9	16.6	29.6	20.2
RIS+KL, $\alpha=1.0$	42.5	25.4	15.3	9.2	16.7	30.1	19.5

Table 6: The impact of the truncated value

c

on the performance of TRIS.

Methods	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
TRIS+KL, $c=0.96$	44.2	26.8	16.2	9.8	17.2	30.8	20.6
TRIS+KL, $c=0.95$	42.7	25.7	15.5	9.4	16.9	30.2	20.9
TRIS+KL, $c=0.85$	41.6	24.7	14.7	8.7	16.5	29.5	19.9

Comparison with the State-of-the-art.

The comparison of our scheme and the current leading methods are shown in Table 7. We achieve state-of-the-art results by using our algorithms with a Transformer optimised on CIDEr. The achieved results even significantly outperform the human’s annotations on BLEU scores. The CIDEr score is also state-of-the-art.

Table 7: The Performance Comparison with the State-of-the-art Methods on the Stanford Visual Paragraph Dataset.

Category	Methods	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	CIDEr
Flat Models	Sentence-Concat [6]	31.1	15.1	7.6	4.0	12.1	6.8
	Template [6]	37.5	21.0	12.0	7.4	14.3	12.2
	Image-Flat [6]	34.0	19.1	12.2	7.7	12.8	11.1
	Top-down Attention [42]	32.8	19.0	11.4	6.9	12.9	13.7
	self-critical [4]	29.7	16.5	9.7	5.9	13.6	13.8
	DAM-Att [43]	35.0	20.2	11.7	6.6	13.9	17.3
	Meshed Transformer + MLE [2]	37.5	22.3	13.7	8.4	15.4	16.1
Hierarchical Models	Regions-Hierarchical [6]	41.9	24.1	14.2	8.7	16.0	13.5
	RTT-GAN [30]	42.0	24.9	14.9	9.0	17.1	16.9
	Diverse (VAE) [31]	42.4	25.6	15.2	9.4	18.6	20.9
	ParaCNN [44]	42.0	25.0	14.9	8.8	17.0	20.4
Ours	Meshed Transformer [2] + off-policy (c = 0.95)	42.7	25.7	15.5	9.4	16.9	20.9
Ours	Meshed Transformer [2] + off-policy (c = 0.96)	44.2	26.8	16.2	9.8	17.2	20.6
Human [6]	Annotations	42.9	25.7	15.6	9.7	19.2	28.6

4.2 Extending to the Convolutional Model and Image Captioning

Convolutional captioning [45] has a similar parallel computing feature to Transformer. The sentence needs to be generated is shorter, requiring relatively less GPU computing resources. Nevertheless, we test our off-policy learning on the convolutional image captioning task. Following the practice of [45], We experiment on MS-COCO dataset [46] under the ‘Karpathy’ split and report results, which are presented in Table 8. We follow the training protocol of the paper [45] for the baseline. We further train the model for 5 epochs using our off-policy self-critical algorithm. Our method improves the convolutional captioning in almost every metric of language evaluation. Notably, the CIDEr is significantly enhanced as our off-policy RL is optimised towards the CIDEr score.

Table 8: The impact of off-policy RL on convolutional image captioning.

Methods	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr
Conv Captioning [45]	71.0	53.6	38.9	27.9	24.0	51.9	88.1
Our off-policy learning	70.9	53.8	39.3	28.3	24.2	52.0	90.0

5 Conclusions

Transformer and Convolution-based seq-to-seq model is hard to perform on-line RL optimisation in visual paragraph generation as the computing resources required for a such models are beyond current equipments. Hence, we propose an off-policy self-critical algorithm in which a GRU-based model is set as the behaviour policy to perform sampling in RL, whose sampling is much more efficient. To better approximate this off-policy RL, both TRIS and a KL-divergence penalty term are applied in the off-policy RL to reduce the high variance of the IS approximation. As Transformer is empowered with RL learning capability, we achieve state-of-the-art results on visual paragraph generation and also improved results on image captioning.

Broader Impact

In this paper, we introduce an off-policy self-critical sequence training algorithm, especially targeting on Transformer-like models in visual paragraph generation, enabling the feasibility of the combination of these advanced models and reinforcement learning (RL).

Usually, we can consider the language generation task as a sequential decision-making process. At each time step, the agent selects a word from a pre-defined vocabulary until the whole sentence or paragraph is generated. Previous studies usually make the agent perform on-policy. However, the off-policy can prevent the agent from real exploration and is sample efficient. Transformer is especially computing-expensive in on-policy exploration, leading to the in-feasibility of the on-policy RL for Transformer. One of the impact is that the proposed algorithm can not only directly save computing resource by preventing Transformer from real exploration but also show the possibilities of the off-policy RL in language generation tasks.

This research is also a test on how the off-policy RL performs in large action space problems. The off-policy is usually rooted in a value-based RL algorithm where the action space is small. Instead, we propose a policy gradient method, without Temporal Difference (TD) bootstrapping, to directly transfer the Monte-Carlo experience of the behaviour policy to the target policy. Mostly, the policy gradient is better behaved when combined with function approximations, while the TD method is more readily applied in off-policy learning. The main obstacle is the high variance in the off-policy estimation of the policy gradient. We show that it is feasible to formulate and apply the off-policy policy gradient if we handle the variance properly.

A drawback of the proposed algorithm is that we might need to introduce another RNN-based model as the behaviour policy, which increases the number of the parameters of the models. Hence, further research can make efforts on how to reduce the training models’ complexity while achieving the same effects of the off-policy RL.

In summary, this research can help existing natural language processing (NLP) models like Transformer to perform off-policy RL learning, which is a novel way of the training of Transformer. This research will also provide insights for other RL learning scheme, for instance, actor-critic learning, in various NLP tasks.

References

[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
[2] Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. $m^{2}$ : Meshed-memory transformer for image captioning. In CVPR, 2020.
[3] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, 2017.
[4] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. In CVPR, 2017.
[5] Emilio Parisotto, H Francis Song, Jack W Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant M Jayakumar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, et al. Stabilizing transformers for reinforcement learning. arXiv preprint arXiv:1910.06764, 2019.
[6] Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. A hierarchical approach for generating descriptive image paragraphs. In CVPR, 2017.
[7] Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, and Sergey Levine. Q-prop: Sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247, 2016.
[8] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. In NeurIPS, 2013.
[9] Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient off-policy reinforcement learning. In NeurIPS, 2016.
[10] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
[11] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An actor-critic algorithm for sequence prediction. In ICLR, 2017.
[12] Yaser Keneshloo, Tian Shi, Naren Ramakrishnan, and Chandan K Reddy. Deep reinforcement learning for sequence-to-sequence models. IEEE Transactions on Neural Networks and Learning Systems, 2019.
[13] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In ICML, 2019.
[14] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR, 2016.
[15] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust off-policy evaluation. arXiv preprint arXiv:1802.03493, 2018.
[16] Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In ICML, 2016.
[17] Mahammad Humayoo and Xueqi Cheng. Relative importance sampling for off-policy actor-critic in deep reinforcement learning. arXiv preprint arXiv:1810.12558, 2018.
[18] Ydo Wexler and Dan Geiger. Importance sampling via variational optimization. In UAI, 2007.
[19] Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. In IJCAI, 2013.
[20] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
[21] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
[22] David Isele and Akansel Cosgun. Selective experience replay for lifelong learning. In AAAI, 2018.
[23] Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In ICML, 2016.
[24] Josiah P Hanna, Scott Niekum, and Peter Stone. Importance sampling policy evaluation with an estimated behavior policy. In ICML, 2019.
[25] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In NeurIPS, 2018.
[26] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. In ICLR, 2015.
[27] Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
[28] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In ICML, 2015.
[29] Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, José Miguel Hernández-Lobato, Richard E Turner, and Douglas Eck. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In ICML, 2017.
[30] Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Gan, and Eric P Xing. Recurrent topic-transition gan for visual paragraph generation. In ICCV, 2017.
[31] Moitreya Chatterjee and Alexander G Schwing. Diverse and coherent paragraph generation from images. In ECCV, 2018.
[32] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In NeurIPS, 2000.
[33] Peter J Huber. Wiley series in probability and mathematics statistics. Robust statistics, pages 309–312, 1981.
[34] Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
[35] Masashi Sugiyama. Introduction to statistical machine learning. Morgan Kaufmann, 2015.
[36] Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Masashi Sugiyama. Relative density-ratio estimation for robust distribution comparison. In NeurIPS, 2011.
[37] Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
[38] Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. In ICML, 2019.
[39] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
[40] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
[41] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[42] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.
[43] Ziwei Wang, Yadan Luo, Yang Li, Zi Huang, and Hongzhi Yin. Look deeper see richer: Depth-aware image paragraph captioning. In ACMMM, 2018.
[44] Shiyang Yan, Yang Hua, and Neil Robertson. Paracnn: Visual paragraph generation via adversarial twin contextual cnns. arXiv preprint arXiv:2004.10258, 2020.
[45] Jyoti Aneja, Aditya Deshpande, and Alexander G Schwing. Convolutional image captioning. In CVPR, 2018.
[46] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.