ZeroGen: Zero-shot Multimodal Controllable Text Generation with Multiple Oracles

Haoqin Tu, Bowen Yang, Xianfeng Zhao
State Key Laboratory of Information Security, Institute of Information Engineering,
School of Cyber Security, University of Chinese Academy of Sciences
[email protected], {yangbowen,zhaoxianfeng}@iie.ac.cn

Abstract

Automatically generating textual content with desired attributes is an ambitious task that people have pursued long. Existing works have made a series of progress in incorporating unimodal controls into language models (LMs), whereas how to generate controllable sentences with multimodal signals and high efficiency remains an open question. To tackle the puzzle, we propose a new paradigm of zero-shot controllable text generation with multimodal signals (ZeroGen). Specifically, ZeroGen leverages controls of text and image successively from token-level to sentence-level and maps them into a unified probability space at decoding, which customizes the LM outputs by weighted addition without extra training. To achieve better inter-modal trade-offs, we further introduce an effective dynamic weighting mechanism to regulate all control weights. Moreover, we conduct substantial experiments to probe the relationship of being in-depth or in-width between signals from distinct modalities. Encouraging empirical results on three downstream tasks show that ZeroGen not only outperforms its counterparts on captioning tasks by a large margin but also shows great potential in multimodal news generation with a higher degree of control. Our code will be released at https://github.com/ImKeTT/ZeroGen.

Refer to caption — Figure 1: Traditional CTG only has unimodal guidance (up), while our ZeroGen follows Multimodal CTG (down) that incorporates multimodal controls to generate relevant texts. We mark words/sentences that are relevant to the textual control or visual control.

1 Introduction

Large-scale pre-trained models (PTMs) have recently achieved great success and become a milestone in the field of AI. Owing to their sophisticated pre-training objectives and huge model parameters, PTMs can benefit a variety of downstream tasks just like Oracles. In the domain of language, pre-trained language models (PLMs) have become a cornerstone of versatile generation tasks including controllable text generation (CTG). By controlling the presence of certain linguistic attributes, these PLMs can be trained to generate texts with desired aspects such as length, and topic Kikuchi et al. (2016); Ficler and Goldberg (2017). Conventional approaches usually construct a conditional LM with supervision (e.g., by fine-tuning), which is unscalable due to the combinatorially numerous conceivable compositions and the lack of annotated data Keskar et al. (2019); Liu et al. (2022). Most recent studies have begun to look into “plug-and-play” (PnP) solutions. Those techniques plug in arbitrary restrictions to guide the generation of desired sentences with PLMs and little training expenses. And the control signals of this paradigm are typically limited to unimodal domains, such as provided keywords or topics Dathathri et al. (2019); Pascual et al. (2021); Yang and Klein (2021); Tu et al. (2022). Rapidly, the PnP fashion has been adopted to bridge multimodal knowledge, recent works have introduced pre-trained multimodal models like CLIP Radford et al. (2021) into cross-modal tasks with vision-only controls such as captioning. These approaches obtained exceptional performances with minimal or no task-oriented training Su et al. (2022a); Tewel et al. (2022); Nukrai et al. (2022).

On the one hand, meaningful interactions between human speakers often necessitate real-world experiences Bisk et al. (2020), and the text-only instruction alone may not be sufficient to fulfill such communication purpose Harnad (1990). As a result, using unimodal controls for CTG may conflict between how to reliably regulate current PLMs and real-world scenarios (e.g., multimodal controlled news generation in Figure 1). On the other hand, unlike some keyword-guided PnP works Pascual et al. (2021); Gu et al. (2022), models that incorporate visual guidance into language generation insert constant controls at LM decoding instead of considering the dynamic nature of such process, which may lead to task under-performance Su et al. (2022a); Tewel et al. (2022).

To overcome those shortcomings, we take a step further to extend the current unimodal PnP paradigm into a multimodal setting and propose ZeroGen. To accomplish multimodal CTG task, we are aware that inputs from different domains affect different granularities of presences in texts. As shown in Figure 1, while textual control steers generated news to the science topic by presenting related keywords, visual control provides more abundant ambient information by producing sentence descriptions. In order to plug in multimodal signals, we propose to unify controls into the LM output probability using token- or sentence-level similarity with several Oracles. Specifically, we first regard the textual guidance as the token-level similarity between keywords and the LM vocabulary from a textual Oracle before decoding, then we incorporate such guidance to LM outputs by weighted addition at generation. For visual guidance, we use a multimodal score Su et al. (2022a) based on sentence-level probability determined by a multimodal Oracle. Finally, we employ beam search to find the token with the highest score at each step. To adapt to the dynamic nature of LM decoding and further promote model performance, we provide a dynamic weighting mechanism on the word-level that can not only enhance visual information expression but also maintain output fluency.

We conduct three tasks (image captioning, stylized captioning, and controllable news generation) with ZeroGen. We explore the relationship between textual and visual control being either vertical or lateral. Specifically, in two captioning tasks, textual objects of the image extend the visual signal as a complement (vertical extension). For news generation, a collection of positive or negative words are used to embody generated news a specific sentiment (lateral extension). The effectiveness of our approach in providing better captions and easily controlled news is demonstrated by results on both automatic metrics and human evaluations.

Contributions. (1) We explore the task of multimodal controllable text generation under zero-shot setting and propose ZeroGen that utilizes token- and sentence-level multimodal guidance to fulfill this task. (2) We present a dynamic weighting scheme on the word-level that can be applied to different modalities and boost the fluency and controllability of generated texts. (3) Extensive experiments on two captioning tasks and the controllable news generation task not only justify the effectiveness of ZeroGen but also investigate the relationship between different types of modal controls.

2 Related Work

Efficient Image Captioning.

The prerequisite of supervised captioning for a large amount of paired image-text data is unrealistic in real-life scenarios. Various attempts have been made to reduce the dependence on large paired image-text data. For example, some works Anderson et al. (2018); Laina et al. (2019); Chen et al. (2020); Honda et al. (2021) have sought to incorporate objects from given images into model training. Despite their efficiency in comparing supervised methods, they still need to be trained with partial cross-modal guidance as supervision. CLIP Radford et al. (2021) as a milestone for vision-language alignment has shown impressive zero-shot capabilities on various multimodal generation tasks. For example, Tewel et al. (2022) proposed the first zero-shot captioning model with CLIP and a base LM (i.e., GPT-2). It constantly updates the model’s transformer cache under the direction of CLIP guidance decoding. Nevertheless, it still demands gradient computation and optimization during generation, introducing additional generation overhead. Su et al. (2022a) proposed MAGIC that utilizes a token decoding score based on CLIP to produce plausible captions without task-specified training. Most recently, Nukrai et al. (2022) employs text-only training with gaussian noises parameterized by a few images to connect CLIP and the base LM textual embedding. Still, Nukrai et al. (2022) requires a small amount of external visual knowledge during training. As for our model, ZeroGen expands MAGIC with additional capabilities to facilitate multimodal guided generation with dynamic weighting, supporting several downstream applications while keeping its ability to transfer to different base LMs. Most recently, Zeng et al. (2023) proposed to employ sample-based sequential polishing during language decoding to produce plausible and fluent captions.

PnP Controllable Text Generation.

To avoid excessive training costs from fine-tuning PLMs into CTG tasks, researchers have turned their attention to specialized training-free methods such as the “plug-and-play” (PnP) framework by Dathathri et al. (2019). This framework can be used along an existing generative LM (the base LM) with minimum or no training procedure between PnP components and the base LM. In comparison to conventional methods, these PnP approaches typically follow two aspects. In-model guidance approaches including “prompt tuning” Lester et al. (2021) that either aim at optimizing the input prompts and additional parameters that are fed into the base LM Houlsby et al. (2019); Li and Liang (2021); Lester et al. (2021) or seek to alter certain hidden representations that are not model input or output layers, by plugging a trainable model into the middle of the base LM Dathathri et al. (2019); Duan et al. (2020); Mai et al. (2020); Tu et al. (2022). Out-model guidance techniques, on the contrary, focus on building controllable language models that only modify the output probabilities from the base LMs at inference time Krause et al. (2021); Pascual et al. (2021); Yang and Klein (2021); Liu et al. (2021a); Krause et al. (2021). And our ZeroGen belongs to the last category that only imposes control signals at LM decoding.

3 ZeroGen Methodology

For the multimodal CTG task, we formally define it as: given the visual control $\textbf{C}_{V}$ (i.e., an image) and $N$ representative words from a topic or an image as the textual control $\textbf{C}_{T}=\{C_{T_{1}},...,C_{T_{N}}\}$ , we aim at getting the textual output $\textbf{X}=\{x_{1},x_{2},...\}$ to meet the two control aspects simultaneously.

ZeroGen focuses on the output probability space of the base LM. As shown in Figure 2, at decoding step $t$ , it first adjusts the original LM output probability $p_{\text{LM}_{t}}$ to $p^{\prime}_{\text{LM}_{t}}$ follows the token-level textual guidance from keywords-vocabulary similarities, then it completes word searching on $p^{\prime}_{\text{LM}_{t}}$ using a sentence-level multimodal scoring function and beam search. Note that, instead of calculating the token similarity constantly Pascual et al. (2021), we only compute it once before decoding and turning it into the overall textual control with options. Finally, being processed on word-level, the dynamic weighting scheme is applied to regulate both control weights for every generation step.

3.1 Token-level Textual Guidance

Since the appearance of keywords from a certain topic can drive sentences in such direction, we consider token-level similarity between LM tokens and keywords in $\textbf{C}_{T}$ as the textual guidance. To avoid the additional computational costs, we unify the textual control into probability space by a set of cosine similarities between word $C_{T_{n}}\in\textbf{C}_{T}$ and the full base LM vocabulary $\textbf{V}\in\mathbb{R}^{V}$ before decoding. These word similarities are obtained using the textual Oracle $\phi_{T}$ (e.g., pre-trained word embedding):

\displaystyle p(\textbf{V},\textbf{C}_{T})=\left\{\cos(\phi_{T}(\textbf{V}),\phi_{T}(C_{T_{n}}))\right\}_{n=1}^{N},

where $p(\textbf{V},\textbf{C}_{T})\in\mathbb{R}^{N\times V}$ , $V$ is the vocabulary size. To fully utilize all the given keywords, we explore three selection methods at time $t$ when $N>1$ to get the overall textual control $p_{t}(\textbf{C}_{T})\in\mathbb{R}^{V}$ :

Step-wise Random (SR):

we first provide changing controls through the generation. At different steps, we sample one keyword-vocabulary similarity uniformly from $p(\textbf{V},\textbf{C}_{T})$ as textual guidance.

Mean Pooling (MP):

an intuitive way to consider all textual information is to average their guiding similarities w.r.t. V across distinct keywords.

Word-wise Max (WM):

for every token $w$ from V, we choose the most similar keyword in $\textbf{C}_{T}$ with $w$ (with the highest cosine similarity score) to compute its guiding probability and compose all the highest similarities together as $p_{t}(\textbf{C}_{T})$ .

The overall textual control $p_{t}(\textbf{C}_{T})\in\mathbb{R}^{V}$ is available after this selection, we introduce it to $p_{\text{LM}_{t}}$ as a control bias by simple addition operation with weighting: $p^{\prime}_{\text{LM}_{t}}=p_{\text{LM}_{t}}+\alpha\times p_{t}(\textbf{C}_{T})$ .

3.2 Sentence-level Visual Guidance

Image information carries more general and higher-level global information than a single word. As discussed in Sec. 1, we thus consider sentence-level similarity between generated texts and visual control $\textbf{C}_{V}$ as the visual guidance.

We employ scoring function $S_{t}$ for word $w\in\textbf{V}$ at $t$ -th step with weighted visual guidance as in Su et al. (2022a), and use beam search for generation:

		$\displaystyle S_{t}\left(w,\textbf{C}_{V}\mid x_{<t},W_{t}^{(k)}\right)=$
		$\displaystyle\left\{\begin{aligned} &\begin{aligned} p^{\prime}_{\text{LM}_{t}}&(w\mid x_{<t})+\\ &\beta\times\frac{e^{p_{\phi_{M}}\left(\left[x_{<t};w\right],\textbf{C}_{V}\right)}}{\sum_{z\in W_{t}^{(k)}}e^{p_{\phi_{M}}\left(\left[x_{<t};z\right],\textbf{C}_{V}\right)}}\end{aligned},\text{if }w\in W_{t}^{(k)}\\ &-\inf,\qquad\qquad\qquad\qquad\qquad\quad\quad\text{otherwise}.\end{aligned}\right.$

Here $[x_{<t};w]$ means appending $w$ to the generated texts before step $t$ , $W_{t}^{(k)}$ is the searching beam consists of words with the $k$ highest probabilities in $p^{\prime}_{\text{LM}_{t}}$ . In detail, we bridge texts and $\textbf{C}_{V}$ using multimodal Oracle $\phi_{M}$ (e.g., CLIP) and compute their similarity: $p_{\phi_{M}}([x_{<t};w],\textbf{C}_{V})=\cos(\phi_{M}([x_{<t};w]),\phi_{M}(\textbf{C}_{V}))$ . Our final goal is to find $x_{t}=\arg\max_{w\in\textbf{V}}S_{t}(w,\textbf{C}_{V}\mid x_{<t},W_{t}^{(k)})$ .

3.3 Multimodal Dynamic Weighting

To further boost the model performance and make the model get attuned to different generation steps, a novel dynamic weighting mechanism is proposed to achieve step-wise multimodal weights adjustment. Concretely, we replace $\alpha,\beta$ with dynamic $\alpha_{t},\beta_{t}$ severally. The design should consider such principles: (1) It is necessary to seek a certain balance between textual control (i.e., shifts the LM output probability) and the original LM modeling to avoid inconsistent outputs. (2) During generation, visual-relevant words ought to be encouraged, while those irrelevant are punished. Since the smallest comprehensible output of an LM is one word, we then apply this framework on the word-level.

Dynamic $\boldsymbol{\alpha_{t}}$ .

To maintain the original language modeling ability, and also make the most out of provided textual guidance. We re-scale the textual control using a step-wise weighting calibration that incorporates the original LM output confidence $p_{\text{LM}_{t}}$ . Specifically, we compute the average probability of $\hat{N}\in[1,N]$ keywords in $\textbf{C}_{T}$ from the unshifted LM output as the $t$ -th textual control weight:

\displaystyle D_{T}=\sum_{n=1}^{\hat{N}}\frac{p_{\text{LM}_{t}}\left(C_{T_{n}}\mid x_{<t}\right)}{\hat{N}},

\displaystyle\alpha_{t}=\min\left(\frac{D_{T}}{\lambda},\hat{\alpha}\right).

If $D_{T}$ is high, keywords from $\textbf{C}_{T}$ are encouraged to be spoken. Since this is the exact time when unchanged base LM has high confidence to produce these words, we can avoid jeopardizing output fluency while generating controlled texts.

Dynamic $\boldsymbol{\beta_{t}}$ .

To reward the generation stages where all words in $W_{t}^{(k)}$ are highly associated with the knowledge in $\textbf{C}_{V}$ and penalize those do not, we employ average word-level similarity between current candidate words and the visual control:

\displaystyle D_{V}=\sum_{w\in W_{t}^{(k)}}\frac{p\left(w,\textbf{C}_{V}\right)}{k},

\displaystyle\beta_{t}=\min\left(\frac{D_{V}}{\lambda},\hat{\beta}\right).

If $D_{V}$ is high, words in $W_{t}^{(k)}$ are relevant to $\textbf{C}_{V}$ and should be expressed with a higher chance.

Inspired by Gu et al. (2022), $\lambda$ in this framework serves as a threshold that amplifies the control signal if $D_{V}$ or $D_{T}$ is larger than it and vice versa. Meanwhile, $\hat{\alpha},\hat{\beta}$ are weighting upper bounds.

Model	MS-COCO						Flickr30k						Speed
Model	B@1 $\uparrow$	B@4 $\uparrow$	M $\uparrow$	R-L $\uparrow$	CIDEr $\uparrow$	SPICE $\uparrow$	B@1 $\uparrow$	B@4 $\uparrow$	M $\uparrow$	R-L $\uparrow$	CIDEr $\uparrow$	SPICE $\uparrow$	Speed
Weakly Supervised Approaches
IC-SME Laina et al. (2019)	-	6.5	12.9	35.1	22.7	-	-	7.9	13.0	32.8	9.9	-	-
S2S-GCC Honda et al. (2021)	50.4	7.6	13.5	37.3	31.8	8.4	-	-	-	-	-	-	-
CapDec Nukrai et al. (2022)	69.2	26.4	25.1	51.9	91.8	-	55.5	17.7	20.0	43.9	39.1	-	-
Unsupervised Approaches
CLIPRe	39.5	4.9	11.4	29.0	13.6	5.3	38.5	5.2	11.6	27.6	10.0	5.7	-
ZeroCap Tewel et al. (2022)	49.8	7.0	15.4	31.8	34.5	9.2	44.7	5.4	11.8	27.3	16.8	6.2	1.0 $\times$
MAGIC Su et al. (2022a)	56.5	12.4	17.3	39.6	48.3	11.2	43.3	6.8	12.3	30.8	20.5	6.8	26.6 $\times$
ConZIC Zeng et al. (2023)	-	1.3	11.5	-	12.8	5.2	-	-	-	-	-	-	-
DeCap Li et al. (2023)	-	8.9	17.5	-	50.6	13.1	-	-	-	-	-	-	-
ZeroGen	59.4	15.5	18.7	42.3	55.4	12.1	54.9	13.1	15.2	37.4	26.4	8.3	16.4 $\times$
-TDW	58.9	15.2	18.4	41.8	54.4	11.9	54.1	12.8	14.7	36.8	24.5	7.7	16.5 $\times$
-T	58.6	14.7	17.4	41.3	51.7	11.8	53.3	11.9	14.3	36.2	24.1	7.5	18.6 $\times$
-VDW	57.0	12.6	17.6	39.7	49.7	11.6	49.2	6.4	14.1	32.4	22.9	7.7	22.5 $\times$
-DW	57.0	12.6	17.6	39.7	49.7	11.6	47.7	7.1	13.8	32.3	21.9	7.6	21.6 $\times$

Table 1: Captioning results of ZeroGen with only 1 object as

\textbf{C}_{T}

(i.e.,

N=1

) on MS-COCO and Flickr30k. ZeroGen outperforms most baselines with tolerable efficiency sacrifice. T, TDW/VDW, DW represent textual control, textual/visual dynamic weighting and two dynamic weighting schemes combined respectively.

Model	B@1 $\uparrow$	B@4 $\uparrow$	M $\uparrow$	R-L $\uparrow$	CIDEr $\uparrow$
ZeroGen ( $N=1$ )	59.4	15.5	18.7	42.3	55.4
ZeroGen ( $N=2$ )	60.1	15.6	18.5	42.3	55.9
ZeroGen ( $N=3$ )	60.4	15.6	18.6	42.3	56.5
ZeroGen ( $N=4$ )	60.5	15.7	18.7	42.4	57.0
ZeroGen ( $N=5$ )	60.6	15.8	18.7	42.4	57.1

Table 2: Captioning results of ZeroGen on MS-COCO with varied size

N

for textual control

\textbf{C}_{T}

Model	B@1 $\uparrow$	B@4 $\uparrow$	M $\uparrow$	R-L $\uparrow$	CIDEr $\uparrow$
ZeroGen SR	59.6	15.3	18.4	42.1	55.5
ZeroGen MP	59.9	15.2	18.3	42.0	55.2
ZeroGen WM	60.6	15.8	18.7	42.4	57.1

Table 3: Captioning results of ZeroGen with

N=5

for

\textbf{C}_{T}

on MS-COCO with three

p_{t}(\textbf{C}_{T})

options.

4 General Implementations and Baselines

General Implementations.

We take SimCTG Su et al. (2022b) as our base LM and first fine-tune it on every dataset with text-only data like previous works. Since ZeroGen follows the zero-shot paradigm, it can leverage any off-the-shelf LM and empower it a pair of eyes. For Oracles, we employ GloVe Pennington et al. (2014) as the textual Oracle $\phi_{T}$ and CLIP Radford et al. (2021) as the multimodal Oracle $\phi_{M}$ . The $\hat{N}$ for $\alpha_{t}$ is $N$ itself on two captioning tasks, while $\hat{N}=2$ on controllable news generation task through ablation study. The amplifying factor $\lambda$ is $0.2$ throughout the paper. See Appendix A for full model details.

Baseline Models.

For the image captioning task, we select both weakly supervised and unsupervised methods as our baselines, (1) IC-SME Laina et al. (2019), S2S-GCC Honda et al. (2021), and CapDec Nukrai et al. (2022) are three weakly supervised approaches, the former two adapt neural network modules to align visual features with pseudo captions, CapDec introduces CLIP guidance and few images in training. (2) CLIPRe, ZeroCap Tewel et al. (2022), and MAGIC Su et al. (2022a) are three zero-shot methods, which follow the retrieval manner, CLIP-guided gradient update, and decoding scheme respectively. For fair comparisons, we use the same base LM as ours for ZeroCap and MAGIC. In stylized captioning, MemCap Zhao et al. (2020) is additionally considered.

Model	FlickrStyle10k Romantic						FlickrStyle10k Humorous
Model	B@1 $\uparrow$	B@3 $\uparrow$	M $\uparrow$	R-L $\uparrow$	CIDEr $\uparrow$	SPICE $\uparrow$	B@1 $\uparrow$	B@3 $\uparrow$	M $\uparrow$	R-L $\uparrow$	CIDEr $\uparrow$	SPICE $\uparrow$
MemCap Zhao et al. (2020)	21.2	4.8	8.4	-	22.4	-	19.9	4.3	7.4	-	19.4	-
ZeroCap Tewel et al. (2022)	19.3	2.7	7.6	16.5	14.9	7.0	18.4	2.7	7.7	16.5	15.6	7.7
MAGIC Su et al. (2022a)	23.3	4.9	8.6	21.7	24.4	8.6	23.7	5.2	9.0	21.2	27.8	10.1
CapDec^∗ Nukrai et al. (2022)	21.4	5.0	9.6	-	26.9	-	24.9	4.3	10.2	-	34.1	-
ConZIC Zeng et al. (2023)	-	1.2	6.1	-	-	-	-	1.2	6.1	-	-	-
ZeroGen	24.4	5.5	9.2	22.3	27.3	9.8	24.2	5.6	9.6	22.0	30.5	11.2
-TDW	23.5	5.4	8.7	21.9	26.1	9.0	24.2	5.6	9.6	21.9	30.5	11.2
-T	23.3	4.9	8.6	21.8	24.7	8.6	23.7	5.2	9.1	21.2	28.3	10.2
-VDW	24.0	5.5	9.0	22.1	26.9	9.4	24.1	5.6	9.5	21.9	30.2	11.1
-DW	23.4	5.1	8.7	21.8	25.0	9.0	23.8	5.3	9.1	21.4	29.1	10.3

Table 4: Stylized captioning results on two subsets of FlickrStyle10k with

N=1

. * meas CapDec is a weakly supervised method that requires additional visual knowledge from several images during training.

For controllable news generation task, we take MAGIC and MAGIC+PPLM as two baseline models. Specifically, MAGIC+PPLM is the combination of two existing PnP works that take image and keywords as input respectively. PPLM Dathathri et al. (2019) is the first controllable PnP LM that requires gradient descents of model hidden states at decoding time. More details and code links of baselines are available in Appendix A.5.

5 Experiments and Analysis

5.1 Image Captioning

Dataset and Metrics.

We conduct experiments on MS-COCO and Flickr30k using Karpathy split Karpathy and Fei-Fei (2015). For the visual control, we take images for captioning task as $\textbf{C}_{V}$ . For the textual control, we take textual objects of the corresponding image as $\textbf{C}_{T}$ .¹¹1Textual objects are extracted from each picture ahead of the generation using a pre-trained DETR Carion et al. (2020). We use five relevance-based metrics for evaluation: BLEU-1 (B@1), BLEU-4 (B@4) Papineni et al. (2002), METEOR (M) Denkowski and Lavie (2014), ROUGE-L (R-L) Lin and Och (2004), CIDEr Vedantam et al. (2015), and SPICE Anderson et al. (2016). Besides, we also compare the decoding speed of ZeroGen against baselines.²²2We measure the model’s decoding speed on the same machine with one NVIDIA GeForce 3090 GPU sequentially.

Main Results.

Since both modal controls aim to enhance the model’s ability to understand image content and to generate better captions, we consider $\textbf{C}_{T}$ to be a vertical augmentation (or a complement) of $\textbf{C}_{V}$ in this task. From results in Table 1, we can draw the following conclusions, (1) the proposed ZeroGen model consistently outperforms unsupervised baselines and most of the weakly supervised methods (except CapDec) by a great margin, demonstrating the superiority of the proposed method. (2) Textual guidance as a vertical augmentation of the visual guidance, provides extra information about an image, thus promoting the model performance by more than 2 absolute points in CIDEr on both datasets (comparing model -T). (3) Both dynamic weighting techniques help strengthen the model’s capacity, especially VDW, we ascribe this situation to its direct optimization of certain token appearances that are recognized in the image. (4) However, ZeroGen falls short in efficiency comparing MAGIC, but still largely outperforms ZeroCap. This is because additional computations are required for multimodal controls and dynamic weightings, but there is no need for gradient calculation in our model like ZeroCap.

We also make a series of cross-domain evaluations in Appendix B.3, which further verifies the robustness of ZeroGen across various domains.

Number of Objects in $\boldsymbol{\textbf{C}_{T}}$ .

Is the more objects the better? To answer this question, we conduct an experiment over MS-COCO with varied numbers of objects from the image (size $N$ in $\textbf{C}_{T}$ ) using word-wise max (WM) for $p(\textbf{C}_{T})$ selection. In Table 2, we can observe that, as the increase of object number, our model generally performs better on most metrics, which verifies that more textual object guidance brings more information for captioning task. Similar results also prove the answer on Flickr30k as shown in Appendix B.1.

$\boldsymbol{p(\textbf{C}_{T})}$ Selection Method.

In Table 3, the most effective method for $p(\textbf{C}_{T})$ selection is word-wise max (WM). It is attributed to WM’s highlight of textual objects with all their relevant tokens in the vocabulary. While mean pooling (MP) also takes all given keywords into consideration, by presenting them equally, it may introduce biases in token similarity calculation and output controlling. Hence, we use WM for the rest experiments.

Model	Positive					Negative					Speed $\uparrow$
Model	D-2 $\uparrow$	D-4 $\uparrow$	C-S $\uparrow$	$\Delta\text{Acc}$ $\uparrow$	PPL $\downarrow$	D-2 $\uparrow$	D-4 $\uparrow$	C-S $\uparrow$	$\Delta\text{Acc}$ $\uparrow$	PPL $\downarrow$	Speed $\uparrow$
Human^∗	96.25	96.98	23.36	0.00	14.59	96.25	96.98	23.36	0.00	14.59	-
$\textsc{MAGIC}^{*}$ Su et al. (2022a)	95.62	95.92	20.07	-	10.01	95.62	95.92	20.07	-	10.01	25.0 $\times$
+PPLM Dathathri et al. (2019)	74.22	81.44	20.44	11.00	29.07	74.47	83.66	20.79	18.76	27.32	1.0 $\times$
ZeroGen	72.04	79.32	18.11	22.12	12.22	76.42	83.01	19.11	31.75	13.04	9.8 $\times$
-TDW	71.87	78.90	18.08	21.87	11.75	76.29	82.52	19.14	29.88	12.53	10.4 $\times$
-VDW	75.44	82.06	17.56	20.50	11.62	77.80	83.84	18.20	29.63	12.62	11.7 $\times$
-DW	81.70	86.38	17.22	19.00	12.62	77.73	83.60	18.19	29.13	12.13	12.4 $\times$
-T^∗	95.27	95.80	21.19	-	10.84	95.27	95.80	21.19	-	10.84	17.9 $\times$
ZeroGen w/ obj	81.56	87.93	19.42	16.37	12.93	82.23	87.93	19.66	29.76	13.30	9.8 $\times$

Table 5: Results of controllable news generation on VisNews. With

\textbf{C}_{T}

controlling the sentiment, we regard the textual and the visual control as vertical elements in this task. Methods with * cannot be controlled w.r.t. sentiment.

Model	Positive			Negative
Model	Flue. $\uparrow$	Relv. $\uparrow$	Sent. $\uparrow$ / $\downarrow$	Flue. $\uparrow$	Relv. $\uparrow$	Sent. $\uparrow$ / $\downarrow$
MAGIC	3.37	2.77	28.7/22.0	3.85	3.13	46.0/14.7
+PPLM	2.24	2.85	34.0/7.3	3.12	3.11	52.0/10.7
ZeroGen	3.38	2.94	80.0/10.7	3.80	2.85	84.7/6.0

Table 6: Human evaluation results. Sent. scores are percentages of news that obeys/disobeys given sentiment.

5.2 Stylized Captioning

To explore the sufficiency of our model to adapt to different styles, such as “romantic” or “humorous”. We follow Nukrai et al. (2022) to conduct stylized-text fine-tuning in base LM for stylized captioning.

Dataset and Metrics.

In this task we still take textual objects from images as $\textbf{C}_{T}$ and we follow the exact experimental setting in previous works Zhao et al. (2020); Nukrai et al. (2022) on FlickrStyle10k dataset Gan et al. (2017). As for metrics, we take the same ones in Sec. 5.1. Refer to Appendix A.3 for more detailed settings.

Main Results.

Table 4 shows quantitative results of the task, (1) ZeroGen outperforms most baselines on two stylized data, including weakly supervised CapDec on Romantic and MemCap with task-oriented training. (2) While it under-performs CapDec on some metrics over Humorous data, our method produces more fluent and plausible captions with consistently higher B@3 scores. (3) From two stylized sets, textual guidance takes a large credit to boost the model performance (comparing model -T), verifying the effectiveness of the proposed multimodal guidance in ZeroGen.

5.3 Controllable News Generation

Textual guidance can not only serve as a complement to visual knowledge but can also be a lateral extension. In this task, we assign textual control for news sentiment guidance and visual control for image-relevant news generation.

Dataset and Metrics.

We conduct experiments on VisNews Liu et al. (2021b). We fine-tune the base LM (i.e., SimCTG) on news data with the news title as an input prompt. We follow Dathathri et al. (2019) to obtain word lists for two types of sentiment guidance respectively. We take four aspects for evaluation, diversity: Distinct-2 (D-2) and Distinct-4 (D-4) Li et al. (2016). Image-text relevance: CLIP score (C-S) is the image-text similarity calculated by a CLIP. Control degree: $\Delta\text{Acc}$ (%) evaluates the accuracy gain between generated sentences and human written news.³³3Human written news in the test set consists of $62.88\%$ positive content and $37.12\%$ negative content. Fluency: perplexity (PPL) measures the model output confidence.

For human evaluation, we take Fluency (Flue.) for content fluency evaluation, Relevance (Relv.) for image/title-news relevance evaluation, and Sentiment (Sent.) to measure sentiment control. We strictly obey a double-blind procedure, where three annotators know nothing about the models. We sample 100 instances across every model.⁴⁴4Details of automatic metrics and human evaluation settings are in Appendix A.4 and D respectively.

Main Results.

From results in Table 5, we can draw the following conclusions: (1) ZeroGen has the highest classification accuracy gain and competitive CLIP scores among all presented statistics, proving that the proposed method can successfully produce controllable outputs under both modal supervisions. But our model generally degenerates the diversity, which we consider a trade-off. (2) Introducing dynamic weighting enhances the overall model performance. While VDW augments connections between the given image and generated news content with higher CLIP scores (C-S), TDW is able to make the output more recognizable w.r.t. the sentiment control without sacrificing content diversity. These findings validate the vastness and efficacy of the dynamic weighting mechanism even when their functional domains (i.e., $\textbf{C}_{T},\textbf{C}_{V}$ ) are not complementary. (3) External controls jeopardize our model’s output confidence with slightly higher PPL than MAGIC, yet ZeroGen still largely outperforms MAGIC+PPLM, the only controllable counterpart on PPL. Also, ZeroGen without parts of the dynamic weighting (e.g., -VDW) can advantageously outgain MAGIC+PPLM on both controllability and diversity metrics. (4) ZeroGen requires no task-oriented training, thus registering its superiority in decoding efficiency over MAGIC+PPLM by nearly 10 times faster.

We also present human evaluation results in Table 6,⁵⁵5The average Fleiss’s Kappa Fleiss and Cohen (1973) is 0.28, indicating three annotators reached a fair agreement. which can further verify our findings above.

$\hat{\alpha}$		5.0	8.0	10.0
Pos.	D-2 $\uparrow$	91.60	72.04	52.37
	D-4 $\uparrow$	95.00	79.32	60.00
	C-S $\uparrow$	19.96	18.11	17.19
	$\Delta\text{Acc}\uparrow$	10.75	22.12	26.62
	PPL $\downarrow$	11.76	12.22	10.28
Neg.	D-2 $\uparrow$	92.95	76.42	52.09
	D-4 $\uparrow$	95.87	83.01	60.07
	C-S $\uparrow$	21.59	19.11	18.81
	$\Delta\text{Acc}\uparrow$	11.26	31.75	51.26
	PPL $\downarrow$	11.71	13.04	11.52

Table 7: Effect of different

\hat{\alpha}

on news generation task.

Effect of $\boldsymbol{\alpha}$ Upper Bound.

$\textbf{C}_{T}$ is the only source for sentiment manipulation in this task, the upper bound of which can decide how distinguishable an output sentence is w.r.t. sentiment. In Table 7, with the increase of $\hat{\alpha}$ , both accuracy and fluency will gain significant benefits. However, the diversity of texts and image relevance indicators will fall precipitously. This phenomenon is explained as more guiding information from sentiment may make the model inclined to only express the desired sentiment, trading for wider imagination associated with image or diversity. At the user end, we can twitch $\hat{\alpha}$ according to tasks to fit in different situations.

$\boldsymbol{\textbf{C}_{T}}$ Plays Two Roles.

Now we have examined the function of $\textbf{C}_{T}$ as a complement (captioning) or additional control element (news generation). We wonder can $\textbf{C}_{T}$ play both roles well at the same time? We conduct experiments using our ZeroGen with both objects from the image and sentiment words as textual guidance. Our method with objects as a part of $\textbf{C}_{T}$ is marked with “w/ obj”. Though ZeroGen w/ obj reaches a higher CLIP score, the accuracy and PPL generally decline comparing methods without textual objects guidance, which can also be accomplished by switching $\hat{\alpha}$ as mentioned earlier. That is to say, $\textbf{C}_{T}$ may be confused to play both roles as a complement and a lateral extension of images at the same time.

Case Analysis.

We exhibit an example of generated controllable news in Figure 3. As the image shows Culture secretary Jeremy Hunt is giving a talk. All methods are able to produce image/title-relevant sentences, but MAGIC+PPLM generates some false evidence such as recognizing Jeremy Hunt as the “leader of Conservative and Nationalist Labour groups”. Besides, our ZeroGen can produce more diverse and controllable words like “good”, “benefits” for positive and “deadbeat”, “lose” for negative, while MAGIC+PPLM is not competent to fulfill the controllable aspect. More cases are exhibited in Appendix C.

6 Conclusion

In this paper, we present ZeroGen, a paradigm of zero-shot controllable text generation with multimodal signals. We explicitly separate visual control and textual control into sentence-level and token-level guidance. And we use two Oracles to unify the control signals to LM output probability space. A dynamic weighting mechanism is applied to adapt to all multimodal controls, which further boosts the model generation ability. Three tasks from captioning to controllable news generation justify the effectiveness of ZeroGen and help us explore the relationship between distinct signals. By providing multimodal knowledge, we demonstrate LMs without task-specified training can substantially achieve astonishing performance in multimodal tasks across different setups and domains.

7 Limitations

Although ZeroGen successfully achieves zero-shot controllable text generation, our technique is still subject to a few limitations to be addressed in follow-up work. (1) There is still a large gap between weakly and fully supervised methods. We believe the rich semantic information contained in these large-scale pre-trained Oracles we employed can further narrow it. (2) The diversity in our controllable news generation task is insufficient. Since this is a widespread problem in the field of zero-shot research, we plan to alleviate the issue by incorporating more diverse language decoding schemes Xu et al. (2022) or partial training parameters in the model such as adapters Houlsby et al. (2019). (3) The existence of spurious correlations Tu et al. (2020); Chai et al. (2022) in bad cases (as shown in Appendix C) is nonnegligible, one of our future work directions is to handle it by introducing causal inference Pearl (2009).

8 Ethics Statement

We are well aware that text generation technologies may be abused to create deceptive, harmful, or objectionable content. For our ZeroGen, we can conduct experiments on detoxification datasets Gehman et al. (2020) to make it a useful tool for combating hate speech and eliminating harmful information in PLMs. As we are considering components to make our method more robust and effective in multimodal controllable tasks, we believe it is meaningful and beneficial to progress research on controllable text generation.

References

Anderson et al. (2016) Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European conference on computer vision, pages 382–398. Springer.
Anderson et al. (2018) Peter Anderson, Stephen Gould, and Mark Johnson. 2018. Partially-supervised image captioning. Advances in Neural Information Processing Systems, 31.
Bisk et al. (2020) Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, et al. 2020. Experience grounds language. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8718–8735.
Carion et al. (2020) Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer.
Chai et al. (2022) Junyi Chai, Reid Pryzant, Victor Ye Dong, Konstantin Golobokov, Chenguang Zhu, and Yi Liu. 2022. Fast: Improving controllability for text generation with feedback aware self-training. arXiv preprint arXiv:2210.03167.
Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR.
Dathathri et al. (2019) Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2019. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.
Denkowski and Lavie (2014) Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380.
Duan et al. (2020) Yuguang Duan, Canwen Xu, Jiaxin Pei, Jialong Han, and Chenliang Li. 2020. Pre-train and plug-in: Flexible conditional text generation with variational auto-encoders. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 253–262.
Ficler and Goldberg (2017) Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural language generation. EMNLP 2017, page 94.
Fleiss and Cohen (1973) Joseph L Fleiss and Jacob Cohen. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and psychological measurement, 33(3):613–619.
Gan et al. (2017) Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, and Li Deng. 2017. Stylenet: Generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3137–3146.
Gehman et al. (2020) Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. 2020. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369.
Gu et al. (2022) Yuxuan Gu, Xiaocheng Feng, Sicheng Ma, Jiaming Wu, Heng Gong, and Bing Qin. 2022. Improving controllable text generation with position-aware weighted decoding. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3449–3467.
Harnad (1990) Stevan Harnad. 1990. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346.
Hartmann et al. (2022) Jochen Hartmann, Mark Heitmann, Christian Siebert, and Christina Schamp. 2022. More than a feeling: Accuracy and application of sentiment analysis. International Journal of Research in Marketing.
Honda et al. (2021) Ukyo Honda, Yoshitaka Ushiku, Atsushi Hashimoto, Taro Watanabe, and Yuji Matsumoto. 2021. Removing word-level spurious alignment between images and pseudo-captions in unsupervised image captioning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3692–3702.
Houlsby et al. (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
Karpathy and Fei-Fei (2015) Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137.
Keskar et al. (2019) Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. 2019. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
Kikuchi et al. (2016) Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya Takamura, and Manabu Okumura. 2016. Controlling output length in neural encoder-decoders. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1328–1338.
Krause et al. (2021) Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. 2021. Gedi: Generative discriminator guided sequence generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4929–4952.
Laina et al. (2019) Iro Laina, Christian Rupprecht, and Nassir Navab. 2019. Towards unsupervised image captioning with shared multimodal embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7414–7424.
Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and William B Dolan. 2016. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119.
Li et al. (2023) Wei Li, Linchao Zhu, Longyin Wen, and Yi Yang. 2023. Decap: Decoding clip latents for zero-shot captioning via text-only training. arXiv preprint arXiv:2303.03032.
Li and Liang (2021) Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597.
Lin and Och (2004) Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 605–612.
Liu et al. (2021a) Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A Smith, and Yejin Choi. 2021a. Dexperts: Decoding-time controlled text generation with experts and anti-experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6691–6706.
Liu et al. (2021b) Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Ordonez. 2021b. Visual news: Benchmark and challenges in news image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 6761–6771. Association for Computational Linguistics.
Liu et al. (2022) Guangyi Liu, Zeyu Feng, Yuan Gao, Zichao Yang, Xiaodan Liang, Junwei Bao, Xiaodong He, Shuguang Cui, Zhen Li, and Zhiting Hu. 2022. Composable text controls in latent space with odes. arXiv preprint arXiv:2208.00638.
Mai et al. (2020) Florian Mai, Nikolaos Pappas, Ivan Montero, Noah A Smith, and James Henderson. 2020. Plug and play autoencoders for conditional text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6076–6092.
Nukrai et al. (2022) David Nukrai, Ron Mokady, and Amir Globerson. 2022. Text-only training for image captioning using noise-injected clip. arXiv preprint arXiv:2211.00575.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
Pascual et al. (2021) Damian Pascual, Beni Egressy, Clara Meister, Ryan Cotterell, and Roger Wattenhofer. 2021. A plug-and-play method for controlled text generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3973–3997.
Pearl (2009) Judea Pearl. 2009. Causal inference in statistics: An overview. Statistics surveys, 3:96–146.
Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
Su et al. (2022a) Yixuan Su, Tian Lan, Yahui Liu, Fangyu Liu, Dani Yogatama, Yan Wang, Lingpeng Kong, and Nigel Collier. 2022a. Language models can see: Plugging visual controls in text generation. arXiv preprint arXiv:2205.02655.
Su et al. (2022b) Yixuan Su, Tian Lan, Yan Wang, Dani Yogatama, Lingpeng Kong, and Nigel Collier. 2022b. A contrastive framework for neural text generation. arXiv preprint arXiv:2202.06417.
Tewel et al. (2022) Yoad Tewel, Yoav Shalev, Idan Schwartz, and Lior Wolf. 2022. Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17918–17928.
Tu et al. (2022) Haoqin Tu, Zhongliang Yang, Jinshuai Yang, Siyu Zhang, and Yongfeng Huang. 2022. Pcae: A framework of plug-in conditional auto-encoder for controllable text generation. Knowledge-Based Systems, 256:109766.
Tu et al. (2020) Lifu Tu, Garima Lalwani, Spandana Gella, and He He. 2020. An empirical study on robustness to spurious correlations using pre-trained language models. Transactions of the Association for Computational Linguistics, 8:621–633.
Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
Xu et al. (2022) Jin Xu, Xiaojiang Liu, Jianhao Yan, Deng Cai, Huayang Li, and Jian Li. 2022. Learning to break the loop: Analyzing and mitigating repetitions for neural text generation. arXiv preprint arXiv:2206.02369.
Yang and Klein (2021) Kevin Yang and Dan Klein. 2021. Fudge: Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3511–3535.
Zeng et al. (2023) Zequn Zeng, Hao Zhang, Zhengjue Wang, Ruiying Lu, Dongsheng Wang, and Bo Chen. 2023. Conzic: Controllable zero-shot image captioning by sampling-based polishing. arXiv preprint arXiv:2303.02437.
Zhao et al. (2020) Wentian Zhao, Xinxiao Wu, and Xiaoxun Zhang. 2020. Memcap: Memorizing style knowledge for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12984–12992.

Appendix A Implementation Details

A.1 General Model Details

For the base LM, we use SimCTG Su et al. (2022b), which is extended from a pre-trained GPT-2 Radford et al. (2019) model. SimCTG essentially consists of contrastive training and contrastive searching. As for training period, it introduces a $\mathcal{L}_{\mathrm{CL}}$ term to learn discriminative and isotropic token representations:

	$\displaystyle\mathcal{L}_{\mathrm{CL}}$	$\displaystyle=\frac{1}{V\times(V-1)}\sum_{i=1}^{V}\sum_{j=1,j\neq i}^{V}\max\{0,$		(1)
		$\displaystyle\rho-s\left(h_{x_{i}},h_{x_{i}}\right)+s\left(h_{x_{i}},h_{x_{j}}\right)\},$		(1)

where $V$ is the vocabulary size, function $s(\cdot,\cdot)$ is the similarity function, $h_{x_{i}}$ is the LM hidden state of token $x_{i}$ and $\rho$ is a pre-defined margin. The final training objective of a LM turns into:

\mathcal{L}_{\text{SimCTG}}=\mathcal{L}_{\mathrm{MLE}}+\mathcal{L}_{\mathrm{CL}},

(2)

with $\mathcal{L}_{\mathrm{MLE}}$ to be the vanilla MLE objective of LM.

As for contrastive decoding, at decoding time $t$ , the token to be generated is formalized as:

		$\displaystyle S_{\text{SimCTG}}(x_{<t})=(1-\eta)\times\underbrace{p_{\theta}\left(w\mid x_{<t}\right)}_{\text{model confidence}}-$
		$\displaystyle\eta\times\underbrace{\left(\max\left\{s\left(h_{w},h_{x_{j}}\right):1\leq j\leq t-1\right\}\right)}_{\text{degeneration penalty}}$
		$\displaystyle x_{t}=\underset{w\in W_{t}^{(k)}}{\arg\max}S_{\text{SimCTG}}(x_{<t})$

where $\eta$ is a parameter to balance generation diversity and consistency. Then for our ZeroGen, we have the following decoding objective based on the shifted LM output $p^{\prime}_{\text{LM}_{t}}$ :

	$\displaystyle x_{t}$	$\displaystyle=\underset{w\in W_{t}^{(k)}}{\arg\max}\{S_{\text{SimCTG}}(x_{<t})$
		$\displaystyle+\beta_{t}\times S_{t}(w,\textbf{C}_{V}\mid x_{<t})\},$

here $W_{t}^{(k)}$ is grouped from $p^{\prime}_{\text{LM}_{t}}$ and $S_{t}$ is the MAGIC score we introduced in Sec. 3.2.

When we apply ZeroGen, there are several parameters should be decided in advance: $k$ in $W_{t}^{(k)}$ , $\eta$ for contrastive decoding, $\beta,\alpha,\hat{\beta},\hat{\alpha}$ for dynamic weighting mechanism. We present detailed parameters in Table 9 to aid reproducibility. We also present the workflow of our system in Algorithm 1.

We implement all the experiments on the same machine with one NVIDIA GeForce RTX 3090 GPU with 24G memory and one Intel 3.70GHz i9-10900K CPU. We will release the code of all methods (including baselines) and datasets processing once the paper is accepted.

Algorithm 1 ZeroGen

Input: Visual control: $\textbf{C}_{V}$ , textual control: $\textbf{C}_{T}$
Output: Generated content $\textbf{X}=[x_{1},x_{2},...]$

1:initialize V; //LM vocabulary

2:initialize

\phi_{M},\phi_{T}

; //oracles

3:initialize

\hat{\beta},\hat{\alpha},k,\lambda

; //hyper params

4:compute

p(\textbf{V},\textbf{C}_{T})

using

\phi_{T}

;

x_{0}\longleftarrow\texttt{[BOS]},\textbf{X}\longleftarrow\text{[}x_{0}\text{]},t\longleftarrow 0

;

6:while

x_{t}\not=\texttt{[EOS]}

t\leftarrow t+1

;

8: compute

p_{\text{LM}_{t}}

using base LM and

x_{<t}

;

9: compute

p_{t}(\textbf{C}_{T})

using

p(\textbf{V},\textbf{C}_{T})

;

10:

D_{T}\leftarrow\sum_{n}p_{\text{LM}_{t}}(C_{T_{n}}\mid x_{<t})/N

;

11:

\alpha_{t}\leftarrow\min(D_{T}/\lambda,\hat{\alpha})

;

12:

p^{\prime}_{\text{LM}_{t}}\leftarrow p_{\text{LM}_{t}}+\alpha_{t}\times p_{t}(\textbf{C}_{T})

;

13:

D_{V}\leftarrow\sum_{w\in W_{t}^{(k)}}p_{\phi_{M}}(w_{t},\textbf{C}_{V})/k

;

14:

\beta_{t}\leftarrow\min(D_{V}/\lambda,\hat{\beta})

;

15: compute

S_{t}(w,\textbf{C}_{V}\mid x_{<t})

using

\phi_{M}

;

16:

x_{t}\leftarrow\arg\max_{w}S_{t}(w,\textbf{C}_{V}\mid x_{<t})

;

17: add

x_{t}

to content X;

18:return generated content X;

Dataset	Train	Val	Test	# Voc	# Len
F10k Humor	6,000	1,000	1,000	7,186	14.07
F10k Romantic	6,000	1,000	1,000	6,434	14.55
VisNews	13,098	200	800	23,274	136.20

Table 8: Detailed statistics of data employed in our tasks. # Voc and # Len represent the vocabulary size and average sentence length of current dataset. F10k represents the Flickr10k dataset.

# Params / Data	MS-COCO	Flickr30k	F10k Romantic	F10k Humor	VisNews
$k$ (int)	45	25	45	45	5
$\eta$ (float)	0.10	0.10	0.10	0.10	0.65
$\alpha$ (float)	1.0	2.0	1.0	1.0	8.0
$\beta$ (float)	1.0	1.0	1.0	1.0	1.0
$\hat{\alpha}$ (float)	2.5	2.0	3.0	2.5	8.0
$\hat{\beta}$ (float)	1.0	0.5	0.5	0.5	0.5
$N$ (int)	1 $\sim$ 5	1 $\sim$ 5	1	1	40

Table 9: Detailed parameters of ZeroGen for different tasks.

A.2 Image Captioning Details

For both MS-COCO and Flickr30k datasets, we take the Karpathy split Karpathy and Fei-Fei (2015) and use the train, valid sets for base LM trainin and test set for the task evaluation. In detail, MS-COCO dataset is under a Creative Commons Attribution 4.0 License, and is publicly available at https://cocodataset.org, while Flickr30k is under a Creative Commons Public Domain Dedication License, and is publicly available at https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset. For the base LM, we load the publicly available pre-trained language model weights⁶⁶6https://huggingface.co/cambridgeltl/magic_flickr30k and https://huggingface.co/cambridgeltl/magic_mscoco and set the maximum decoding length to be 16 for this task.

A.3 Stylized Captioning Details

For stylized captioning and controllable news generation task, the Flickr10k Stylized dataset is introduced by Gan et al. (2017) and under an unknown license, it can be publicly downloaded from https://zhegan27.github.io/Papers/FlickrStyle_v0.9.zip. On model side, we follow Zhao et al. (2020) to randomly sample 6,000 instances from the original corpus as training data and 1,000 as test data. Their detailed statistics are shown in Table 8. Following Nukrai et al. (2022), we fine-tune two base LMs on two training sets respectively to achieve stylized outputs. As for the base LM fine-tuning, we use SimCTG with 0.5 to be the margin $\rho$ and 1e-5 as the learning rate to train the model until it has no further loss decrease on the valid set. We set the maximum sentence length to 128 for base LM training and 25 for decoding.

A.4 Controllable News Generation Details

For dataset and metrics, we conduct experiments based on VisNews Liu et al. (2021b). This dataset is under an unknown license, and can be acquired by asking the author directly.⁷⁷7https://github.com/FuxiaoLiu/VisualNews-Repository Specifically, for image-text data, we sampled 13000, 200, and 800 image-news pairs from the original VisNews dataset as train, valid, and test set. We use train and valid set for base LM (i.e., SimCTG) fine-tuning with the news title as an input prompt. The test set is employed for the final evaluation. Details are shown in Table 8. The max training news length is 200, and the max decoding length is set to 130.

For word bags of two sentiments, we follow Dathathri et al. (2019) to obtain word lists of “happiness” and “negative words” for positive and negative sentiment guidance respectively.⁸⁸8Word lists are downloaded from www.enchantedlearning.com/wordlist. As for evaluation, diversity, we use the CLIP model that is different from the one guides multimodal generation in ZeroGen to compute the CLIP score (C-S). For $\Delta\text{Acc}$ (%), we take the pre-trained SiEBRT model Hartmann et al. (2022) that has SOTA performance on SST-2 dataset Socher et al. (2013) as the classifier.

From Table 9, we can notice that we select the topic number $N$ to be 40 in this task, meaning at every decoding time, we input 40 sentiment words as textual control $\textbf{C}_{T}$ . We conduct ablation experiments w.r.t. topic number $N$ and the classification accuracy as well as CLIP score in Figure 4. From the picture, we can figure out that, as the increase of $N$ , the control degree gradually gets higher with better accuracy while the image-news relevance declines with lower C-S. This is because more words from one sentiment as $\textbf{C}_{T}$ make the model more focused on sentiment control but sacrifice some image-related text generative ability.

Model	MS-COCO						Flickr30k
Model	B@1 $\uparrow$	B@4 $\uparrow$	M $\uparrow$	R-L $\uparrow$	CIDEr $\uparrow$	SPICE $\uparrow$	B@1 $\uparrow$	B@4 $\uparrow$	M $\uparrow$	R-L $\uparrow$	CIDEr $\uparrow$	SPICE $\uparrow$
ZeroGen ( $N=1$ )	59.4	15.5	18.7	42.3	55.4	12.1	54.9	13.1	15.2	37.4	26.4	8.3
ZeroGen ( $N=2$ )	60.1	15.6	18.5	42.3	55.9	11.9	55.3	13.3	15.4	37.7	27.5	8.3
ZeroGen ( $N=3$ )	60.4	15.6	18.6	42.3	56.5	12.1	55.0	13.1	15.2	37.4	27.1	8.2
ZeroGen ( $N=4$ )	60.5	15.7	18.7	42.4	57.0	12.1	54.5	13.1	15.2	37.5	27.2	8.3
ZeroGen ( $N=5$ )	60.6	15.8	18.7	42.4	57.1	12.1	55.0	13.0	15.2	37.5	27.3	8.2

Table 10: Caption results of ZeroGen with only varied number of objects as textual control for each generation.

Model	MS-COCO						Flickr30k
Model	B@1 $\uparrow$	B@4 $\uparrow$	M $\uparrow$	R-L $\uparrow$	CIDEr $\uparrow$	SPICE $\uparrow$	B@1 $\uparrow$	B@4 $\uparrow$	M $\uparrow$	R-L $\uparrow$	CIDEr $\uparrow$	SPICE $\uparrow$
ZeroGen SR	59.6	15.3	18.4	42.1	55.5	11.8	54.8	13.1	15.0	37.2	26.7	8.1
ZeroGen MP	59.9	15.2	18.3	42.0	55.2	11.8	54.8	13.2	15.1	37.5	26.7	8.1
ZeroGen WM	60.6	15.8	18.7	42.4	57.1	12.1	55.3	13.3	15.4	37.7	27.5	8.3

Table 11: Caption results of ZeroGen with only three different

p_{t}(\textbf{C}_{T})

selection methods. WM, MP, SR represent Word-wise Max, Mean Pooling and Step-wise Random in Sec. 3.1 respectively.

Model	MS-COCO $\implies$ Flickr30k						Flickr30k $\implies$ MS-COCO
Model	B@1 $\uparrow$	B@4 $\uparrow$	M $\uparrow$	R-L $\uparrow$	CIDEr $\uparrow$	SPICE $\uparrow$	B@1 $\uparrow$	B@4 $\uparrow$	M $\uparrow$	R-L $\uparrow$	CIDEr $\uparrow$	SPICE $\uparrow$
ZeroCap	49.2	6.2	11.9	29.3	18.3	-	46.3	6.0	13.7	30.1	27.3	-
MAGIC	46.4	6.2	12.2	31.3	17.5	5.9	41.4	5.2	12.5	30.7	18.3	5.7
ZeroGen	50.5	8.1	13.1	34.5	17.3	6.0	46.9	7.6	14.0	34.4	26.1	6.8
-TDW	50.1	8.0	12.7	34.0	17.0	5.8	46.2	7.1	13.5	33.7	23.9	6.2
-T	49.3	8.1	12.5	33.7	16.7	5.8	45.1	6.7	13.1	33.3	22.4	5.9
-VDW	49.3	7.2	13.0	33.5	18.5	6.2	43.8	6.2	13.5	32.6	24.5	6.3
-DW	48.2	7.1	12.5	32.8	17.4	5.9	43.7	6.2	13.5	32.6	24.4	6.3

Table 12: Cross-domain results on two image-caption datasets MS-COCO and Flickr30k.

A.5 Baseline Model Details

For IC-SME, S2S-GCC and CapDec results, we directly take them from their respect paper. For ZeroCap, we take their official implementation from https://github.com/YoadTew/zero-shot-image-to-text and use its default parameter setting for captioning tasks. For MAGIC, we take the official code from https://github.com/yxuansu/MAGIC to reproduce the results.

For MAGIC+PPLM implementation, we take two codebases into consideration, including the official PPLM code from https://github.com/uber-research/PPLM and a simpler version of PPLM from https://github.com/hit-scma/CAT-PAW. And we add MAGIC process at the decoding stage of PPLM as MAGIC+PPLM, we provide a minimum reproducible code of it in this anonymous repository: https://anonymous.4open.science/r/Pplm_Magic-3E15. For positive sentiment control, we set the iteration for each LM hidden state update to be 5 with step size to be 0.03, while 15 iterations for negative control. And we choose (15+5)/2=10 iterations for decoding time measure in Sec. 5.3. We use the same SimCTG model in ZeroGen on VisNews for MAGIC+PPLM generation. We set the max decoding length to be 130, $k=5$ for MAGIC searching in MAGIC+PPLM like our method. Other hyper-parameters are the default values in the official code repositories.

Appendix B Additional Experimental Results

B.1 Number of Objects in $\boldsymbol{\textbf{C}_{T}}$ for Captioning

We display the full results of varying the number of objects in $\textbf{C}_{T}$ (number $N$ ) for captioning task in Table 10. For images with less than $N$ objects extracted, we just use their all existing objects as the textual control $\textbf{C}_{T}$ . We can observe that, for both datasets, more textual guidance can bring performance gain. Nevertheless, increasing objects in this process do not necessarily gain better metrics. For instance, on Flickr30k, the model with $N=2$ performs the best among other $N$ settings. This may be because too much textual guidance cause confusion for ZeroGen on relatively easy tasks (i.e., shorter captions, smaller vocabulary size).

B.2 $\boldsymbol{p(\textbf{C}_{T})}$ Selection Method in Captioning

We also present full results of $p(\textbf{C}_{T})$ selection methods in Table 11. On two datasets, the results are consistent and indicate that WM is the best way for calculating $p_{t}(\textbf{C}_{T})$ at $t$ -th step.

B.3 Image Captioning in Cross Domain

For cross-domain captioning evaluation, we follow the setting in Su et al. (2022b) and fine-tune the base LM on one dataset (e.g., MS-COCO), while evaluating the model’s captioning capacity on another dataset (e.g., $\texttt{MS-COCO}\implies$ Flickr30k). Results in Table 12 can verify findings from in-domain evaluations in Sec. 5.1.

Appendix C More Cases of ZeroGen

In the presented cases, we highlight Positive words and Negative words respectively.

Good Cases.

As shown in Figure 5 and 6, the proposed method is capable to generate image-related and sentiment-controllable texts even with very few prompt words (Figure 5). And our ZeroGen generally produce more diverse sentiment-specified words like “beautiful”, “unique”, “great-looking” and “good” for positive sentiment and “terrible”, “damaged”, “dirty” and “evil” for negative sentiment. The compared baselines are unable to generate controllable news. For instance, MAGIC+PPLM generates very short texts given negative sentiment for Hair today image in Figure 5 and negative contents given positive sentiment in Figure 6.

Bad Cases.

We present bad cases of ZeroGen in Figure 7 and 8. For both cases, we can observe that there may exist generating biases caused by spurious correlation in dataset Tu et al. (2020); Chai et al. (2022). In Figure 7, the image describes a smiling woman with a flowered sweater, which means the visual control may be confounded with textual control when $\textbf{C}_{T}$ includes negative words (under the case where no task-oriented training is conducted). Our method struggles to generate negative-only content given the image and negative keywords. In Figure 8, the title Morrisons faces gloomy week gives away its sentimental preference (i.e., negative) for the image. Similarly, for both ZeroGen and MAGIC+PPLM, they fail to generate news with only positive sentiment given positive textual control. We plan to take causal-related solutions such as self-training and causal intervention Pearl (2009); Chai et al. (2022) towards this issue in the future.

Appendix D Human Evaluation

For annotators, we hire three graduate students from America or China with fluent English reading skills. Each annotator is assigned $100$ (instances) $\times 3$ (models) $\times 3$ (aspects) $=900$ rating tasks, resulting in $900$ (tasks) $\times 3$ (annotators) $=2,700$ human ratings in total. We use a three-scale scheme (i.e., integrate score 1, 2 to the poor class, 3 to the moderate class and 4, 5 to the good class) for Fluency and Relevance to compute the Fleiss’s kappa Fleiss and Cohen (1973). The annotators have acknowledged the use of annotated data sets and are paid an average annotation salary. All annotators were aware of the potential risks or ethical concerns of machine-generated texts.

Annotation Instruction

Here we present the human evaluation standard:

Fluency(Flue.): Whether the generated news is fluent and easy to understand.

1.

The system’s result does not make sense and it is unreadable.
2.

Choose this score when you are hesitant between score 1 and score 3.
3.

The system’s result contains minor errors but they do not affect your understanding.
4.

Choose this score when you are hesitant between score 3 and score 5.
5.

The system’s result is human-like, grammatically correct, and very easy to understand.

Relevance (Relv.): Whether the generated news is related to the given image and the corresponding title.

1.

The system’s result is completely irrelevant to the given image.
2.

Choose this score when you are hesitant between score 1 and score 3.
3.

The system’s result is partially related to the image and some of its content can be found in the image.
4.

Choose this score when you are hesitant between score 3 and score 5.
5.

The system’s result is very related to the given image and contains a diverse set of concepts in the image.

Sentiment (Sent.): Does the generated news have positive or negative sentiment.

•

Positive: The system’s result has a positive sentiment.
•

Negative: The system’s result has a negative sentiment.
•

Can’t Tell: The system’s result is neither negative nor positive.

ZeroGen: Zero-shot Multimodal Controllable Text Generation with Multiple Oracles

Abstract

1 Introduction

2 Related Work

Efficient Image Captioning.

PnP Controllable Text Generation.

3 ZeroGen Methodology

3.1 Token-level Textual Guidance

Step-wise Random (SR):

Mean Pooling (MP):

Word-wise Max (WM):

3.2 Sentence-level Visual Guidance

3.3 Multimodal Dynamic Weighting

Dynamic 𝜶𝒕\boldsymbol{\alpha_{t}}.

Dynamic 𝜷𝒕\boldsymbol{\beta_{t}}.

4 General Implementations and Baselines

General Implementations.

Baseline Models.

5 Experiments and Analysis

5.1 Image Captioning

Dataset and Metrics.

Main Results.

Number of Objects in C𝑻\boldsymbol{\textbf{C}_{T}}.

𝒑​(C𝑻)\boldsymbol{p(\textbf{C}_{T})} Selection Method.

5.2 Stylized Captioning

Dataset and Metrics.

Main Results.

5.3 Controllable News Generation

Dataset and Metrics.

Main Results.

Effect of 𝜶\boldsymbol{\alpha} Upper Bound.

C𝑻\boldsymbol{\textbf{C}_{T}} Plays Two Roles.

Case Analysis.

6 Conclusion

7 Limitations

8 Ethics Statement

References

Appendix A Implementation Details

A.1 General Model Details

A.2 Image Captioning Details

A.3 Stylized Captioning Details

A.4 Controllable News Generation Details

A.5 Baseline Model Details

Appendix B Additional Experimental Results

B.1 Number of Objects in C𝑻\boldsymbol{\textbf{C}_{T}} for Captioning

B.2 𝒑​(C𝑻)\boldsymbol{p(\textbf{C}_{T})} Selection Method in Captioning

B.3 Image Captioning in Cross Domain

Appendix C More Cases of ZeroGen

Good Cases.

Bad Cases.

Appendix D Human Evaluation

Annotation Instruction

Dynamic $\boldsymbol{\alpha_{t}}$ .

Dynamic $\boldsymbol{\beta_{t}}$ .

Number of Objects in $\boldsymbol{\textbf{C}_{T}}$ .

$\boldsymbol{p(\textbf{C}_{T})}$ Selection Method.

Effect of $\boldsymbol{\alpha}$ Upper Bound.

$\boldsymbol{\textbf{C}_{T}}$ Plays Two Roles.

B.1 Number of Objects in $\boldsymbol{\textbf{C}_{T}}$ for Captioning

B.2 $\boldsymbol{p(\textbf{C}_{T})}$ Selection Method in Captioning