This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Align Vision-Language Semantics by Multi-task Learning for Multi-Modal Summarization

Chenhao Cui2111Contribution Equally., Xinnian Liang1111Contribution Equally., Shuangzhi Wu2, Zhoujun Li1222Corresponding Authors.
1State Key Lab of Software Development Environment, Beihang University, Beijing, China
2Tencent Inc., Beijing, China
{xnliang,cuich,lzj}@buaa.edu.cn,{frostwu}@tencent.com
Abstract

Most current multi-modal summarization methods follow a cascaded manner, where an off-the-shelf object detector is first used to extract visual features. After that, these visual features are fused with language representations for the decoder to generate the text summary. However, the cascaded way employs separate encoders for different modalities, which makes it hard to learn the joint vision and language representation. In addition, they also ignored the semantics alignment between paragraphs and images for multi-modal summarization tasks, which are crucial to a precise summary. To tackle these issues, in this paper, we propose ViL-Sum to jointly model paragraph-level Vision-Language Semantic Alignment and Multi-Modal Summarization. Our ViL-Sum contains two components for better learning multi-modal semantics and aims to align them. The first one is a joint multi-modal encoder. The other one is two well-designed tasks for multi-task learning, including image reordering and image selection. Specifically, the joint multi-modal encoder converts images into visual embeddings and attaches them with text embedding as the input of the encoder. The reordering task guides the model to learn paragraph-level semantic alignment and the selection task guides the model to select summary-related images in the final summary. Experimental results show that our proposed ViL-Sum significantly outperforms current state-of-the-art methods. In further analysis, we find that two well-designed tasks and a joint multi-modal encoder can effectively guide the model to learn reasonable paragraph-image and summary-image relations.

1 Introduction

The dramatic increase of multi-modal data (including text, image, audio, and video) on the Internet makes research on multi-modal summarization necessary. Multi-modal summarization aims to generate a condensed summary, which can cover salient information from one or more modalities inputs Evangelopoulos et al. (2013); Li et al. (2017). Different from traditional pure text summary, Zhu et.al. Zhu et al. (2018) pointed out that generated summaries with both text and images can effectively increase the satisfaction of users. Intuitively, people can grasp key information easier from multiple modalities than only from the text. This task is defined as multi-modal summarization with multi-modal outputs (MSMO). Figure 2 shows an example of this task, which gets text and images as input and generates one summary with two selected images.

Refer to caption
Figure 1: Multi-modal summarization models with different encoder structures.

Recent single-modal summarization models always employ an encoder-decoder framework with transformers structure Zhang et al. (2019); Lewis et al. (2020). Existing multi-modal models always add separate encoders for different modalities into the single-modal encoder-decoder framework Li et al. (2017, 2018); Chen and Zhuge (2018); Zhu et al. (2018); Khullar and Arora (2020); Zhu et al. (2020); Im et al. (2021). We show the widely-used structure of them in Figure 1a and 1b. The representation of different modalities is obtained separately from single-modal encoders, which leads to the model can not effectively capture the interaction between them. Recently, some works have paid attention to how to enhance image-text interaction Zhu et al. (2020); Zhang et al. (2021a) by adding interactive modules or auxiliary tasks.

Refer to caption
Figure 2: An example for explaining the semantic alignment between images and paragraphs in the document. “…” means some content was omitted. Each image is aligned to the paragraph at right.

However, previous works ignored the paragraph-level vision-language semantic alignment, where an example is shown in Figure 2. The semantics of each paragraph is highly corresponding to the image on the left. Besides, visual-language joint encoding is not well-applied for multi-modal summarization tasks, which has been proven effective on many multi-modal natural language understanding (NLU) tasks (e.g. Visual Question Answering) Li et al. (2020a, b, c); Zhang et al. (2021b); Xu et al. (2021); Zhou et al. (2020).

To improve these deficiencies, in this paper, we propose the Vision-Language Summarization model ViL-Sum with a universal transformer-based encoder-decoder structure. The core of ViL-Sum is a joint multi-modal encoder with two well-designed tasks, image reordering, and image selection, which aims to guide the model to learn better vision-language representations and capture the alignment of paragraph-level vision-language semantics. Specifically, we use a backbone (e.g. ViT Dosovitskiy et al. (2021)) to convert images into visual token embeddings and concatenate them with document token embeddings as the input of the joint multi-modal encoder. The ViL-Sum structure with the joint multi-modal encoder is shown in Figure 1c. To model paragraph-level vision-language semantic alignment, we propose a simple but effective image reordering task. It forces the model to reorder shuffled input images, which guides the model to learn the corresponding relation between paragraphs and images. To further enhance vision-language representation, we also train ViL-Sum with an image selection task, which selects several summary-related images as part of the multi-modal summary. We follow Zhu et al. (2020) used image caption construct pseudo labels. Finally, we train ViL-Sum with text summary generation, image selection, and image reordering tasks in a multi-task manner.

Experiments show that our ViL-Sum with multi-task training can outperform baselines by a wide margin. And further analysis demonstrates that the improvement is exactly from the joint modelling and multi-task training. However, the caption of the image is not always available. So the image selection task is not generalization for all datasets. It is deserved to mention that if we remove the image selection task, our proposed multi-modal encoder and the image reordering task still help the model beat all comparison models.

Our contributions can be summarized as follows111Code will be released after de-anonymous.:

  • We proposed a novel vision-language Summarization (ViL-Sum) model, which can jointly encode images and text to capture their interrelation.

  • We propose two auxiliary tasks and employ multi-task learning to guide the model to learn the paragraph-level vision and language semantic alignment.

  • Our model outperforms all current state-of-the-art methods on automatic and manual evaluation metrics. And in further analysis, we find that the improvement is exactly from the paragraph-level semantic alignment modelling and multi-task training.

2 Methodology

Refer to caption
Figure 3: The overall framework of our proposed ViL-Sum model. Figure a) is the detail of the ViT-based image tokenizer. Figure b) is the encoder-decoder framework with multi-task learning.

We show the main architecture of our ViL-Sum model in Figure 3. Firstly, we employ a backbone network as the image tokenizer to convert images into visual token embeddings in Figure 3a. Then, text embeddings and visual token embeddings are concatenated as the input of the main encoder-decoder framework in Figure 3b. Finally, we train the ViL-Sum model in a multi-task manner. In the following sections, we will first introduce vision-language joint representation. Then, we will describe the details of multi-task learning.

2.1 Vision-Language Joint Representation

First of all, we formalized the input and output of our ViL-Sum as (D,I)(D,I) and (S,IS)(S,I_{S}), where D={t1,t2,,tT}D=\{t_{1},t_{2},\dots,t_{T}\} refers to the sequence of tokens from the input document, I={img0,img1,,imgM}I=\{img_{0},img_{1},\dots,img_{M}\} refers to the sequence of input images from the input document, S={t1,t2,}S=\{t_{1},t_{2},\dots\} refers to the sequence of tokens from gold text summary, and IS={img1,img2,,imgK}I_{S}=\{img_{1},img_{2},\dots,img_{K}\} refers to KK selected images for the multi-modal summary.

2.1.1 Document Embeddings

Each document is firstly converted into the sequence of tokens {t1,t2,,tT}\{t_{1},t_{2},\dots,t_{T}\} and then two special tokens “s\langle s\rangle" and “\s\langle\backslash s\rangle" are added to represent the start and end of the document. After that, we map each token into vector representation ED={estart,e1,,eT,eend}E_{D}=\{e_{start},e_{1},\dots,e_{T},e_{end}\} with text embedding layer.

2.1.2 Image Embeddings

Different from previous methods, which extract many image features via existing object detection models. We employ ViT Dosovitskiy et al. (2021) as the backbone, which split each image into several patches and then encode them. The details of the image tokenizer are shown in Figure 3b.

Firstly, we reshape image imgH×W×Cimg\in\mathbb{R}^{H\times W\times C} into a sequence of flattened 2D patches {imgpN×(P2C)}p=1N\{img^{p}\in\mathbb{R}^{N\times(P^{2}\cdot C)}\}_{p=1}^{N}, where (H,W)(H,W) is the resolution of the original image, CC is the number of channels, (P,P)(P,P) is the resolution of each image patch, and N=HW/P2N=HW/P^{2} is the resulting number of patches. Then, we can obtain a sequence of image patches {imgp}p=1N\{img^{p}\}_{p=1}^{N} as the input of the image tokenizer.

Secondly, the patches are linearly projected to patch embeddings ep=E×imgipe^{p}=E\times img_{i}^{p}, where E(P2D)×CE\in\mathbb{R}^{(P^{2}\cdot D)\times C}. We also add a special token “[class]” with learnable embedding e0e^{0}. Then attaching position embeddings and patch embeddings as input Z0Z_{0} for the image encoder to retain positional information of images:

Z0=[ei0;ei1;;eiN]+EposZ_{0}=[e_{i}^{0};e_{i}^{1};\dots;e_{i}^{N}]+E_{pos} (1)

Where Z0,Epos(N+1)×DZ_{0},E_{pos}\in\mathbb{R}^{(N+1)\times D}, EposE_{pos} is position embeddings.

Finally, we employ the pre-trained ViT as the backbone to encode these patches of each image. This backbone also can be replaced by any other encoders (e.g. linear projection layer).

Z+1=𝚃𝚛𝚊𝚗𝚜𝚏𝚘𝚛𝚖𝚎𝚛(Z),=1,2,,LZ_{\ell+1}=\mathtt{Transformer}(Z_{\ell}),\ell=1,2,...,L (2)

The global max-pooling of output vectors is obtained as the visual token embedding of image imgiimg_{i}:

vi=𝙼𝚊𝚡𝚙𝚘𝚘𝚕𝚒𝚗𝚐(ZL)v_{i}=\mathtt{Maxpooling}(Z_{L}) (3)

where viRDv_{i}\in{R}^{D}. Through the image tokenizer, we can convert the sequence of input images into a sequence of visual token embeddings Ev={vi}i=1ME_{v}=\{v_{i}\}_{i=1}^{M}.

2.1.3 Multi-modal Encoder

The input of the multi-modal encoder is the concatenation of visual token embeddings EvE_{v} and token embeddings EDE_{D}. We can formalize the input as H0={Ev;ED}H_{0}=\{E_{v};E_{D}\} and then encode visual and text embeddings with 12 transformer blocks. Finally, we can obtain vision-language representation HLH_{L} from the last layer output of this encoder.

HL={hv1,,hvM,hstart,h1,,hend}H_{L}=\{h_{v_{1}},\dots,h_{v_{M}},h_{start},h_{1},\dots,h_{end}\} (4)

The vision and language semantics interact with the self-attention mechanism of the transformer structure during the encoding process.

2.2 Visual-enhanced Summary Generation

The vision-language representations HLH_{L} from the previous multi-modal encoder contain multi-modal features of input text and images. After encoding, we feed the representations HLH_{L} into the decoder to generate a text summary. The target of the summary generation task is to minimize the negative log-likelihood of the reference yy tokens as given input document DD and images II via updating model parameters θ\theta. The loss function of the summary generation task is as follows:

θGEN=j=1|y|logPθ(yj|y<j,D,I)\mathcal{L}_{\theta}^{GEN}=-\sum_{j=1}^{|y|}\log P_{\theta}(y_{j}|y_{<j},D,I) (5)

Different from single-modal summarization tasks, this optimization target also depends on the features from input images II, which enhance the final summary generation.

2.3 Images Reordering

To align the paragraphs and images from the input, in this section, we introduce a simple yet effective task, image reordering, to guide the model to learn semantic alignment. Specifically, we shuffle the order of input images and then force the ViL-Sum model to predict the original order of input images with a classification head:

yi=P(posi)=𝚜𝚘𝚏𝚝𝚖𝚊𝚡(Whvi+b)y_{i}=P(pos_{i})=\mathtt{softmax}(W\cdot h_{v_{i}}+b) (6)

where all input images share one classification head. To train the classification layer, the model computes loss and minimizes the objective function:

θIR=1Mi=1Mc=1Cy^iclogyic\mathcal{L}_{\theta}^{IR}=\frac{1}{M}\sum_{i=1}^{M}\sum_{c=1}^{C}-\hat{y}_{ic}\log y_{ic} (7)

where CC is the number of categories, depending on the number of input images. We set C=10C=10. if the number of input images is greater than 10, we only keep the first 10 images as input images.

2.4 Images Selection

We also train our ViL-Sum with multi-modal output reference following Zhu et al. (2020). To build pseudo image selection labels of training data, we employ similarity between image caption and gold summary to select top-KK images as labels y^\hat{y} (KK is empirically set as 3). The similarity is the average of ROUGE-1, ROUGE-2, and ROUGE-L scores. The probability to select each image is as follows:

yi=P(imgi)=σ(Whvi+b)y_{i}=P(img_{i})=\sigma(W\cdot h_{v_{i}}+b) (8)

The loss function of the image selection task is as follows:

θIS=1Mi=1M[y^ilogyi+(1y^i)log(1yi)]\mathcal{L}_{\theta}^{IS}=\frac{1}{M}\sum_{i=1}^{M}-[\hat{y}_{i}\log y_{i}+(1-\hat{y}_{i})\log(1-y_{i})] (9)

2.5 Enhanced by Multi-task Learning

We train our ViL-Sum with a text summary generation task and two well-designed auxiliary tasks in a multi-task manner, which are used to enhance vision-language representation and paragraph-level semantic alignment. In previous sections, we have introduced the details of them. Finally, ViL-Sum is trained with three tasks: summary generation, image selection, and image reordering, jointly by simultaneously minimizing three loss functions as follows:

θTOTAL=LθGEN+LθIS+LθIR\mathcal{L}_{\theta}^{TOTAL}=L_{\theta}^{GEN}+L_{\theta}^{IS}+L_{\theta}^{IR} (10)

It is deserved to mention that the caption of the image is not always available. So the image selection task is not generalization for all datasets. If we remove the image selection task, we can select images by measuring the similarity between generated summary and vector representations of images. Our proposed multi-modal encoder and the image reordering task still help the model achieve excellent performance.

3 Experimental Setup

3.1 Dataset

Table 1: Statistical information of MSMO dataset. D refers to the input document. S refers to the summary.
train valid test
#Documents 293,965 10,355 10,261
#AvgTokens(D) 721 766 731
#AvgTokens(S) 70 70 72
#Images 1,928,356 68,520 71,509
#AvgImgs 6.56 6.62 6.97

We employ the MSMO dataset Zhu et al. (2018) to evaluate the effectiveness of our proposed ViL-Sum. MSMO dataset is a large-scale dataset for the Multi-modal Summarization with Multi-modal Output tasks. Each example in the dataset is a triplet (document, images, summary), which contains more than one image in each example. This dataset contains online news articles (723 tokens on average) paired with multiple image-caption pairs (6.58 images on average) and multi-sentence summaries (70 tokens on average). For test data, based on text reference, at most three images are annotated to produce a multi-modal reference by humans. The detailed statistical information of the MSMO dataset is shown in Table 1.

3.2 Baseline Models

We report the existing multi-modal summarization methods (ATG, ATL, HAN, GR) Zhu et al. (2018) and MOFdecRR{}^{RR}_{dec} Zhu et al. (2020) using multiple metrics. We also report the result of PGC See et al. (2017), which is a single-modal summarization model.

To prove the effectiveness of our proposed joint representation and multi-task learning, we mainly compare with BART-base Lewis et al. (2020) model and a reproduced two-stream model BART-cross which has the same structure with MOFdecRR{}^{RR}_{dec} and replace GRU and VGG19 Liu and Deng (2015) with BART and ViT Dosovitskiy et al. (2021) respectively. To be fair, we mainly compare our model with BART-base and BART-cross due to previous methods did not employ pre-trained models. The details of these models are as follows:

  • PGC is the BiGRU-based pointer-generator network that allows both copying words from the input text and generating words from a fixed vocabulary.

  • ATG is based on the PGC model. It fuses static visual features from VGG19 with text features after the BiGRU encoder. Besides, ATG selects final images by the visual-text attention weight.

  • ATL replaces the image global features of ATG with local features (multiple pooling features), which select images by measuring the sum of visual attention distribution over the local patch features of each image.

  • HAN is based on the ATL model and adds a hierarchical attention mechanism. This attention mechanism first attends to the image patches to get the intermediate vectors to represent images and then attends to these vectors to get the visual context vector.

  • GR is an extractive method that employs LexRank Erkan and Radev (2004) to rank captions of images and select images based on the rank score. The text summary of it is generated by the PGC model.

  • MOFdecRR{}^{RR}_{dec} is based on ATG model. This model first constructs pseudo-labels of image selection for the final summary. Specifically, it employs the ROUGE score to measure the relevance of image caption and summary text.

  • BART-base is a pre-trained seq2seq generation model, which achieved promising results in many generations of NLP tasks, especially on text summarization. We employ this model to confirm visual features’ contribution to a summary generation.

  • BART-cross is a BART-based model with the same model structure as previous ATG, ATL, HAN, GR, and MOFdecRR{}^{RR}_{dec}. It first encodes images with ViT and then fuses text representation from the BART encoder output. The fusion of image and text representations employs cross-attention like the ATG model. This is the main comparison model.

For a fair comparison, we construct this BART-cross model to prove the effectiveness of joint multi-modal encoder and multi-task training in our ViL-Sum. Because our ViL-Sum without multi-task training only changes the encoding mechanism from separate encoders to the joint multi-modal encoder.

Model ROUGE-1 ROUGE-2 ROUGE-L MAXsim IP MMAE
1 PGC (See et al. 2017) 41.11 18.31 37.74 - - -
ATG Zhu et al. (2018) 40.63 18.12 37.53 25.82 59.28 3.35
ATL Zhu et al. (2018) 40.86 18.27 37.75 13.26 62.44 3.26
HAN Zhu et al. (2018) 40.82 18.30 37.70 12.22 61.83 3.25
GR Zhu et al. (2018) 37.13 15.03 30.21 26.60 61.70 3.20
MOFdecRR{}^{RR}_{dec} Zhu et al. (2020) 41.20 18.33 37.80 26.38 65.45 3.37
2 BART-base 43.75 20.70 40.66 - - -
BART-cross 43.67 20.65 40.65 30.25 65.98 3.45
ViL-Sum 44.29 20.96 41.34 32.17 66.27 3.48
ViL-Sum+SEL 44.20 20.90 41.22 34.47 68.18 3.51
ViL-Sum+REO 44.21 20.98 41.20 34.35 69.03 3.52
ViL-Sum+SEL,REO 44.16 20.88 41.21 34.52 71.73 3.55
Table 2: The main results of all comparison models on different metrics. Models in block 1 are based on the pointer network with Bi-GRU. Models in block 2 are based on the BART-base model. SEL means selection task and REO means reordering task. All reported results of ours are the average of 3 different checkpoints.

3.3 Implementation Details

We train our model for 10 epochs on 8xV100 GPUs using Adam Kingma and Ba (2015) with β1=0.9\beta_{1}=0.9, β2=0.99\beta_{2}=0.99, a batch size of 64. We also use a linear learning rate warm-up with 1,000 steps. The weight-decay is set as 10410^{-4}. The model is initialized with ViT-B/16 and BART-base parameters. The max length of input images and tokens are 10 and 512 respectively. For the image tokenizer, we employ the same setting with ViT-b/16 in Dosovitskiy et al. (2021). During testing, we generate the summary with a beam size of 3, and the minimum and maximum decoding lengths are set as 15 and 150 separately.

3.4 Evaluation Metrics

We evaluate the pictorial summary with the MMAE metric Zhu et al. (2018). 222Comment: Zhu et al. (2020) also proposed a MMAE+ to better evaluate MSMO task. However, the author did not release their MR model, which is the core component of their MMAE+. We find that the performance of MMAE and MMAE+ is very closer and consistent.

MMAE consists of three sub-metrics: ROUGE score (ROUGE-L), Image Precision (IP), and Image Text Relevance (MAXsim). ROUGE Lin (2004) score can measure the salience of text in generated summary, which is widely used for measuring summarization systems. The image precision can measure the salience of selected images and is computed as Equ. (11).

𝐈𝐏=|refimgrecimg||recimg|\mathbf{IP}=\frac{|ref_{img}\cap rec_{img}|}{|rec_{img}|} (11)

where refimgref_{img} and recimgrec_{img} denote reference images and recommended images by MSMO systems respectively. MAXsim can measure the relevance between selected images and generated text summary, which trains an image-text retrieval Faghri et al. (2018) model with max-margin loss to evaluate Image-Text relevance. Finally, Zhu et. al. Zhu et al. (2018) choose the linear regression results of 3 metrics as MMAE with human judgments and the weight for ROUGE-L, MAXsim, and IP is 1.641, 0.854, 0.806 respectively, the intercept is 1.978.

We report the results of ROUGE-1/2/L, MAXsim, IP, and MMAE of each model to comprehensively measure their performance. The results of our model are all the averages of three different checkpoints.

4 Results

4.1 Overall Performance

The main results of all models are shown in Tab. 2. Models in block 1 are based on the pointer network with Bi-GRU. Models in block 2 are based on the BART-base model. SEL means selection task and REO means reordering task. All reported results of ours are the average of 3 different checkpoints.

We can see that compared with the baselines, our ViL-Sum gains significant improvement on all metrics, and ViL-Sum+selection, reordering achieves the best comprehensive performance. Compared with BART-cross, we can see that the joint representation and multi-task training both bring satisfactory improvement, which proved the effectiveness of our proposed methods. Interestingly, the introduction of image features hurts the performance of all single-modal summarization models, especially Bi-GRU-based models.

4.2 Performance of Joint Representation

Firstly, we can see that the performance of ATG, ATL, HAN, and GR all hurt ROUGE scores by simply introducing images as independent visual features. Through the multi-modal objective optimization, MOFdecRR{}^{RR}_{dec} has a significant improvement on IP and does not decrease the quality of generated text summary. This situation proves that modelling vision and language information independently did not bring in the revenue for text summary generation. The results of BART-cross, which also introduces images as independent features, also have lower ROUGE scores than BART-base. This situation proved again the previous conclusion.

Different from previous performance on ROUGE score, our ViL-Sum with joint vision-language representation obtains better ROUGE scores, and the Image Precision (IP) and MAXism{}_{s}im both have a significant improvement. This demonstrated that using the joint multi-modal encoder to obtain vision-language representation is better than using separate encoders with cross-attention to fuse multi-modal features.

4.3 Performance of Multi-task Learning

The result of ViL-Sum without multi-task learning has achieved new state-of-the-art performance and is better than BART-cross. In this section, we will analyze the influence of our proposed multi-task learning. From the results, we can see that the introduction of image selection and reordering bring a slight decrease in ROUGE scores. Meanwhile, the IP and MAXsim scores increase significantly, which makes the overall score MMAE better than ViL-Sum without multi-task training.

We report the ablation study results of two auxiliary tasks in the second block of Table 2. From the results, we can see that image selection and reordering both can bring improvement in IP and MAXsim scores. The combination of two tasks can push the overall score MMAE higher. The comparison of these models demonstrated that the introduction of multi-task learning exactly improved the vision-language representation and semantic alignment, which is reflected in the improvement of the multi-modal metrics: IP, MAXsim and MMAE.

5 Discussion

5.1 Human Evaluation

Systems Human Score
BART-base 3.29
BART-cross 3.46
ViL-Sum (best) 3.78
Reference 4.02
Table 3: Results evaluated by human annotators. Each summary is scored by three persons, and we take the average value.

We randomly sample 100 examples from the test set to conduct the human evaluation. The multi-modal summary of golden reference, BART-base, BART-cross, and our ViL-Sum (best) are evaluated by three human annotators. Each annotator will score each example with a rating scale from 1 (worst) to 5 (best). Table 3 shows the average scores from three annotators (t-test, p<0.05p<0.05). We can see that annotators tend to give the multi-modal summary from BART-cross and our ViL-Sum higher scores. In addition, our ViL-Sum outperforms two strong baselines by a wide margin and is close to the references.

5.2 Impact of Different Numbers of Images

KK ROUGE-L MAXsim IP MMAE
1 40.97 34.63 70.94 3.54
2 41.12 34.33 70.40 3.53
3 41.21 34.52 71.73 3.55
4 41.08 34.49 70.61 3.53
Table 4: Results of ViL-Sum under different numbers KK of images, which are selected into the final summary.

Tab. 4 depicts the experimental results of our model performance varying with different KK (the image number at the final summary). Since the golden reference in the test set contains three images, the consistency between training and test makes the model perform best when KK is 3. Overall, our model is not very sensitive with KK. With different KK, our ViL-Sum all achieve excellent performance, which proves our method can identify the real importance images from multi-modal inputs. Besides, we guess the image selection of the MSMO dataset is simple due to the data from the news.

5.3 Impact of Different Image Tokenizer

ROUGE-L MAXsim IP MMAE
ViT 41.21 34.52 71.73 3.55
Linear 40.18 33.89 70.44 3.51
Vision 41.10 34.28 71.04 3.54
Table 5: Results of ViL-Sum with different image tokenizers. Linear means image tokenizer which replaces transformer blocks with linear layers. Vision is an image tokenizer from Vision Transformer.

To further evaluate the effectiveness of joint modelling and multi-task learning, we replace the backbone of the image tokenizer to observe the performance of ViL-Sum. We replace the ViT backbone with Linear Layer and an image tokenizer from Vision Transformer Wu et al. (2020). Both of them have much smaller parameters than the ViT backbone. Specifically, linear is the simple version of ViT which replaces the transformer image encoder with a simple linear layer to map the images into visual token embeddings. Vision is an image tokenizer from Vision Transformer Wu et al. (2020), which can convert one image into several visual token embeddings. Table 5 reports the results of them. We can see that the ViT exactly provides better visual features than the other two backbones. However, the performance does not drop sharply with the replacement of the image tokenizer. This proves that Our proposed two strategies are robust and the ViL-Sum is flexible with different image tokenizers.

Refer to caption
Figure 4: Example from the test set with the generated multi-modal summary. Figure a) is the full example. Figure b) is the heatmap that shows the relevance of the summary and selected images. Figure c) is the heatmap that shows the relevance of selected paragraphs and images. Each color block means cosine similarity between the image and text object. The darker color refers to higher similarity.
Refer to caption
Figure 5: The heatmap shows the relevance of all input tokens and images. The darker color refers to higher similarity.

5.4 Case Study and Relevance Visualization

We select one typical example from the test set and visualize the relevance of 1) summary sentences and selected images; 2) selected paragraphs and images; 3) all tokens and images in Figure 4 and 5. Each colour block means a cosine similarity between the image and text object. The darker colour refers to a higher similarity in the heatmap. With our proposed methods, the generated summary contains high-quality summary with three related images as shown in Figure 4a. From three different relevant visualizations, we can see that our ViL-Sum can effectively align the semantic representation of summary sentences and selected images as shown in Figure 4b. The input images can be aligned with paragraphs by training with image reordering as shown in Figure 4c. We also report the heatmap of all input tokens and images in Figure  5, which is consistent with Figure 4b and 4c. This case proves that the multi-task training really helps ViL-Sum learn reasonable relations between images and input paragraphs.

6 Related Work

6.1 Single-modal Summarization

Recently, text summarization models have achieved remarkable performance with the development of pre-trained language models. Liu and Lapata Liu and Lapata (2019) first apply the pre-trained language model BERT to summarization tasks. They added several transformers as the decoder to the BERT encoder and then train them with different learning rates. Their work outperformed all traditionally trained neural models. Pegasus Zhang et al. (2019) and BART Lewis et al. (2020) are two fully pre-trained models for summarization generation with well-designed self-supervised tasks. Their appearance provided powerful base models for summarization and totally changed the research paradigm in the summarization task. After that, more and more summarization works begin to focus on pre-trained language models, including supervised and unsupervised methods Zhong et al. (2020); Liang et al. (2021b, a, 2022); Dong et al. (2023).

6.2 Vision-Language Representation

Large-scale Transformers-based Vaswani et al. (2017) vision and language representation models Radford et al. (2019); Devlin et al. (2019); Lewis et al. (2020) have achieved state-of-the-art results on many Natural Language Processing (NLP) tasks. They first pre-trained on a large-scale corpus with self-supervised tasks and then fine-tuned on specific downstream tasks. Most existing vision and language pretraining (VLP) models Tan and Bansal (2019); Li et al. (2021) adopt two different encoders to model vision and language separately, which extracts visual features by an object detection model and then combines the derived object-centric representation of the image and text. Recently, large-scale vision and language representation learning has tried to jointly encode different modalities with the same encoder and achieved promising improvements Li et al. (2020a, b, c); Zhang et al. (2021b); Xu et al. (2021); Zhou et al. (2020). Their success proves the joint modelling of different modalities is practicable.

6.3 Multi-modal Summarization

Different from single-modal text summarization, multi-modal summarization is a task to generate a condensed summary to cover the primary information from multimedia data. One of the most significant characteristics of this task is it is not only based on text information, but can also employ rich visual information from images, audio, and videos. Multi-modal summarization tasks can be divided into two types with different outputs: single-modal output Evangelopoulos et al. (2013); Chen and Zhuge (2018); Li et al. (2018) and multi-modal output Bian et al. (2015); Zhu et al. (2018, 2020). Compared with single-modal output, a multi-modal output summary can increase users’ satisfaction Zhu et al. (2018) and first proposed a large-scale Multi-modal Summarization with Multi-modal Output (MSMO) dataset. To tackle the gap between training and testing in the MSMO task, Zhu et. al. Zhu et al. (2020) proposed two methods to obtain pseudo image labels and train the model with multi-modal optimization objectives.

However, previous works all obtain vision-language representation via separate encoders for different modalities, which has been proved weaker than joint representation in vision-language representation learning research Zhou et al. (2020); Xu et al. (2021). Besides, they ignored the special paragraph-level semantic alignment between different modalities. In this paper, we proposed a novel vision-language summarization ViL-Sum model with a multi-task learning framework to tackle these issues.

7 Conclusion

In this paper, we proposed a novel Vision-Language Summarization (ViL-Sum) model, which can enhance the vision-language representation and the paragraph-level semantics alignment through multi-task training and joint modelling. Our ViL-Sum achieves new state-of-the-art results on automatic and manual evaluation metrics. Further analysis demonstrates the improvement is from the joint multi-modal encoder and multi-task training. Our proposed image reordering task is straightforward yet effective. We believe it can be extended to more scenarios (e.g. vision-language pre-training models) and modalities (e.g. audio and video).

References