This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sample Efficient Multimodal Semantic Augmentation for
Incremental Summarization



Intel labs,USA
  
Intel labs,USA
   Sumanta Bhattacharyya
UIC, USA \AndRamesh Manuvinakurike
Intel labs, USA \ANDSahisnu Mazumder
Intel labs, USA \AndSaurav Sahay
Intel labs, USA
  Work done while at Intel labs
Abstract

In this work, we develop a prompting approach for incremental summarization of task videos. We develop a sample-efficient few-shot approach for extracting semantic concepts as an intermediate step. We leverage an existing model for extracting the concepts from the images and extend it to videos and introduce a clustering and querying approach for sample efficiency, motivated by the recent advances in perceiver-based architectures. Our work provides further evidence that an approach with richer input context with relevant entities and actions from the videos and using these as prompts could enhance the summaries generated by the model. We show the results on a relevant dataset and discuss possible directions for the work.

1 Introduction

Summarization is the consolidated format for a large document and has been widely used for many applications i.e., understanding a long meeting/event, story summarization etc. Abstractive summarization is challenging in the Natural Language Generation(NLG) domain as it requires an understanding of all the salient information in the input document and rewriting logically in a condensed manner rather than selection (extractive). Recent advancements in transformer-based abstractive summarization have shown promising attempts Su et al. (2020); Hoang et al. (2019); Wang et al. (2020) with ideas ranging from the two-stage method,domain-adaptive training to plug and play topic models on top of the transformer. Despite these strong advancements in text-based summarization, there is a huge potential for how we can improve summarization from multimodal data. Since in real-time, data prevails in different modes rather than a single mode like text, there has been an increasing demand for how we can bridge the gap between these modalities i.e., cross-modal search applications for video, utilize the text data associated with the video to search for relevant video content Otani et al. (2016); Song et al. (2011), which requires a complete understanding of the video without ignoring the subtle differences Wang et al. (2012). Recent work Palaskar et al. (2021) suggests that learning a semantic concept as an intermediate step can help the model to learn efficiently. Learning a semantic concept has always been beneficial in categorization tasks like scene recognition, video tagging, etc Zhou et al. (2017); Ghadiyaram et al. (2019).

Recent advancements in the vision-language-based models Radford et al. (2021); Alayrac et al. (2022) have shown immense potential for generating text-based descriptions from images/videos. In our context, we refer to these text-based descriptions as "semantic concepts". Our work utilizes learning of these semantic concepts as an intermediate step from the videos. These semantic concepts along with the transcriptions (semantic augmentation) as input to a pre-trained summarizer model enrich the performance. In this work, we address the problem of (i) generating semantically relevant annotations of a video (semantic concepts) using a fixed number of sampled frames from each video segment. (ii) utilize these semantic concepts along with input transcription (semantic augmentation) to enrich the summarization output of pre-trained models. (i.e.BART).

In summary, Our contributions are the following:

  • We propose a novel CLIP-based approach Radford et al. (2021) to generate semantic concepts from video frames.

  • In order to maintain diversity in each batch, we propose a clustering-based batch creation approach.

  • We have experimented with our proposed approach using the YOUCOOK2 Zhou et al. (2018a) dataset. The results perfectly demonstrate the efficiency of our approach.

Refer to caption
Figure 1: Shows the architecture of the system and the use of semantic augmentation for summarization.

2 Related work

Early attempts show promising ideas (i.e., reinforcement approach, the copy-pointer mechanism) in abstractive summarization using the advancement in sequence to sequence modelRush et al. (2015); Nallapati et al. (2016); See et al. (2017); Henß et al. (2015). Although these approaches mainly focus on single-document summarization, There are other attempts at multi-document summarization Yasunaga et al. (2017); Cao et al. (2015) as well.

Recent advancements in deep learning and transformer-based models Li et al. (2019); Liu et al. (2019) have achieved impressive performance in abstractive summarization tasks Zhang et al. (2020); Raffel et al. (2020); Lewis et al. (2020); Zhu et al. (2020). Such transformer-based models are typically pre-trained on a large dataset and then fine-tuned on a smaller dataset to achieve impressive performance. There are also other methods to improve summarization using auxiliary tasks. Since summarization should contain all the salient information, It should generate answers to the logical question about the input document. Automatic question-answer (QA) generation in the process of summarization has shown promise in recent times Guo et al. (2018); Dong et al. (2020). Such an automated QA generation method is used to verify if the generated summary entails the same information as the content by matching the answer generated from the content and the summary.

Text generation from multimodal data has always been a challenging research area in the NLG domain. Tasks like video captioningZhou et al. (2018b), or summarization involve generating a compressed textual description of the dataPalaskar et al. (2019). Recent developments show how these tasks can be benefitted from semantic representation learning in latent space that provides general-purpose embedding for downstream tasks Lu et al. (2019); Hubert Tsai et al. (2017). Despite the performance, this approach limits due to controllability issues in tasks like summarization. As an alternative approach, there is also recent interest in utilizing Reranking-based approaches Pernes et al. (2022) in abstractive summarization similar to machine translation Bhattacharyya et al. (2020).

Evaluation of the summaries generated is a challenging task as there is no ‘single’ correct summary for a dialogue Lloret et al. (2018). Numerous automatic metrics have been proposed for evaluating summaries Lin (2004); Yogatama et al. (2015); Jung et al. (2019); Zhang et al. (2019); Hashimoto et al. (2019); Gao et al. (2020); Sellam et al. (2020). Human evaluation of summaries are another popular approach to evaluate the summaries, either by experts or by crowd-workers Iskender et al. (2020); Dang (2006); Khashabi et al. (2021).

Our approach does not contribute to the development of a new model architecture for summarization instead it intends to benchmark and adapt the training methodology for incremental temporal summarization tasks. We adopt the current state-of-the-art transformer architecture and utilize transfer learning to generate summaries. We also evaluate the summaries ( generated by the experts) qualitatively using crowd-workers.

Liu et al. (2022); Tsimpoukelli et al. (2021); Zeng et al. (2022); Pasca et al. (2023)

3 Task Formulation

3.1 Image frame sampling

Since each video segment contains a lot of image frames based on its duration, it is essential to sample a fixed number of image frames for computational efficiency but to sample a fixed number of frames from a pool of frames that describe the entire event is tricky Shi et al. (2019). We designed various experiments with and without sampling and observed that the middle frames of a video segment are the best frames that we can use to capture reasonable augmentation. For all the experiments we have performed, we used three frames from each video, i.e., if N is the total number of image frames for a video, we use N/2N/2, N/21N/2-1, and N/2+1N/2+1 frames. For ease of understanding we use "frames" to signify three middle frames for the rest of the discussion.

To know more about the details of the experiments we designed, Please refer to Appendix 1. We have also designed a different network that learns to sample frames from the frame pool but we will keep this discussion for the sake of future directions of our work.

3.2 Clustering-based batch creation

In a single batch of data, instead of having similar event frames along with the corresponding event annotation, we performed a k-means-based clustering on the encoded feature of the image frames. Since similar event frames, in a single batch can not possess enough diversity, based on the clustering we can identify which event frame’s features are dissimilar and use those features from different clusters to create a batch. Since frames is a collection of three middle frames we concatenate the features of each of these three middle frames and perform the clustering. This concatenation operation preserves the temporal relation between these middle frames for a particular annotation. This strategy improved our performance in augmentation generation compared to keeping similar event frames as depicted in the video data.

# samples TOP-1 (Kmeans) TOP-1 (Random) TOP-3 (Kmeans) TOP-3 (Random)
150 (10) 0.2156 0.196 0.549 0.5098
300 (10) 0.2749 0.2745 0.6176 0.5098
1500 (10) 0.2941 0.2156 0.6176 0.598
150 (20) 0.480392 0.441176 0.803922 0.77451
300 (20) 0.470588 0.382353 0.784314 0.735294
1500 (20) 0.303922 0.245 0.647059 0.6372
Table 1: Clustering & accuracy of semantic entities extracted K-means vs Random sampling.

3.3 Perceiver Resampler

Recent developments in transformer architectures Jaegle et al. (2021) show we can scale transformers without the quadratic scaling concern in the attention. It involves learning a predefined number of latent input queries as input to the transformer and cross-attend to the feature. State-of-the-art vision-language-based model Alayrac et al. (2022) architecture, utilizes this concept (perceiver resampler) to generate fixed-size embedding from variable length inputs.
Traditional transformer-based image and text encoders use different kinds of pooling layers(mean/linear) to generate fixed embedding sizes from variable length input. We replace the last pooling layer with the perceiver resampler architecture to get fixed-size output from both encoders in a similar fashion Alayrac et al. (2022), keeping the encoder layers frozen. This approach can also scale to larger inputs while retaining the model’s expressivity. As shown in the following table, using a learnable attention-based layer to generate fixed-size embedding compared to the pooling layer improves the feature quality. The Top1 accuracy for correctly predicting the annotation (semantic concept/augmentation) of the frame is more than other approaches.

architecture accuracy (Top1%/Top3%)
pre-trained encoder 18/27
pre-trained encoder+custom pooling 21/32
pre-trained encoder+perceiver resampler 26/38
Table 2: In custom pooling, we replace the pooling layer of the pre-trained model with a learnable pooling layer. for the learnable parameters, the result is for 5 epochs on the Youcook2 dataset.

4 Models

We develop a two-stage approach (i) Phase I: Learning the correct annotation from the video frame (frame to text). (ii) Phase II: Use these augmentations along with the summarizer’s input to generate summarization (text to text) using pre-trained models (i.e., BART or distilBART).

4.1 Phase I:

We used the CLIP model to generate annotations from the video frames. CLIP models are known for learning visual concepts from language supervision. It involves two pre-trained encoders for image and text to predict the correlation between image and text. For image, CLIP uses a similar to ResNet50 architecture and for text, CLIP uses a masked self-attention Transformer. In order to train the CLIP model we used frames (as discussed in section 2.1) and corresponding annotation as input. Our experiments during feeding the data into the clip answer the following questions (i) How to efficiently create a batch of data that consists of diverse examples? (as discussed in section 2.2) (ii) Which frames in a pool of frame that describes a single event should be used for input? (as discussed in section 2.1).

4.2 Phase II:

It takes the semantic augmentation generated in phase I along with the transcription of the video as input to the pre-trained model and generates summarization (system flowchart in Figure 1.)

Since each video is divided between segments based on the procedure, we can learn to predict the annotation (semantic concepts) for each of these segments from the video frames and use these concepts to augment the summarizer model’s input, which is the transcript of the entire video, to generate summaries.

5 Experimental Setup

In Phase I, we finetuned the CLIP model for 1 epoch, and for the rest of the epochs, we only train the learnable perceiver resampler part keeping the encoder layers frozen. We observed finetuning the CLIP model for more than 2 epochs heavily degrades the performance on the prediction. Since it is already pre-trained on huge datasets, we found finetuning for one epoch on the new dataset is reasonable. For phase II also since we are using a pre-trained summarizer model, we adopt a similar strategy.

5.1 Dataset

We use Youcook2 for all our experiments. Existing datasets on instructional videos lack in many aspects (i.e., limited videos, limited actions etc.). The Youcook2 dataset is a collection of around 2000 cooking videos which contains around 89 cooking recipes and 14000 annotated clips with one descriptive sentence.

Dataset Duration
YouCook 140 minutes
50Salads 320 minutes
Breakfast 34.25 hours
Youcook2 176hours
Table 3: Comparison of other instructional video datasets

Unlike Other datasets, as shown in Table 1, Youcook2 includes temporally localized procedure annotation with descriptions along with long-duration videos. Each video contains 3 to 16 procedure annotations. These procedure segments preserve rich semantic information which is useful for our task compared to other datasets. We randomly split the dataset to 67% for training, 8% for validation, and 25% for testing.

5.2 Evaluation Metrics.

Our experiments are evaluated on the widely used evaluation metric Recall-Oriented Understudy for Gisting Evaluation (ROUGE score) for text-based summarization. It considers both the precision and recall between predicted and target summaries. Recall defines the proportion of words in the target summary generated by the predicted summary and precision defines the proportion of words generated by the predicted summary that appears in the target summary. ROUGE score has several methods and as shown in Table 1, we evaluate on ROUGE-1(R-1)/ROUGE-2(R-2)/ROUGE-L(R-L) (Precision and Recall compare the similarity of uni-grams/bi-grams/Longest Common sub-sequence between target and generated summaries). For summarization, Recall is significant since it shows the generated summary captures all of the target summary’s information. We gained a significant amount of improvement in recall using our method compared to the existing pre-trained model.

Experiments
Vision
encoder
        Text
     encoder
     
           Batch mechanism
Result(%)
(Top1/Top3)
            1      Yes      Yes k-means on image feature       39/70
            2      Yes      Yes uniform sampling+k-means clustering       37/71
            3      Yes      Yes k-means on text feature       36/68
            4      Yes      Yes(2nd last) Temporal k-means       37/70
            5      Yes      Yes Temporal k-means       41/73
Table 4: Results for CLIP-based semantic concept learning. Vision encoder and Text encoder column say whether it is frozen or not. For all the experiments, CLIP is finetuned for 1 epoch, and for the rest encoder layers are frozen
      Model      R-1 (P/R/F)      R-2 (P/R/F)      R-L (P/R/F)
     Semantic
Augmentation
BART-Base 0.49/0.42/0.44 0.26/0.23/0.23 0.46/0.41/0.42          NO
Distil-BART 0.50/0.50/0.48 0.26/0.27/0.26 0.46/0.46/0.44          NO
BART-Base 0.61/0.53/0.54 0.36/0.33/0.33 0.58/0.51/0.52         YES
Distil-BART 0.60/0.58/0.57 0.39/0.38/0.37 0.57/0.54/0.54         YES
Table 5: P, R,F denotes precision-recall and F-1 score for the corresponding ROUGE scores.

5.3 Results and Analysis

Table 3 contains the results of experiments on CLIP-based semantic augmentation based on different batching of data. We found clustering based on the image features improves our performance rather than text features. We also uniformly sample the frames from the pool of frames as the input, which contains frames that are not contributing to the event depicted in the video (second experiment). We found adding more frames from the initial and end position of a video segment does not contribute to the accuracy much compared to middle frames. One of the experiments (experiment 4) includes training the text encoder’s second last layer along with the perceiver resampler. we found the accuracy is less compared to a completely frozen encoder and learn only the perceiver resampler layer.

Table 4 contains the results of the summarization output. we used the predicted semantic concepts along with the transcription as input to the pre-trained summarizer model. In order to show the efficiency of the augmentation, we did not fine-tune our pre-trained summarizer model with the semantic concept. Our approach shows significant improvement in terms of accuracy in all the metrics when we augment the input with predicted concepts from the CLIP model.

6 Conclusion

We presented two stage multimodal abstractive video to text summarization model that takes advantage of the extra semantic concepts along with summarizer input. We have provided detailed evaluation for each of our step. We demonstrate that our method gains a significant improvement over the existing pre trained summarizer model. We use this semantic augmentation generation step as an intermediate process. We also showed using various methods like adding a perceiver resampler layer and batching using k-means based clustering with temporal relation can improve the accuracy for concept generation which in turn improves the summarizer quality.

References

  • Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198.
  • Bhattacharyya et al. (2020) Sumanta Bhattacharyya, Amirmohammad Rooshenas, Subhajit Naskar, Simeng Sun, Mohit Iyyer, and Andrew McCallum. 2020. Energy-based reranking: Improving neural machine translation using energy-based models. arXiv preprint arXiv:2009.13267.
  • Cao et al. (2015) Ziqiang Cao, Furu Wei, Li Dong, Sujian Li, and Ming Zhou. 2015. Ranking with recursive neural networks and its application to multi-document summarization. In Proceedings of the AAAI conference on artificial intelligence, volume 29.
  • Dang (2006) Hoa Trang Dang. 2006. Duc 2005: Evaluation of question-focused summarization systems. In Proceedings of the Workshop on Task-Focused Summarization and Question Answering, pages 48–55.
  • Dong et al. (2020) Yue Dong, Shuohang Wang, Zhe Gan, Yu Cheng, Jackie Chi Kit Cheung, and Jingjing Liu. 2020. Multi-fact correction in abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9320–9331.
  • Gao et al. (2020) Yang Gao, Wei Zhao, and Steffen Eger. 2020. Supert: Towards new frontiers in unsupervised evaluation metrics for multi-document summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1347–1354.
  • Ghadiyaram et al. (2019) Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. 2019. Large-scale weakly-supervised pre-training for video action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12046–12055.
  • Guo et al. (2018) Han Guo, Ramakanth Pasunuru, and Mohit Bansal. 2018. Soft layer-specific multi-task summarization with entailment and question generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 687–697.
  • Hashimoto et al. (2019) Tatsunori Hashimoto, Hugh Zhang, and Percy Liang. 2019. Unifying human and statistical evaluation for natural language generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1689–1701.
  • Henß et al. (2015) Stefan Henß, Margot Mieskes, and Iryna Gurevych. 2015. A reinforcement learning approach for adaptive single-and multi-document summarization. In GSCL, pages 3–12.
  • Hoang et al. (2019) Andrew Hoang, Antoine Bosselut, Asli Celikyilmaz, and Yejin Choi. 2019. Efficient adaptation of pretrained transformers for abstractive summarization. arXiv preprint arXiv:1906.00138.
  • Hubert Tsai et al. (2017) Yao-Hung Hubert Tsai, Liang-Kang Huang, and Ruslan Salakhutdinov. 2017. Learning robust visual-semantic embeddings. In Proceedings of the IEEE International conference on Computer Vision, pages 3571–3580.
  • Iskender et al. (2020) Neslihan Iskender, Tim Polzehl, and Sebastian Möller. 2020. Towards a reliable and robust methodology for crowd-based subjective quality assessment of query-based extractive text summarization. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 245–253.
  • Jaegle et al. (2021) Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. 2021. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR.
  • Jung et al. (2019) Taehee Jung, Dongyeop Kang, Lucas Mentch, and Eduard Hovy. 2019. Earlier isn’t always better: Sub-aspect analysis on corpus and system biases in summarization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3315–3326.
  • Khashabi et al. (2021) Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg, Nicholas Lourie, Jungo Kasai, Yejin Choi, Noah A Smith, and Daniel S Weld. 2021. Genie: A leaderboard for human-in-the-loop evaluation of text generation. arXiv preprint arXiv:2101.06561.
  • Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
  • Li et al. (2019) Manling Li, Lingyu Zhang, Heng Ji, and Richard J Radke. 2019. Keep meeting summaries on topic: Abstractive multi-modal meeting summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2190–2196.
  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  • Liu et al. (2022) Xiaochen Liu, Yang Gao, Yu Bai, Jiawei Li, Yinan Hu, He-Yan Huang, and Boxing Chen. 2022. Psp: Pre-trained soft prompts for few-shot abstractive summarization. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6355–6368.
  • Liu et al. (2019) Zhengyuan Liu, Angela Ng, Sheldon Lee, Ai Ti Aw, and Nancy F Chen. 2019. Topic-aware pointer-generator networks for summarizing spoken conversations. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 814–821. IEEE.
  • Lloret et al. (2018) Elena Lloret, Laura Plaza, and Ahmet Aker. 2018. The challenging task of summary evaluation: an overview. Language Resources and Evaluation, 52(1):101–148.
  • Lu et al. (2019) Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
  • Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Çağlar Gulçehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290.
  • Otani et al. (2016) Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkilä, and Naokazu Yokoya. 2016. Learning joint representations of videos and sentences with web image search. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part I 14, pages 651–667. Springer.
  • Palaskar et al. (2019) Shruti Palaskar, Jindrich Libovickỳ, Spandana Gella, and Florian Metze. 2019. Multimodal abstractive summarization for how2 videos. arXiv preprint arXiv:1906.07901.
  • Palaskar et al. (2021) Shruti Palaskar, Ruslan Salakhutdinov, Alan W Black, and Florian Metze. 2021. Multimodal speech summarization through semantic concept learning. In Interspeech, pages 791–795.
  • Pasca et al. (2023) Razvan-George Pasca, Alexey Gavryushin, Yen-Ling Kuo, Otmar Hilliges, and Xi Wang. 2023. Summarize the past to predict the future: Natural language descriptions of context boost multimodal object interaction. arXiv preprint arXiv:2301.09209.
  • Pernes et al. (2022) Diogo Pernes, Afonso Mendes, and André FT Martins. 2022. Improving abstractive summarization with energy-based re-ranking. arXiv preprint arXiv:2210.15553.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67.
  • Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.
  • See et al. (2017) Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083.
  • Sellam et al. (2020) Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020. Bleurt: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892.
  • Shi et al. (2019) Jing Shi, Jia Xu, Boqing Gong, and Chenliang Xu. 2019. Not all frames are equal: Weakly-supervised video grounding with contextual similarity and visual clustering losses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10444–10452.
  • Song et al. (2011) Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple feature hashing for real-time large scale near-duplicate video retrieval. In Proceedings of the 19th ACM international conference on Multimedia, pages 423–432.
  • Su et al. (2020) Ming-Hsiang Su, Chung-Hsien Wu, and Hao-Tse Cheng. 2020. A two-stage transformer-based approach for variable-length abstractive summarization. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2061–2072.
  • Tsimpoukelli et al. (2021) Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212.
  • Wang et al. (2012) Meng Wang, Richang Hong, Guangda Li, Zheng-Jun Zha, Shuicheng Yan, and Tat-Seng Chua. 2012. Event driven web video summarization by tag localization and key-shot identification. IEEE Transactions on Multimedia, 14(4):975–985.
  • Wang et al. (2020) Zhengjue Wang, Zhibin Duan, Hao Zhang, Chaojie Wang, Long Tian, Bo Chen, and Mingyuan Zhou. 2020. Friendly topic assistant for transformer based abstractive summarization. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 485–497.
  • Yasunaga et al. (2017) Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, Ayush Pareek, Krishnan Srinivasan, and Dragomir Radev. 2017. Graph-based neural multi-document summarization. arXiv preprint arXiv:1706.06681.
  • Yogatama et al. (2015) Dani Yogatama, Fei Liu, and Noah A Smith. 2015. Extractive summarization by maximizing semantic volume. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1961–1966.
  • Zeng et al. (2022) Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, et al. 2022. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598.
  • Zhang et al. (2020) Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR.
  • Zhang et al. (2019) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  • Zhou et al. (2017) Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464.
  • Zhou et al. (2018a) Luowei Zhou, Chenliang Xu, and Jason J Corso. 2018a. Towards automatic learning of procedures from web instructional videos. In AAAI Conference on Artificial Intelligence, pages 7590–7598.
  • Zhou et al. (2018b) Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. 2018b. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8739–8748.
  • Zhu et al. (2020) Chenguang Zhu, Ruochen Xu, Michael Zeng, and Xuedong Huang. 2020. A hierarchical network for abstractive meeting summarization with cross-domain pretraining. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 194–203.

Appendix A Example Appendix

This is an appendix.