Multimodal Frame-Scoring Transformer for Video Summarization

Jeiyoon Park^1,4 Kiho Kwoun² Chanhee Lee³&Heuiseok Lim¹ ¹Department of Computer Science and Engineering, Korea University
²School of Electrical and Computer Engineering, University of Seoul
³Naver Corporation
⁴LLSOLLU {k4ke, limhseok}@korea.ac.kr, [email protected], [email protected]

Abstract

As the number of video content has mushroomed in recent years, automatic video summarization has come useful when we want to just peek at the content of the video. However, there are two underlying limitations in generic video summarization task. First, most previous approaches read in just visual features as input, leaving other modality features behind. Second, existing datasets for generic video summarization are relatively insufficient to train a caption generator used for extracting text information from a video and to train the multimodal feature extractors. To address these two problems, this paper proposes the Multimodal Frame-Scoring Transformer (MFST), a framework exploiting visual, text, and audio features and scoring a video with respect to frames. Our MFST framework first extracts each modality features (audio-visual-text) using learned encoders. Then, MFST trains the multimodal frame-scoring transformer that uses multimodal representation based on extracted features as inputs and predicts frame-level scores. Our extensive experiments with previous models and ablation studies on TVSum and SumMe datasets demonstrate the effectiveness and superiority of our proposed method by a large margin in both F1 score and Rank-based evaluation.

1 Introduction

Our Intuition about Video Summarization. When humans watch a video on YouTube or Netflix, they perceive visual, linguistic, and audio information through various sense organs and know which parts of the video are interesting. To consider whether a scene in a movie is absorbing, for example, we observe characters’ facial expressions and actions, recognize background and situation with language, and listen to the characters’ utterances and sound effects. Intuitively, humans have access to well-defined scoring function in their mind, using versatile sensory systems, while video summarization models from previous studies did not.

Refer to caption — Figure 1: The proposed MFST takes a video as input and generates frame-level importance scores for video summarization. The red areas indicate predicted importance score.

Two Underlying Research challenges. Video summarization aims to capture key frames using predicted frame-wise importance scores, given datasets as shown in Figure 1. Despite its importance and convenience, video summarization has two inherent challenges: (i) Most previous approaches exploit just visual features, leaving other modality features behind Zhao et al. (2018); Rochan et al. (2018); Zhang et al. (2018); Zhou et al. (2018); Jung et al. (2019); Rochan and Wang (2019); Park et al. (2020); Jung et al. (2020); Ghauri et al. (2021). (ii) Since the existing datasets Gygli et al. (2014); Song et al. (2015), which consist only of videos and frame-level ground truths, relatively insufficient to train a caption generator used for extracting text information from the datasets and to train audio-visual-text feature extractors. Note that frame-wise human-scored video dataset is expensive to obtain compared to other common video datasets.

Language-attended methods exploit extracted text information from videos and predict importance score based on visual and text features. Bor-Chun Chen and Chen (2017) jointly combines video summarization and video caption model and trains the recurrent network in end-to-end manner. Narasimhan et al. (2021) leverages language-guided video summarization model given videos and corresponding user query or automatically generated video captions. Haopeng et al. (2022) collects video titles and descriptions for pre-training the language-attended self-supervised learning model.

Though language-guided approaches alleviate modality issue somewhat, there are underlying constraints in conveying vivid audio features into a video summarization model (e.g., we know there are limits to expressing a beautiful song just with the text “beautiful song”).

Our Solutions. This paper proposes Multimodal Frame-Scoring Transfomer (MFST) to handle the modality combinations and frame-level scoring for video summarization. Note that existing datasets for generic video summarization Gygli et al. (2014); Song et al. (2015) are relatively insufficient to pretrain a dense caption generator and audio, visual, and text feature extractors. We investigate a new multimodal setting where it can mitigate the lack of human-scored videos used for training video summarization model.

To this end, our framework consists of three stages: (i) Generating dense video captions using learned caption generator and extracting each modality features (audio-visual-text) using feature encoders. (ii) Multimodal representation using caption-guided attention mechanism and coarse-grained projection based on modality fusion (iii) Frame-scoring transformer that takes audio-visual-text multimodal representations as input and predicts frame-level scores.

Our extensive experiments with previous models and ablation studies on TVSum and SumMe datasets demonstrate the effectiveness and superiority of our proposed method by a large margin in both F1 score and Rank-based evaluation.

Our Contributions. The main contributions of this work are summarized as below:

•

To the best of our knowledge, Our MFST is the first to introduce frame-scoring transformer exploiting multimodal features (audio-visual-text) for generic video summarization task.
•

We investigate a new multimodal setting where it can mitigate the lack of human-scored videos used for training generic video summarization model exploiting pretrained modules.
•

Our empirical study on generic video summarization datasets (TVSum and SumMe) demonstrates that MFST can surpass its all counterparts by nontrivial margins, and attests the effectiveness and superiority of our approach.

2 Related Works

Two broad categories of Video Summarization. In video summarization task, there are two broad categories of methods: (i) generic video summarization Park et al. (2020); Jung et al. (2020); Ghauri et al. (2021); Narasimhan et al. (2021); Haopeng et al. (2022) and (ii) query-guided video summarization Narasimhan et al. (2021); Wu et al. (2022); Liu et al. (2022); Jiang and Mu (2022).

Generic Video Summarization. The First category of methods aim to extract representative frames from original videos using well-defined frame-wise scoring function. Existing models focus on both supervised learning and unsupervised learning. Zhou et al. (2018) designed a reward function which determines diversity and representativeness of generated summaries based on end-to-end reinforcement learning framework. Rochan et al. (2018) tried to solve video summarization as a sequence labeling problem based on fully convolutional sequence models. Zhang et al. (2018) proposed the retrospective encoder to embed both predicted summary and original video. Zhao et al. (2018) integrated shot-level segmentation and video summary into a hierarchical RNN. Jung et al. (2019) proposed variance loss with variational autoencoder and generative adversarial networks. Rochan and Wang (2019) learned a mapping function between raw videos and summarized videos because this kind of dataset is much easier to gain. Yuan et al. (2019) proposed a cycle-consistent adversarial networks which consist of frame selector and evaluator. Jung et al. (2020) exploited global and local input decomposition to capture the interdependencies of video frames. To represent a relation graph, Park et al. (2020) leveraged recursive graph modeling networks.

Noted that, these methods just employs visual features, leaving other modality features behind. In this paper, we investigate a new multimodal setting for video summarization including audio-visual-text features.

Query-Guided Video Summarization. The second category of methods find relevant moments according to user-defined query. Unlike generic video summarization, most query-guided models take Query-Focused Video Summarization Sharghi et al. (2016) dataset, UT Egocentric Lee et al. (2012) dataset, and QVHighlights Lei et al. (2021) dataset as input. Sharghi et al. (2017) introduced a parametrized memory network into query-focuesd video summarization. Wei et al. (2018) proposed a semantic attended network which consists of frame selector and video descriptor. Kanehira et al. (2018) investigated how to divide videos into groups under the assumption that video summaries extracted from similar videos should be similar. Narasimhan et al. (2021) leverages a framework for handling both generic model and query-guided model given videos and corresponding user query or automatically generated video captions. To effectively address generic queries from different modalities Wu et al. (2022) introduced a graph convolutional networks which is used for both summary module and intent module. Liu et al. (2022) proposed the unified multimodal transformer to cover different input modality combinations. Jiang and Mu (2022) jointly leveraged video summarization model and moment localization model.

Note that though query-driven approach is necessary because defining salient scenes is often subjective, it is difficult to apply if we do not know the contents of the video or if we do not need a subjective summary (e.g., YouTube video previews). In this paper, we aim to tackle the first category based on a novel multimodal frame-scoring framework.

Frame-Scoring Transformer. Some recent works apply transformer Vaswani et al. (2017) for generating video summaries. Narasimhan et al. (2021) employed transformer for predicting importance score using language-attended representation. Liu et al. (2022) leveraged modality encoder, query generator, and query decoder using transformer framework. Compared to Narasimhan et al. (2021), we propose frame-scoring transformer exploiting audio-visual-text modality represenatation.

3 Our Approach

Our Goal. In this section, we present the proposed MFST framework (as shown in Figure 2) with extracted modality features. we consider the standard generic video summarization setting, where given a set of videos $V$ and ground truth frame scores $S_{gt}$ , the goal is to minimize the loss $\mathcal{L}_{\theta}$ with respect to predicted frame scores $S$ :

\displaystyle\hat{\theta}=\arg\min_{\theta}(\mathcal{L}_{\theta}(S_{gt},S))

(1)

Multimodal Feature Extraction. Note that existing datasets for generic video summarization Gygli et al. (2014); Song et al. (2015) are relatively insufficient to train a caption generator and audio, visual, and text feature extractors. Considering that large-scale videos scored by humans are not available, we co-opt a pretrained caption generator and feature extractors for each modality:

	$\displaystyle C=g_{c}(V)$		(2)
	$\displaystyle\mathcal{T}=f_{t}(C)$		(3)
	$\displaystyle\mathcal{V}=f_{v}(V)$		(4)
	$\displaystyle\mathcal{A}=f_{a}(V)$		(5)

, where $C$ denotes a set of dense video captions, $g_{c}$ is a dense video caption generator Iashin and Rahtu (2020a), $f_{t}$ is a CLIP-based text feature extractor Xu et al. (2021), $f_{v}$ is a CLIP-based visual feature extractor Xu et al. (2021), and $f_{a}$ is a learned audio feature extractor in Baevski et al. (2020). Here, we use audio-visual-text features at the same time like human to make it easier for our model in understanding video context and demoting peripheral parts.

Fine and Coarse Spaces. Since our model handles three modality features, $\mathcal{T},\mathcal{V}$ , and $\mathcal{A}$ , configuration of modality space should be considered. Fine and coarse spaces (FAC) Alayrac et al. (2020) combines the modalities into a common embedding, while preserving fine-grained information and integrating modality features. Inspired by Alayrac et al. (2020), but unlike the study, we first propose to learn visual-text granularities in the FAC modality embedding graphs.

Though video caption methods based on automatic speech recognition are useful Hessel et al. (2019); Alayrac et al. (2020); Iashin and Rahtu (2020b), most videos in generic video summarization rarely have human dialogues. Note that in real-world applications, not all videos contain human voice, nor we always have to figure out exactly what people are saying for highlight detection (e.g., croud cheers at a soccer game).

In this paper, we co-opt the feature-based video cation generator $g_{c}$ Iashin and Rahtu (2020a) to extract dense video captions $C=\{C_{1},C_{2},...,C_{N}\}$ . Then, we leverage a fine-grained embedding space where visual features $\mathcal{V}=\{\mathcal{V}_{1},\mathcal{V}_{2},...,\mathcal{V}_{N}\}$ and text features $\mathcal{T}=\{\mathcal{T}_{1},\mathcal{T}_{2},...,\mathcal{T}_{N}\}$ lie. Note that Liang et al. (2022) demonstrates ”multimodal video-text pretraining” paradigm can not solve the modality gap phenomenon completely which causes performance degradation. In this work, we compute text-attended visual representation using attention layer:

\displaystyle h_{\mathcal{V}\mathcal{T}}=\text{Attention}(h_{\mathcal{V}},h_{\mathcal{T}})

(6)

, where $h$ stands for feature embedding space. Intuitively, in the fine-grained embedding space, each $v_{i,l}$ from $\mathcal{V}_{i}$ chooses the most relevant caption ${t_{i}^{*}}$ in the $\mathcal{T}_{i}$ using attention mechanism.

Lastly, we project the fine-grained embedding space into the coarse-grained embedding space by modality fusion:

h_{\mathcal{A}\mathcal{V}\mathcal{T}}=\mathcal{F}_{M}(h_{\mathcal{V}\mathcal{T}},h_{\mathcal{A}})

(7)

, where $\mathcal{F}_{M}$ denotes function of fusion and $h_{\mathcal{A}\mathcal{V}\mathcal{T}}$ represents coarse-grained feature representation. The complete process is summarized in Algorithm 1. An important note is though extracting the feature of three modalities has a cost of time, due to the transformer architecture and fusion operation, the cost of training and model size are not increased, compared to existing models using single or bimodal features.

Multimodal Frame-Scoring Transformer. Frame-scoring transformer Narasimhan et al. (2021) takes feature representation as input and predict importance scores. Note that previous approach exploits just text-visual representation, leaving audio features behind. MFST introduces frame-scoring transformer to video summarization, which is modified to predict frame-level importance scores $S$ , based on coarse-grained feature representation $h_{\mathcal{A}\mathcal{V}\mathcal{T}}$ :

	$\displaystyle\mathrm{Multimodal}~{}\text{-}~{}\mathrm{Attn.}(h_{\mathcal{A}\mathcal{V}\mathcal{T}})=\mathrm{Concat}(\mathrm{h_{1}},...,\mathrm{h_{h}})W^{O_{\mathcal{A}\mathcal{V}\mathcal{T}}}$		(8)
	$\displaystyle\mathrm{h_{i}}=\mathrm{Attn.}(Q_{\mathcal{A}\mathcal{V}\mathcal{T}}W^{Q_{\mathcal{A}\mathcal{V}\mathcal{T}}}_{i},K_{\mathcal{A}\mathcal{V}\mathcal{T}}W^{K_{\mathcal{A}\mathcal{V}\mathcal{T}}}_{i},V_{\mathcal{A}\mathcal{V}\mathcal{T}}W^{V_{\mathcal{A}\mathcal{V}\mathcal{T}}}_{i})$		(9)
	$\displaystyle\mathrm{Attn.}(h_{\mathcal{A}\mathcal{V}\mathcal{T}})=\text{softmax}(\frac{Q_{\mathcal{A}\mathcal{V}\mathcal{T}}K_{\mathcal{A}\mathcal{V}\mathcal{T}}^{T}}{\sqrt{d_{k}}})V_{\mathcal{A}\mathcal{V}\mathcal{T}}$		(10)

, where $W^{Q_{\mathcal{V}\mathcal{T}\mathcal{A}}}$ , $W^{K_{\mathcal{V}\mathcal{T}\mathcal{A}}}$ , and $W^{V_{\mathcal{V}\mathcal{T}\mathcal{A}}}$ denote parameter matrices and $d_{k}$ is the dimension of $K_{\mathcal{A}\mathcal{V}\mathcal{T}}$ .

Finally, we feed $h_{\mathcal{A}\mathcal{V}\mathcal{T}}$ to frame-scoring transformer with positional encodings at the bottom of transformer encoder and decoder stacks. Given ground truth frame scores $S_{gt}$ of N frames from a video, we train MFST using the mean square error:

\mathcal{L}_{\theta}(S_{gt},S)=\frac{1}{N}\left\lVert S_{gt}-S\right\rVert_{2}^{2}

(11)

The complete process is summarized in Algorithm 2.

Algorithm 1 Fine-to-Coarse Space Projection

V=\{V_{1},V_{2},...,V_{N}\}

1: for all

V_{i}

2: Generate dense video captions

C_{i}=g_{c}(V_{i})

3: , where

C_{i}=\{c^{i}_{1},c^{i}_{2},...,c^{i}_{M}\}

4: end for

5: for all

V_{i}

and

C_{i}

\mathcal{T}_{i}=f_{t}(C_{i})

\mathcal{V}_{i}=f_{v}(V_{i})

, and

\mathcal{A}_{i}=f_{a}(V_{i})

7: , where

\mathcal{T}=\{\mathcal{T}_{1},\mathcal{T}_{2},...,\mathcal{T}_{N}\}

\mathcal{V}=\{\mathcal{V}_{1},\mathcal{V}_{2},...,\mathcal{V}_{N}\}

, and

\mathcal{A}=\{\mathcal{A}_{1},\mathcal{A}_{2},...,\mathcal{A}_{N}\}

8: end for

h_{\mathcal{V}\mathcal{T}}

\text{Concat}(h_{\mathcal{V}_{1}\mathcal{T}_{1}},h_{\mathcal{V}_{2}\mathcal{T}_{2}}

, …,

h_{\mathcal{V}_{N}\mathcal{T}_{N})}

10: for all

\mathcal{T}_{i}

\mathcal{V}_{i}

and

\mathcal{A}_{i}

11:

h_{\mathcal{T}_{i}}:=\mathcal{T}_{i}

h_{\mathcal{V}_{i}}:=\mathcal{V}_{i}

h_{\mathcal{A}_{i}}:=\mathcal{A}_{i}

12: Calculate fine-grained modality space

h_{\mathcal{V}_{i}\mathcal{T}_{i}}

13:

h_{\mathcal{V}_{i}\mathcal{T}_{i}}

\text{Attention}(h_{\mathcal{V}_{i}},h_{\mathcal{T}_{i}})

14: Project fine-to-coarse embedding space

h_{\mathcal{A}_{i}\mathcal{V}_{i}\mathcal{T}_{i}}

15:

h_{\mathcal{A}_{i}\mathcal{V}_{i}\mathcal{T}_{i}}=\mathcal{F}_{M}(h_{\mathcal{V}_{i}\mathcal{T}_{i}},h_{\mathcal{A}_{i}})

16: end for

17:

h_{\mathcal{A}\mathcal{V}\mathcal{T}}

\text{Concat}(h_{\mathcal{A}_{1}\mathcal{V}_{1}\mathcal{T}_{1}},h_{\mathcal{A}_{2}\mathcal{V}_{2}\mathcal{T}_{2}}

, …,

h_{\mathcal{A}_{N}\mathcal{V}_{N}\mathcal{T}_{N})}

Algorithm 2 Multimodal Frame-Scoring Transformer

h_{\mathcal{A}\mathcal{V}\mathcal{T}}

= {

h_{\mathcal{A}_{1}\mathcal{V}_{1}\mathcal{T}_{1}},h_{\mathcal{A}_{2}\mathcal{V}_{2}\mathcal{T}_{2}},...,h_{\mathcal{A}_{N}\mathcal{V}_{N}\mathcal{T}_{N}}\}

1: for each epoch do

2: for

i

in range(h) do

3: Calculate self-attention for each

i

\text{Attn.}(h_{\mathcal{A}\mathcal{V}\mathcal{T}})=\text{softmax}(Q_{\mathcal{A}\mathcal{V}\mathcal{T}}K_{\mathcal{A}\mathcal{V}\mathcal{T}}^{T}/\sqrt{d_{k}})V_{\mathcal{A}\mathcal{V}\mathcal{T}}

5: Calculate single head attention

\text{h}_{i}

\text{h}_{i}=\text{Attn.}(Q_{\mathcal{A}_{i}\mathcal{V}_{i}\mathcal{T}_{i}},K_{\mathcal{A}_{i}\mathcal{V}_{i}\mathcal{T}_{i}},V_{\mathcal{A}_{i}\mathcal{V}_{i}\mathcal{T}_{i}})

7: end for

\text{Multimodal-Attn.}=\text{Concat}(\text{h}_{1},...,\text{h}_{h})

S=\text{Linear(Multimodal-Attn.)}

10: Compute loss between ground truth scores

S_{gt}

and predicted scores

S

11:

\mathcal{L}_{\theta}(S_{gt},S)=\frac{1}{N}\left\lVert S_{gt}-S\right\rVert_{2}^{2}

12: end for

4 Experiments

Table 1: Experimental results on SumMe under the Canonical, Augment, and Transfer settings (F-score).

Methods	SumMe
Methods	Can	Aug	Tran
vsLSTM Zhang et al. (2016)	0.376	0.416	0.407
SGAN Mahasseni et al. (2017)	0.387	0.417	—
SGAN_s Mahasseni et al. (2017)	0.417	0.436	—
H-RNN Zhao et al. (2017)	0.421	0.438	—
DRDSN Zhou et al. (2018)	0.421	0.439	0.426
HSA-RNN Zhao et al. (2018)	0.423	0.421	—
ACGAN He et al. (2019)	0.460	0.470	0.445
WS-HRL Chen et al. (2019)	0.436	0.445	—
re-S2S Zhang et al. (2018)	0.425	0.449	—
S-FCN Rochan et al. (2018)	0.475	0.511	0.441
VASNet Fajtl et al. (2018)	0.497	0.510	—
CSNet_s Jung et al. (2019)	0.513	0.521	0.451
GLRPE Jung et al. (2020)	0.502	—	—
SumGraph Park et al. (2020)	0.514	0.529	0.487
RSGN Zhao et al. (2021)	0.450	0.457	0.440
RSGN_uns Zhao et al. (2021)	0.423	0.436	0.412
MSVA Ghauri et al. (2021)	0.545	—	—
CLIP-It Narasimhan et al. (2021)	0.542	0.564	0.519
SSPVS Haopeng et al. (2022)	0.501	—	—
iPTNet Jiang and Mu (2022)	0.545	0.569	0.492
MFST (Ours)	0.595	0.655	0.576

Table 2: Experimental results on TVSum under the Canonical, Augment, and Transfer settings (F-score).

Methods	TVSum
Methods	Can	Aug	Tran
vsLSTM Zhang et al. (2016)	0.542	0.579	0.569
SGAN Mahasseni et al. (2017)	0.508	0.589	—
SGAN_s Mahasseni et al. (2017)	0.563	0.612	—
H-RNN Zhao et al. (2017)	0.579	0.619	—
DRDSN Zhou et al. (2018)	0.581	0.598	0.589
HSA-RNN Zhao et al. (2018)	0.587	0.598	—
ACGAN He et al. (2019)	0.585	0.589	0.578
WS-HRL Chen et al. (2019)	0.584	0.585	—
re-S2S Zhang et al. (2018)	0.603	0.639	—
S-FCN Rochan et al. (2018)	0.568	0.592	0.582
VASNet Fajtl et al. (2018)	0.614	0.623	—
CSNet_s Jung et al. (2019)	0.588	0.590	0.592
GLRPE Jung et al. (2020)	0.591	—	—
SumGraph Park et al. (2020)	0.639	0.658	0.605
RSGN Zhao et al. (2021)	0.601	0.611	0.600
RSGN_uns Zhao et al. (2021)	0.580	0.591	0.597
MSVA Ghauri et al. (2021)	0.628	—	—
CLIP-It Narasimhan et al. (2021)	0.663	0.690	0.655
SSPVS Haopeng et al. (2022)	0.607	—	—
iPTNet Jiang and Mu (2022)	0.634	0.642	0.598
MFST (Ours)	0.737	0.779	0.691

Table 3: Experimental results on SumMe (Kendall’s

\tau

and Spearman’s

\rho

Methods	SumMe
Methods	$\tau$	$\rho$
Random	0.000	0.000
Human	0.205	0.213
Ground Truth	1.000	1.000
SGAN Mahasseni et al. (2017)	—	—
WS-HRL Chen et al. (2019)	—	—
DRDSN Zhou et al. (2018)	0.047	0.048
dppLSTM Zhang et al. (2016)	—	—
CSNet_s Jung et al. (2019)	—	—
GLRPE Jung et al. (2020)	—	—
HSA-RNN Zhao et al. (2018)	0.064	0.066
RSGN Zhao et al. (2021)	0.083	0.085
RSGN_u Zhao et al. (2021)	0.071	0.073
SumGraph Park et al. (2020)	—	—
SSPVS Haopeng et al. (2022)	0.123	0.170
MSVA Ghauri et al. (2021)	0.200	0.230
MFST (Ours)	0.229	0.229

Table 4: Experimental results on TVSum (Kendall’s

\tau

and Spearman’s

\rho

Methods	TVSum
Methods	$\tau$	$\rho$
Random	0.000	0.000
Human	0.177	0.204
Ground Truth	0.364	0.456
SGAN Mahasseni et al. (2017)	0.024	0.032
WS-HRL Chen et al. (2019)	0.078	0.116
DRDSN Zhou et al. (2018)	0.020	0.026
dppLSTM Zhang et al. (2016)	0.042	0.055
CSNet_s Jung et al. (2019)	0.025	0.034
GLRPE Jung et al. (2020)	0.070	0.091
HSA-RNN Zhao et al. (2018)	0.082	0.088
RSGN Zhao et al. (2021)	0.083	0.090
RSGN_u Zhao et al. (2021)	0.048	0.052
SumGraph Park et al. (2020)	0.094	0.138
SSPVS Haopeng et al. (2022)	0.169	0.231
MSVA Ghauri et al. (2021)	0.190	0.210
MFST (Ours)	0.222	0.224

4.1 Two Fundamental Research Questions.

In our extensive experiments, we struggle with two fundamental research questions: (1) How to alleviate data sparsity problem? and (2) How to learn diverse modality features to predict importance score better? The first question arose because summarization datasets consist of frame-level human annotations and they are hard to collect. The second question should be solved in order to create a video summarization model considering multimodal information as people do. Note that these two questions are not separate, rather highly correlated.

4.2 Dataset Description

We conduct our video summarization experiments on two benchmarks: TVSum Song et al. (2015) and SumMe Gygli et al. (2014).

•

TVSum Song et al. (2015) contains 50 videos, including the topics of news, documentaries. The duration of each video varies from 1 to 10 minutes. 20 annotators provide frame-level importance scores for each video.
•

SumMe Gygli et al. (2014) consists of 25 user videos, covering various topics (e.g., holidays and sports). Each video ranges from 1 to 6 minutes. 15 to 18 persons annotated multiple ground truth summaries for each video.

4.3 Metric Description

We follow the same experimental metrics used in existing works: F-score and Rank-based evaluation. True positive means highlight overlaps between model-generated summary $V_{m}$ and human-generated summary $V_{h}$ based on importance scores. The precision and recall are calculated as follows:

\displaystyle\text{Precision}=\frac{\mid V_{m}\cap V_{h}\mid}{\mid V_{m}\mid},\text{Recall}=\frac{\mid V_{m}\cap V_{h}\mid}{\mid V_{h}\mid}.

(12)

Rank-based evaluations Otani et al. (2019) compute Kendall’s $\tau$ and Spearman’s $\rho$ which measure non-parametric rank correlations:

	$\displaystyle\tau$	$\displaystyle=Kendall(S_{gt},S)$		(13)
	$\displaystyle\rho$	$\displaystyle=Spearman(S_{gt},S)$		(14)

Though Performance over Random (PoR) Apostolidis et al. (2020) proposed a new evaluation protocol for handling the non-overlapping splits, most existing methods did not disclose their code and experimented in the existing protocols which uses fixed number of test splits. Therefore, we follow the same experimental metrics and then evince the superiority of our model by a large margin.

4.4 Settings

Experimental Settings. We compare MFST with existing models in three different experimental settings:

•

In Canonical setting, we selects the dataset (e.g., TVSum or SumMe) and randomly splits the dataset into training and evaluation.
•

In Augment setting, we merges the two datasets into one and randomly splits the dataset into training and evaluation.
•

In Transfer setting, we trains a model using one dataset and evaluates the trained model on the other dataset.

As we follow the experimental protocol in proposed by existing studies, in all experimental settings, we conduct experiments over 5 splits and average the results. Each experiment randomly selects 20% of the dataset for evaluation.

Implementation Details. We leverage a Frame-Scoring Transformer with 8 heads, 6 encoder layers, and 6 decoder layers. We use a linear layer whose hidden dimension is 512.

Note that summarization datasets are hard to collect because they consist of frame-level human annotations. To mitigate data scarcity, we co-opt learned VideoCLIP Xu et al. (2021) to obtain both extracted features. We compute text-attended visual representation using attention mechanism, finding the most relevant caption per frame. We leverage feature extractor using Wav2Vec2 Baevski et al. (2020) to exploit audio features.

Training Details. We train our model on 8 NVIDIA GeForce TITAN GPUs for 20 epochs. The batch size are selected based on available GPU memory. We use Adam optimizer with learning rate 1e-4 and weight decay of 1e-3. Note that though extracting the feature of three modalities has a cost of time, due to the transformer architecture and fusion operation, the cost of training and model size are not increased, compared to existing models using single or bimodal features (not shown).

4.5 Performance Comparison

Results on Video Summarization. We conduct experiments to answer the two questions. Table 1 and Table 3 show our extensive experiments with previous methods on SumMe dataset. Table 2 and Table 4 also demonstrates our comprehensive experiments with existing models on TVSum dataset. As shown in Table 1 and Table 2, under the canonical, augment, and transfer settings, we demonstrate that MFST outperforms existing methods on the benchmark. Table 3 and Table 4 also show that MFST outperforms by a large margin in Rank-based evaluation except MSVA. However, MSVA did not conduct experiments in Augment setting and Transfer setting, it is hard to say MSVA is compatible with our model.

Experimental results evince that methods which exploit two or more modality features and mitigate data sparsity problem get high a score than other methods. Among them, MFST, which properly exploits the visual, text and audio modalities, achieves state-of-the-art performance by a large margin in F1 score and Rank-based evaluation by nontrivial margins. Furthermore, thanks to the architecture of MFST, we speculate that MFST can circumvent data-hungry and sensitive to the overfitting issues, caused by the transformer structure. We leave the proof of this for future work.

4.6 Ablation Studies

Effect of Each Modality. We further conduct ablation studies on SumMe and TVSum datsets. As shown in Table 5 and Table 6, to validate the effect of our feature representation, we compare three different model variants, adding modality features. Results in Table 5 and Table 6 demonstrate our approach can be advanced when extracted modality features added. Interestingly, our framework without audio features also outperforms existing methods on both SumMe and TVSum. We conjecture that MFST exploiting pretrained model can represent modality features in a embedding space leading higher prediction performance than other models. Though existing methods get higher score as they use more modality information than counterparts, most of them did not disclose the code, so we leave ablations about modality using them for future work. In the following section, we show qualitative analysis and ablation.

Table 5: Effect of each modality feature on SumMe (F-score).

Modality			SumMe
$\mathcal{V}$	$\mathcal{T}$	$\mathcal{A}$	Tran	Can	Aug
✓	✗	✗	0.525	0.564	0.505
✓	✓	✗	0.542	0.629	0.553
✓	✓	✓	0.595	0.655	0.576

Table 6: Effect of each modality feature on TVSum (F-score).

Modality			TVSum
$\mathcal{V}$	$\mathcal{T}$	$\mathcal{A}$	Can	Aug	Tran
✓	✗	✗	0.659	0.674	0.629
✓	✓	✗	0.708	0.753	0.659
✓	✓	✓	0.737	0.779	0.691

4.7 Qualitative Results.

In Figure 3, we visualize importance scores generated by MFST and ground-truth scores (left), and video summarization results (right). While comparing the summary results, we add each modality information sequentially to our frame-scoring transformer when predicting the importance score. Interestingly, we observe that MFST generates a summary considering the modality information whenever each additional modality features are used. For example, the summary whose model uses three modalities focuses on frames to check electric voltage with sound of iron hitting while the other summaries whose model uses one or two modalities fail to capture the highlight frames. As shown in Figure 3, MFST predicts importance score similar to the ground truth score as more modality information is exploited. Indeed, model-generated score graph and ground truth graph are very similar, which means maximum points and minimum points in both graphs are fairly overlapped. Results in Figure 3 represents our model predicts parts that humans find interesting or not.

5 Conclusion

The conclusion of this paper is threefold:

•

We propose MFST, a simple and effective frame-scoring framework given videos. MFST exploits audio-visual-text features using learned feature extractors and frame-scoring multimodal transformer. MFST first generates caption-attended representation on fine-grained embedding space using attention mechanism. Then, our model project fine-grained space to coarse-grained space based on modality fusion.
•

We wrestle with two underlying research questions for generic video summarization: (1) How to alleviate data sparsity problem? and (2) How to learn diverse modality features to predict importance score better? The first question arose because summarization datasets consist of frame-level human annotations and they are hard to collect. The second question should be solved in order to create a video summarization model considering multimodal information as people do.
•

Our comprehensive comparisons with previous approaches and ablation studies on generic video summarization datasets (TVSum and SumMe) evince that MFST can surpass its all counterparts by nontrivial margins, and attests the effectiveness and superiority of our approach.

6 Discussion and Future Work

Limitations. Though we demonstrate the effectiveness and superiority of our proposed method, lack of deep insights about modality representations and utilizing large-scale models are biggest limitations of our work. For example, we could not discover precise reasons why our framework is superior with respect to modality representations.

Additionally, This work also lacks the scrutiny to represent the modality well in the common space. In other words, we recognize that this work doesn’t provide experiments using various model architectures to reflect modality information to our model.

Despite these limitations, it is clear that our approach handle three different modalities well and achieves state-of-the-art performance by a large margin in F1 score and Rank-based evaluation by nontrivial margins.

Future Works. Our direction of future work is threefold:

•

Proving that our framework can circumvent data-hungry and sensitive to the overfitting issues, caused by the transformer structure, both theoretically and empirically.
•

Further study to express each modality information well in common space (e.g., contrastive learning).
•

Further research and ablations using existing approaches and various model architectures to better reflect modality information.

References

Alayrac et al. [2020] Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. Self-supervised multimodal versatile networks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
Apostolidis et al. [2020] Evlampios Apostolidis, Eleni Adamantidou, Alexandros I. Metsai, Vasileios Mezaris, and Ioannis Patras. Performance over random: A robust evaluation protocol for video summarization methods. In Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, page 1056–1064, New York, NY, USA, 2020. Association for Computing Machinery.
Baevski et al. [2020] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12449–12460. Curran Associates, Inc., 2020.
Bor-Chun Chen and Chen [2017] Yan-Ying Chen Bor-Chun Chen and Francine Chen. Video to text summary: Joint video summarization and captioning with recurrent neural networks. In Gabriel Brostow Tae-Kyun Kim, Stefanos Zafeiriou and Krystian Mikolajczyk, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 118.1–118.14. BMVA Press, September 2017.
Chen et al. [2019] Yiyan Chen, Li Tao, Xueting Wang, and Toshihiko Yamasaki. Weakly supervised video summarization by hierarchical reinforcement learning. In Proceedings of the ACM Multimedia Asia, pages 1–6. 2019.
Fajtl et al. [2018] Jiri Fajtl, Hajar Sadeghi Sokeh, Vasileios Argyriou, Dorothy Monekosso, and Paolo Remagnino. Summarizing videos with attention. In Asian Conference on Computer Vision, pages 39–54. Springer, 2018.
Ghauri et al. [2021] Junaid Ahmed Ghauri, Sherzod Hakimov, and Ralph Ewerth. Supervised video summarization via multiple feature sets with parallel attention. 2021.
Gygli et al. [2014] Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. Creating summaries from user videos. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 505–520, Cham, 2014. Springer International Publishing.
Haopeng et al. [2022] Li Haopeng, Ke Qiuhong, Gong Mingming, and Zhang Rui. Video summarization based on video-text modelling, 2022.
He et al. [2019] Xufeng He, Yang Hua, Tao Song, Zongpu Zhang, Zhengui Xue, Ruhui Ma, Neil Robertson, and Haibing Guan. Unsupervised video summarization with attentive conditional generative adversarial networks. In Proceedings of the 27th ACM International Conference on Multimedia, pages 2296–2304, 2019.
Hessel et al. [2019] Jack Hessel, Bo Pang, Zhenhai Zhu, and Radu Soricut. A case study on combining ASR and visual features for generating instructional video captions. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 419–429, Hong Kong, China, November 2019. Association for Computational Linguistics.
Iashin and Rahtu [2020a] Vladimir Iashin and Esa Rahtu. A better use of audio-visual cues: Dense video captioning with bi-modal transformer. In British Machine Vision Conference (BMVC), 2020.
Iashin and Rahtu [2020b] Vladimir Iashin and Esa Rahtu. Multi-modal dense video captioning. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 958–959, 2020.
Jiang and Mu [2022] Hao Jiang and Yadong Mu. Joint video summarization and moment localization by cross-task sample transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16388–16398, 2022.
Jung et al. [2019] Yunjae Jung, Donghyeon Cho, Dahun Kim, Sanghyun Woo, and In So Kweon. Discriminative feature learning for unsupervised video summarization. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 8537–8544. AAAI Press, 2019.
Jung et al. [2020] Yunjae Jung, Donghyeon Cho, Sanghyun Woo, and In So Kweon. Global-and-local relative position embedding for unsupervised video summarization. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV, page 167–183, Berlin, Heidelberg, 2020. Springer-Verlag.
Kanehira et al. [2018] Atsushi Kanehira, Luc Van Gool, Y. Ushiku, and Tatsuya Harada. Viewpoint-aware video summarization. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7435–7444, 2018.
Lee et al. [2012] Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman. Discovering important people and objects for egocentric video summarization. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1346–1353, 2012.
Lei et al. [2021] Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34:11846–11858, 2021.
Liang et al. [2022] Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. arXiv preprint arXiv:2203.02053, 2022.
Liu et al. [2022] Ye Liu, Siyuan Li, Yang Wu, Chang Wen Chen, Ying Shan, and Xiaohu Qie. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
Mahasseni et al. [2017] Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 202–211, 2017.
Narasimhan et al. [2021] Medhini Narasimhan, Anna Rohrbach, and Trevor Darrell. Clip-it! language-guided video summarization. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 13988–14000. Curran Associates, Inc., 2021.
Otani et al. [2019] Mayu Otani, Yuta Nakashima, Esa Rahtu, and Janne Heikkila. Rethinking the evaluation of video summaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7596–7604, 2019.
Park et al. [2020] Jungin Park, Jiyoung Lee, Ig-Jae Kim, and Kwanghoon Sohn. Sumgraph: Video summarization via recursive graph modeling. In ECCV, 2020.
Rochan and Wang [2019] Mrigank Rochan and Yang Wang. Video summarization by learning from unpaired data. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7894–7903, 2019.
Rochan et al. [2018] Mrigank Rochan, Linwei Ye, and Yang Wang. Video summarization using fully convolutional sequence networks. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision – ECCV 2018, pages 358–374, Cham, 2018. Springer International Publishing.
Sharghi et al. [2016] Aidean Sharghi, Boqing Gong, and Mubarak Shah. Query-focused extractive video summarization. In ECCV, 2016.
Sharghi et al. [2017] Aidean Sharghi, Jacob S. Laurel, and Boqing Gong. Query-focused video summarization: Dataset, evaluation, and a memory network based approach. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2127–2136, 2017.
Song et al. [2015] Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. Tvsum: Summarizing web videos using titles. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5179–5187, 2015.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
Wei et al. [2018] Huawei Wei, Bingbing Ni, Yichao Yan, Huanyu Yu, and Xiaokang Yang. Video summarization via semantic attended networks. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018.
Wu et al. [2022] Guande Wu, Jianzhe Lin, and Claudio T. Silva. Intentvizor: Towards generic query guided interactive video summarization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10503–10512, June 2022.
Xu et al. [2021] Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. VideoCLIP: Contrastive pre-training for
zero-shot video-text understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, November 2021. Association for Computational Linguistics.
Yuan et al. [2019] Li Yuan, Francis E. H. Tay, Ping Li, Li Zhou, and Jiashi Feng. Cycle-sum: Cycle-consistent adversarial lstm networks for unsupervised video summarization. In AAAI, 2019.
Zhang et al. [2016] Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. Video summarization with long short-term memory. In European conference on computer vision, pages 766–782. Springer, 2016.
Zhang et al. [2018] Ke Zhang, Kristen Grauman, and Fei Sha. Retrospective encoders for video summarization. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision – ECCV 2018, pages 391–408, Cham, 2018. Springer International Publishing.
Zhao et al. [2017] Bin Zhao, Xuelong Li, and Xiaoqiang Lu. Hierarchical recurrent neural network for video summarization. In Proceedings of the 25th ACM international conference on Multimedia, pages 863–871, 2017.
Zhao et al. [2018] Bin Zhao, Xuelong Li, and Xiaoqiang Lu. Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7405–7414, 2018.
Zhao et al. [2021] Bin Zhao, Haopeng Li, Xiaoqiang Lu, and Xuelong Li. Reconstructive sequence-graph network for video summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
Zhou et al. [2018] Kaiyang Zhou, Yu Qiao, and Tao Xiang. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018.