LGDN: Language-Guided Denoising Network
for Video-Language Modeling

Haoyu Lu^1,2 Mingyu Ding³ Nanyi Fei^1,2 Yuqi Huo⁴ Zhiwu Lu^1,2,
¹Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
²Beijing Key Laboratory of Big Data Management and Analysis Methods
³The University of Hong Kong, Pokfulam, Hong Kong
⁴JD Corporation, Beijing, China
{lhy1998, luzhiwu}@ruc.edu.cn The corresponding author.

Abstract

Video-language modeling has attracted much attention with the rapid growth of web videos. Most existing methods assume that the video frames and text description are semantically correlated, and focus on video-language modeling at video level. However, this hypothesis often fails for two reasons: (1) With the rich semantics of video contents, it is difficult to cover all frames with a single video-level description; (2) A raw video typically has noisy/meaningless information (e.g., scenery shot, transition or teaser). Although a number of recent works deploy attention mechanism to alleviate this problem, the irrelevant/noisy information still makes it very difficult to address. To overcome such challenge, we thus propose an efficient and effective model, termed Language-Guided Denoising Network (LGDN), for video-language modeling. Different from most existing methods that utilize all extracted video frames, LGDN dynamically filters out the misaligned or redundant frames under the language supervision and obtains only 2–4 salient frames per video for cross-modal token-level alignment. Extensive experiments on five public datasets show that our LGDN outperforms the state-of-the-arts by large margins. We also provide detailed ablation study to reveal the critical importance of solving the noise issue, in hope of inspiring future video-language work.

1 Introduction

Humans are exposed to the world through a variety of sensory organs, such as eyes, ears, and the sense of touch. In the past few years, multi-modal data (e.g., text or video) has grown and accumulated rapidly on the Internet, which brings the increasing demands for video-language understanding. As one of the fundamental topics, video-language modeling is still challenging due to the heterogeneity of the video-text data. More notably, the video-text data is typically noisy (e.g., misaligned or semi-relevant, as shown in Figure 1), leading to intractable video-language modeling.

The dominant paradigm [8, 29, 12, 13, 45] for video-language modeling is to first extract language features and dense video features via off-the-shelf language and vision models (e.g., BERT [7], 3D CNN [48]), and then model the cross-modal representation by defining the objective function (e.g., triplet loss [16]) within a joint semantic space. Although achieving great success, these methods typically densely sample frames from a full sequence of raw video to obtain richer representation and thus cost excessive computation. Since the heavy computation makes it challenging to train the whole network end-to-end, they often achieve sub-optimal performance in video-language modeling. Recently, ClipBERT [24] proposes a sparse sampling strategy to tackle this drawback. Concretely, ClipBERT first samples video frames sparsely (8–16 frames per video), and then models the cross-modal alignment at frame-level. This sparse sampling paradigm enables end-to-end training, leading to much better performance. Nevertheless, token-level cross-modal interaction, which has achieved great success in image-text modeling [20, 25], is still not well explored for video-language modeling due to the heavy resource computation (even with 8–16 frames per video). Moreover, both the dominant paradigm and ClipBERT’s sparse sampling paradigm assume that video frames and the text description (w.r.t. a video-text pair) are semantically correlated, which is often invalid in practice.

Refer to caption — Figure 1: An example of video-text pair where the raw video contains misaligned and redundant video frames given the text description. Instead of aggregating all video frames for token-level alignment, we obtain *only 2–4 salient frames* per video by filtering out the misaligned ones. We find that utilizing 2–4 salient frames is much more effective while enjoying faster speed.

The correlation hypothesis often fails for two reasons: (1) With the rich semantics of video contents, it is hard to cover all frames with a single video-level description; (2) A raw video often has noisy or meaningless information (e.g., scenery shot, transition or teaser). For the dominant paradigm which utilizes densely-sampled frames, though often with self-attention mechanism [43], the irrelevant/noisy information makes it hard to learn high-quality video-language representation. For the sparse sampling paradigm used in ClipBERT that models the cross-modal alignment at frame-level, the misaligned frame-text pairs are wrongly forced to become closer, which inevitably leads to inaccurate cross-modal alignment. Overall, due to this noise issue (see Figure 1), video-language modeling is still challenging. Note that humans also encounter such problem in reality, but seem to be born with the ability to resist noise. That is, everyone can quickly scan through the entire video, easily ignore the noisy frames and focus on the salient ones given the text.

Motivated by this human ability, we propose a Language-Guided Denoising Network termed LGDN to dynamically filter out irrelevant or redundant information under the language supervision for better video-language modeling. Concretely, we devise a Salient Frame Proposal (SFP) mechanism which adopts four strategies to estimate frame-level relevance scores under the language supervision and proposes/selects only salient frames (per video) for precisely video-language modeling. Although the frame embeddings and text embeddings can be (roughly) aligned by introducing a Momentum Video-Level Contrastive Learning (MVCL) module, it is vital to precisely establish frame-text alignment for proposing salient frames. Therefore, based on multiple instance learning (MIL), we propose a Momentum Frame-Level Multiple Salient-instance learning (MSL) -Contrastive Learning (MFCL) module for video-language modeling at frame-level. Finally, with our SFP mechanism, we propose a Language-Guided Salient Frame Matching (LSFM) module for fine-grained alignment, which adopts a token-aware cross-attention Transformer for cross-modal token-level alignment.

Our main contributions are as follows: (1) We devise a salient frame proposal mechanism that can dynamically filter out irrelevant information under the language supervision, meanwhile maintaining salient information. (2) We propose an end-to-end framework termed LGDN for video-language modeling with cross-modal interaction at three levels: language-guided salient frame matching at token-level, momentum frame-level MSL-contrastive learning, and momentum video-level contrastive learning. (3) We evaluate our LGDN on five public datasets and find that our LGDN outperforms the latest competitors by large margins. We also provide detailed ablation study to reveal the critical importance of solving the noise issue, in hope of inspiring future video-language work.

2 Related Work

Video-Language Modeling.

Video-language modeling, a fundamental research topic that is beneficial for search engine and video recommendation, has attracted a lot of attention in recent years with the rapid growth of web videos. Previous works have made great efforts to model richer representations for video and text modalities and then align the features of the two modalities by the objective function (e.g., triplet loss). One common representative approach [5, 19] is to adopt a Graph Convolution Network (GCN) to extract richer information for video-text retrieval. Another representative approach [29, 12, 13, 52, 28] is to exploit extra experts (e.g., object, motion, speech) for video-language modeling. Recently, ClipBERT [24] proposes a sparse sampling strategy that enables end-to-end training, thus achieving higher performance. Moreover, Frozen in Time [2] also follows a sparse sampling paradigm, and proposes an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets. However, as illustrated in Figure 1, a raw video typically has noisy/meaningless information, and thus the presence of misaligned frames is inevitable during video-language modeling. Note that most existing methods assume that the video frames and paired text are semantically correlated, without considering the noise phenomenon. Although a self-attention mechanism has been widely applied, the misaligned frames still harm the cross-modal alignment. In this work, we thus propose a salient frame proposal mechanism to effectively (and directly) address this problem.

Cross-Modal Alignment Objective Functions

Most previous methods adopt triplet loss as a major objective function for video-language modeling. CGMSCD [13] points out that the triplet loss sometimes leads to a wrong learning direction and thus devises an adaptive margin triplet loss for representation learning. More recent works [40, 17, 18] propose to apply the InfoNCE contrastive loss [47, 37, 6] to enhance representation learning. Particularly, BriVL [17], ALBEF [25] and COTS [31] introduce a momentum mechanism [14] to maintain more negative samples for image-text contrastive learning. Following these state-of-the-art models, we propose momentum video-level contrastive learning for video-text global alignment in this paper. Note that MIL-NCE [34] enhances the InfoNCE loss with multiple-instance learning (MIL) to cope with the misaligned narration descriptions in HowTo100M [35]. In this work, we thus propose momentum frame-level MSL-contrastive learning to assist in addressing the misaligned frame problem.

3 Methodology

Figure 2 gives a brief overview of our LGDN framework for video-language modeling, which is composed of four main components: 1) language and vision representation extractors; 2) momentum video-level contrastive learning; 3) momentum frame-level MSL-contrastive learning, and 4) language-guided salient frame matching. In the following, we will describe each component in detail.

3.1 Feature Representation

Vision Representation.

Given an input video $V$ as a sequence of frames $\{E_{i}\}_{i=1}^{N}$ , where $N$ is the length of the video, we utilize a 2-D vision Transformer (e.g., ViT) as our vision backbone to extract frame-level features $\mathbf{E}=\{\mathbf{E}_{1},\mathbf{E}_{2},...,\mathbf{E}_{N}\}$ . Each frame $E_{i}$ of video $V$ can be represented as $\mathbf{E}_{i}=[\mathbf{e}_{cls};\mathbf{e}_{1};...;\mathbf{e}_{k_{v}-1}]\in R^{k_{v}\times D_{v}}$ , where $\mathbf{e}_{cls}$ denotes the [CLS] token, $k_{v}$ denotes the patch sequence length, and $D_{v}$ denotes the dimension of the patch embeddings. We utilize a fully-connected layer to project the [CLS] token into the frame embedding $\mathbf{f}_{i}^{e}$ . We then deploy a temporal module $T$ (e.g., a Transformer layer) to aggregate the frame embeddings to obtain the final video embedding:

\mathbf{f}^{v}=T([\mathbf{f}^{e}_{1},\mathbf{f}^{e}_{2},...,\mathbf{f}^{e}_{N}])=f^{v}(V),

(1)

where $f^{v}$ denotes the entire vision (video) encoder.

Language Representation.

Given an input text $L$ , we utilize BERT-Base as our language backbone to extract text feature $\mathbf{L}$ , which can be represented as $\mathbf{L}=[\mathbf{l}_{cls};\mathbf{l}_{1};...;\mathbf{l}_{k_{l}-1}]\in R^{k_{l}\times D_{l}}$ , where $\mathbf{l}_{cls}$ is the [CLS] token, $k_{l}$ is the token sequence length, and $D_{l}$ is the dimension of the token embeddings. We deploy a fully-connected layer to project the [CLS] token into the text embedding $\mathbf{f}^{l}=f^{l}(L)$ , where $f^{l}$ is the language encoder.

3.2 Momentum Video-Level Contrastive Learning (MVCL) Module

Note that our LGDN is designed to filter out the unmatched/redundant frames for better token-level alignment, without leveraging the temporal information of the videos explicitly. Therefore, we firstly introduce a Momentum Video-Level Contrastive Learning (MVCL) module to address this problem.

The MVCL module utilizes a temporal module (e.g., Transformer block) to aggregate the frame embeddings to obtain the video embedding. Contrastive learning is then applied for holistic video-text alignment. However, video data takes up large GPU memory and the mini-batch size tends to be small with strict resource, which brings harm to contrastive learning. Inspired by MoCo [14], we introduce the momentum mechanism to maintain massive negative samples in memory bank for contrastive learning. Concretely, We firstly maintain video memory bank $\mathcal{M}^{v}=\{\mathbf{\hat{q}}^{v}_{j}\}_{j=1}^{N_{m}}$ and text memory bank $\mathcal{M}^{l}=\{\mathbf{\hat{q}}^{l}_{j}\}_{j=1}^{N_{m}}$ to store video/text features, where $N_{m}$ denotes the memory bank size and $\mathbf{\hat{q}}^{v}_{j}$ / $\mathbf{\hat{q}}^{l}_{j}$ denotes the $j$ -th stored video/text feature vector. Let $f^{v}$ (with parameters $\theta^{v}$ ) and $\hat{f}^{v}$ (with parameters $\hat{\theta}^{v}$ ) denote vision encoder and vision momentum encoder, respectively. Similarly, let $f^{l}$ (with parameters $\theta^{l}$ ) and $\hat{f}^{l}$ (with parameters $\hat{\theta}^{l}$ ) denote language encoder and language momentum encoder, respectively. The parameters of momentum encoders are updated by:

\displaystyle\hat{\theta}^{v}

\displaystyle=m\cdot\hat{\theta}^{v}+(1-m)\cdot\theta^{v},~{}~{}~{}~{}\hat{\theta}^{l}=m\cdot\hat{\theta}^{l}+(1-m)\cdot\theta^{l},

(2)

where $m$ is the momentum coefficient hyper-parameter.

The loss function is thus constructed as follow: for each video $V_{i}$ in mini-batch $\mathcal{B}$ , we define the video-to-text contrastive loss between its paired text $L_{i}$ and all negative samples in the text memory bank $\mathcal{M}^{l}$ , resulting in an InfoNCE loss (with $\tau$ being the temperature hyper-parameter):

\mathcal{L}_{\text{V2T}}=-\frac{1}{|\mathcal{B}|}\sum_{(V_{i},L_{i})\in\mathcal{B}}\log\frac{\exp(\cos(\mathbf{f}^{v}_{i},\mathbf{\hat{f}}^{l}_{i})/\tau)}{\exp(\cos(\mathbf{f}^{v}_{i},\mathbf{\hat{f}}^{l}_{i})/\tau)+\sum_{\mathbf{\hat{q}}^{l}_{j}\in\mathcal{M}^{l}}\exp(\cos(\mathbf{f}^{v}_{i},\mathbf{\hat{q}}^{l}_{j})/\tau)},

(3)

where $\mathbf{\hat{f}}^{l}_{i}=\hat{f}^{v}(l_{i})$ , and the similarity of two features is measured by the cosine similarity. Similarly, given each text description $L_{i}$ in mini-batch $\mathcal{B}$ , we define the text-to-video contrastive loss as:

\mathcal{L}_{\text{T2V}}=-\frac{1}{|\mathcal{B}|}\sum_{(V_{i},L_{i})\in\mathcal{B}}\log\frac{\exp(\cos(\mathbf{f}^{l}_{i},\mathbf{\hat{f}}^{v}_{i})/\tau)}{\exp(\cos(\mathbf{f}^{l}_{i},\mathbf{\hat{f}}^{v}_{i})/\tau)+\sum_{\mathbf{\hat{q}}^{v}_{j}\in\mathcal{M}^{v}}\exp(\cos(\mathbf{f}^{l}_{i},\mathbf{\hat{q}}^{v}_{j})/\tau)},

(4)

where $\mathbf{\hat{f}}^{v}_{i}=\hat{f}^{v}(V_{i})$ . Finally, the objective function for MVCL is defined as follows:

\mathcal{L}_{\text{MVCL}}=\mathcal{L}_{\text{V2T}}+\mathcal{L}_{\text{T2V}}.

(5)

3.3 Salient Frame Proposal (SFP) Mechanism

As shown in Figure 1, video-text data inevitably contains misaligned frame-text pairs. Although an attention mechanism has been applied in Eq. (1), the irrelevant and noisy information would still mislead the cross-modal alignment in our model. To alleviate this problem, we thus propose a Salient Frame Proposal (SFP) mechanism for video-language modeling.

The core idea of our SFP mechanism is to dynamically filter out misaligned or redundant frames and maintain only a few important frames to represent the video well, which are called as salient frames. Formally, for each video-text pair, we first identify the relevance score $R(j|i)$ between the text $L_{i}$ and the $j$ -th frame $E_{i,j}$ of the video $V_{i}$ . Further, we perform language-guided denoising to retain only top- $N_{salient}$ salient frames by filtering out the unmatched/redundant frames from each video.

Table 1: Four strategies for estimating relevance scores.

SimDot	Momentum	CrossMom	Collaborative
$R(j\|i)=\mathbf{{f}}^{e}_{i,j}\cdot\mathbf{{f}}^{l}_{i}$	$R(j\|i)=\mathbf{{f}}^{e}_{i,j}\cdot\mathbf{f}^{l}_{i}+\mathbf{\hat{f}}^{e}_{i,j}\cdot\mathbf{\hat{f}}^{l}_{i}$	$R(j\|i)=\mathbf{\hat{f}}^{e}_{i,j}\cdot\mathbf{{f}}^{l}_{i}+\mathbf{{f}}^{e}_{i,j}\cdot\mathbf{\hat{f}}^{l}_{i}$	$R(j\|i)=(\mathbf{{f}}^{e}_{i,j}+\mathbf{\hat{f}}^{e}_{i,j})\cdot(\mathbf{{f}}^{l}_{i}+\mathbf{\hat{f}}^{l}_{i})$

Since only video-level annotations are provided, we need to estimate the relevance scores $R$ automatically. As shown in Table 1, we introduce four strategies for estimating relevance scores. (1) SimDot prediction relies on the output of two separate encoders (i.e., frame encoder $f^{e}$ and language encoder $f^{l}$ ) to model the relevance score $R(j|i)$ by computing the dot product of the frame embedding $\mathbf{{f}}^{e}_{i,j}=f^{e}(E_{i,j})$ and the text embedding $\mathbf{{f}}^{l}_{i}=f^{l}(L_{i})$ . However, since video-text data is noisy, only utilizing single-modality encoders may result in incorrect salient frames. (2) Momentum prediction improves SimDot prediction by introducing the supervision of momentum encoders (i.e., momentum frame encoder $\hat{f}^{e}$ and momentum language encoder $\hat{f}^{l}$ ), where the momentum frame embedding $\mathbf{\hat{f}}^{e}_{i,j}=\hat{f}^{e}(E_{i,j})$ and the momentum text embedding $\mathbf{\hat{f}}^{l}_{i}=\hat{f}^{l}(L_{i})$ . (3) CrossMom prediction considers the frame-text alignment that is directly built on the interaction between one modality encoder and another modality’s momentum encoder. (4) Collaborative prediction combines Momentum prediction and CrossMom prediction for better performance.

Although the frame embeddings and text embeddings can be (roughly) aligned through applying video-text contrastive learning in Sec. 3.2, it is vital to precisely establish frame-text alignment for proposing/selecting salient frames. To this end, we introduce the MFCL module below.

3.4 Momentum Frame-Level MSL-Contrastive Learning (MFCL) Module

To dynamically filter out the unmatched/redundant frames, we propose to adopt frame-level contrastive learning to directly measure the relevance scores $R$ between video frames and paired text. However, video data often contains misaligned frame-text pairs. Simply applying standard NCE-based contrastive learning would force the misaligned frame-text pairs to be pulled closer, which inevitably has negative effect on learning high-quality frame-text representation. Inspired by MIL-NCE [34], we thus propose a Momentum Frame-Level Multiple Salient-instance Learning (MSL) Contrastive Learning (MFCL) module to assist in alleviating the noise problem. The core idea is to use the salient frames filtered by the SFP Mechanism in each video to form a set of positive candidate pairs, instead of considering each positive pair independently. In this work, we suppose that MFCL and SFP have mutual interdependence so that they can bring boost to each other during training.

Similar to MVCL, we additionally maintain a frame-level memory bank $\mathcal{M}^{e}=\{\mathbf{\hat{q}}^{e}_{j^{\prime}}\}_{{j^{\prime}}=1}^{N_{m}*N}$ to store frame features, where $N_{m}$ is the memory bank size, $N$ is the number of sampled frames per video, and $\mathbf{\hat{q}}^{e}_{j^{\prime}}$ is a stored frame feature vector.

Given each text description $L_{i}$ in mini-batch $\mathcal{B}$ , we select salient frames filtered by the SFP Mechanism in the paired video $V_{i}$ to form a set of positive candidate (frame-text) pairs $S_{i}$ and all frame samples in $\mathcal{M}^{e}$ to form the negative ones. We then define the text-to-frame contrastive loss as:

\mathcal{L}_{\text{T2E}}=-\frac{1}{|\mathcal{B}|}\sum_{(S_{i},L_{i})\in\mathcal{B}}\log\frac{\sum_{\mathbf{\hat{f}}^{e}_{ij}\in\mathbf{\hat{f}}^{s}_{i}}\exp(\cos(\mathbf{f}^{l}_{i},\mathbf{\hat{f}}^{e}_{ij})/\tau)}{\sum_{\mathbf{\hat{f}}^{e}_{ij}\in\mathbf{\hat{f}}^{s}_{i}}\exp(\cos(\mathbf{f}^{l}_{i},\mathbf{\hat{f}}^{e}_{ij})/\tau)+\sum_{\mathbf{\hat{q}}^{e}_{j^{\prime}}\in\mathcal{M}^{e}}\exp(\cos(\mathbf{f}^{l}_{i},\mathbf{\hat{q}}^{e}_{j^{\prime}})/\tau)},

(6)

where $S_{i}=\{E_{i,j}\}_{j=1}^{N}$ is the positive frame set of the video $V_{i}$ , $N$ is the frame sequence length of the video, $\mathbf{f}^{l}_{i}=f^{l}(L_{i})$ , and $\mathbf{\hat{f}}^{s}_{i}=\{\mathbf{\hat{f}}^{e}_{ij}\}_{j=1}^{N}=\{\hat{f}^{e}(E_{i,j})\}_{j=1}^{N}$ .

Similarly, given each positive frame set $S_{i}$ , we define the frame-to-text contrastive loss as:

\mathcal{L}_{\text{E2T}}=-\frac{1}{|\mathcal{B}|}\sum_{(S_{i},L_{i})\in\mathcal{B}}\log\frac{\sum_{\mathbf{f}^{e}_{ij}\in\mathbf{f}^{s}_{i}}\exp(\cos(\mathbf{f}^{e}_{ij},\mathbf{\hat{f}}^{l}_{i})/\tau)}{\sum_{\mathbf{f}^{e}_{ij}\in\mathbf{f}^{s}_{i}}\exp(\cos(\mathbf{f}^{e}_{ij},\mathbf{\hat{f}}^{l}_{i})/\tau)+\sum_{\mathbf{f}^{e}_{ij}\in\mathbf{f}^{s}_{i}}\sum_{\mathbf{\hat{q}}^{l}_{j^{\prime}}\in\mathcal{M}^{l}}\exp(\cos(\mathbf{f}^{e}_{ij},\mathbf{\hat{q}}^{l}_{j^{\prime}})/\tau)},

(7)

where $\mathbf{\hat{f}}^{l}_{i}=\hat{f}^{l}(L_{i})$ and $\mathbf{f}^{s}_{i}=\{\mathbf{f}^{e}_{ij}\}_{j=1}^{N}=\{f^{e}(E_{i,j})\}_{j=1}^{N}$ (text memory bank $\mathcal{M}^{l}=\{\mathbf{\hat{q}}^{l}_{j^{\prime}}\}_{{j^{\prime}}=1}^{N_{m}}$ is defined in Sec. 3.2). As a result, by combining the text-to-frame and frame-to-text contrastive losses, the objective function for MFCL is given by:

\mathcal{L}_{\text{MFCL}}=\mathcal{L}_{\text{E2T}}+\mathcal{L}_{\text{T2E}}.

(8)

3.5 Language-Guided Salient Frame Matching (LSFM) Module

After obtaining language-guided salient frames, we utilize a multi-modal cross-attention fusion Transformer (see Figure 2) to capture token-level semantic alignment between visual patches and words for better performance (see the design details of this Transformer in the supp. material).

Further, we take the [CLS] token embedding outputted by the multi-modal fusion Transformer as the joint representation of a frame-text pair ( $V_{i},L_{i}$ ), and deploy a fully-connected layer to predict the matched probability, which is similar to the sentence pair classification task in BERT’s pre-training phase. The matching loss is defined as:

\mathcal{L}_{\text{LSFM}}=-\mathbb{E}_{(\mathbf{E}_{i,j},\mathbf{L}_{i})\sim\mathcal{D}_{salient}}\log P(y_{i,j}|\mathbf{E}_{i,j},\mathbf{L}_{i}),

(9)

where $\mathbf{E}_{i,j}$ denotes $j$ -th frame feature of video $V_{i}$ , $\mathbf{L}_{i}$ denotes text feature, $\mathcal{D}_{salient}$ is the set of salient frame-text pairs obtained by applying the SFP mechanism to the mini-batch, and $y_{i,j}$ is the ground-truth matching label (0 or 1) of the frame-text pair $(\mathbf{E}_{i,j},\mathbf{L}_{i})$ . During inference, we use a mean pooling layer to aggregate all salient frame scores as the video-level prediction score.

Finally, by combining all the proposed modules for video-language modeling at three levels, we train our LGDN model via minimizing the total objective function:

\mathcal{L}_{\text{LGDN}}=\mathcal{L}_{\text{MVCL}}+\mathcal{L}_{\text{MFCL}}+\mathcal{L}_{\text{LSFM}}.

(10)

4 Experiments

4.1 Datasets and Settings

Pre-Training Datasets. Due to the restricted computing resources, we follow COTS [31] to pre-train our LGDN on the pure image-text datasets. Our pre-training datasets consists of Conceptual Captions [41], SBU [38], VG [22] and MSCOCO [27], which contains 5.2 million image-text pairs. We additionally apply CC12M [3] (about 2 million URLs are now invalid) for better performance, which accumulates 15.2 million image-text pairs in total.

Downstream Datasets. We evaluate our proposed LGDN on four public video-text retrieval datasets: MSR-VTT [50], MSVD [4], DiDeMo [15], and VATEX [46]. To further demonstrate the general applicability of our LGDN, we also carry out experiments on a public video-question answering dataset: MSRVTT-QA [49]. We present the details of these downstream datasets as well as the evaluation metrics for downstream tasks in the supp. material.

Implementation Details. Following previous work [24], we sample $N=16$ frames per video: each video is equally split into 16 segments and one frame is randomly sampled from each segment. We empirically set the initial learning rate to 1e-5 and adopt AdamW [30] with a weight decay of 0.02 for 5 epochs. In the warm-up stage (first epoch), the model is trained to optimize Eq. (10) without applying SFP mechanism. We also set the other hyper-parameters uniformly as: salient frame numbers $N_{salient}=2$ , mini-batch size $|\mathcal{B}|=24$ , momentum hyper-parameter $m=0.99$ , temperature $\tau=0.07$ , and queue size $N_{m}=9,600$ . We adopt pre-trained BERT-Base as language encoder and ViT-Base [9] as vision encoder. More details are given in the supp. material.

Table 2: Ablation study for our full LGDN model. Retrieval results are reported on the MSR-VTT 1k-A test set. Local: only token-level alignment during inference. Global: only global alignment during inference. Ensemble: ensemble of global and local (token-level) alignment.

Method	Inference	Text-to-Video Retrieval				Video-to-Text Retrieval				R@SUM
Method	Inference	R@1	R@5	R@10	MdR	R@1	R@5	R@10	MdR	R@SUM
$\mathcal{L}_{\text{LSFM}}$ (w/o SFP)	Local	31.4	59.8	70.3	4.0	34.9	61.9	72.9	3.0	331.2
$\mathcal{L}_{\text{LSFM}}$ (w/o SFP) $+\mathcal{L}_{\text{MFCL}}$ (w/o MSL)	Local	32.2	58.6	70.0	3.0	34.7	61.7	73.3	3.0	330.5
$\mathcal{L}_{\text{LSFM}}$ (w/o SFP) $+\mathcal{L}_{\text{MFCL}}$	Local	33.2	60.2	71.0	3.0	34.7	62.5	73.6	3.0	335.2
$\mathcal{L}_{\text{LSFM}}$ (w/o SFP) $+\mathcal{L}_{\text{MFCL}}+\mathcal{L}_{\text{MVCL}}$	Local	33.0	60.4	71.2	3.0	35.6	62.2	73.7	3.0	336.1
$\mathcal{L}_{\text{LSFM}}$ $+\mathcal{L}_{\text{MFCL}}+\mathcal{L}_{\text{MVCL}}$	Local	35.3	65.0	75.3	3.0	36.3	65.0	76.0	3.0	352.9
$\mathcal{L}_{\text{LSFM}}$ $+\mathcal{L}_{\text{MFCL}}+\mathcal{L}_{\text{MVCL}}$	Global	32.5	60.4	71.7	3.0	32.1	61.8	72.2	3.0	330.7
$\mathcal{L}_{\text{LSFM}}$ $+\mathcal{L}_{\text{MFCL}}+\mathcal{L}_{\text{MVCL}}$	Ensemble	38.9	65.7	76.5	2.0	37.9	65.4	76.0	2.0	360.4

Evaluation Metrics. We adopt two widely-used metrics in cross-modal retrieval: Recall at K (R@K, K $=1,5,10$ ), and Median Rank (MdR) / Mean Rank (MnR). R@K means the percentage of correct matching in the K nearest points, and MdR / MnR measures the median / mean rank of target items in the retrieved ranking list. We also report two additional metrics named ‘R@Sum’ and ‘R@Mean’ in our ablation study, which sums/averages all recall metrics for overall evaluation. Following ClipBERT [24], we also report accuracy (Acc) in video-question answering task.

4.2 Ablation Study

In this subsection, we conduct comprehensive ablation study to investigate the contributions of different components of our full model. If not specifically indicated, we set $N=16$ for global alignment and $N_{salient}=2$ for token-level alignment as the default setting.

Effect of Value Change of $N_{salient}$ and $N$ . A common perspective for video/video-language understanding is that more frames per video bring better performance. We thus conduct experiments on the frame number used for token-level alignment in Figure 3(a-b). We sample $N=16$ frames from each video and evaluate different variants that use $N_{salient}\in\{1,2,3,4,8,16\}$ frames. Note that when $N_{salient}=16$ , sampling by our SFP degrades to w/o SFP. It can be observed that utilizing only $N_{salient}=\{2,3,4\}$ salient frames filtered by our SFP significantly outperforms utilizing all 16 extracted frames meanwhile enjoying the faster speed (see the green lines). This suggests that our SFP mechanism not only selects correct salient frames but also alleviates the noise problem. To investigate the influence of value change of $N$ on our LGDN, we evenly sample $N\in\{2,3,4,8\}$ frames per video and freeze $N_{salient}=\{1,2\}$ salient frames. The results in Figure 3(c) indicate that more extracted frames per video are beneficial to the token-level alignment in our LGDN model, as it provides larger candidate set for selecting salient frames. Meanwhile, when $N$ becomes larger (> 4), the performance tends to converge, further demonstrating the redundancy in the videos.

Contributions of Each Components. We further demonstrate the contributions of the three objective functions as well as the salient frame proposal (SFP) mechanism used in our full LGDN model in Table 2. We start with the objective function $\mathcal{L}_{\text{LSFM}}$ (w/o SFP), which means only applying matching loss in token-level alignment without using the SFP mechanism. It can be observed that: (1) $\mathcal{L}_{\text{MFCL}}$ (and $\mathcal{L}_{\text{MVCL}}$ ) combined with $\mathcal{L}_{\text{LSFM}}$ (w/o SFP) can bring improvements, suggesting that global alignment is beneficial to token-level alignment (during the training stage). (2) Simply applying the frame-level alignment may cause negative effect while combing with our MSL design brings better results. This demonstrates that our design of $\mathcal{L}_{\text{MFCL}}$ does help alleviate the noise problem. (3) When the SFP mechanism is added (see $\mathcal{L}_{\text{LSFM}}$ (w/o SFP) $+\mathcal{L}_{\text{MFCL}}+\mathcal{L}_{\text{MVCL}}$ vs. $\mathcal{L}_{\text{LSFM}}$ $+\mathcal{L}_{\text{MFCL}}+\mathcal{L}_{\text{MVCL}}$ ), the performance is significantly improved, which clearly shows the effectiveness of our proposed SFP mechanism. (4) For the same trained full LGDN model, combining the global and token-level alignment during inference can bring further improvements. Note that our full LGDN still achieves the state-of-the-art on MSR-VTT even without considering global alignment during inference.

Table 3: Comparison to the state-of-the-arts for video-text retrieval on MSR-VTT. Extra Expert: methods utilized expert features (e.g., object, motion and OCR features). # PT Pairs: the number of pre-training pairs. ^† denotes that our LGDN is additionally pre-trained with CC12M [3].

Method	Extra	#PT Pairs	Text-to-Video Retrieval				Video-to-Text Retrieval
Method	Expert	#PT Pairs	R@1	R@5	R@10	MdR	R@1	R@5	R@10	MdR
Full Split:
HGR [5]		-	9.2	26.2	36.5	24.0	15.0	36.7	48.8	11.0
CE [29]	$\checkmark$	-	10.0	29.0	41.2	16.0	15.6	40.9	55.2	8.3
CMGSD [13]	✓	$>$ 100M	11.3	32.0	44.1	14.2	17.2	43.6	57.2	7.6
T2VLAD [45]	✓	$>$ 100M	12.7	34.8	47.1	12.0	20.7	48.9	62.1	6.0
LGDN (ours)		5.2M	22.9	46.0	56.8	7.0	41.8	65.2	74.6	2.0
LGDN^† (ours)		15.2M	27.5	51.7	61.9	5.0	50.2	73.9	82.3	1.0
7k-1k Split:
HERO [26]		$>$ 100M	16.8	43.4	57.7	-	-	-	-	-
UniVL [32]	$\checkmark$	$>$ 100M	21.2	49.6	63.1	6.0	-	-	-	-
ClipBERT [24]		5.6M	22.0	46.8	59.9	6.0	-	-	-	-
TACo [52]	$\checkmark$	$>$ 100M	24.8	52.1	64.5	5.0	-	-	-	-
LGDN (ours)		5.2M	34.3	62.5	72.2	3.0	34.7	60.8	70.4	3.0
LGDN^† (ours)		15.2M	39.8	65.2	77.0	2.0	39.2	66.4	76.1	3.0
1k-A Split:
MMT [12]	✓	$>$ 100M	26.6	57.1	69.6	4.0	27.0	57.5	69.7	3.7
Support Set [39]	✓	$>$ 100M	30.1	58.5	69.3	3.0	28.5	58.6	71.6	3.0
TACo [52]	✓	$>$ 100M	28.4	57.8	71.2	4.0
Frozen in Time [2]		5.5M	31.0	59.5	70.5	3.0	-	-	-	-
LGDN (ours)		5.2M	38.9	65.7	76.5	2.0	37.9	65.4	76.0	2.0
LGDN^† (ours)		15.2M	43.7	71.4	80.4	2.0	42.6	71.6	80.6	2.0

Table 4: Results on the VATEX test set.

Method	R@1	R@5	R@10	MdR
VSE [21]	28.0	64.3	76.9	3.0
VSE++ [10]	33.7	70.1	81.0	2.0
Dual [36]	31.1	67.4	78.9	3.0
HGR [5]	35.1	73.5	83.5	2.0
Support Set [39]	45.9	82.4	90.4	1.0
LGDN (ours)	57.1	87.5	93.6	1.0
LGDN^† (ours)	61.0	90.2	95.1	1.0

Table 5: Results on the MSVD test set.

Method	R@1	R@5	R@10	MdR
VSE++ [10]	15.4	39.6	53.0	9.0
Multi. Cues [36]	20.3	47.8	61.1	6.0
CE [29]	19.8	49.0	63.8	6.0
Support Set [39]	28.4	60.0	72.9	4.0
Frozen in Time [2]	33.7	64.7	76.3	3.0
LGDN (ours)	39.7	70.2	79.8	2.0
LGDN^† (ours)	43.2	73.3	82.4	2.0

Table 6: Results on the DiDeMo test set. ^∗ denotes using temporal labels of captions.

Method	R@1	R@5	R@10	MdR
S2VT [44]	11.9	33.6	-	13.0
FSE [53]	13.9	36.0	-	11.0
CE [29]	16.1	41.1	-	8.3
ClipBERT [24]^∗	20.4	48.0	60.8	6.0
Frozen in time [2]^∗	34.6	65.0	74.7	2.0
LGDN (ours)	44.1	71.9	82.3	2.0
LGDN^† (ours)	47.8	76.2	83.3	2.0

Table 7: Results on MSRVTT-QA. ^∗ denotes utilizing large-scale VideoQA datasets.

Method	#PT Pairs	Acc
Heterogeneous Memory [11]	-	33.0
HCRN [23]	-	35.6
SSML [1]	100M	35.1
ClipBERT [24]	5.6M	37.4
Just Ask^∗ [51]	69.0M	41.5
LGDN (ours)	5.2M	42.4
LGDN^† (ours)	15.2M	43.1

4.3 Comparison to the State-of-the-Arts

We first report the text-video retrieval results on MSR-VTT with three data partitions in Table 3. It can be observed that: (1) our LGDN outperforms all previous works by large margins. Particularly, as compared with the most recent model Frozen in Time [2], our LGDN achieves an improvement of 7.9% (38.9% vs. 31.0%) for Text-to-Video R@1 on the MSR-VTT 1k-A test set. (2) Our LGDN also outperforms methods utilizing extra modalities (e.g., motion and audio) or those pre-trained on extremely-large video data (e.g., HowTo100M). (3) When leveraging a much larger pre-training (image-text) dataset, our LGDN (marked with ^†) achieves significant improvements.

To demonstrate the robustness of our model, we also evaluate it on VATEX, MSVD, and Didemo in Tables 4.2–4.2, respectively. Due to limited space, only text-to-video retrieval is considered here. For VATEX (Table 4.2), our LGDN significantly outperforms the state-of-the-art method Support Set which is trained on an order of magnitude more data. Our LGDN still performs the best on MSVD (Table 4.2) and Didemo (Table 4.2). Particularly, in the Didemo dataset, each description is annotated with localization information, in other words, annotations may only be aligned with the localized moments, thus causing the noise problem as many methods utilize all frames as the input. Recent works exploit temporal labels of captions to alleviate the noise problem and achieve higher performance. However, even without considering this, our LGDN still largely outperforms the most recent method Frozen [2], further demonstrating the effectiveness of our LGDN.

To show the general applicability of our LGDN, we evaluate our LGDN on the VideoQA task in Table 4.2. Even without utilizing large-scale video datasets devoted to the VideoQA task, our LGDN outperforms all competitors, validating the effectiveness of our LGDN in VideoQA. In addition, to reveal the critical importance of solving the noise issue for video-language modeling, we directly apply the SFP mechanism to the latest model CLIP4Clip [33] in Table 8. We find that applying the SFP mechanism brings boost to Clip4CLIP. The ensemble mechanism further improves the results, indicating that the proposed SFP mechanism is complementary to the baseline.

Table 8: Further evaluation results by directly applying our SFP mechanism to Clip4CLIP [33] (re-implemented) for video-text retrieval on the MSR-VTT 1k-A test set. ^∗ denotes the ensemble results of Clip4CLIP and Clip4CLIP+SFP.

Method	Text-to-Video Retrieval					Video-to-Text Retrieval
Method	R@1	R@5	R@10	MdR	MnR	R@1	R@5	R@10	MdR	MnR
Clip4CLIP [33]	44.9	71.8	81.7	2.0	14.2	46.2	73.9	84.3	2.0	10.8
+ SFP	45.3	73.0	83.4	2.0	13.4	47.6	75.5	85.3	2.0	9.6
+ SFP^∗	47.2	73.4	83.9	2.0	13.0	48.1	76.7	86.1	2.0	9.3

Table 9: Results by applying SFP to different sampling techniques on the MSR-VTT 1kA test set.

Sampling	w\o SFP (R@SUM)			w\SFP (R@SUM)
Sampling	4 frames	8 frames	16 frames	4 frames	8 frames	16 frames
Random Sampling	164.7	169.5	172.8	168.0	174.3	179.4
Dense Uniform	166.5	171.3	174.3	173.1	179.4	180.6
Sparse Sampling	168.0	171.9	173.7	179.1	180.3	181.1

Table 10: Ablation study for relevance score estimator. Retrieval results are reported on the MSR-VTT 1k-A test set. Random / SFP:

N_{salient}

frames are sampled randomly/by our SFP mechanism. # Frames:

N_{salient}

N

for local / global alignment.

Method	Strategy	# Frames	Text-to-Video Retrieval				Video-to-Text Retrieval				R@SUM
Method	Strategy	# Frames	R@1	R@5	R@10	MdR	R@1	R@5	R@10	MdR	R@SUM
All	Random	16/16	35.7	63.8	74.2	3.0	35.4	63.8	74.9	3.0	347.8
Random	Random	2/16	34.1	62.1	73.4	3.0	34.3	61.6	74.0	3.0	339.5
SimDot	SFP	2/16	37.4	65.0	76.4	3.0	37.2	65.1	75.4	2.0	356.5
Momentum	SFP	2/16	38.1	65.8	76.4	2.0	37.9	65.4	75.9	2.0	359.5
CrossMom	SFP	2/16	38.4	65.4	76.5	2.0	37.9	65.3	76.2	2.0	359.7
Collaborative	SFP	2/16	38.9	65.7	76.5	2.0	37.9	65.4	76.0	2.0	360.4

4.4 Additional Results

Applying SFP to Different Frame Sampling Techniques. Note that our SFP mechanism must be combined with a frame sampling technique since we adopt a two-stage sampling strategy in this paper. Thus, we apply our SFP mechanism to three frame sampling techniques: Sparse Sampling, Random Sampling, and Dense Uniform (equally interval sampling). The obtained results on the MSR-VTT 1kA test set are provided in Table 9. It can be observed that our SFP significantly boosts different sampling strategies, further demonstrating the general applicability of our SFP mechanism.

Expansion of Relevance Score Estimator. In Sec. 3.3, we have proposed four relevance score estimators for the LSFM module. To find out which is the best, we present the ablation study results for different relevance score estimators in Table 10. We can see a large gap between SFP and random sampling (w/o SFP), directly demonstrating the effectiveness of the proposed SFP mechanism. Meanwhile, both Momentum and CrossMom outperform SimDot, suggesting that introducing momentum encoder is beneficial to relevance score estimation. Collaborative that combines Momentum and CrossMom generally leads to further improvements.

Model Capacity. We also provide the detailed comparison to other methods in terms of model capacity and R@SUM (on the MSR-VTT 1kA test set) in Table 11. It can be clearly seen that: (i) When fusion layers are not used (i.e., only global alignment is adopted), our LGDN (global) outperforms the state-of-the-art method Frozen in Time [2], but with much less model parameters. (ii) Our full LGDN performs much better than all the competitors, but its parameter number (215M) is still comparable to that of Frozen in Time (180M) and even significantly smaller than those of the other competitors. These observations suggest that the performance gains obtained by our LGDN is not due to utilizing more model parameters.

Table 11: The model capacity of different recent methods on the MSR-VTT 1kA test set.

Methods	Visual Encoder	Lingual Encoder	Fusion Layer	Total	R@SUM
TACo [52]	155M	110M	14M	279M	157.4
Support Set [39]	136M	220M	-	356M	157.9
Frozen in Time [2]	114M	66M	-	180M	161.0
LGDN (global)	93M	55M	-	148M	164.6
LGDN (ours)	93M	55M	68M	215M	181.1

4.5 Visualization Results

We provide visualization of our LGDN in Figure 4. We uniformly sample 5 frames from each video and provide relevance scores of 5 frames on the left, and the red ones denote salient frames selected by the SFP mechanism. It can be seen that: (1) Although the holistic video is semantically related to the paired text, there still exist noisy frames (e.g., the transition in Frame 1 and Frame 3 of Query7500) and unrelated frames (e.g., in Frame 4 and Frame 5 of Query7544, a man is rolling while the paired text is ‘a car goes racing down the road’). (2) The relevance scores obtained from the SFP mechanism correctly measure the consistency between each frame and the paired text, which indeed helps our LGDN to precisely filter out noisy information for better video-language modeling.

5 Conclusion

In this work, we propose a novel Language-Guided Denoising Network (LGDN) for video-language modeling, which can dynamically filter out the unmatched or redundant frames under the language supervision and thus maintain only 2–4 salient frames per video for cross-modal token-level alignment. Extensive experiments on five public datasets show that our LGDN outperforms the state-of-the-arts by large margins. In the future, we will consider aggregating temporal information on salient frames and apply our approach to more challenging video-language tasks (e.g., video grounding).

Acknowledgments and Disclosure of Funding

This work was supported in part by National Natural Science Foundation of China (61976220 and 61832017), Beijing Outstanding Young Scientist Program (BJJWZYJH012019100020098), and the Research Seed Funds of School of Interdisciplinary Studies, Renmin University of China.

References

[1] Elad Amrani, Rami Ben-Ari, Daniel Rotman, and Alex M. Bronstein. Noise estimation using density estimation for self-supervised multimodal learning. In AAAI, pages 6644–6652, 2021.
[2] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021.
[3] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, pages 3558–3568, 2021.
[4] David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL, pages 190–200, 2011.
[5] Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. Fine-grained video-text retrieval with hierarchical graph reasoning. In CVPR, pages 10635–10644, 2020.
[6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, pages 1597–1607, 2020.
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pages 4171–4186, 2019.
[8] Jianfeng Dong, Xirong Li, and Cees GM Snoek. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia, 20(12):3377–3388, 2018.
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
[10] Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. VSE++: improving visual-semantic embeddings with hard negatives. In BMVC, page 12, 2018.
[11] Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. Heterogeneous memory enhanced multimodal attention model for video question answering. In CVPR, pages 1999–2007, 2019.
[12] Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. Multi-modal transformer for video retrieval. In ECCV, pages 214–229, 2020.
[13] Feng He, Qi Wang, Zhifan Feng, Wenbin Jiang, Yajuan Lu, Yong Zhu, and Xiao Tan. Improving video retrieval by adaptive margin. In SIGIR, pages 1359–1368, 2021.
[14] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9726–9735, 2020.
[15] Anne Lisa Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In ICCV, pages 5804–5813, 2017.
[16] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
[17] Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, et al. WenLan: Bridging vision and language by large-scale multi-modal pre-training. arXiv preprint arXiv:2103.06561, 2021.
[18] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916, 2021.
[19] Weike Jin, Zhou Zhao, Pengcheng Zhang, Jieming Zhu, Xiuqiang He, and Yueting Zhuang. Hierarchical cross-modal graph consistency learning for video-text retrieval. In SIGIR, pages 1114–1124, 2021.
[20] Wonjae Kim, Bokyung Son, and Ildoo Kim. ViLT: Vision-and-language transformer without convolution or region supervision. In ICML, pages 5583–5594, 2021.
[21] Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
[22] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1):32–73, 2017.
[23] Thao Minh Le, Vuong Le, Svetha Venkatesh, and Truyen Tran. Hierarchical conditional relation networks for video question answering. In CVPR, pages 9972–9981, 2020.
[24] Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. Less is more: ClipBERT for video-and-language learning via sparse sampling. In CVPR, pages 7331–7341, 2021.
[25] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, pages 9694–9705, 2021.
[26] Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, and Jingjing Liu. HERO: Hierarchical encoder for video+ language omni-representation pre-training. EMNLP, pages 2046–2065, 2020.
[27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, pages 740–755, 2014.
[28] Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, and Zhongyuan Wang. HiT: Hierarchical transformer with momentum contrast for video-text retrieval. In ICCV, pages 11915–11925, 2021.
[29] Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. Use what you have: Video retrieval using representations from collaborative experts. In BMVC, page 279, 2019.
[30] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.
[31] Haoyu Lu, Nanyi Fei, Yuqi Huo, Yizhao Gao, Zhiwu Lu, and Ji-Rong Wen. COTS: Collaborative two-stream vision-language pre-training model for cross-modal retrieval. In CVPR, pages 15692–15701, 2022.
[32] Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Xilin Chen, and Ming Zhou. UniVL: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
[33] Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. CLIP4Clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860, 2021.
[34] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In CVPR, pages 9879–9889, 2020.
[35] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, pages 2630–2640, 2019.
[36] Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy-Chowdhury. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ICMR, pages 19–27, 2018.
[37] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
[38] Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2Text: Describing images using 1 million captioned photographs. In NeurIPS, pages 1143–1151, 2011.
[39] Mandela Patrick, Po-Yao Huang, Yuki Markus Asano, Florian Metze, Alexander G. Hauptmann, João F. Henriques, and Andrea Vedaldi. Support-set bottlenecks for video-text representation learning. In ICLR, 2021.
[40] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
[41] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pages 2556–2565, 2018.
[42] Hao Tan and Mohit Bansal. LXMERT: learning cross-modality encoder representations from transformers. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, EMNLP-IJCNLP, pages 5099–5110, 2019.
[43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
[44] Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond J. Mooney, and Kate Saenko. Translating videos to natural language using deep recurrent neural networks. In NAACL-HLT, pages 1494–1504, 2015.
[45] Xiaohan Wang, Linchao Zhu, and Yi Yang. T2VLAD: global-local sequence alignment for text-video retrieval. In CVPR, pages 5079–5088, 2021.
[46] Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, pages 4580–4590, 2019.
[47] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, pages 3733–3742, 2018.
[48] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV, pages 318–335, 2018.
[49] Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACMMM, pages 1645–1653, 2017.
[50] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016.
[51] Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, pages 1666–1677, 2021.
[52] Jianwei Yang, Yonatan Bisk, and Jianfeng Gao. TACo: Token-aware cascade contrastive learning for video-text alignment. In ICCV, pages 11562–11572, 2021.
[53] Bowen Zhang, Hexiang Hu, and Fei Sha. Cross-modal and hierarchical modeling of video and text. In ECCV, pages 385–401, 2018.

Appendix A Appendix

A.1 Limitations and Potential Negative Societal Impacts

Limitations. The key idea of our LGDN is to propose the SFP mechanism to filter out noisy/redundant frames for fine-grained semantic alignment, along with MVCL for capturing global temporal information. In most downstream tasks, these two modules are complementary to each other. And we also observe that only a few salient frames (e.g., 2 ones) are enough for most downstream tasks, and thus we do not consider aggregating temporal information across salient frames. However, the SFP mechanism may need to be slightly changed when facing specified scenarios (e.g., long-term complicated videos over 30 minutes that highly rely on temporal information). On the one hand, we could adjust the weights between the two modules (MVCL and SFP) according to the situation. On the other hand, we can split the full video into several clips (e.g., 3 minutes per clip), apply our SFP mechanism on each clip, and obtain the salient frames from all clips. In this way, we could consider aggregating temporal information across salient frames.

Potential Negative Societal Impacts. Video-language learning, especially large-scale video-language modeling, has developed rapidly over the past few years and led to the greatest advance in search engines, video recommendation, and multimedia data management. Despite its effectiveness, existing video-language pre-training models still face possible risks. As these models often rely on a large amount of web data, they may acquire biases or prejudices (especially in search engines and recommendation systems), which must be properly addressed before model deploying.

A.2 More Implementation Details

Two-Stage Sampling Strategy. The sampling strategy of our LGDN has two stages as shown in Figure 5: (1) We first adopt sparse sampling to sample 16 frames from each video before feeding them into the LGDN, which is the same as ClipBERT. (2) We further utilize salient sampling (SFP) to select a few salient frames (from 16 frames per video) before fusion layers.

Details of Network Architecture. We adopt ViT-B/16 [9] as our frame encoder and the first 6 layers of BERT-base [7] as our text encoder. The dimensions of the output vectors of the frame and text tokens are both $N_{seq}\times 768$ , where $N_{seq}$ is the sequence length. For each frame/text, the final output vector of the [CLS] token is used as the frame/text embedding. We utilize one Transformer layer (with 768 hidden units and 12 heads) as the temporal module to aggregate the frame embeddings to obtain the video embedding. We then utilize a single fully-connected layer for each modality to project the frame/video/text embeddings to the joint cross-modal space. The final dimensions of the frame and text embeddings are 256. We further apply a 6-layer cross-attention Transformer with an additional cross-attention module in each layer as our multi-modal encoder as in LXMERT [42] and ALBEF [25], where the network parameters are initialized from the last 6 layers of BERT-base. Each layer of the cross-attention Transformer consists of a self-attention sub-layer, a cross-attention sub-layer, and a feed-forward sub-layer with 768 hidden units and 12 heads.

Downstream Datasets. (1) MSR-VTT [50] is a popular video-text dataset with three data partitions. The full split is the official partition which uses 6,513/497/2,990 videos for training/validation/testing. The 1k-A split is a widely-used partition with 9,000/1,000 videos for training/testing. Recent works also apply the 7k-1k split, which uses 7,000/1,000 videos for training/testing. All three data partitions are considered in our experiments. (2) MSVD [50] contains 80K descriptions for 1,970 videos from YouTube. Following Frozen in Time [2], we employ the standard split with 1,200 videos for training and 670 videos for testing. (3) DiDeMo [15] consists of 10K videos and 40K sentences. Each sentence includes the temporal localization information. Following Frozen in Time [2], we conduct the paragraph-to-video retrieval task, where all descriptions in the same video are concatenated into a single description. (4) VATEX [46] is composed of 34,911 videos. We use 25,991/1,500/1,500 videos for training/validation/testing, following the split in Support Set [39]. (5) MSRVTT-QA [49] is a widely-used video-question answering dataset. We employ the standard split as in ClipBERT [24].

Resources Used. It takes around 3 / 9 days to pre-train LGDN (5.2M / 15.2M) with 16 Tesla V100 GPUs. For each downstream task, it takes about 5-15 hours with 8 Tesla V100 GPUs.

A.3 More Experimental Results

Effect of SFP mechanism. To further demonstrating the effectiveness of SFP mechanism, we conduct experiments considering model variants that select $N_{salient}$ salient frames from $N=16$ frames by our SFP or just randomly select $N_{salient}$ frames (denoted as w/o SFP) for token-level alignment in Figure 6. We sample $N=16$ frames from each video and evaluate different model variants that use $N_{salient}\in\{1,2,3,4,8,16\}$ frames for token-level alignment. Note that when $N_{salient}=16$ , sampling by our SFP degrades to w/o SFP. As expected, when randomly selecting frames like most existing methods, adding more frames does bring better results. However, utilizing only $N_{salient}=\{2,3,4\}$ salient frames filtered by our SFP significantly outperforms random sampling (i.e., w/o SFP), and even outperforms utilizing all 16 extracted frames. This suggests that our SFP mechanism not only selects correct salient frames but also alleviates the noise problem.

We also present the experiment results on other public datasets in Figure 7. Though the best parameters $N_{salient}$ are not quite the same on different datasets ( $N_{salient}=2$ for MSRVTT, Dedimo; $N_{salient}=4$ for MSVD, VATEX, MSRVTT-QA), it can be observed that SFP mechanism significantly improves the baseline. Meanwhile, the performance changes among the three parameters ( $N_{salient}\in\{2,3,4\}$ ) are very marginal, which further verifies the robustness and effectiveness of our LGDN.

Table 12: Effect of value change of memory bank size on the performance of our LGDN model. Video-text retrieval results are reported on the MSR-VTT 1k-A test set.

Bank Size	Text-to-Video Retrieval			Video-to-Text Retrieval
Bank Size	R@1	R@5	MdR	R@1	R@5	MdR
1,200	36.0	65.2	3.0	36.2	64.8	3.0
2,400	36.9	64.6	3.0	37.0	64.8	2.0
4,800	37.7	65.7	3.0	37.8	65.0	2.0
9,600	38.9	65.7	2.0	37.9	65.4	2.0
19,200	38.3	64.9	3.0	37.5	64.3	3.0

Effect of Value Change of Memory Bank Size. In Table 12, we show the influence of different values of the memory bank size $N_{m}$ on the performance of our LGDN. With the increase of memory bank size $N_{m}$ , the performance of our LGDN begins with an increase, indicating that the introduction of large-scale negatives for contrastive learning indeed brings performance improvements. However, when $N_{m}$ becomes too large ( $>9,600$ ), the performance drops slightly. One possible reason is that too large memory bank size may introduce more hard negative samples, which makes it harder to learn a good vision-language representation. Our LGDN thus performs the best at $N_{m}=9,600$ .

Table 13: The speed and memory cost of different sampling strategies on the Deidemo test set.

Sampling Strategy

Speedup

Memory Cost

R@SUM

Sparse sampling

1.0x

183.0

Salient sampling (

N_{salient}=1

)

10.4x

0.60x

193.5

Salient sampling (

N_{salient}=2

)

6.5x

0.62x

198.3

Salient sampling (

N_{salient}=4

)

3.6x

0.68x

195.6

Speed and Memory Cost. We present the speed and memory cost on the Didemo test set in Table 13. For fair comparison, all experiments are conducted on 8 Tesla-V100 GPUs with mini-batch size 24. It can be seen that our salient sampling strategy is obviously faster and costs less memory, as compared with sparse sampling (utilizing all 16 frames for feature extraction and multi-modal fusion).

A.4 Visualization Results

We provide visualization examples in Figure 8. Figures 8(a)-(b) are the retrieved results by our LGDN and Figures 8(c)-(d) are the retrieved results by our model without using the SFP mechanism. We can see from Figures 8(a)-(b) that although the target videos have noisy frames (e.g., in the first frame of rank 1 video of Figure 8(a), the man is laughing; the last frame of rank 1 video of Figure 8(b) is the close up of the player who run into the crowd), LGDN precisely retrieves the ground-truth. In Figure 8(c), LGDN without using SFP also retrieves the corresponding video where the people are singing to the audience, however, in rank 1 video, there are a woman and a man both singing, which is incorrect to the query “A man”. In Figure 8(d), LGDN without using SFP only retrieves video corresponding to basketball and crowd. These examples indicate that the noisy information misleads cross-modal modeling and our LGDN with SFP can help to alleviate it.

Further, we provide more visualization results obtained by our LGDN in Figures 9-10. We uniformly sample 6 frames from each video, among which the red ones denote salient frames selected by the SFP mechanism. It can be clearly observed that: (1) Although the holistic video is semantically related to the paired text, there still exist noisy frames (e.g., the transition in Frame 3-6 of Query7466) and unrelated frames (e.g., Frame 4-6 of Query7468, Frame 1-5 of Query7586, Frame 2-3 of Query8069, and Frame 2-5 of Query8265). (2) The salient frames obtained from the SFP mechanism correctly represent the semantic information of the video given the paired text, which indeed helps our LGDN to precisely filter out noisy information for better video-language modeling.

LGDN: Language-Guided Denoising Network for Video-Language Modeling

Abstract

1 Introduction

2 Related Work

Video-Language Modeling.

Cross-Modal Alignment Objective Functions

3 Methodology

3.1 Feature Representation

Vision Representation.

Language Representation.

3.2 Momentum Video-Level Contrastive Learning (MVCL) Module

3.3 Salient Frame Proposal (SFP) Mechanism

3.4 Momentum Frame-Level MSL-Contrastive Learning (MFCL) Module

3.5 Language-Guided Salient Frame Matching (LSFM) Module

4 Experiments

4.1 Datasets and Settings

4.2 Ablation Study

4.3 Comparison to the State-of-the-Arts

4.4 Additional Results

4.5 Visualization Results

5 Conclusion

Acknowledgments and Disclosure of Funding

References

Appendix A Appendix

A.1 Limitations and Potential Negative Societal Impacts

A.2 More Implementation Details

A.3 More Experimental Results

A.4 Visualization Results

LGDN: Language-Guided Denoising Network
for Video-Language Modeling