CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios

Xiangshuo Qiao^1∗ Xianxin Li¹¹¹1Co-first authors Xiaozhe Qu¹ Jie Zhang¹²²2Corresponding author Yang Liu¹ Yu Luo¹ Cihang Jin¹ Jin Ma² ¹Tencent PCG
²University of Science and Technology of China {xsqiao, ceceliali, xiaozhequ, jeyzzhang, jelmeliu, yamiluo, alexajin}@tencent.com {majin01}@mail.ustc.edu.cn

Abstract

Vision-Language Models pre-trained on large-scale image-text datasets have shown superior performance in downstream tasks such as image retrieval. Most of the images for pre-training are presented in the form of open domain common-sense visual elements. Differently, video covers in short video search scenarios are presented as user-originated contents that provide important visual summaries of videos. In addition, a portion of the video covers come with manually designed cover texts that provide semantic complements. In order to fill in the gaps in short video cover data, we establish the first large-scale cover-text benchmark for Chinese short video search scenarios. Specifically, we release two large-scale datasets CBVS-5M/10M to provide short video covers, and the manual fine-labeling dataset CBVS-20K to provide real user queries, which serves as an image-text benchmark test in the Chinese short video search field. To integrate the semantics of cover text in the case of modality missing, we propose UniCLIP where cover texts play a guiding role during training, however are not relied upon by inference. Extensive evaluation on CBVS-20K demonstrates the excellent performance of our proposal. UniCLIP has been deployed to Tencent’s online video search systems with hundreds of millions of visits and achieved significant gains. The dataset and code are available at https://github.com/QQBrowserVideoSearch/CBVS-UniCLIP.

1 Introduction

Refer to caption — Figure 1: Morphological differences between open domain images and short video cover images.

CLIP Radford et al. (2021) demonstrates the promise of performing contrastive learning pre-training on large-scale image-text data from the web with a data size of 400 million. In this work, visual base models represented by ViT Dosovitskiy et al. (2020) are aligned with textual base models represented by Bert Vaswani et al. (2017); Devlin et al. (2018) by learning on large-scale unsupervised data. These base models can be transferred to downstream tasks such as image search Hendriksen et al. (2022) via natural language prompts Zhou et al. (2022).

In the field of Chinese multi-modal representation learning, previous work Yang et al. (2022); Xie et al. (2023); Gu et al. (2022) supplements high-quality Chinese image-text datasets and successfully pre-trains Chinese visual language models. Most of the data are open-domain images collected from the web or multiplexed from publicly available English datasets. These images are captured by a camera and presented in the form of common-sense visual elements, including animals, buildings, activities, etc. with corresponding descriptive text.

With the rise of short videos, video search has become a popular topic Spolaôr et al. (2020); Wray et al. (2021). Previous work Zhang et al. (2022b); Nie et al. (2022); Xu et al. (2023) create large-scale datasets for the Chinese short-video search domain and provide publicly available video frames or video features to support content-based search. However, cover-based search remains to be investigated. During the creation of short videos, creators craft video covers for short videos with the aim of attracting the interest of the most relevant viewers. Therefore, in short video search scenarios, short video covers provide direct overviews and serve as crucial visual features of the videos. Besides, cover-based search has efficiency advantages over content-based search.

However, there are remarkable morphological differences between short video cover images and open domain images. As shown in Fig. 1, compared to open-domain visual elements, short video covers, as user-originated content, are mostly artificial combinations of various visual elements and may undergo post-processing such as cropping and splicing. On the other hand, many creators craft cover texts for video covers to complement or emphasize the semantic information of the video. This is a feature that open domain images do not share. Therefore, short video cover images represent a different form of data from open domain images, and the availability of large-scale cover dataset is crucial. However, available large-scale cover datasets are lacking.

In this work, we release a large-scale Chinese image-text Benchmark for short Video Search scenarios (CBVS) to fill the gap of data in real Chinese video search scenarios. CBVS is designed in three versions: the manually fine-labeled CBVS-20K and the large-scale unsupervised CBVS-5M/10M. Fig. 2 shows their data examples. Specifically, CBVS-20K contains 20K high-quality $<$ user query-video cover $>$ pairs, which serves as an image-text benchmark test in the field of Chinese short video search. Well-trained human experts annotate the relevance of each user query to the video cover and at least two cross-validations are performed. In addition, Optical Character Recognition (OCR) texts of cover images are provided after machine extraction and human correction. Due to the constraints of user privacy and platform rules, the large-scale CBVS-5M/10M contains about 5M/10M $<$ video title-video cover $>$ pairs, where the text is provided in the form of video titles and OCR texts. These data are available for visual language models to learn modal alignment in pre-training or fine-tuning tasks. The CBVS dataset includes 32 categories such as Film and animation, Character, Education, Game, Commodity, etc. to avoid data distribution bias. Tab. 2 shows a detailed comparison of various versions.

In short video search scenarios, cover texts complement the semantics of cover images. On one hand, CLIP lacks the ability to fuse multi-semantic signals on the visual side. On the other hand, not all cover images come with cover texts, so the modality missing problem needs to be considered. In order to effectively integrate the semantics of cover images with cover texts, we propose UniCLIP inspired by the work of OCR-free Kim et al. (2022); Davis et al. (2022). Cover texts signals are unified to guide image-text contrastive learning in an presence-guided and semantic-guided manner. It is worth emphasizing that the inference process does not depend on any module related to OCR and the model is immune to the problem of missing cover text modalities. Extensive experimental evaluations demonstrate the effectiveness of our proposal.

Our contributions can be summarized as follows:

•

In order to fill in the lack of cover data for short video search scenarios, we release the largest Chinese cover image-text dataset with video title texts and cover texts.
•

We build a manual fine-labeling image-text benchmark test for Chinese short video search scenarios, containing real user queries from browser logs.
•

We propose UniCLIP, which introduces an image classification task and an image-text matching task to guide image-text contrastive learning training. UniCLIP imposes no additional inference cost and training is immune to the modality missing problem.

Table 1: Statistics of Chinese image-text/video-text datasets.

Dataset	$\#$ Vision	$\#$ Text	Source	Vision type	Text type	Availability
Chinese Video-Text Datasets
VATEX	41,269	825,380	Kinetics-600	Video	Caption	✓
BFVD	43,166	43,166	E-Commerce	Feature	Title	✓
FFVD	32,763	32,763	E-Commerce	Feature	Title	✓
CREATE-210K	216,303	268,593	Open Websites	Video	Title, Caption	✗
CREATE-3M	3,000,000	3,000,000	Open Websites	Video	Title	✗
CREATE-10M	10,000,000	10,000,000	Open Websites	Video	Title	✗
Kwai-SVC	222,077	143,569	Video Search	Feature	Title, OCR, ASR	✗
Kwai-SVC-11M	11,075,084	3,931,879	Video Search	Video	Title, ASR	✗
CNVid-3.5M	3,508,120	3,508,120	Open Websites	Video	ASR, Title	✓
ALIVOL-10M	10,300,000	11,000,000	E-Commerce	Video, Image	Title, Abstract	✗
Youku-mPLUG	10,000,000	10,000,000	Open Websites	Video	Title	✓
Chinese Image-Text Datasets
Wukong	101,483,885	101,483,885	Open Websites	Image	Caption	✓
Wukong-Test	33,365	33,365	Open Websites	Image	Caption	✓
Product1M	1,182,083	1,182,083	E-Commerce	Image	Caption	✓
M6-Corpus	60,500,000	60,500,000	Open Websites	Image	Caption	✗
ZERO-Corpus	250,000,000	750,000,000	Image Search	Image	Title, Content, Query	✓
R2D2-ICR	200,000	200,000	Image Search	Image	Caption	✓
R2D2-IQR	200,000	200,000	Image Search	Image	Query	✓
CBVS-20K	20,001	20,001	Video Search	Cover Image	OCR, Query	✓
CBVS-5M	4,767,435	4,767,435	Video Search	Cover Image	OCR, Title	✓
CBVS-10M	10,075,989	10,075,989	Video Search	Cover Image	OCR, Title	✓

2 Related Work

2.1 Chinese Video/Image-text benchmark

Compared to English multi-modal pre-training, the Chinese community is lagging behind. Yang et al. (2022) introduces translated versions of the English multi-modal datasets Chen et al. (2015); Krishna et al. (2017) to support Chinese multi-modal pre-training. Gu et al. (2022) releases a large-scale Chinese dataset Wukong containing 100 million image-text pairs collected from the web to bridge the language gap. Xie et al. (2023) further establishes large-scale Chinese cross-modal benchmarks by releasing two pre-training datasets and five fine-tuning datasets. Besides, Product1M Zhan et al. (2021) provides additions to the e-commerce domain. Zhang et al. (2022b); Nie et al. (2022); Xu et al. (2023) supplements the Chinese video-text data, and visual modalities are provided in the form of video frames. However, large-scale video cover data is scarce.

Table 2: Comparison of CBVS versions.

Version	#Pair	Purp.	Ann.	Text type
CBVS-20K	20,001	Test	✓	Query, OCR
CBVS-5M	4,767,435	Train	✗	Title, OCR
CBVS-10M	10,075,989	Train	✗	Title, OCR

2.2 Image-text Matching

The image-text matching task aims to measure the semantic similarity of different modalities in the same embedding spaceAbdullah and Rangarajan (2021). Existing implementations fall into two categories: the first is embedding-based, i.e., encoding global representations of vision and text separately, and then performing similarity computation Chen et al. (2021); Faghri et al. (2017); Qu et al. (2020). The second is score-based, i.e., performing cross-modal interactions locally and calculating cumulative scores Chen et al. (2020); Diao et al. (2021); Liu et al. (2020). Due to the advantages of performance and efficiency, embedding-based methods have attracted the attention of researchers Fu et al. (2023). In particular, CLIP Radford et al. (2021) provides new ideas for multi-modal representation learning. A series of studies following CLIP have been applied to downstream tasks including image text matching. For example, GCLIPLi et al. (2022), CLIP-AdapterZhang et al. (2021) and SLIPMu et al. (2022) raise the upper performance limit of CLIP, AltCLIPChen et al. (2022), CN-CLIPYang et al. (2022), and TaiyiCLIPZhang et al. (2022a) expand the CLIP language domain.

3 CBVS Dataset

3.1 Comparison

A comparison with other Chinese image-text/video-text datasets is shown in Tab. 1. On the one hand, CBVS provides cover images, user queries, and is larger in size compared to publicly available video datasets such as VATEX Wang et al. (2019), BFVD/FFVD Zhang et al. (2020), and CNVid-3.5M Gan et al. (2023). CREATE Zhang et al. (2022b), Kwai-SVC Nie et al. (2022), ALIVOL-10M Lei et al. (2021) with similar scale are access restricted. On the other hand, compared to image-text datasets such as Wukong Gu et al. (2022), Product1M Zhan et al. (2021), M6-Corpus Lin et al. (2021), ZERO-Corpus/R2D2 Xie et al. (2023), etc., the biggest advantage of CBVS lies in the uniqueness of the video cover image and the cover text specific to the cover image. To the best of our knowledge, CBVS is the largest publicly available Chinese image-text dataset providing cover images.

3.2 Data Collection

In order to provide real video search data, we capture the user query logs of QQ Browser and divide user queries into two parts. We retrieve more than 8M videos from the Chinese video website BiliBili³³3https://www.bilibili.com through the first part of user queries. To avoid data distribution bias due to a single platform, we retrieve more than 5M videos from Tencent Video⁴⁴4https://v.qq.com as a supplement in the same way, and finally obtain more than 13M $<$ cover-title $>$ pairs as a data source for CBVS-5M/10M. Besides, we manually select more than 2K high-quality user queries from the second part, and collect 20K high-quality $<$ user query-cover image $>$ pairs in the same way, as the data source of CBVS-20K. The number of cover images under each user query is controlled in $[5,30)$ .

We design the data cleaning program from two aspects: data quality and image-text relevance. First, we filter the video covers with low resolution and scale disproportion, and eliminate the dead links. After that, we score the relevance of video covers and titles in 13M data with the open-source Chinese image-text model QA-CLIP⁵⁵5https://github.com/TencentARC-QQ/QA-CLIP, and filter the trailing 3M data to obtain CBVS-10M. We randomly sample 1K data from 13M and 10M data, respectively, and human experts evaluate whether the video cover is relevant to the title. The evaluation conclusions show that the relevance is improved from 75.6% to 93.0% after data cleaning. Finally, CBVS-5M is obtained by sampling in CBVS-10M.

3.3 Data Annotation

The data annotation of CBVS-20K is performed by trained experts in the field of video search in a two-stage, cross-validated approach. First, they annotate whether the user query reveals a clear need for video, and filter out query terms related to pornography and violence, and finally take the queries with a need for video as candidates for annotation in the second stage. After that, the annotators mark the degree of relevance for each $<$ user query-cover image $>$ pair, which is categorized into three grades: strongly relevant, weakly relevant, and irrelevant. The semantics of cover images and cover texts are required to be considered together. Meanwhile, the annotators correct the OCR text extracted by the machine.

We exclude the controversial data, and end up with 20,001 image-text pairs consisting of 2,486 unique queries and 19,648 unique images. The percentage of strongly relevant, weakly relevant and irrelevant data is 29.74%, 30.80% and 39.46% respectively. The average length of user queries is 7.0 and the average length of OCR texts is 14.5. 33.41% of cover images come with OCR texts. Detailed data distribution is provided in the supplementary material.

4 Methodology

4.1 Image-Text Contrastive Learning

We follow CLIP Radford et al. (2021) to co-train the image encoder with the text encoder, taking InfoNCE Loss Oord et al. (2018) as the Image-Text Contrastive (ITC) loss $L_{ITC}$ , as shown in Fig. 3. Specifically, we maximize the similarity scores of the matched image and text embeddings in terms of batch. We adopt ViT Dosovitskiy et al. (2020) and RoBERTa Liu et al. (2019) as the visual and textual skeletons, respectively, and introduce the weight initialization of QA-CLIP.

In particular, to bridge the gap between the data morphology of the title and the user query, we employ a Chinese word-splitting component, Lexical Analysis of Chinese (LAC)⁶⁶6https://github.com/baidu/lac. In order to simulate the morphological distribution of the user query, the result of the lexical segmentation is composed into a string using spaces as the splice character. For the case of failed word splitting, the original title is employed. This setting takes effect for all fine-tuning tasks unless otherwise specified.

4.2 Presence-guided Encoder

Video cover images differ from open domain images in that they partially carry cover texts. One option is to outsource the cover text understanding task to an external OCR engine and fuse the cover image with the cover text on the image side in an ALBEF Li et al. (2021) manner. However, the cost of image-text similarity inference becomes expensive. In addition, the image-text contrastive learning becomes highly dependent on the accuracy of the OCR engine, which creates an obstacle for model generalization.

Differently, we are inspired by Kim et al. (2022); Davis et al. (2022) to design UniCLIP in an OCR-free form. One idea is to guide the ViT to perceive the cover texts during the training process through agent tasks, so that it relies on no module related to the OCR function in the inference process. Since the presence of cover text is uncertain, we propose the presence-guided encoder, where the first agent task of UniCLIP is set as an Image Classification (IC) task: ”To determine whether an image carries cover texts”.

Specifically, as shown in Fig. 3, the presence-guided encoder takes the output tokens from the last layer of ViT as input. These tokens go through a 3-layer, 8-header Transformer Vaswani et al. (2017) structure, after which they are fed into a MLP layer for predicting the presence or absence of cover texts. The loss in this part is $L_{IC}$ .

4.3 Semantic-guided Encoder

The presence-guided encoder directs the cover image encoder to focus on the cover texts, but does not involve semantic information. We further propose the semantic-guided encoder, which sets the second agent task as an Image-Text Matching (ITM) task: ”To determine whether the specified text is consistent with the text on the cover image”, encouraging the ViT to incorporate gains from the semantics of the cover texts. This design is motivated by the notion that visual tokens from the ViT contain the semantic information of cover texts, if they are successfully employed to discriminate the consistency of the cover text with a given text.

We design negative samples by nearest neighbor lookup. Take the training on CBVS-5M dataset as an example. First, we adopt RoBERTa-wwm-Base as the encoder and load the checkpoints released by QA-CLIP to encode all the 2.0M valid OCR texts. The Hierarchical Navigable Small World Algorithm (HNSW) Malkov and Yashunin (2018) is applied to retrieve the Top- $K$ OCR texts that are most semantically similar but not identical to the anchor, and one of them is randomly selected as the negative sample. For covers without OCR texts, only negative samples are considered. For covers with OCR texts, the percentage of positive samples is set to 70%.

The semantic-guided encoder accepts two inputs, i.e., tokens from the last layer of the ViT, and the embeddings of positive or negative samples. As shown in Fig. 3, the module is a 3-layer structure, where each layer consists of a self-attention, a cross-attention with an MLP, and includes residual connections. The embeddings of the samples are updated layer by layer. The loss of this process is $L_{ITM}$ .

4.4 Training and Inference

Pre-training for UniCLIP starts with the checkpoints released by QA-CLIP. Fine-tuning is then performed on the CBVS-5M/10M dataset with a total loss of:

L_{total}=\lambda_{1}L_{ITC}+\lambda_{2}L_{IC}+\lambda_{3}L_{ITM},

(1)

where $\lambda_{1}$ , $\lambda_{2}$ and $\lambda_{3}$ are hyperparameters.

It is worth noting that the fine-tuning of UniCLIP relies on positive samples, negative samples and ground truths related to OCR texts, but the inference process does not rely on any OCR-related components. As shown in Fig. 3, the presence-guided encoder and semantic-guided encoder guide the training of the image-text alignment task in UniCLIP, but do not participate in the inference process. Therefore, UniCLIP infers in a manner consistent with CLIP.

5 Experiments

Table 3: Evaluation on the CBVS-20K dataset. Our proposal achieves SOTA performance.

Mode	Method	Recall Metrics				Rank Metrics
Mode	Method	R@1	R@5	R@10	MR	PNR	NDCG@1	NDCG@5	NDCG@10	MAP
Zero-shot	CN-CLIP_ViT-B/16	0.384	0.628	0.704	0.572	2.718	0.768	0.835	0.885	0.764
	CN-CLIP_ViT-L/14	0.434	0.685	0.756	0.625	2.812	0.773	0.840	0.889	0.775
	WuKong_ViT-B/32	0.197	0.356	0.439	0.331	2.000	0.702	0.791	0.858	0.712
	WuKong_ViT-L/14	0.311	0.503	0.583	0.466	2.229	0.739	0.811	0.872	0.738
	Taiyi-CLIP_ViT-B	0.251	0.445	0.525	0.407	2.142	0.718	0.800	0.861	0.727
	Taiyi-CLIP_ViT-L	0.269	0.492	0.577	0.446	2.278	0.726	0.805	0.866	0.733
	Ernie-ViL2.0_ViT-B	0.413	0.660	0.742	0.605	2.759	0.764	0.835	0.886	0.768
	R2D2-23M_ViT-L/14	0.258	0.407	0.436	0.367	2.312	0.733	0.810	0.868	0.738
	R2D2-250M_ViT-L/14	0.356	0.512	0.535	0.468	2.829	0.789	0.842	0.891	0.775
	AltCLIP_ViT-L	0.162	0.284	0.336	0.261	1.869	0.669	0.771	0.842	0.701
	QA-CLIP_ViT-B/16	0.400	0.652	0.724	0.592	2.804	0.774	0.838	0.888	0.770
Fine-tuning	CN-CLIP_ViT-B/16	0.471	0.721	0.796	0.663	2.914	0.767	0.838	0.888	0.767
	R2D2-250M_ViT-L/14	0.418	0.605	0.650	0.558	2.934	0.780	0.844	0.891	0.774
	QA-CLIP_ViT-B/16	0.473	0.711	0.783	0.656	2.907	0.778	0.841	0.890	0.771
	ALBEF-CLIP_ViT-B/16	0.468	0.731	0.794	0.664	2.906	0.771	0.839	0.889	0.769
	UniCLIP_ViT-B/16	0.503	0.754	0.820	0.692	3.069	0.784	0.846	0.893	0.779

5.1 Evaluation Metrics

5.1.1 Recall Metrics

Recall is a widely adopted retrieval metric. We report on the Recall (R) and Mean Recall (MR) of the models.

5.1.2 Rank Metrics

Positive-to-Negative Ratio (PNR) measures the consistency of predicted results with ground truth. Formally, PNR is defined as:

PNR=\frac{\sum_{k}\sum_{i,j\in S_{k}}\mathbb{I}\{y_{ki}>y_{kj}\}\cdot\mathbb{I}\{\hat{y}_{ki}>\hat{y}_{kj}\}}{\sum_{k}\sum_{i,j\in S_{k}}\mathbb{I}\{y_{ki}>y_{kj}\}\cdot\mathbb{I}\{\hat{y}_{ki}<\hat{y}_{kj}\}},

(2)

where $\mathbb{I}$ is the indicator function. The result of the indicator function is $1$ if the internal expression is true, and $0$ otherwise. $S_{k}$ represents the set of all documents under query $k$ . $y_{ki}$ , $\hat{y}_{ki}$ are the true and predicted labels of $<$ image $i$ , text $k$ $>$ , respectively. In particular, we compute PNR only for documents under the same user query.

In addition to PNR, we also report on the Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG) metrics to fully evaluate our model.

5.2 Implementation Details

We follow the model architecture setup in OpenAI CLIP with ViT-B/16 as the visual backbone network for UniCLIP. The text encoder is the 12-layer architecture of RoBERTa-wwm-Base. Both are implemented by 12-layer 12-head Transformers with 768 encoding dimensions and eventually mapped linearly to 512 dimensions. The vocabulary size of text tokenizer is consistent with CN-CLIP. The initialized weights of the visual encoder and text encoder are from QA-CLIP as described in Sec. 4.4. In addition, the weights of presence-guided encoder and semantic-guided encoder are randomly initialized with normal distribution, and $K$ for the semantic-guided encoder is set to 10.

In the fine-tuning stage, we perform random-size cropping and AutoAugment Cubuk et al. (2019) on the input image. All parameters of the image encoder and text encoder are allowed to be updated. Training is executed on 8 NVIDIA A100 GPUs for 20 epochs, with a learning rate of 2e-5. The maximum length of the text encoder is 12 and the weight decay is 1e-3. We leverage all-gather communications across GPU workers to compute $L_{ITC}$ on the global batch. The batch size is set to 1,520 due to GPU memory limitations. Mixed-precision training is activated. We save checkpoints for each epoch and report results with the highest Positive-to-Negative Ratio (PNR) on CBVS-20K. Due to computational resource constraints, we report the results of training on the CBVS-5M dataset. We simultaneously release the CBVS-10M dataset for subsequent studies.

5.3 Comparisons

To demonstrate the performance of our proposal and the value of the CBVS dataset, we extensively evaluate advanced Chinese image-text models on CBVS-20K. The competing models include CN-CLIP Yang et al. (2022), R2D2 Xie et al. (2023), Wukong Gu et al. (2022), TaiyiCLIP Zhang et al. (2022a), Ernie-ViL2.0 Shan et al. (2022) and AltCLIP Chen et al. (2022). The results are shown in Tab. 3.

The results show that, for example, WuKong, although it achieves competitive performance on other public datasets, it lacks the ability of image-text matching on CBVS-20K. The same conclusion appears in most of the other competitors, e.g., the MR of Taiyi-CLIP_ViT-B is only 0.407, which much lower than UniCLIP’s 0.692. The generalization performance of models trained on large-scale open domain images is generally low on the CBVS-20K, which demonstrates the domain uniqueness of video cover images. Vision-Language Models that perform well in the open domain do not migrate to the video cover domain as expected.

It is worth noting that although R2D2-250M_ViT-L’s recall metrics are significantly behind UniCLIP, its ranking metrics are close to those of UniCLIP. In particular, NDCG@1 slightly outperforms UniCLIP with a maximum of 0.789. We infer that the experimental result is due to the fact that R2D2-250M_ViT-L employs a more powerful visual architecture and enjoys a training corpus size of 250M. We encourage the incorporation of CBVS-10M into the training corpus on the one hand, and the adoption of ViT-L as a visual skeleton on the other hand, to facilitate further improvement of UniCLIP performance in subsequent studies.

Compared to the pre-trained QA-CLIP, the fine-tuning on the CBVS-5M dataset comprehensively improves the metrics, especially the PNR by 3.67%, and R@1 by 18.25%. Performing fine-tuning on the publicly released CN-CLIP_ViT-B, consistent findings are observed with a 7.21% improvement in PNR and a 22.66% improvement in R@1. Performing fine-tuning on the R2D2-250M_ViT-L, significantly higher recall metrics are observed, as well as largely comparable rank metrics. These results demonstrate that fine-tuning on a large-scale cover dataset can improve the performance of the model in the video search domain. In addition, UniCLIP achieves state-of-the-art performance with the highest metrics compared to its competitors and does not introduce additional inference cost compared to the simple and efficient CN-CLIP.

Table 4: Results of ablation study of UniCLIP.

$L_{IC}$	$L_{ITM}$	Recall Metrics				Rank Metrics
$L_{IC}$	$L_{ITM}$	R@1	R@5	R@10	MR	PNR	NDCG@1	NDCG@5	NDCG@10	MAP
		0.473	0.711	0.783	0.656	2.907	0.778	0.841	0.890	0.771
✓		0.491	0.747	0.818	0.685	2.991	0.776	0.843	0.890	0.772
	✓	0.499	0.754	0.812	0.688	3.006	0.783	0.845	0.893	0.779
✓	✓	0.503	0.754	0.820	0.692	3.069	0.784	0.846	0.893	0.779

Table 5: PNR metrics for different OCR texts combinations.

<S_{T},S_{F}>

stands for that in Eq. 2,

i\in S_{T}

and

j\in S_{F}

Model	$<S_{T},S_{T}>$	$<S_{F},S_{F}>$	$<S_{T},S_{F}>$	All
Model	(11.71%)	(46.51%)	(41.78%)	(100.00%)
QA-CLIP_ViT-B/16	3.203	2.722	2.975	2.877
ALBEF-CLIP_ViT-B/16	3.375	2.689	3.051	2.906
UniCLIP_ViT-B/16	3.331	2.904	3.194	3.069

5.4 Ablation Study

In order to evaluate the proposed presence-guided encoder and semantic-guided encoder, we implement versions of the model with and without $L_{IC}$ and $L_{ITM}$ , respectively. The weights of $L_{ITC}$ and $L_{IC}$ (or $L_{ITM}$ ) are set to 0.8 and 0.2. Tab. 4 shows the results of the ablation study of UniCLIP.

If both the presence-guided encoder and the semantic-guided encoder are removed and the cover text is discarded, the model degenerates into a fine-tuned version of QA-CLIP. The PNR is reduced from 3.069 to 2.907 and the MR from 0.692 to 0.656. In addition, removing any of the two encoders results in varying degrees of degradation in model performance. Removing the presence-guided encoder reduces the PNR of UniCLIP by 2.05%. Besides, removing the semantic-guided encoder reduces the PNR by 2.54%. Interestingly, when only the semantic-guided encoder is employed, the NDCG@1/5/10, MAP, and R@1/5 of the model are basically the same as the final scheme, which indicates that the gain of cover text is mainly in the semantic information. If two encoders are employed at the same time, i.e., the proposed two agent tasks are considered at the same time, the highest metrics are achieved in all aspects. This result is ample proof of the validity of our proposal. We encourage follow-up studies to further generalise our ideas.

5.5 Cover Text Capability Assessment

Since the training process of UniCLIP relies on the cover text modality, for a fair comparison, we implement an explicit OCR text fusion scheme, which is denoted as ALBEF-CLIP. Compared to CLIP, we replace ViT with an ALBEF structure, where the cover image and the cover text go through their respective encoders before passing through a 6-layer Attention-based fusion structure. For the case of missing OCR texts, the prompt word is employed. The results of ALBEF-CLIP are shown in Tab. 3. If the cover text is introduced, but with ALBEF-CLIP rather than our proposal to exploit this modality, the metrics are much lower than UniCLIP in all aspects. We hypothesize that the reason for this is that UniCLIP guides the semantic training of ViT and handles the modality missing problem more consistently, reducing information confusion.

To further evaluate the cover text capability, we categorize the data in CBVS-20K into two main categories according to the presence or absence of cover texts, which are denoted as $S_{T}$ and $S_{F}$ , respectively. Tab. 5 demonstrates the PNR metrics for different combinations. Compared to the scheme without cover texts (QA-CLIP), ALBEF-CLIP significantly improves the matching ability for covers with cover texts, increasing the PNR from 3.203 to 3.375. However, for covers without cover texts, the scheme degrades the performance, which may be due to semantic confusions brought about by prompt words.

In comparison, UniCLIP is basically comparable to ALBEF-CLIP for matching between covers with cover texts. This is in line with expectations, as we discard cover texts modalities in our inference, however the very close results are a good indication of the promise of UniCLIP. Besides, UniCLIP performs much better for both other cases than QA-CLIP and ALBEF-CLIP. For the matching between covers without cover texts, which is most likely to happen, UniCLIP’s PNR exceeds that of the fusion scheme by 8.00% and has a lower inference cost. For the hybrid cases, UniCLIP achieves the PNR of 3.194. This suggests that UniCLIP is able to overcome the modality missing problem to some extent and handle cover images with or without cover texts uniformly. Thanks to this, UniCLIP shows the best performance on the full range of data.

6 Conclusion

In this work, we release the largest publicly available Chinese video cover-video title dataset to fill in the lack of cover data for short video search scenarios. We further build a manual fine-labeling video cover-user query benchmark test for short video search domain and propose UniCLIP to unify cover texts to guide contrastive learning. The image classification task and the image-text matching task are performed in an OCR-free manner. We believe in the ability of CBVS-5M/10M to expand the domain of large-scale Chinese image-text training and are pleasantly surprised to observe the substrate-independent potential of UniCLIP. However, significant room for exploration remains in balancing the video search domain with the generalized domain. We look forward to the extension of CBVS to downstream tasks such as title generation, as well as inspiration from UniCLIP for performing multi-modal fusion in the CLIP framework.

References

Abdullah and Rangarajan [2021] Taghreed Abdullah and Lalitha Rangarajan. Image-text matching: Methods and challenges. Inventive Systems and Control: Proceedings of ICISC 2021, pages 213–222, 2021.
Chen et al. [2015] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arxiv 2015. arXiv preprint arXiv:1504.00325, 2015.
Chen et al. [2020] Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12655–12663, 2020.
Chen et al. [2021] Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15789–15798, 2021.
Chen et al. [2022] Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, and Ledell Wu. Altclip: Altering the language encoder in clip for extended language capabilities. arXiv preprint arXiv:2211.06679, 2022.
Cubuk et al. [2019] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 113–123, 2019.
Davis et al. [2022] Brian Davis, Bryan Morse, Brian Price, Chris Tensmeyer, Curtis Wigington, and Vlad Morariu. End-to-end document recognition and understanding with dessurt. In European Conference on Computer Vision, pages 280–296. Springer, 2022.
Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Diao et al. [2021] Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. Similarity reasoning and filtration for image-text matching. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 1218–1226, 2021.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Faghri et al. [2017] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612, 2017.
Fu et al. [2023] Zheren Fu, Zhendong Mao, Yan Song, and Yongdong Zhang. Learning semantic relationship among instances for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15159–15168, 2023.
Gan et al. [2023] Tian Gan, Qing Wang, Xingning Dong, Xiangyuan Ren, Liqiang Nie, and Qingpei Guo. Cnvid-3.5 m: Build, filter, and pre-train the large-scale public chinese video-text dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14815–14824, 2023.
Gu et al. [2022] Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems, 35:26418–26431, 2022.
Hendriksen et al. [2022] Mariya Hendriksen, Maurits Bleeker, Svitlana Vakulenko, Nanne van Noord, Ernst Kuiper, and Maarten de Rijke. Extending clip for category-to-image retrieval in e-commerce. In European Conference on Information Retrieval, pages 289–303. Springer, 2022.
Kim et al. [2022] Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer, 2022.
Krishna et al. [2017] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
Lei et al. [2021] Chenyi Lei, Shixian Luo, Yong Liu, Wanggui He, Jiamang Wang, Guoxin Wang, Haihong Tang, Chunyan Miao, and Houqiang Li. Understanding chinese video and language via contrastive multimodal pre-training. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2567–2576, 2021.
Li et al. [2021] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
Li et al. [2022] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
Lin et al. [2021] Junyang Lin, Rui Men, An Yang, Chang Zhou, Ming Ding, Yichang Zhang, Peng Wang, Ang Wang, Le Jiang, Xianyan Jia, et al. M6: A chinese multimodal pretrainer. arXiv preprint arXiv:2103.00823, 2021.
Liu et al. [2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
Liu et al. [2020] Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10921–10930, 2020.
Malkov and Yashunin [2018] Yu A Malkov and Dmitry A Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836, 2018.
Mu et al. [2022] Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language-image pre-training. In European Conference on Computer Vision, pages 529–544. Springer, 2022.
Nie et al. [2022] Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian, and Alberto Del Bimbo. Search-oriented micro-video captioning. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3234–3243, 2022.
Oord et al. [2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
Qu et al. [2020] Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. Context-aware multi-view summarization network for image-text matching. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1047–1055, 2020.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Shan et al. [2022] Bin Shan, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. Ernie-vil 2.0: Multi-view contrastive learning for image-text pre-training. arXiv preprint arXiv:2209.15270, 2022.
Spolaôr et al. [2020] Newton Spolaôr, Huei Diana Lee, Weber Shoity Resende Takaki, Leandro Augusto Ensina, Claudio Saddy Rodrigues Coy, and Feng Chung Wu. A systematic review on content-based video retrieval. Engineering Applications of Artificial Intelligence, 90:103557, 2020.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Wang et al. [2019] Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591, 2019.
Wray et al. [2021] Michael Wray, Hazel Doughty, and Dima Damen. On semantic similarity in video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3650–3660, 2021.
Xie et al. [2023] Chunyu Xie, Heng Cai, Jincheng Li, Fanjing Kong, Xiaoyu Wu, Jianfei Song, Henrique Morimitsu, Lin Yao, Dexin Wang, Xiangzheng Zhang, et al. Ccmb: A large-scale chinese cross-modal benchmark. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4219–4227, 2023.
Xu et al. [2023] Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Guangwei Xu, et al. Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks. arXiv preprint arXiv:2306.04362, 2023.
Yang et al. [2022] An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, and Chang Zhou. Chinese clip: Contrastive vision-language pretraining in chinese. arXiv preprint arXiv:2211.01335, 2022.
Zhan et al. [2021] Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Minlong Lu, Yichi Zhang, Hang Xu, and Xiaodan Liang. Product1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11782–11791, 2021.
Zhang et al. [2020] Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Jie Liu, Jingren Zhou, Hongxia Yang, and Fei Wu. Poet: Product-oriented video captioner for e-commerce. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1292–1301, 2020.
Zhang et al. [2021] Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
Zhang et al. [2022a] Jiaxing Zhang, Ruyi Gan, Junjie Wang, Yuxiang Zhang, Lin Zhang, Ping Yang, Xinyu Gao, Ziwei Wu, Xiaoqun Dong, Junqing He, et al. Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence. arXiv preprint arXiv:2209.02970, 2022.
Zhang et al. [2022b] Ziqi Zhang, Yuxin Chen, Zongyang Ma, Zhongang Qi, Chunfeng Yuan, Bing Li, Ying Shan, and Weiming Hu. Create: A benchmark for chinese short video retrieval and title generation. arXiv preprint arXiv:2203.16763, 2022.
Zhou et al. [2022] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.