AntM²C: A Large Scale Dataset For Multi-Scenario Multi-Modal CTR Prediction

Zhaoxin Huan , Ke Ding Ant GroupHangzhouChina , Ang Li , Xiaolu Zhang Ant GroupHangzhouChina , Xu Min , Yong He Ant GroupHangzhouChina , Liang Zhang , Jun Zhou Ant GroupHangzhouChina , Linjian Mo , Jinjie Gu Ant GroupHangzhouChina , Zhongyi Liu , Wenliang Zhong Ant GroupHangzhouChina and Guannan Zhang Ant GroupHangzhouChina

Abstract.

Click-through rate (CTR) prediction is a crucial issue in recommendation systems, directly impacting user experience and platform revenue. In recent years, CTR has garnered attention from both industry and academia, leading to the emergence of various public CTR datasets. However, existing CTR datasets primarily suffer from the following limitations. Firstly, users generally click different types of items from multiple scenarios, and modeling the CTR from multiple scenarios can provide a more comprehensive understanding of users and share knowledge between different scenarios. Existing datasets only include CTR data for the same type of items from a single scenario. Secondly, multi-modal features are essential in multi-scenario CTR prediction as they effectively address the issue of inconsistent ID encoding between different scenarios. The existing datasets are based on ID features and lack multi-modal features. Third, a large-scale CTR dataset can provide a more reliable and comprehensive evaluation of complex models, fully reflecting the performance differences between models. While the scale of existing datasets is around 100 million, which is relatively small compared to the real-world industrial CTR prediction. To address these limitations, we propose AntM²C, a Multi-Scenario Multi-Modal CTR dataset based on real industrial data from the Alipay platform. Specifically, AntM²C possesses the following characteristics: 1) It covers CTR data of 5 different types of items from Alipay, providing insights into the preferences of users for different items, including advertisements, vouchers, mini-programs, contents, and videos. 2) Apart from ID-based features, AntM²C also provides 2 multi-modal features, raw text and image features, which can effectively establish connections between items with different IDs. 3) AntM²C provides 1 billion CTR data with 200 features, including 200 million users and 6 million items. It is currently the largest-scale CTR dataset available, providing a reliable and comprehensive evaluation for CTR models. Based on AntM²C, we construct several typical CTR tasks, including multi-scenario modeling, item and user cold-start modeling, and multi-modal modeling. For each task, we provide comparisons with baseline methods. The dataset homepage is available at https://www.atecup.cn/home.

Click-through rate prediction; Multi-Scenario; Multi-Modal

1. Introduction

Click-through rate (CTR) prediction plays a significant role in various domains, including online advertising, search engines, and recommendation systems. CTR prediction refers to the task of estimating the probability that a user will click on a given item. It is essential for optimizing ad revenue, enhancing user experience, and improving engagement. One of the challenging issues in CTR prediction lies in the faithful evaluation of the model. Public CTR datasets provide a standardized and benchmarked environment for evaluating the performance of different CTR models. This enables researchers to compare the effectiveness of different models and identify the most suitable ones for specific applications.

However, in order to meet the constantly growing demands of users, the current CTR scenarios and items are becoming increasingly diverse, and the amount of CTR data is also increasing. For example, in Alipay, CTR occurs in the consumer coupons at marketing campaigns, videos on the tab3 page, and mini-programs after a search. As a result, the existing CTR datasets suffer from the following limitations. Firstly, in real-world industrial CTR prediction, users generally click various types of items from different business scenarios, reflecting their preferences for different items. For example, on Alipay, a user may browse a video about coffee on the Tab3 page, then click on a coffee coupon during a marketing campaign, and finally use the Alipay search to click a coffee ordering mini-program to place an order. Jointly modeling this multi-scenario CTR data can provide a more comprehensive understanding of user preferences, and the knowledge across scenarios can be shared to improve the CTR performance in each scenario. However, existing CTR datasets have a limited range of item types and generally originate from the same business scenario, which fails to capture the multi-scenario preferences of users. For example, Criteo¹¹1https://www.kaggle.com/c/criteo-display-ad-challenge and Avazu²²2https://www.kaggle.com/c/avazu-ctr-prediction only involve CTR data for advertisements. As e-commerce platforms, both Amazon³³3https://nijianmo.github.io/amazon/index.html and AliExpress⁴⁴4https://tianchi.aliyun.com/dataset/74690 provide CTR data for their e-commerce items. Tenrec (Yuan et al., 2022) focuses more on video and article recommendations. Secondly, multi-modal features can address the issue of inconsistent IDs for similar items in different business scenarios and effectively establish a bridge between different scenarios. For example, a video about coffee and a coffee coupon have different IDs in different business scenarios. Directly using ID features cannot perceive the relationship between these two items. Multi-modal features inherently carry semantic meaning and can better compensate for the inconsistency of ID features across different domains. Additionally, with the rise of large language models (LLMs), combining LLMs with CTR prediction has become an emerging research field. Existing CTR datasets are based on ID features and lack abundant multi-modal features, resulting in the CTR model being unable to test the performance in multi-scenarios and multi-modal settings. Furthermore, large-scale datasets can reliably and comprehensively reflect the performance of CTR models, while also highlighting the differences between CTR models. The existing datasets are typically at the scale of 100 million, which is insufficient to further validate the capabilities in larger-scale industrial scenarios.

To address the aforementioned challenges, we propose the AntM²C dataset, a large-scale multi-scenario multi-modal dataset for CTR prediction. Compared with existing CTR datasets, AntM²C has the following advantages:

•

Diverse business scenarios and item types: AntM²C contains different types of items from five typical business scenarios on the Alipay platform, including advertisements, vouchers, mini-programs, contents, and videos. Each business scenario has a unique data distribution. The abundant intersecting users and similar items between scenarios enable a more comprehensive evaluation for multi-scenario CTR modeling. Through one evaluation, the effectiveness of the CTR model can be evaluated in multiple business scenarios.
•

Multi-modal feature system: AntM²C not only includes ID features but also provides rich multi-modal features such as text and image, which can establish connections between similar items across scenarios and provide better evaluation for multi-modal CTR models. Furthermore, the feature system in AntM²C includes up to 200 features⁵⁵5In the first release, AntM2C open-sourced 10 million samples, including 29 ID features and 2 text features. More data and image features will be gradually released in subsequent phases., making it more closely aligned with real-world CTR prediction in industrial scenarios.
•

Largest data scale: AntM²C comprises 200 million users and 6 million items, with a total of 1 billion samples⁵⁵footnotemark: 5. The average number of interactions per user is above 50. To the best of our knowledge, AntM²C is the largest public CTR dataset in terms of scale, which can provide comprehensive and reliable CTR evaluation results.
•

Comprehensive benchmark: Based on AntM²C, three typical CTR tasks have been built, including multi-scenario modeling, cold-start modeling, and multi-modal modeling. Benchmark evaluation results based on state-of-the-art models are also provided.

The rest of the paper is organized as follows. In Section 2, we briefly review some related works about public CTR datasets. In Section 3, we give a detailed introduction to the dataset collection and data analysis. In Section 4, we conduct empirical studies with baseline CTR methods on different CTR tasks.

Refer to caption — Figure 1. An illustration of typical CTR prediction scenarios on the Alipay platform, including service/content search, marketing voucher, Tab3 video recommendation, and advertisement. Each scenario has different types of items, and users have different mindsets when browsing different scenarios.

2. Existing CTR Datasets

The existing public CTR datasets can be roughly divided into two categories: single-scenario and multi-scenario. Both have been widely adopted by the evaluation of CTR methods.

2.1. Single-Scenario CTR Datasets

The Criteo dataset is one of the publicly available datasets for CTR prediction. It contains over 45 million records of user interactions with advertisements, including features such as click-through rates, impression rates, and user demographics. Similar to the Criteo dataset, the Avazu dataset contains over 40 million records of user interactions with mobile advertisements. It includes features such as device information, app category, and user demographics. One of the main limitations of the Criteo and Avazu dataset is they only include CTR data for advertisements and cannot be used to evaluate CTR for other business scenarios or types of items. Additionally, the datasets do not provide text information about the advertisement or user, which can limit the scope of the multi-modal modeling.

2.2. Multi-Scenario CTR Datasets

The AliExpress is a dataset gathered from real-world traffic logs of the search system in AliExpress. This dataset is collected from 5 countries: Russia, Spain, French, Netherlands, and America, which can be seen as 5 scenarios. It can be used to develop and evaluate CTR prediction models for e-commerce platforms. The Tenrec dataset is a multipurpose dataset for CTR prediction where click data was collected from two scenarios: articles and videos. Although the above datasets cover different scenarios, the items within these scenarios are similar. The AliExpress dataset only consists of e-commerce items, and Tenrec involves videos and articles that only reflect the personal interests of users in the entertainment and cultural aspects. Additionally, similar to single-scenario datasets, both of these datasets lack textual modal information and only provide features such as IDs. This limitation restricts the application of multi-modal modeling.

3. Data Description

3.1. Data Collection

AntM²C’s data is collected from Alipay, a leading platform for payments and digital services. In order to meet the growing demands of users, Alipay recommends various types of items from different business scenarios to users.

3.1.1. Scenarios

AntM²C collects CTR data in five scenarios on Alipay, and there are differences in the types of items in each scenario. As shown in Figure 1, the CTR prediction occurs in multiple scenarios, including services and content on search, vouchers on marketing, videos on Tab3 page, and advertisements on the membership page. In the search scenario, when a user enters search words, several relevant mini-apps of services or content are displayed for the user to click on. Marketing scenarios recommend some consumer vouchers, and users click the coupons they are willing to use. On the Tab3 page, the recommended items are primarily short videos, and users will click to watch the videos they are interested in. On the membership page, users may click on some online advertisements. In conclusion, AntM²C includes various types of items from different business scenarios. In section 3.2.2, we will show that there are differences in the data distribution of these different scenarios. The rich and diverse items provide a more comprehensive evaluation for CTR prediction.

Table 1. Data statistics of AntM²C. To protect user privacy, AntM²C anonymizes the scenario names as A-E. The click rate is calculated by dividing the number of clicks by the number of exposures. Since negative sampling is applied to the samples, the click rate may be higher than the actual value.

Scenario	Exposure	Users	Items	Click	Click Rate
A	3,996,614	93,465	112,098	147,656	3.69%
B	8,983,124	104,016	29,835	430,1270	47.88%
C	1,211,813	96,689	6,408	68,566	5.66%
D	1,981,484	37,095	19,092	722,009	36.44%
E	955,162	17,904	18,265	102,671	10.75%
ALL	17,128,197	120,721	184,306	5,342,172	31.19%

3.1.2. Data Sampling

AntM²C collects 9-day (from 20230709 to 20230717) CTR samples from the above-mentioned five scenarios and then filters out 1 billion samples of relatively high-activity users who have a total click count $\geq$ 30 across all scenarios. In the first stage of open sourcing, we randomly sampled 10 million data from these 1 billion samples, and their statistical properties are shown in Table 1. We will open all 1 billion data in the subsequent stage. For the purpose of protecting user privacy, we do not explicitly indicate the names of the scenarios in the dataset, but instead use the letters ’A-E’ as substitutes.

3.1.3. Data Desensitization

The AntM²C does not contain any Personal Identifiable Information (PII) and has been desensitized and encrypted. Each user in the dataset was de-linked from the production system when securely encoded into an anonymized ID. Adequate data protection measures were undertaken during the experiment to mitigate the risk of data copy leakage. It is important to note that the dataset is solely utilized for academic research purposes and does not represent any actual commercial use.

Table 2. Overlapped users across the five scenarios in AntM²C. AntM²C includes the preferences of the same user for items in different scenarios.

Scenario	A	B	C	D	E
A	-	90537	75227	19561	14937
B	-	-	83141	22721	15978
C	-	-	-	31704	17019
D	-	-	-	-	4788
E	-	-	-	-	-

3.2. Data Distribution

3.2.1. Data Overlapping

AntM²C contains a portion of overlapped users across the five scenarios. Table 2 shows the number of intersecting users among different scenarios, indicating that AntM²C can reflect the preferences of the same user for items in different scenarios to effectively conduct multi-scenario CTR evaluation. As for items, due to the significant diversity in item types among different scenarios, there is no intersection of items between different scenarios.

3.2.2. Item & User Frequency

Figure 2 illustrates the frequency of user and item in AntM²C dataset, including all samples and samples from different scenarios (A-E). The horizontal axis represents the number of frequencies for users/items, while the vertical axis represents the number of users/items at that frequency. It can be observed that, in terms of item distribution, all scenarios exhibit a long-tail distribution, with 80% of the sample appearing less than 5 frequencies. This long-tail distribution is consistent with real-world situations. As for user distribution, there are differences between scenarios. In scenario B, the distribution of user frequency has two peaks, one at less than 5 times and the other around 50 times. After the frequency is greater than 50, the number of users decreases as the frequency increases. In other scenarios, the exposure frequency of users follows a long-tail distribution similar to that of items, where more exposure frequency leads to fewer users. Due to the overlapping users between scenarios, the long-tail distribution of users in multiple scenarios becomes a normal distribution in the global samples. Most users have an exposure frequency of around 50. Overall, the distribution of items and users in AntM²C reflects CTR prediction in practice.

Table 3. Features of AntM²C. In addition to ID features, AntM²C also includes the raw text features of users and items.

Category	Feature_name	description	Type	Coverage
	user_id	user number	ID	100%
	features_0-26	user sequences	ID	85.50%
User Features	query_entity_seq	search sequence	Text	90.32%
	item_id	item number	ID	100%
	item_entity_names	entity name of item	Text	100%
Item Features	item_title	title of item	Text	95.50%
	log_time	time in log	Text	100%
Other Features	scene	scenario number	ID	100%
Label	label	click label	Int	100%

3.3. Features

The feature system of AntM²C, as shown in Table 3, includes ID features of users and items, as well as raw text features.

3.3.1. User Features

The user features consist of static profile features⁶⁶6User static attributes and item title will be open-sourced in the subsequent phases. and user sequence features. The static profile features include basic user attributes such as gender, age, occupation, etc. The sequence features provide the user’s recent activities on Alipay, including clicked mini-apps, searched services, purchased items, etc. As mentioned in Section 3.1.3, these user features have been desensitized and encrypted for the purpose of user privacy protection and appear in the dataset in an encrypted ID format, making it impossible to reconstruct the original user features. In addition to the ID-based features, AntM²C also includes the raw text of user search entities to provide multi-modal evaluation.

3.3.2. Item Features

The item features consist of item ID and item textual features. The item ID is a globally unique identifier for each item, and the encoding of item IDs varies across different scenarios. To address the inconsistency of item IDs across scenarios, AntM²C also includes the original title text of the items⁶⁶footnotemark: 6 and entities extracted based on the title text.

3.3.3. Other Features

In addition to user and item features, AntM²C also provides additional features such as log time and scene identification. Users can utilize these extra features to flexibly split the training, validation, and testing sets based on time and evaluate the performance in different scenarios.

3.3.4. Label

The label in AntM²C indicates whether the user clicked on the corresponding item. If the user clicked, the label is set to 1, otherwise it is set to 0. The ratio of positive to negative samples in AntM²C can be obtained from the click rate in Table 1. It should be noted that there are a large number of negative samples in the actual online logs (samples that were exposed but not clicked on). To address this issue, negative sampling was performed which resulted in a higher click-through rate in the AntM²C dataset compared to that in the actual online logs.

4. Experimental Evaluation

In this section, we describe the applications of AntM²C in several CTR prediction tasks. We briefly introduce each task and report the results of some baseline methods. We select the commonly used AUC (Area Under the Curve) as the metrics for all experiments. The baseline methods and evaluation results in the experiment provide a demo of using AntM²C. More baselines and evaluations will continue to be updated in future work.

4.1. Multi-Scenario CTR prediction

Multi-scenario CTR prediction is a common issue in industrial recommendation systems. It builds a unified model by leveraging CTR data from multiple scenarios. The knowledge sharing between scenarios enables the multi-scenario model to achieve better performance compared to single-scene modeling. We conduct an evaluation on multi-scenario CTR prediction using different baseline methods based on the 5 scenarios in the AntM²C dataset.

Table 4. The distribution of training and testing data in multi-scenario CTR evaluation. The data is divided by time, and there are differences in the data volume between scenarios.

Scenario	Train Set	Test Set
A	3,499,645	496,969
B	7,890,222	1,092,901
C	1,059,578	151,670
D	1,802,707	178,777
E	846,791	104,359
Total	15,098,943	2,024,676

4.1.1. Data preprocess

In the multi-scenario CTR evaluation, we divide the AntM²C dataset based on time, using the data before 20230717 as the training set and the data on 20230717 as the test set. The training and test sets include samples from all five scenarios, and their data distribution is shown in Table 4. It can be observed that there are differences in the number of training and test samples among different scenarios. Among them, Scenario B has the highest number of samples, which is ten times that of Scenario E. In terms of features, we use the user and item features from the ID category as shown in Table 3. The text features will be used for multi-modal evaluation (see in Section 4.3).

4.1.2. Baselines and hyper-parameters

We mainly choose the multi-task methods as the baseline methods for multi-scenario CTR prediction. We treat the CTR estimation for each scenario as a task and share the knowledge among the scenarios at the bottom layer, with each scenario’s CTR score output at the tower layer. The baseline methods and hyperparameter settings are as follows:

•

DNN: The DNN is trained on a mixture of samples from all scenarios without tasks, serving as the baseline for multi-scenario CTR prediction. The DNN consists of three layers with 128, 32, and 2 units, respectively. The following multi-task model has the same number of layers and unit settings as the DNN.
•

Shared Bottom (Ruder, 2017): Shared bottom is the most fundamental model in multi-task learning, where the knowledge is shared among the tasks at the bottom layer. Each task has its own independent tower layer and outputs the corresponding CTR score⁷⁷7https://github.com/shenweichen/DeepCTR.
•

MMoE (Ma et al., 2018): Based on the shared bottom, MMOE introduces multiple expert networks, each specialized in predicting a specific task, sharing a common input layer. Additionally, MMOE adds a gating network that assigns different weights to each expert based on the input data to determine their influence on predicting the output for a specific task. In the experiment, we set the number of experts in MMOE to 6⁸⁸8https://github.com/drawbridge/keras-mmoe.
•

PLE (Tang et al., 2020): Based on MMOE, PLE further designs task-specific experts for each task, while retaining the shared expert. This structure allows the model to better learn the differences and correlations among tasks. We set the number of experts in PLE to be the same as MMOE, with each of the five scenarios having its own specific expert and one globally shared expert⁷⁷footnotemark: 7.

All baseline methods utilized the Adam (Kingma and Ba, 2015) optimizer with a learning rate of 1e-3 for parameter optimization. The models were trained for 5 epochs with a batch size of 512.

Table 5. Multi-scenario CTR evaluation on AntM²C. The table shows the AUC metric of the baseline methods in different scenarios.

Methods	Scenario
Methods	A	B	C	D	E
DNN	0.7846	0.9328	0.8733	0.6880	0.8338
Sharedbottom	0.8039	0.9414	0.8798	0.6915	0.8525
MMoE	0.7986	0.9438	0.8751	0.6854	0.8519
PLE	0.8039	0.9429	0.8785	0.6903	0.8506

4.1.3. Results

Table 5 shows the evaluation results of different baseline methods on multi-scenario CTR prediction, from which we can draw the following conclusions. Firstly, compared to the DNN model that trains all data together without considering scenario characteristics, all multi-task models achieve better performance. This demonstrates that in AntM²C, there are differences and commonalities between scenarios, and simply mixing training data will not achieve the best results. Secondly, the CTR performance varies across each scenario, indicating different levels of difficulty between scenarios. For example, in scenario B, where there is a large amount of data, the AUC is generally above 0.93, while in scenario D, the AUC is only around 0.68. The diverse business scenarios and items in AntM²C enable a more comprehensive and diverse evaluation of CTR. Finally, the expert-structured MMOE and PLE outperform the shared bottom model, demonstrating that refined model design can enhance the performance on AntM²C. AntM²C is capable of reflecting the differences between different models.

4.2. Cold-start CTR prediction

The cold-start problem is a challenging issue in recommendation systems. Training high-quality CTR models using sparse user-item interaction data is a challenging task. Cold-start primarily involves two aspects: users and items. As shown in Figure 2, the AntM²C dataset exhibits a natural long-tail distribution in both users and items. Therefore, we conduct a comprehensive evaluation of cold-start baseline methods based on AntM²C dataset.

4.2.1. Data preprocess

In cold-start CTR prediction, we split the dataset based on time, using data before 20230717 as the training set and data on 20230717 as the validation and test sets. Based on this data division, we simulated two common cold-start problems in practice: few-shot and zero-shot.

•

Few-shot: users and items that appear in the training set with a count greater than 0 and less than $N$ ⁹⁹9The selection of this threshold $N$ can vary based on experiments, and we use 100 as an example for all experiments., meaning there is only a small amount of training data for these users and items.
•

Zero-shot: users and items that have never appeared in the training set, indicating that either the user is visiting the scenario for the first time or the item has been launched and added to the scenario on the first day.

Table 6 shows the data distribution of the test set under cold-start CTR evaluation. By using this dataset division, we can comprehensively evaluate and compare the performance of CTR models on few-shot and zero-shot samples. For few-shot samples, we can observe the model’s performance with only a small amount of training data and evaluate the model’s generalization ability. For zero-shot samples, we can evaluate the model’s recommendation ability on samples that it has never seen before.

Table 6. Data statistics of cold-start CTR evaluation. The meaning of ”zero-shot” is that the users and items have never appeared in the training set, while ”few-shot” means that there are only a small number of samples of users and items in the training set.

Category	Cold-start user		Cold-start item
Category	Count	Samples	Count	Samples
Few-Shot	67,110	685,774	30,315	306,964
Zero-Shot	65	2,752	14,230	121,447

4.2.2. Baselines and hyper-parameters

The key issue in cold-start modeling is how to learn user preferences and embeddings of users and items with limited data. In recent years, meta-learning-based cold-start methods have become state-of-the-art methods. We selected several representative methods with publicly available code as our baseline models.

•

DropoutNet (Volkovs et al., 2017): The DropoutNet is a popular cold-start method which applies dropout to control input, and exploits the average representations of interacted items/users to enhance the embeddings of users/items. We implemented the DropoutNet algorithm based on open-source code¹⁰¹⁰10https://github.com/layer6ai-labs/DropoutNet.
•

MAML (Finn et al., 2017): The MAML algorithm is a popular meta-learning approach that aims to enable fast adaptation to new tasks with limited data. MAML learns a good initialization of model parameters that can be effectively adapted to new tasks quickly. We treat each user and item as a task in MAML, and conduct meta-training on warm items. Then we perform meta-testing on cold-start items. The subsequent meta-learning-based algorithms will also follow this task setting.
•

MeLU (Lee et al., 2019): The MeLU algorithm is the first to apply the MAML to address the cold-start problem in recommender systems. Building upon MAML, MeLU ensures the stability of the learning process by not updating the embeddings in the inner loop (support set). The hyperparameter settings in MeLU were determined based on the public code¹¹¹¹11https://github.com/hoyeoplee/MeLU implementation.
•

MetaEmb (Pan et al., 2019): The MetaEmb algorithm also applies the MAML to address the cold-start problem in recommender systems. Unlike MeLU, MetaEmb focuses on optimizing the embeddings of items. It learns an initial representation using all training samples and then quickly adapts the embeddings of cold-start items. We implemented the MetaEmb algorithm based on open-source code¹²¹²12https://github.com/Feiyang/MetaEmbedding. Although MetaEmb only optimizes the embeddings of items, we have also applied the same approach to optimize the embeddings of users.

These base models share the common embedding and DNN structure. The dimensionality of embedding vectors of each input field is fixed to 32 for all our experiments. The Adam optimizer with a learning rate of 1e-3 is used to optimize the model parameters, and the training is performed for 3 epochs with a batch size of 512. In addition to the aforementioned cold-start algorithms, the DNN (without any cold-start optimization) is also considered as the baseline method for cold-start CTR.

Table 7. Cold-start evaluation on AntM²C. The table shows the AUC metrics of cold start users and items in zero-shot and few-shot situations.

Methods	Item		User
Methods	Zero-Shot	Few-Shot	Zero-Shot	Few-Shot
DNN	0.8021	0.8339	0.7931	0.9365
DropNet	0.8097	0.8498	0.7957	0.9387
MAML	0.8131	0.8511	0.8133	0.9393
MeLU	0.8197	0.8519	0.8103	0.9404
MetaEmb	0.8203	0.8583	0.8091	0.9399

4.2.3. Results

Table 7 shows the CTR performance for cold-start users and items. Because there is limited data for cold start users and items, we do not calculate AUC by scenarios, and evaluate the overall performance of cold start users and items. From the table, we can observe several phenomena. Firstly, compared to the results shown in Table 5, the AUC for cold-start users and items are generally lower than the overall level, which demonstrates that AntM²C’s data can effectively reflect the differences between cold and warm items and users. Secondly, different cold-start methods show distinguishable results in AntM²C, and all of them are significantly better than the DNN model without cold-start optimization. This indicates that AntM²C can effectively compare the effects of different cold-start methods and demonstrate the distinctiveness between methods. Finally, the lower performance of zero-shot compared to few-shot indicates that zero-shot CTR prediction is more challenging than few-shot. The two cold start modes provided by AntM²C can comprehensively evaluate cold-start CTR prediction.

4.3. Multi-Modal CTR prediction

With the rise of large language models (LLMs), it has become a hot research topic to effectively transfer the knowledge of LLM to CTR prediction. There have been many works(Sun et al., 2019; Geng et al., 2022; Hou et al., 2022; Penha and Hauff, 2020) based on multi-modal CTR modeling using features such as item and user text. AntM²C contains raw text features for both users and items, which can provide a more comprehensive evaluation of multi-modal modeling compared to existing CTR datasets. Therefore, we conduct the evaluation of different multi-modal methods based on the AntM²C dataset.

4.3.1. Data preprocess

In multi-modal evaluation, we adapt the same data processing approach as in multi-scenario evaluation mentioned in Section 4.1.1, and additionally include the text features from Table 3: user query entities and item entities. The text features will be used as inputs to the model together with other ID features.

4.3.2. Baselines and hyper-parameters

For the baseline model, we use the language model to process the text features, and then concatenate the text embedding with other ID features and input them into the multi-scenario model described in Section 4.1.2. For ease of evaluation, we choose MMoE as the backbone and pre-trained Bert-base¹³¹³13https://huggingface.co/docs/transformers/main/model_doc/bert (Devlin et al., 2018) as the text embedding extractor. The output dimension of Bert’s embeddings is 768. Then, a DNN with two layers, each layer having [768, 32] units, is used to reduce the dimension of Bert’s embedding to 32. This reduced embedding is concatenated with other features and input into the MMOE model. More powerful language models and the application of text features will continue to be supplemented in future works.

Table 8. Multi-modal evaluation on AntM²C. This table shows the AUC metrics for each scenario after incorporating the Bert-base model to model text features based on the multi-task CTR estimation using MMoE.

Methods	Scenarios
Methods	A	B	C	D	E
MMoE	0.7986	0.9438	0.8751	0.6854	0.8519
MMoE+Bert	0.7951	0.9437	0.8851	0.6974	0.8642

4.3.3. Results

Table 8 shows the evaluation results of the multi-modal CTR. It can be observed that, after adding the text modality, the CTR performance is better in data-sparse scenarios C, D, and E compared to using only the ID modality in the MMoE. Since the current baseline for using the text modality is relatively simple, the improvement in performance is not significant. However, this shows the potential of the text modality provided in AntM²C to improve CTR performance.

5. Conclusion And Future Work

This paper introduces a large-scale Multi-Scenario Multi-Modal CTR prediction dataset, AntM²C dataset. It includes 1 billion CTR data from five business scenarios on the Alipay platform, and each sample contains multi-modal features in addition to ID features, providing a comprehensive evaluation for CTR models. In the first release, we have made 10 million data publicly available, and we will continue to release more data and features. At the same time, we will gradually evaluate the more state-of-the-art baseline methods on AntM²C and provide comprehensive and solid evaluation results.

References

(1)
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning. PMLR, 1126–1135.
Geng et al. (2022) Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In Proceedings of the 16th ACM Conference on Recommender Systems. 299–315.
Hou et al. (2022) Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. 2022. Towards universal sequence representation learning for recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 585–593.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
Lee et al. (2019) Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. 2019. Melu: Meta-learned user preference estimator for cold-start recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1073–1082.
Ma et al. (2018) Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1930–1939.
Pan et al. (2019) Feiyang Pan, Shuokai Li, Xiang Ao, Pingzhong Tang, and Qing He. 2019. Warm up cold-start advertisements: Improving ctr predictions via learning to learn id embeddings. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 695–704.
Penha and Hauff (2020) Gustavo Penha and Claudia Hauff. 2020. What does bert know about books, movies and music? probing bert for conversational recommendation. In Proceedings of the 14th ACM Conference on Recommender Systems. 388–397.
Ruder (2017) Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017).
Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450.
Tang et al. (2020) Hongyan Tang, Junning Liu, Ming Zhao, and Xudong Gong. 2020. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. In Proceedings of the 14th ACM Conference on Recommender Systems. 269–278.
Volkovs et al. (2017) Maksims Volkovs, Guangwei Yu, and Tomi Poutanen. 2017. Dropoutnet: Addressing cold start in recommender systems. Advances in neural information processing systems 30 (2017).
Yuan et al. (2022) Guanghu Yuan, Fajie Yuan, Yudong Li, Beibei Kong, Shujie Li, Lei Chen, Min Yang, Chenyun Yu, Bo Hu, Zang Li, et al. 2022. Tenrec: A Large-scale Multipurpose Benchmark Dataset for Recommender Systems. Advances in Neural Information Processing Systems 35 (2022), 11480–11493.

AntM2C: A Large Scale Dataset For Multi-Scenario Multi-Modal CTR Prediction