AntM2C: A Large Scale Dataset For Multi-Scenario Multi-Modal CTR Prediction
Abstract.
Click-through rate (CTR) prediction is a crucial issue in recommendation systems, directly impacting user experience and platform revenue. In recent years, CTR has garnered attention from both industry and academia, leading to the emergence of various public CTR datasets. However, existing CTR datasets primarily suffer from the following limitations. Firstly, users generally click different types of items from multiple scenarios, and modeling the CTR from multiple scenarios can provide a more comprehensive understanding of users and share knowledge between different scenarios. Existing datasets only include CTR data for the same type of items from a single scenario. Secondly, multi-modal features are essential in multi-scenario CTR prediction as they effectively address the issue of inconsistent ID encoding between different scenarios. The existing datasets are based on ID features and lack multi-modal features. Third, a large-scale CTR dataset can provide a more reliable and comprehensive evaluation of complex models, fully reflecting the performance differences between models. While the scale of existing datasets is around 100 million, which is relatively small compared to the real-world industrial CTR prediction. To address these limitations, we propose AntM2C, a Multi-Scenario Multi-Modal CTR dataset based on real industrial data from the Alipay platform. Specifically, AntM2C possesses the following characteristics: 1) It covers CTR data of 5 different types of items from Alipay, providing insights into the preferences of users for different items, including advertisements, vouchers, mini-programs, contents, and videos. 2) Apart from ID-based features, AntM2C also provides 2 multi-modal features, raw text and image features, which can effectively establish connections between items with different IDs. 3) AntM2C provides 1 billion CTR data with 200 features, including 200 million users and 6 million items. It is currently the largest-scale CTR dataset available, providing a reliable and comprehensive evaluation for CTR models. Based on AntM2C, we construct several typical CTR tasks, including multi-scenario modeling, item and user cold-start modeling, and multi-modal modeling. For each task, we provide comparisons with baseline methods. The dataset homepage is available at https://www.atecup.cn/home.
1. Introduction
Click-through rate (CTR) prediction plays a significant role in various domains, including online advertising, search engines, and recommendation systems. CTR prediction refers to the task of estimating the probability that a user will click on a given item. It is essential for optimizing ad revenue, enhancing user experience, and improving engagement. One of the challenging issues in CTR prediction lies in the faithful evaluation of the model. Public CTR datasets provide a standardized and benchmarked environment for evaluating the performance of different CTR models. This enables researchers to compare the effectiveness of different models and identify the most suitable ones for specific applications.
However, in order to meet the constantly growing demands of users, the current CTR scenarios and items are becoming increasingly diverse, and the amount of CTR data is also increasing. For example, in Alipay, CTR occurs in the consumer coupons at marketing campaigns, videos on the tab3 page, and mini-programs after a search. As a result, the existing CTR datasets suffer from the following limitations. Firstly, in real-world industrial CTR prediction, users generally click various types of items from different business scenarios, reflecting their preferences for different items. For example, on Alipay, a user may browse a video about coffee on the Tab3 page, then click on a coffee coupon during a marketing campaign, and finally use the Alipay search to click a coffee ordering mini-program to place an order. Jointly modeling this multi-scenario CTR data can provide a more comprehensive understanding of user preferences, and the knowledge across scenarios can be shared to improve the CTR performance in each scenario. However, existing CTR datasets have a limited range of item types and generally originate from the same business scenario, which fails to capture the multi-scenario preferences of users. For example, Criteo111https://www.kaggle.com/c/criteo-display-ad-challenge and Avazu222https://www.kaggle.com/c/avazu-ctr-prediction only involve CTR data for advertisements. As e-commerce platforms, both Amazon333https://nijianmo.github.io/amazon/index.html and AliExpress444https://tianchi.aliyun.com/dataset/74690 provide CTR data for their e-commerce items. Tenrec (Yuan et al., 2022) focuses more on video and article recommendations. Secondly, multi-modal features can address the issue of inconsistent IDs for similar items in different business scenarios and effectively establish a bridge between different scenarios. For example, a video about coffee and a coffee coupon have different IDs in different business scenarios. Directly using ID features cannot perceive the relationship between these two items. Multi-modal features inherently carry semantic meaning and can better compensate for the inconsistency of ID features across different domains. Additionally, with the rise of large language models (LLMs), combining LLMs with CTR prediction has become an emerging research field. Existing CTR datasets are based on ID features and lack abundant multi-modal features, resulting in the CTR model being unable to test the performance in multi-scenarios and multi-modal settings. Furthermore, large-scale datasets can reliably and comprehensively reflect the performance of CTR models, while also highlighting the differences between CTR models. The existing datasets are typically at the scale of 100 million, which is insufficient to further validate the capabilities in larger-scale industrial scenarios.
To address the aforementioned challenges, we propose the AntM2C dataset, a large-scale multi-scenario multi-modal dataset for CTR prediction. Compared with existing CTR datasets, AntM2C has the following advantages:
-
•
Diverse business scenarios and item types: AntM2C contains different types of items from five typical business scenarios on the Alipay platform, including advertisements, vouchers, mini-programs, contents, and videos. Each business scenario has a unique data distribution. The abundant intersecting users and similar items between scenarios enable a more comprehensive evaluation for multi-scenario CTR modeling. Through one evaluation, the effectiveness of the CTR model can be evaluated in multiple business scenarios.
-
•
Multi-modal feature system: AntM2C not only includes ID features but also provides rich multi-modal features such as text and image, which can establish connections between similar items across scenarios and provide better evaluation for multi-modal CTR models. Furthermore, the feature system in AntM2C includes up to 200 features555In the first release, AntM2C open-sourced 10 million samples, including 29 ID features and 2 text features. More data and image features will be gradually released in subsequent phases., making it more closely aligned with real-world CTR prediction in industrial scenarios.
-
•
Largest data scale: AntM2C comprises 200 million users and 6 million items, with a total of 1 billion samples55footnotemark: 5. The average number of interactions per user is above 50. To the best of our knowledge, AntM2C is the largest public CTR dataset in terms of scale, which can provide comprehensive and reliable CTR evaluation results.
-
•
Comprehensive benchmark: Based on AntM2C, three typical CTR tasks have been built, including multi-scenario modeling, cold-start modeling, and multi-modal modeling. Benchmark evaluation results based on state-of-the-art models are also provided.
The rest of the paper is organized as follows. In Section 2, we briefly review some related works about public CTR datasets. In Section 3, we give a detailed introduction to the dataset collection and data analysis. In Section 4, we conduct empirical studies with baseline CTR methods on different CTR tasks.

2. Existing CTR Datasets
The existing public CTR datasets can be roughly divided into two categories: single-scenario and multi-scenario. Both have been widely adopted by the evaluation of CTR methods.
2.1. Single-Scenario CTR Datasets
The Criteo dataset is one of the publicly available datasets for CTR prediction. It contains over 45 million records of user interactions with advertisements, including features such as click-through rates, impression rates, and user demographics. Similar to the Criteo dataset, the Avazu dataset contains over 40 million records of user interactions with mobile advertisements. It includes features such as device information, app category, and user demographics. One of the main limitations of the Criteo and Avazu dataset is they only include CTR data for advertisements and cannot be used to evaluate CTR for other business scenarios or types of items. Additionally, the datasets do not provide text information about the advertisement or user, which can limit the scope of the multi-modal modeling.
2.2. Multi-Scenario CTR Datasets
The AliExpress is a dataset gathered from real-world traffic logs of the search system in AliExpress. This dataset is collected from 5 countries: Russia, Spain, French, Netherlands, and America, which can be seen as 5 scenarios. It can be used to develop and evaluate CTR prediction models for e-commerce platforms. The Tenrec dataset is a multipurpose dataset for CTR prediction where click data was collected from two scenarios: articles and videos. Although the above datasets cover different scenarios, the items within these scenarios are similar. The AliExpress dataset only consists of e-commerce items, and Tenrec involves videos and articles that only reflect the personal interests of users in the entertainment and cultural aspects. Additionally, similar to single-scenario datasets, both of these datasets lack textual modal information and only provide features such as IDs. This limitation restricts the application of multi-modal modeling.
3. Data Description
3.1. Data Collection
AntM2C’s data is collected from Alipay, a leading platform for payments and digital services. In order to meet the growing demands of users, Alipay recommends various types of items from different business scenarios to users.
3.1.1. Scenarios
AntM2C collects CTR data in five scenarios on Alipay, and there are differences in the types of items in each scenario. As shown in Figure 1, the CTR prediction occurs in multiple scenarios, including services and content on search, vouchers on marketing, videos on Tab3 page, and advertisements on the membership page. In the search scenario, when a user enters search words, several relevant mini-apps of services or content are displayed for the user to click on. Marketing scenarios recommend some consumer vouchers, and users click the coupons they are willing to use. On the Tab3 page, the recommended items are primarily short videos, and users will click to watch the videos they are interested in. On the membership page, users may click on some online advertisements. In conclusion, AntM2C includes various types of items from different business scenarios. In section 3.2.2, we will show that there are differences in the data distribution of these different scenarios. The rich and diverse items provide a more comprehensive evaluation for CTR prediction.
Scenario | Exposure | Users | Items | Click | Click Rate |
A | 3,996,614 | 93,465 | 112,098 | 147,656 | 3.69% |
B | 8,983,124 | 104,016 | 29,835 | 430,1270 | 47.88% |
C | 1,211,813 | 96,689 | 6,408 | 68,566 | 5.66% |
D | 1,981,484 | 37,095 | 19,092 | 722,009 | 36.44% |
E | 955,162 | 17,904 | 18,265 | 102,671 | 10.75% |
ALL | 17,128,197 | 120,721 | 184,306 | 5,342,172 | 31.19% |
3.1.2. Data Sampling
AntM2C collects 9-day (from 20230709 to 20230717) CTR samples from the above-mentioned five scenarios and then filters out 1 billion samples of relatively high-activity users who have a total click count 30 across all scenarios. In the first stage of open sourcing, we randomly sampled 10 million data from these 1 billion samples, and their statistical properties are shown in Table 1. We will open all 1 billion data in the subsequent stage. For the purpose of protecting user privacy, we do not explicitly indicate the names of the scenarios in the dataset, but instead use the letters ’A-E’ as substitutes.
3.1.3. Data Desensitization
The AntM2C does not contain any Personal Identifiable Information (PII) and has been desensitized and encrypted. Each user in the dataset was de-linked from the production system when securely encoded into an anonymized ID. Adequate data protection measures were undertaken during the experiment to mitigate the risk of data copy leakage. It is important to note that the dataset is solely utilized for academic research purposes and does not represent any actual commercial use.
Scenario | A | B | C | D | E |
A | - | 90537 | 75227 | 19561 | 14937 |
B | - | - | 83141 | 22721 | 15978 |
C | - | - | - | 31704 | 17019 |
D | - | - | - | - | 4788 |
E | - | - | - | - | - |
3.2. Data Distribution
3.2.1. Data Overlapping
AntM2C contains a portion of overlapped users across the five scenarios. Table 2 shows the number of intersecting users among different scenarios, indicating that AntM2C can reflect the preferences of the same user for items in different scenarios to effectively conduct multi-scenario CTR evaluation. As for items, due to the significant diversity in item types among different scenarios, there is no intersection of items between different scenarios.

3.2.2. Item & User Frequency
Figure 2 illustrates the frequency of user and item in AntM2C dataset, including all samples and samples from different scenarios (A-E). The horizontal axis represents the number of frequencies for users/items, while the vertical axis represents the number of users/items at that frequency. It can be observed that, in terms of item distribution, all scenarios exhibit a long-tail distribution, with 80% of the sample appearing less than 5 frequencies. This long-tail distribution is consistent with real-world situations. As for user distribution, there are differences between scenarios. In scenario B, the distribution of user frequency has two peaks, one at less than 5 times and the other around 50 times. After the frequency is greater than 50, the number of users decreases as the frequency increases. In other scenarios, the exposure frequency of users follows a long-tail distribution similar to that of items, where more exposure frequency leads to fewer users. Due to the overlapping users between scenarios, the long-tail distribution of users in multiple scenarios becomes a normal distribution in the global samples. Most users have an exposure frequency of around 50. Overall, the distribution of items and users in AntM2C reflects CTR prediction in practice.
Category | Feature_name | description | Type | Coverage |
user_id | user number | ID | 100% | |
features_0-26 | user sequences | ID | 85.50% | |
User Features | query_entity_seq | search sequence | Text | 90.32% |
item_id | item number | ID | 100% | |
item_entity_names | entity name of item | Text | 100% | |
Item Features | item_title | title of item | Text | 95.50% |
log_time | time in log | Text | 100% | |
Other Features | scene | scenario number | ID | 100% |
Label | label | click label | Int | 100% |
3.3. Features
The feature system of AntM2C, as shown in Table 3, includes ID features of users and items, as well as raw text features.
3.3.1. User Features
The user features consist of static profile features666User static attributes and item title will be open-sourced in the subsequent phases. and user sequence features. The static profile features include basic user attributes such as gender, age, occupation, etc. The sequence features provide the user’s recent activities on Alipay, including clicked mini-apps, searched services, purchased items, etc. As mentioned in Section 3.1.3, these user features have been desensitized and encrypted for the purpose of user privacy protection and appear in the dataset in an encrypted ID format, making it impossible to reconstruct the original user features. In addition to the ID-based features, AntM2C also includes the raw text of user search entities to provide multi-modal evaluation.
3.3.2. Item Features
The item features consist of item ID and item textual features. The item ID is a globally unique identifier for each item, and the encoding of item IDs varies across different scenarios. To address the inconsistency of item IDs across scenarios, AntM2C also includes the original title text of the items66footnotemark: 6 and entities extracted based on the title text.
3.3.3. Other Features
In addition to user and item features, AntM2C also provides additional features such as log time and scene identification. Users can utilize these extra features to flexibly split the training, validation, and testing sets based on time and evaluate the performance in different scenarios.
3.3.4. Label
The label in AntM2C indicates whether the user clicked on the corresponding item. If the user clicked, the label is set to 1, otherwise it is set to 0. The ratio of positive to negative samples in AntM2C can be obtained from the click rate in Table 1. It should be noted that there are a large number of negative samples in the actual online logs (samples that were exposed but not clicked on). To address this issue, negative sampling was performed which resulted in a higher click-through rate in the AntM2C dataset compared to that in the actual online logs.
4. Experimental Evaluation
In this section, we describe the applications of AntM2C in several CTR prediction tasks. We briefly introduce each task and report the results of some baseline methods. We select the commonly used AUC (Area Under the Curve) as the metrics for all experiments. The baseline methods and evaluation results in the experiment provide a demo of using AntM2C. More baselines and evaluations will continue to be updated in future work.
4.1. Multi-Scenario CTR prediction
Multi-scenario CTR prediction is a common issue in industrial recommendation systems. It builds a unified model by leveraging CTR data from multiple scenarios. The knowledge sharing between scenarios enables the multi-scenario model to achieve better performance compared to single-scene modeling. We conduct an evaluation on multi-scenario CTR prediction using different baseline methods based on the 5 scenarios in the AntM2C dataset.
Scenario | Train Set | Test Set |
A | 3,499,645 | 496,969 |
B | 7,890,222 | 1,092,901 |
C | 1,059,578 | 151,670 |
D | 1,802,707 | 178,777 |
E | 846,791 | 104,359 |
Total | 15,098,943 | 2,024,676 |
4.1.1. Data preprocess
In the multi-scenario CTR evaluation, we divide the AntM2C dataset based on time, using the data before 20230717 as the training set and the data on 20230717 as the test set. The training and test sets include samples from all five scenarios, and their data distribution is shown in Table 4. It can be observed that there are differences in the number of training and test samples among different scenarios. Among them, Scenario B has the highest number of samples, which is ten times that of Scenario E. In terms of features, we use the user and item features from the ID category as shown in Table 3. The text features will be used for multi-modal evaluation (see in Section 4.3).
4.1.2. Baselines and hyper-parameters
We mainly choose the multi-task methods as the baseline methods for multi-scenario CTR prediction. We treat the CTR estimation for each scenario as a task and share the knowledge among the scenarios at the bottom layer, with each scenario’s CTR score output at the tower layer. The baseline methods and hyperparameter settings are as follows:
-
•
DNN: The DNN is trained on a mixture of samples from all scenarios without tasks, serving as the baseline for multi-scenario CTR prediction. The DNN consists of three layers with 128, 32, and 2 units, respectively. The following multi-task model has the same number of layers and unit settings as the DNN.
-
•
Shared Bottom (Ruder, 2017): Shared bottom is the most fundamental model in multi-task learning, where the knowledge is shared among the tasks at the bottom layer. Each task has its own independent tower layer and outputs the corresponding CTR score777https://github.com/shenweichen/DeepCTR.
-
•
MMoE (Ma et al., 2018): Based on the shared bottom, MMOE introduces multiple expert networks, each specialized in predicting a specific task, sharing a common input layer. Additionally, MMOE adds a gating network that assigns different weights to each expert based on the input data to determine their influence on predicting the output for a specific task. In the experiment, we set the number of experts in MMOE to 6888https://github.com/drawbridge/keras-mmoe.
-
•
PLE (Tang et al., 2020): Based on MMOE, PLE further designs task-specific experts for each task, while retaining the shared expert. This structure allows the model to better learn the differences and correlations among tasks. We set the number of experts in PLE to be the same as MMOE, with each of the five scenarios having its own specific expert and one globally shared expert77footnotemark: 7.
All baseline methods utilized the Adam (Kingma and Ba, 2015) optimizer with a learning rate of 1e-3 for parameter optimization. The models were trained for 5 epochs with a batch size of 512.
Methods | Scenario | ||||
A | B | C | D | E | |
DNN | 0.7846 | 0.9328 | 0.8733 | 0.6880 | 0.8338 |
Sharedbottom | 0.8039 | 0.9414 | 0.8798 | 0.6915 | 0.8525 |
MMoE | 0.7986 | 0.9438 | 0.8751 | 0.6854 | 0.8519 |
PLE | 0.8039 | 0.9429 | 0.8785 | 0.6903 | 0.8506 |
4.1.3. Results
Table 5 shows the evaluation results of different baseline methods on multi-scenario CTR prediction, from which we can draw the following conclusions. Firstly, compared to the DNN model that trains all data together without considering scenario characteristics, all multi-task models achieve better performance. This demonstrates that in AntM2C, there are differences and commonalities between scenarios, and simply mixing training data will not achieve the best results. Secondly, the CTR performance varies across each scenario, indicating different levels of difficulty between scenarios. For example, in scenario B, where there is a large amount of data, the AUC is generally above 0.93, while in scenario D, the AUC is only around 0.68. The diverse business scenarios and items in AntM2C enable a more comprehensive and diverse evaluation of CTR. Finally, the expert-structured MMOE and PLE outperform the shared bottom model, demonstrating that refined model design can enhance the performance on AntM2C. AntM2C is capable of reflecting the differences between different models.
4.2. Cold-start CTR prediction
The cold-start problem is a challenging issue in recommendation systems. Training high-quality CTR models using sparse user-item interaction data is a challenging task. Cold-start primarily involves two aspects: users and items. As shown in Figure 2, the AntM2C dataset exhibits a natural long-tail distribution in both users and items. Therefore, we conduct a comprehensive evaluation of cold-start baseline methods based on AntM2C dataset.
4.2.1. Data preprocess
In cold-start CTR prediction, we split the dataset based on time, using data before 20230717 as the training set and data on 20230717 as the validation and test sets. Based on this data division, we simulated two common cold-start problems in practice: few-shot and zero-shot.
-
•
Few-shot: users and items that appear in the training set with a count greater than 0 and less than 999The selection of this threshold can vary based on experiments, and we use 100 as an example for all experiments., meaning there is only a small amount of training data for these users and items.
-
•
Zero-shot: users and items that have never appeared in the training set, indicating that either the user is visiting the scenario for the first time or the item has been launched and added to the scenario on the first day.
Table 6 shows the data distribution of the test set under cold-start CTR evaluation. By using this dataset division, we can comprehensively evaluate and compare the performance of CTR models on few-shot and zero-shot samples. For few-shot samples, we can observe the model’s performance with only a small amount of training data and evaluate the model’s generalization ability. For zero-shot samples, we can evaluate the model’s recommendation ability on samples that it has never seen before.
Category | Cold-start user | Cold-start item | ||
Count | Samples | Count | Samples | |
Few-Shot | 67,110 | 685,774 | 30,315 | 306,964 |
Zero-Shot | 65 | 2,752 | 14,230 | 121,447 |
4.2.2. Baselines and hyper-parameters
The key issue in cold-start modeling is how to learn user preferences and embeddings of users and items with limited data. In recent years, meta-learning-based cold-start methods have become state-of-the-art methods. We selected several representative methods with publicly available code as our baseline models.
-
•
DropoutNet (Volkovs et al., 2017): The DropoutNet is a popular cold-start method which applies dropout to control input, and exploits the average representations of interacted items/users to enhance the embeddings of users/items. We implemented the DropoutNet algorithm based on open-source code101010https://github.com/layer6ai-labs/DropoutNet.
-
•
MAML (Finn et al., 2017): The MAML algorithm is a popular meta-learning approach that aims to enable fast adaptation to new tasks with limited data. MAML learns a good initialization of model parameters that can be effectively adapted to new tasks quickly. We treat each user and item as a task in MAML, and conduct meta-training on warm items. Then we perform meta-testing on cold-start items. The subsequent meta-learning-based algorithms will also follow this task setting.
-
•
MeLU (Lee et al., 2019): The MeLU algorithm is the first to apply the MAML to address the cold-start problem in recommender systems. Building upon MAML, MeLU ensures the stability of the learning process by not updating the embeddings in the inner loop (support set). The hyperparameter settings in MeLU were determined based on the public code111111https://github.com/hoyeoplee/MeLU implementation.
-
•
MetaEmb (Pan et al., 2019): The MetaEmb algorithm also applies the MAML to address the cold-start problem in recommender systems. Unlike MeLU, MetaEmb focuses on optimizing the embeddings of items. It learns an initial representation using all training samples and then quickly adapts the embeddings of cold-start items. We implemented the MetaEmb algorithm based on open-source code121212https://github.com/Feiyang/MetaEmbedding. Although MetaEmb only optimizes the embeddings of items, we have also applied the same approach to optimize the embeddings of users.
These base models share the common embedding and DNN structure. The dimensionality of embedding vectors of each input field is fixed to 32 for all our experiments. The Adam optimizer with a learning rate of 1e-3 is used to optimize the model parameters, and the training is performed for 3 epochs with a batch size of 512. In addition to the aforementioned cold-start algorithms, the DNN (without any cold-start optimization) is also considered as the baseline method for cold-start CTR.
Methods | Item | User | ||
Zero-Shot | Few-Shot | Zero-Shot | Few-Shot | |
DNN | 0.8021 | 0.8339 | 0.7931 | 0.9365 |
DropNet | 0.8097 | 0.8498 | 0.7957 | 0.9387 |
MAML | 0.8131 | 0.8511 | 0.8133 | 0.9393 |
MeLU | 0.8197 | 0.8519 | 0.8103 | 0.9404 |
MetaEmb | 0.8203 | 0.8583 | 0.8091 | 0.9399 |
4.2.3. Results
Table 7 shows the CTR performance for cold-start users and items. Because there is limited data for cold start users and items, we do not calculate AUC by scenarios, and evaluate the overall performance of cold start users and items. From the table, we can observe several phenomena. Firstly, compared to the results shown in Table 5, the AUC for cold-start users and items are generally lower than the overall level, which demonstrates that AntM2C’s data can effectively reflect the differences between cold and warm items and users. Secondly, different cold-start methods show distinguishable results in AntM2C, and all of them are significantly better than the DNN model without cold-start optimization. This indicates that AntM2C can effectively compare the effects of different cold-start methods and demonstrate the distinctiveness between methods. Finally, the lower performance of zero-shot compared to few-shot indicates that zero-shot CTR prediction is more challenging than few-shot. The two cold start modes provided by AntM2C can comprehensively evaluate cold-start CTR prediction.
4.3. Multi-Modal CTR prediction
With the rise of large language models (LLMs), it has become a hot research topic to effectively transfer the knowledge of LLM to CTR prediction. There have been many works(Sun et al., 2019; Geng et al., 2022; Hou et al., 2022; Penha and Hauff, 2020) based on multi-modal CTR modeling using features such as item and user text. AntM2C contains raw text features for both users and items, which can provide a more comprehensive evaluation of multi-modal modeling compared to existing CTR datasets. Therefore, we conduct the evaluation of different multi-modal methods based on the AntM2C dataset.
4.3.1. Data preprocess
In multi-modal evaluation, we adapt the same data processing approach as in multi-scenario evaluation mentioned in Section 4.1.1, and additionally include the text features from Table 3: user query entities and item entities. The text features will be used as inputs to the model together with other ID features.
4.3.2. Baselines and hyper-parameters
For the baseline model, we use the language model to process the text features, and then concatenate the text embedding with other ID features and input them into the multi-scenario model described in Section 4.1.2. For ease of evaluation, we choose MMoE as the backbone and pre-trained Bert-base131313https://huggingface.co/docs/transformers/main/model_doc/bert (Devlin et al., 2018) as the text embedding extractor. The output dimension of Bert’s embeddings is 768. Then, a DNN with two layers, each layer having [768, 32] units, is used to reduce the dimension of Bert’s embedding to 32. This reduced embedding is concatenated with other features and input into the MMOE model. More powerful language models and the application of text features will continue to be supplemented in future works.
Methods | Scenarios | ||||
A | B | C | D | E | |
MMoE | 0.7986 | 0.9438 | 0.8751 | 0.6854 | 0.8519 |
MMoE+Bert | 0.7951 | 0.9437 | 0.8851 | 0.6974 | 0.8642 |
4.3.3. Results
Table 8 shows the evaluation results of the multi-modal CTR. It can be observed that, after adding the text modality, the CTR performance is better in data-sparse scenarios C, D, and E compared to using only the ID modality in the MMoE. Since the current baseline for using the text modality is relatively simple, the improvement in performance is not significant. However, this shows the potential of the text modality provided in AntM2C to improve CTR performance.
5. Conclusion And Future Work
This paper introduces a large-scale Multi-Scenario Multi-Modal CTR prediction dataset, AntM2C dataset. It includes 1 billion CTR data from five business scenarios on the Alipay platform, and each sample contains multi-modal features in addition to ID features, providing a comprehensive evaluation for CTR models. In the first release, we have made 10 million data publicly available, and we will continue to release more data and features. At the same time, we will gradually evaluate the more state-of-the-art baseline methods on AntM2C and provide comprehensive and solid evaluation results.
References
- (1)
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning. PMLR, 1126–1135.
- Geng et al. (2022) Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In Proceedings of the 16th ACM Conference on Recommender Systems. 299–315.
- Hou et al. (2022) Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong Wen. 2022. Towards universal sequence representation learning for recommender systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 585–593.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- Lee et al. (2019) Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. 2019. Melu: Meta-learned user preference estimator for cold-start recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1073–1082.
- Ma et al. (2018) Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1930–1939.
- Pan et al. (2019) Feiyang Pan, Shuokai Li, Xiang Ao, Pingzhong Tang, and Qing He. 2019. Warm up cold-start advertisements: Improving ctr predictions via learning to learn id embeddings. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 695–704.
- Penha and Hauff (2020) Gustavo Penha and Claudia Hauff. 2020. What does bert know about books, movies and music? probing bert for conversational recommendation. In Proceedings of the 14th ACM Conference on Recommender Systems. 388–397.
- Ruder (2017) Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017).
- Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450.
- Tang et al. (2020) Hongyan Tang, Junning Liu, Ming Zhao, and Xudong Gong. 2020. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. In Proceedings of the 14th ACM Conference on Recommender Systems. 269–278.
- Volkovs et al. (2017) Maksims Volkovs, Guangwei Yu, and Tomi Poutanen. 2017. Dropoutnet: Addressing cold start in recommender systems. Advances in neural information processing systems 30 (2017).
- Yuan et al. (2022) Guanghu Yuan, Fajie Yuan, Yudong Li, Beibei Kong, Shujie Li, Lei Chen, Min Yang, Chenyun Yu, Bo Hu, Zang Li, et al. 2022. Tenrec: A Large-scale Multipurpose Benchmark Dataset for Recommender Systems. Advances in Neural Information Processing Systems 35 (2022), 11480–11493.