This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

G-Meta: Distributed Meta Learning in GPU Clusters for Large-Scale Recommender Systems

Youshao Xiao Ant GroupHangzhou, China [email protected] Shangchun Zhao Ant GroupHangzhou, China [email protected] Zhenglei Zhou Ant GroupHangzhou, China [email protected] Zhaoxin Huan Ant GroupHangzhou, China [email protected] Lin Ju Ant GroupHangzhou, China [email protected] Xiaolu Zhang Ant GroupHangzhou, China [email protected] Lin Wang Ant GroupHangzhou, China [email protected]  and  Jun Zhou Ant GroupHangzhou, China [email protected]
(2023; 2023)
Abstract.

Recently, a new paradigm, meta learning, has been widely applied to Deep Learning Recommendation Models (DLRM) and significantly improves statistical performance, especially in cold-start scenarios. However, the existing systems are not tailored for meta learning based DLRM models and have critical problems regarding efficiency in distributed training in the GPU cluster. It is because the conventional deep learning pipeline is not optimized for two task-specific datasets and two update loops in meta learning. This paper provides a high-performance framework for large-scale training for Optimization-based Meta DLRM models over the GPU cluster, namely G-Meta. Firstly, G-Meta utilizes both data parallelism and model parallelism with careful orchestration regarding computation and communication efficiency, to enable high-speed distributed training. Secondly, it proposes a Meta-IO pipeline for efficient data ingestion to alleviate the I/O bottleneck. Various experimental results show that G-Meta achieves notable training speed without loss of statistical performance. Since early 2022, G-Meta has been deployed in Alipay’s core advertising and recommender system, shrinking the continuous delivery of models by four times. It also obtains 6.48% improvement in Conversion Rate (CVR) and 1.06% increase in CPM (Cost Per Mille) in Alipay’s homepage display advertising, with the benefit of larger training samples and tasks.

Recommender System; Deep Meta Learning; Distributed Training
copyright: acmcopyrightjournalyear: 2023doi: https://doi.org/10.1145/3583780.3615208journalyear: 2023copyright: acmlicensedconference: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management; October 21–25, 2023; Birmingham, United Kingdombooktitle: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ’23), October 21–25, 2023, Birmingham, United Kingdomprice: 15.00doi: 10.1145/3583780.3615208isbn: 979-8-4007-0124-5/23/10ccs: Computing methodologies Machine learningccs: Computing methodologies Distributed computing methodologies

1. Introduction

Deep Learning Recommendation Models (DLRM) are ubiquitously adopted by internet companies in core applications, such as Advertising, Search, and Recommendations (ASR) scenarios, which influence revenues at billions of dollar level (Cheng et al., 2016; Ma et al., 2020; Zhou et al., 2019). However, learning-based methods are data-demanding and work poorly on users, items, or scenarios with little logging data, which is known as the cold-start problem (Volkovs et al., 2017; Pan et al., 2019). It not only significantly hinders revenue but also largely downgrades the satisfaction of new users or advertisers.

Meta learning, especially optimization-based meta learning, has recently been widely applied to the cold-start problem in DLRM, effectively alleviating the problem (Du et al., 2019; Pan et al., 2019; Lu et al., 2020; Huan et al., 2022; Vartak et al., 2017). The paradigm of optimization-based meta learning (Hospedales et al., 2021) usually consists of two update loops and datasets: the inner loop is to learn the task-specific parameters using the support set and another outer loop is to update the meta parameters based on the formerly computed parameters using the query set. Among these methods, Model Agnostic Meta Learning (MAML) is the most classical approach and different model derivatives optimize the algorithm varying the base neural network (Ravi and Larochelle, 2017; Bertinetto et al., 2018), or the optimization method (Nichol et al., 2018; Rajeswaran et al., 2019). Additionally, parallelization is essential to train massive datasets with up to billions of samples or model parameters to ensure statistical performance and in-time delivery in the industry (Abadi et al., 2016; Paszke et al., 2019; Chen et al., 2015).

Refer to caption
Figure 1. The Distributed Training Architecture of G-Meta

However, there are two primary problems when parallelizing the Meta DLRM. The first problem is that the existing distributed training architectures do not fit the recommender system workloads for meta learning paradigms. Parameter Server (Dean et al., 2012; Li et al., 2014) is one of the most popular architectures for the distributed DLRM training and DMAML (Bollenbacher et al., 2020) customizes the Parameter Server architecture for MAML training in the CPU cluster. However, two update loops in meta learning double the computing and the computation-intensive dense layer becomes more complicated in DLRM (Zhang et al., 2022; Wang et al., 2022), which makes the CPU time-consuming to compute and requires GPU for acceleration. Nevertheless, Parameter Server is mainly used in the CPU cluster and the design underutilizes the capability of GPU since the embedding layers held in servers are I/O and communication-intensive operators (Sergeev and Del Balso, 2018). AllReduce (Gibiansky, 2017; Sergeev and Del Balso, 2018) is another popular architecture in the GPU cluster, and QMAML (Kunde et al., 2021) parallelizing the MAML algorithm based on AllReduce, while the GPU memory of a single device cannot hold the huge embedding layer in DLRM(Wang et al., 2022).

Secondly, meta learning requires different data management against the traditional deep learning training, and conventional I/O design bottlenecks the training speed (Aizman et al., 2019; Mudigere et al., 2022). Meta learning requires assembling the batch data according to both the task level and batch level when traditional deep learning only requires batch level in the training pipeline (Abadi et al., 2016; Wang et al., 2022; Zhang et al., 2022). To be more specific, each worker may hold the batch data from different tasks, but the samples in a batch should belong to the identical task after shuffling for correctness. QMAML (Kunde et al., 2021) customizes the data loading in CV scenarios, nevertheless, it is unproductive to ingest massive datasets in ASR scenarios, e.g., billions of KB-level small samples (Aizman et al., 2019).

Therefore, we propose G-Meta to speed up the distributed training of Optimization-based Meta DLRM in GPU clusters. This work makes three primary contributions: (1) Hybrid Parallelism. G-Meta parallelizes the training of industrial Meta DLRM in GPU clusters by firstly introducing Hybrid Parallelism via AlltoAll and AllReduce communication primitives. (2) Optimization for Parallelization. G-Meta significantly improves the computation and communication efficiency in distributed meta learning from both algorithm and engineering aspects. (3) High-Performance Meta I/O. G-Meta offers a high-throughput data loading for meta samples by designing the Meta-IO pipeline with I/O optimizations.

2. Our System

The G-Meta consists of two components: Meta-Train and Meta-IO. Firstly, Meta-Train is responsible for high-performance distributed training in the GPU cluster. Secondly, a Meta-IO pipeline is designed to maximize the speed of data ingestion. Without loss of generality, we take the MAML algorithm as an example and it could be easily extended to other optimization-based algorithms. Following notations in MAML paper (Finn et al., 2017), we present G-Meta in detail.

2.1. Meta-Train

The goal of Meta-Train is to learn meta model parameters: embedding layer parameters ξ\xi^{\ast} and dense layer parameters θ\theta^{\ast}, where ξ\xi^{\ast} is too huge to be held in a single device in GPU clusters. Firstly, G-Meta evenly partitions the enormous embedding parameters and distributes them to all workers; it shares small dense parameters among all workers. Then it employs a hybrid parallelism algorithm to enable efficient training with careful orchestration. Also, we introduce additional network optimizations from the hardware aspect. The training architecture of G-Meta is depicted in Figure 1.

2.1.1. Distributed Inner Loop

Following Algorithm 1, we first parallelize the inner loop by distributing the learning processes over all workers to accelerate the training and save the scarce GPU memory. In the inner loop, MAML only utilizes the mini-batch support set 𝒟iSup\mathcal{D}_{i}^{Sup} for model computation and will utilize the mini-batch query set 𝒟iQuery\mathcal{D}_{i}^{Query} in the following outer loop. In this case, the Meta DLRM model requires fetching the embedding parameters using the I/O and communication-intensive embedding lookup operations twice, one for the support set and another for the query set, which is inefficient. Therefore, to lower the communication frequency, we aggregate these two embedding lookups and prefetch both embedding parameters ξiSup\xi_{i}^{Sup} for the support set and ξiQuery\xi_{i}^{Query} for the query set altogether. During the phase, to fully utilize the bandwidth among workers, we perform the AlltoAll primitive (Jeaugey, 2017) to exchange the embedding parameters among all workers in contrast to parameter server architecture. After that, each worker locally computes the loss function iSup\mathcal{L}^{Sup}_{i} in the inner forward and updates model parameters in the backward propagation. As the result, each worker ii owns its task-specific model parameters ξiSup{\xi^{\prime}_{i}}^{Sup} and θi\theta^{\prime}_{i}.

2.1.2. Distributed Outer Loop

In the outer loop, we locally update the embedding parameters from the query dataset 𝒟iQuery\mathcal{D}_{i}^{Query} which overlaps with the support dataset 𝒟iSup\mathcal{D}_{i}^{Sup} using formerly computed ξiSup{\xi^{\prime}_{i}}^{Sup} to obtain ξiQuery{\xi^{\prime}_{i}}^{Query}; otherwise, the outer-loop utilizes the stale embedding parameters due to the above prefetch optimization. Then we take the outer forward propagation based on the query dataset and obtain the loss iQuery\mathcal{L}^{Query}_{i} for each worker. We update the huge embedding parameters using the AlltoAll primitive and update the small DNN parameters via the AllReduce primitive globally. Lastly, we achieve the meta parameters [ξ,θ][\xi,\theta] for the next iteration, and repeat the procedure until coverage.

Algorithm 1 Hybrid Parallelism Algorithm for G-Meta

Input:α\alpha, β\beta are step size hyperparameters, NN is workers number

1:Worker ii:
2:Randomly initialize ξi\xi_{i} and θi\theta_{i} for each worker ii \triangleright ξ\xi is bucketized in shards by rows and evenly distributed to workers as ξi\xi_{i}. θi\theta_{i} is the replica of θ\theta in the worker ii.
3:while not coverage do
4:     Sample a batch of task 𝒯ip(𝒯)\mathcal{T}_{i}\sim p(\mathcal{T}) for worker ii
5:     Split 𝒯i\mathcal{T}_{i} into mini-batch support set 𝒟iSup\mathcal{D}_{i}^{Sup} and mini-batch query set 𝒟iQuery\mathcal{D}_{i}^{Query}.
6:     Perform AlltoAll primitive to prefetch ξiSup\xi_{i}^{Sup} for support dataset 𝒟iSup\mathcal{D}_{i}^{Sup} and ξiQuery\xi_{i}^{Query} for query dataset 𝒟iQuery\mathcal{D}_{i}^{Query}.
7:     Inner forward iSup(f(ξiSup,θi;𝒟iSup))\mathcal{L}^{Sup}_{i}\leftarrow\mathcal{L}(f(\xi_{i}^{Sup},\theta_{i};\mathcal{D}_{i}^{Sup}))
8:     Local Backpropagation ξiSupξiSupαξiSupiSup{\xi^{\prime}_{i}}^{Sup}\leftarrow\xi_{i}^{Sup}-\alpha\nabla_{\xi_{i}^{Sup}}\mathcal{L}^{Sup}_{i}
9:     Local Backpropagation θiθiαθiiSup\theta^{\prime}_{i}\leftarrow\theta_{i}-\alpha\nabla_{\theta_{i}}\mathcal{L}^{Sup}_{i}
10:     Update overlapping embedding parameters using ξiSup{\xi^{\prime}_{i}}^{Sup} to get ξiQuery{\xi^{\prime}_{i}}^{Query} for outer loop.
11:     Outer forward iQuery(f(ξiQuery,θi;𝒟𝒾Query))\mathcal{L}_{i}^{Query}\leftarrow\mathcal{L}(f({\xi^{\prime}_{i}}^{Query},\theta^{\prime}_{i};\mathcal{D_{i}}^{Query}))
12:     Global Backpropagation and communicate via AlltoAll primitive: ξξβi=1NξiQuery\xi\leftarrow\xi-\beta\sum_{i=1}^{N}\nabla_{\xi}\mathcal{L}^{Query}_{i}
13:     Global Backpropagation and communicate via AllReduce primitive θθβi=1NθiQuery{\theta}\leftarrow\theta-\beta\sum_{i=1}^{N}\nabla_{\theta}\mathcal{L}^{Query}_{i}
14:end while

2.1.3. Optimization for Outer Update Rule

The outer loop update rule in MAML naturally corresponds to θθβθi=1NiQuery\theta\leftarrow\theta-\beta\nabla_{\theta}\sum_{i=1}^{N}\mathcal{L}^{Query}_{i} in the parallel setting(Kunde et al., 2021). However, it requires a central node to Gather all NN task-specific parameters from the inner loop and update the meta parameters, which becomes a bottleneck. The data transferred is K(N1)K(N-1) (KK is the data transmission for each node), and the computational complexity is O(KN)O(KN) in the central node. Nevertheless, we could exchange the order of the gradient and summation operator into: θθβi=1NθiQuery{\theta}\leftarrow\theta-\beta\sum_{i=1}^{N}\nabla_{\theta}\mathcal{L}^{Query}_{i}, according to the derivative rules. Then each worker locally calculates the gradients and aggregates them into meta parameters via the AllReduce primitive (e.g., Ring-AllReduce ). Thus, the data transferred downsizes to 2K(N1)N\frac{2K(N-1)}{N} and the computational complexity reduces to O(K)O(K).

2.1.4. Optimization for Network

The above hybrid parallelism algorithm enables parallelizing the large Meta DLRM, however, the AlltoAll and AllReduce primitives are highly-connected communication patterns while the socket-based network in the data center impedes the communication efficiency. Therefore, we further utilize the dedicated RDMA over Converged Ethernet(RoCE) based network in the inter-node data transmission to enable scalable high-speed communication. For the intra-node data transmission, we leverage the NVLink for higher bandwidth instead of the PCIe bus (e.g. system memory). The usage of RDMA and NVLink promotes high-performance data transfers for the highly-connected communications in our G-Meta training.

2.2. Meta-IO

In contrast to CPU devices, using GPUs could largely shorten the model computation duration, but it requires high-throughput I/O and network to swallow data faster, otherwise, the training is still bottlenecked. After optimizing the computing and communication phase, we further design and optimize the data ingestion for meta data in the data preprocessing and training phase. Furthermore, the pipeline fits the data ingestion of all Meta DRLM models with high-performance I/O, not just optimization-based meta learning.

Refer to caption
Figure 2. Dataflow of Meta-IO

2.2.1. Meta I/O pipeline

Different from conventional deep learning pipeline, the meta data for training requires that the batch of records 𝒯i\mathcal{T}_{i} sampled from p(𝒯)p(\mathcal{T}) should belong to one identical task. To achieve efficient data loading, we first sort the samples by the order of tasktask column, which is the unique id for different tasks, and generate a batch_idbatch\_id for each sample according to the batch size and tasktask column in the data preprocessing phase as shown in Figure 2. Secondly, we randomly shuffle the samples with respect to the batch level; otherwise, the traditional sample-level shuffle will mix samples from different tasks and brings unnecessary complexity to data organization. Lastly, in the training phase, samples are loaded by each worker evenly and only records from the same tasks are ensembled in a batch using our GroupBatchOp according to both tasktask id and batch_idbatch\_id. We implement the preprocessing phase in MapReduce (Dean and Ghemawat, 2008) and the GroupBatchOp in the training phase using C++.

2.2.2. Optimization for Data Ingestion

We further optimize the phase to feed more samples to avoid I/O bottleneck. Due to the massive capacity of samples in ASR scenarios, we have to store samples in the HDD-based file system, such as HDFS(Borthakur, 2007), rather than the expensive SSD. To efficiently load massive KB-level small samples, we first generate an extra offsetoffset column in the preprocessing phase and sequentially store samples according to the offsetoffset column. Then samples could be loaded sequentially in the training phase according to (offseti,offseti+total_samples/N)(offset*i,offset*i+total\_samples/N) for each worker ii. The above sequential read access allows high-throughput I/O in the block-based file system. Secondly, the decoding is time-consuming in the mainstream string-based storage format from our profiling. It incurs considerable latency in the data loading since GPUs with the aforementioned optimizations largely shorten the model computation. Therefore, we utilize the TFRecords(Tensorflow, 2023) in Tensorflow and WebDataset (Aizman et al., 2019) in PyTorch as the storage format to speed up the unserialization and reduce the data transmission.

3. Experiments

3.1. Experiments Settings

3.1.1. Environments Setup

We conduct experiments over a dedicated CPU cluster and a GPU cluster since using the PS-based method in GPU clusters is ineffective as previously stated. The CPU cluster consists of at most 200 nodes (160 workers and 40 servers), where each worker owns 18 cores and each server owns 22 cores. In contrast, the GPU cluster contains up to 32 Nvidia A100 GPUs. These nodes are connected via RDMA or Socket-based Networks. Similarly, GPUs in each node are connected via NVLink or system memory for comparison. We implement the G-Meta in Tensorflow and employ NVIDIA’s collective communications library (NCCL) (Jeaugey, 2017) for AlltoAll and AllReduce operations.

3.1.2. Experiments Setup

Three datasets and four models are used for evaluation from two aspects. For statistical correctness, three popular meta DLRM (MAML (Vuorio et al., 2019), MeLU (Lee et al., 2019), and CBML (Song et al., 2021)) using Movielens dataset (Harper and Konstan, 2015) are leveraged for verifying the correctness of the implementation, following the model settings in TSAML (Yang et al., 2022). For efficiency, we evaluate G-Meta using an in-house Meta DLRM model on both Ali-CCP dataset (Ma et al., 2018) and the in-house dataset with 1.6 billion records, since the MovieLens dataset lacks enough samples for large-scale training. For the baseline, we use the DMAML (Bollenbacher et al., 2020), which employs the parameter server architecture for the distributed meta learning training and we also use optimized Meta-IO to avoid I/O bottlenecks for fairness. We run all experiments three times and take average values for evaluation.

3.2. Experiment Results

This subsection evaluates the efficiency (training speed or scalability) and statistical performance (AUC) of G-Meta against the DMAML. For the training speed, Table  1 illustrates that G-Meta over 2×42\times 4 GPUs is even 22% faster than DMAML over 160 workers and 40 servers using a total of 3760 cores in the public dataset, which process 169k and 138k samples per second respectively. In this case, G-Meta even achieves cost savings of 62.29% compared to DMAML, as per the pricing information provided by Aliyun (Aliyun, 2023). Furthermore, G-Meta can process 618k samples per second if we scale out to 8×48\times 4 GPUs. Regarding the scalability, the speedup ratio of the DMAML falls while G-Meta’s gradually shrinks. The speedup ratio of DMAML is 0.88 and 0.59 when we scale out to 40 and 160 workers. In contrast, the speedup ratio of the G-Meta slowly reduces from 0.94 to 0.86 when we scale out to 2×42\times 4, and 8×48\times 4 GPU workers. In the more complicated in-house dataset, we see a similar trend where the speedup ratio using G-Meta declines to 0.88, but the PS-based one drops to 0.58. Lastly, Figure 4 verifies that G-Meta does not lose the statistical performance.

3.3. Ablation Study

We perform the ablation study of both optimizations on I/O and network to further understand how they contribute to acceleration. Shown in Figure 4, both optimizations on the network and I/O bring up 45% and 51% speedup compared with the baseline (G-Meta without these two optimizations) on 2×42\times 4 or 8×48\times 4 GPUs respectively. Specifically, I/O and network optimizations contribute to 27% and 12 % speedup respectively on 2×42\times 4 GPUs. However, the acceleration from I/O optimization shrinks on 8×48\times 4 GPUs. This is because the synchronous training has poor scalability with more participating nodes and the I/O stage in one node may block the whole iteration with high probability. Additionally, it is noteworthy that the baseline processes 72k samples per second on 2×42\times 4 GPUs, which approximates the same throughput (79k) using 80 workers as displayed in Table  1. It shows that G-Meta with our algorithm is still efficient even without extra I/O and network optimizations.

Table 1. Average throughput (Samples/Seconds) and Speedup Ratio of G-Meta and DMAML using Ali-CCP and in-house dataset. Particularly, 2×42\times 4 means that there are 2 nodes and each node is equipped with 4 GPUs.
Throughput/Speedup Ratio
CPU workers 20 40 80 160
PS (public) 29k/1 51k/0.88 91k/0.78 138k/0.59
PS (in-house) 27k/1 48k/0.88 79k/0.73 126k/0.58
GPU workers 1×41\times 4 2×42\times 4 4×44\times 4 8×48\times 4
G-Meta (public) 90k/1 169k/0.94 322k/0.89 618k/0.86
G-Meta (in-house) 54k/1 105k/0.97 197k/0.91 380k/0.88
Refer to caption
Figure 3. Model Performance between G-Meta vs. DMAML using MAML (Vuorio et al., 2019), MeLU (Lee et al., 2019), and CBML (Song et al., 2021) in Movielens dataset.
Refer to caption
Figure 4. The throughput given different experiment settings, like using I/O optimization or Network optimization in in-house data.

3.4. Online Deployment

Since early 2022, G-Meta is widely used in Alipay’s core applications. Using G-Meta remarkably shortens the overall training time by four times on average and ensures immediate delivery. Specifically, in our homepage display advertising, online experimental results show that the model delivery decreases from 3.7 to 1.2 hours using 1.6 billion records. The Conversion Rate (CVR) and Cost Per Mile (CPM) of the trained model respectively rise by 6.48% and 1.06%, benefiting from substantially more training data and tasks.

4. Conclusion

In this paper, we propose G-Meta for high-performance distributed training of optimization-based Meta DLRM models on GPU clusters. We design a hybrid parallelism algorithm with several optimizations to allow efficient distributed training of industrial Meta DLRM models over the GPU cluster. Our extensive experimental results and online deployment in several core applications clearly echo the efficiency and effectiveness of the method.

References

  • (1)
  • Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: a system for large-scale machine learning.. In Osdi, Vol. 16. Savannah, GA, USA, 265–283.
  • Aizman et al. (2019) Alex Aizman, Gavin Maltby, and Thomas Breuel. 2019. High performance I/O for large scale deep learning. In 2019 IEEE International Conference on Big Data (Big Data). IEEE, 5965–5967.
  • Aliyun (2023) Aliyun. 2023. Aliyun pricing. Retrieved May 14, 2023 from https://www.alibabacloud.com/pricing
  • Bertinetto et al. (2018) Luca Bertinetto, Joao F Henriques, Philip HS Torr, and Andrea Vedaldi. 2018. Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136 (2018).
  • Bollenbacher et al. (2020) Jan Bollenbacher, Florian Soulier, Beate Rhein, and Laurenz Wiskott. 2020. Investigating Parallelization of MAML. In International Conference on Discovery Science. Springer, 294–306.
  • Borthakur (2007) Dhruba Borthakur. 2007. The hadoop distributed file system: Architecture and design. Hadoop Project Website 11, 2007 (2007), 21.
  • Chen et al. (2015) Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).
  • Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems. 7–10.
  • Dean et al. (2012) Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. 2012. Large scale distributed deep networks. Advances in neural information processing systems 25 (2012).
  • Dean and Ghemawat (2008) Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107–113.
  • Du et al. (2019) Zhengxiao Du, Xiaowei Wang, Hongxia Yang, Jingren Zhou, and Jie Tang. 2019. Sequential scenario-specific meta learner for online recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2895–2904.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning. PMLR, 1126–1135.
  • Gibiansky (2017) Andrew Gibiansky. 2017. Bringing HPC Techniques to Deep Learning. Retrieved February 19, 2023 from https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce
  • Harper and Konstan (2015) F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5, 4 (2015), 1–19.
  • Hospedales et al. (2021) Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. 2021. Meta-learning in neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence 44, 9 (2021), 5149–5169.
  • Huan et al. (2022) Zhaoxin Huan, Gongduo Zhang, Xiaolu Zhang, Jun Zhou, Qintong Wu, Lihong Gu, Jinjie Gu, Yong He, Yue Zhu, and Linjian Mo. 2022. An Industrial Framework for Cold-Start Recommendation in Zero-Shot Scenarios. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3403–3407.
  • Jeaugey (2017) Sylvain Jeaugey. 2017. Nccl 2.0. In GPU Technology Conference (GTC), Vol. 2.
  • Kunde et al. (2021) Shruti Kunde, Amey Pandit, Mayank Mishra, and Rekha Singhal. 2021. Distributed training for accelerating metalearning algorithms. In Proceedings of the International Workshop on Big Data in Emergent Distributed Environments. 1–6.
  • Lee et al. (2019) Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. 2019. Melu: Meta-learned user preference estimator for cold-start recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1073–1082.
  • Li et al. (2014) Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In 11th {\{USENIX}\} Symposium on Operating Systems Design and Implementation ({\{OSDI}\} 14). 583–598.
  • Lu et al. (2020) Yuanfu Lu, Yuan Fang, and Chuan Shi. 2020. Meta-learning on heterogeneous information networks for cold-start recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1563–1573.
  • Ma et al. (2018) Xiao Ma, Liqin Zhao, Guan Huang, Zhi Wang, Zelin Hu, Xiaoqiang Zhu, and Kun Gai. 2018. Entire space multi-task model: An effective approach for estimating post-click conversion rate. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1137–1140.
  • Ma et al. (2020) Yifei Ma, Balakrishnan Narayanaswamy, Haibin Lin, and Hao Ding. 2020. Temporal-contextual recommendation in real-time. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 2291–2299.
  • Mudigere et al. (2022) Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, et al. 2022. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proceedings of the 49th Annual International Symposium on Computer Architecture. 993–1011.
  • Nichol et al. (2018) Alex Nichol, Joshua Achiam, and John Schulman. 2018. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999 (2018).
  • Pan et al. (2019) Feiyang Pan, Shuokai Li, Xiang Ao, Pingzhong Tang, and Qing He. 2019. Warm up cold-start advertisements: Improving ctr predictions via learning to learn id embeddings. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 695–704.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
  • Rajeswaran et al. (2019) Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. 2019. Meta-learning with implicit gradients. Advances in neural information processing systems 32 (2019).
  • Ravi and Larochelle (2017) Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for few-shot learning. In International conference on learning representations.
  • Sergeev and Del Balso (2018) Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018).
  • Song et al. (2021) Jiayu Song, Jiajie Xu, Rui Zhou, Lu Chen, Jianxin Li, and Chengfei Liu. 2021. CBML: A cluster-based meta-learning model for session-based recommendation. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 1713–1722.
  • Tensorflow (2023) Tensorflow. 2023. TFRecords. Retrieved February 19, 2023 from https://www.tensorflow.org/tutorials/load_data/tfrecord
  • Vartak et al. (2017) Manasi Vartak, Arvind Thiagarajan, Conrado Miranda, Jeshua Bratman, and Hugo Larochelle. 2017. A meta-learning perspective on cold-start recommendations for items. Advances in neural information processing systems 30 (2017).
  • Volkovs et al. (2017) Maksims Volkovs, Guangwei Yu, and Tomi Poutanen. 2017. Dropoutnet: Addressing cold start in recommender systems. Advances in neural information processing systems 30 (2017).
  • Vuorio et al. (2019) Risto Vuorio, Shao-Hua Sun, Hexiang Hu, and Joseph J Lim. 2019. Multimodal model-agnostic meta-learning via task-aware modulation. Advances in neural information processing systems 32 (2019).
  • Wang et al. (2022) Zehuan Wang, Yingcan Wei, Minseok Lee, Matthias Langer, Fan Yu, Jie Liu, Shijie Liu, Daniel G Abel, Xu Guo, Jianbing Dong, et al. 2022. Merlin HugeCTR: GPU-accelerated Recommender System Training and Inference. In Proceedings of the 16th ACM Conference on Recommender Systems. 534–537.
  • Yang et al. (2022) Jieyu Yang, Zhaoxin Huan, Yong He, Ke Ding, Liang Zhang, Xiaolu Zhang, Jun Zhou, and Linjian Mo. 2022. Task Similarity Aware Meta Learning for Cold-Start Recommendation. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 4630–4634.
  • Zhang et al. (2022) Yuanxing Zhang, Langshi Chen, Siran Yang, Man Yuan, Huimin Yi, Jie Zhang, Jiamang Wang, Jianbo Dong, Yunlong Xu, Yue Song, et al. 2022. PICASSO: Unleashing the Potential of GPU-centric Training for Wide-and-deep Recommender Systems. arXiv preprint arXiv:2204.04903 (2022).
  • Zhou et al. (2019) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep interest evolution network for click-through rate prediction. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 5941–5948.