Saturn: Efficient Multi-Large-Model Deep Learning

Kabir Nagrecha
[email protected]
Arun Kumar
[email protected]

Abstract

In this paper, we propose Saturn, a new data system to improve the efficiency of multi-large-model training (e.g. during model selection/hyperparameter optimization). We first identify three key interconnected systems challenges for users building large models in this setting — parallelism technique selection, distribution of GPUs over jobs, and scheduling. We then formalize these as a joint problem, and build a new system architecture to tackle these challenges simultaneously. Our evaluations show that our joint-optimization approach yields 39-49% lower model selection runtimes than typical current DL practice.

1 Introduction

Fine-tuning large-scale deep learning (DL) models has become a common practice in a variety of domains. It is now common for a developer to download an open-source model spanning tens of billions of parameters from a model hub (e.g. HuggingFace [21]) and fine-tune it on their application-specific data [20]. Unfortunately, models of these sizes tend to impose considerable time & cost burdens, often making them impractical for users in smaller companies & the domain sciences [8, 16].

We identify a few key systems-oriented headaches for users seeking to fine-tune large models: (1) GPU memory remains a bottleneck. Large-memory GPUs are expensive, and even public cloud vendors still ration them. (2) Multi-GPU parallelism is needed, but dozens of parallelism techniques exist [6, 7, 1, 18, 19, 24, 2, 13, 11, 14, 12, 9, 4], and understanding the performance behaviors of these complex approaches can be difficult for DL users. (3) Model selection, i.e. tuning hyper-parameters, model layers, etc., only amplifies the computational load by producing multiple models to train.

These challenges result in a three-part technical problem for practitioners. First, resource allocation [15]. Given a cluster of GPUs, how should they be distributed across jobs? Second, parallelism selection. For each model in a multi-job, which technique should be applied? Third, scheduling. What job ordering will produce the lowest makespan?

These three questions are deeply intertwined. GPU allocations affect the optimal parallelism selection for each job, and constrain schedule orderings. Per-job parallelism selections affect runtimes, thus impacting scheduling & allocation decisions. In this paper, we describe a new approach to tackling this intertwined problem, and demonstrate how taking a novel “joint optimization” approach can significantly improve performance.

2 System Architecture & Approach

Saturn has 3 main modules — the Parallelism Library, the Trial Runner, & the Solver. Figure 1(A) illustrates the overall design.

Refer to caption — Figure 1: (A) System architecture of Saturn and the interactions between the components. (B) Overview of key calls in Saturn’s API.

The Parallelism Library allows users to register and apply various model parallelization techniques with minimal effort. Techniques can be added to the Library by implementing a simple two-function interface, as shown in Figure 1(B). They can then be registered to the library to be reused in different execution sessions (even across different cluster users).

The Trial Runner is our empirical approach to profiling & evaluating user-provided black-boxes (models, parallelisms). Given a set of models, the Trial Runner profiles each model under each possible parallelism under each possible GPU count it could be assigned. Since profiling only requires processing one or two mini-batches, this profiling time tends to be negligible in the context of a larger job. The information from the Trial Runner is critical for our next component, the Solver.

The Solver uses the empirical estimates from the Trial Runner to formulate our joint optimization problem — parallelism selection, resource distribution, and scheduling — as a mixed-integer linear program (MILP). We apply the popular MILP solver Gurobi [5] to produce solutions. The output is used to determine: (1) which parallelism technique from the Library each model should use, (2) how many GPUs each model should get, and (3) in what order & when the models should be submitted to the cluster. We augment the solver with an “introspection” mechanism, adapted from prior art [22, 23]. As models are trained, remaining runtimes per-model will change and shift the workload. So, we re-run the solver on fixed intervals. When a new plan is produced, executing jobs are checkpointed and re-launched under the new scheme.

3 Experiments

We evaluate Saturn’s performance against four baselines from standard practice & prior art. The first, “Current Practice”, involves allocating all GPUs per node to one job at a time, with models running in sequence. Task parallelism is applied across nodes. The second baseline, “Random”, randomizes allocations, parallelisms, and schedule orderings. The third baseline ”Optimus” [17], greedily assigns GPUs one-at-a-time based on the estimated marginal runtime improvement. The fourth baseline, Optimus-Dynamic, augments Optimus with our introspection mechanism. Since all approaches optimize on a higher orchestration level, they do not affect correctness/accuracy. We register 4 parallelism techniques in the Library for Saturn: FSDP & DDP from PyTorch Distributed, GPipe, and model offloading from FairScale. Table 1 provides the details of our workloads.

Table 1: Experimental details.

Hardware	Epochs	Learning Rates	Batch Sizes	Models	Datasets
p4d.24xlarge	10	1e-5/1e-4/1e-3	16/32	GPT-2/GPT-J	WikiText-2 [10]
p4d.24xlarge	10	1e-5/1e-4/1e-3	64/128	ViT-G/ResNet-200	ImageNet [3]

Table LABEL:tb:experiments demonstrates our results from both workloads. We evaluate first on a single 8-GPU node, then on two such nodes. In both cases, Saturn significantly outperforms the baselines. We see speedups of 1.64-1.96X versus the Current Practice baseline, with training time reductions of 39-48%. We find that the allocations Saturn chooses are often unintuitive to a human (e.g. giving 5 GPUs to one model and applying GPipe, but 3 GPUs to another with FSDP. But the overall result is lower costs & runtimes. These results indicate that using Saturn for multi-large-model training may help encourage adoption of large models by time- & and cost- constrained users.

Table 2: Runtimes (hours) from our experiments. Numbers are reported as (1-node/2-node).

	Current Practice	Random	Optimus	Optimus-Dynamic	Saturn
WikiText	28.39/14.57	41.45/21.76	34.9/16.62	24.87/13.62	17.24/8.23
ImageNet	19.05/10.15	28.34/14.44	19.44/10.19	17.31/8.32	11.31/5.16

References

[1] Fully Sharded Data Parallel: faster AI training with fewer GPUs, 2021.
[2] Adnan, M., Maboud, Y. E., Mahajan, D., and Nair, P. J. Heterogeneous acceleration pipeline for recommendation system training, 2022.
[3] Deng, J., et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (2009).
[4] Fedus, W., Zoph, B., and Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2021.
[5] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2022.
[6] Huang, C.-C., Jin, G., and Li, J. SwapAdvisor: Push Deep Learning Beyond the GPU Memory Limit via Smart Swapping.
[7] Huang, Y., et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism, 2018.
[8] Kumar, A., Nakandala, S., Zhang, Y., Li, S., Gemawat, A., and Nagrecha, K. Cerebro: A layered data platform for scalable deep learning. In 11th Annual Conference on Innovative Data Systems Research (CIDR’21) (2021).
[9] Li, Z., Zhuang, S., Guo, S., Zhuo, D., Zhang, H., Song, D., and Stoica, I. Terapipe: Token-level pipeline parallelism for training large-scale language models, 2021.
[10] Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models, 2016.
[11] Nagrecha, K. Model-parallel model selection for deep learning systems. In Proceedings of the 2021 International Conference on Management of Data (jun 2021), ACM.
[12] Nagrecha, K. Systems for parallel and distributed large-model deep learning training, 2023.
[13] Nagrecha, K., and Kumar, A. Hydra: A system for large multi-model deep learning, 2021.
[14] Nagrecha, K., and Kumar, A. Tech report of saturn: An optimized data system for multi-large model deep learning, May 2023.
[15] Nagrecha, K., Liu, L., Delgado, P., and Padmanabhan, P. Intune: Reinforcement learning-based data pipeline optimization for deep recommendation models, 2023.
[16] Nakandala, S., Nagrecha, K., Kumar, A., and Papakonstantinou, Y. Incremental and approximate computations for accelerating deep cnn inference. ACM Transactions on Database Systems (TODS) 45, 4 (2020), 1–42.
[17] Peng, Y., et al. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of the Thirteenth EuroSys Conference (2018), pp. 1–14.
[18] Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimizations toward training trillion parameter models, 2019.
[19] Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019.
[20] Wang, P., Nagrecha, K., and Vasconcelos, N. Gradient-based algorithms for machine teaching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), pp. 1387–1396.
[21] Wolf, T., et al. Huggingface’s transformers: State-of-the-art natural language processing, 2019.
[22] Xiao, W., et al. Gandiva: Introspective cluster scheduling for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) (2018), pp. 595–610.
[23] Xiao, W., et al. $\{$ AntMan $\}$ : Dynamic scaling on $\{$ GPU $\}$ clusters for deep learning. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) (2020), pp. 533–548.
[24] Zheng, L., et al. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning, 2022.