Deploying Foundation Model Powered Agent Services: A Survey

Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen W. Xu, J. Chen, P. Zheng, and Y. Fan are with the Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, China (e-mail:{ jinyu.chen, peirong.zheng, yunfeng.fan}@connect.polyu.hk, [email protected]). X. Yi, H. Wang, and Q. Wan are with The School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China (e-mail:{yixiaoquan, hz_wang}@hust.edu.cn, [email protected]). T. Tian and W. Zhu are with the School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing, China (e-mail: {tian_tianyi,wenhui_zhu }@bupt.edu.cn). Q. Su is with the School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China ([email protected]) X. Shen is with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Canada (e-mail: [email protected]).

Abstract

Foundation model (FM) powered agent services are regarded as a promising solution to develop intelligent and personalized applications for advancing toward Artificial General Intelligence (AGI). To achieve high reliability and scalability in deploying these agent services, it is essential to collaboratively optimize computational and communication resources, thereby ensuring effective resource allocation and seamless service delivery. In pursuit of this vision, this paper proposes a unified framework aimed at providing a comprehensive survey on deploying FM-based agent services across heterogeneous devices, with the emphasis on the integration of model and resource optimization to establish a robust infrastructure for these services. Particularly, this paper begins with exploring various low-level optimization strategies during inference and studies approaches that enhance system scalability, such as parallelism techniques and resource scaling methods. The paper then discusses several prominent FMs and investigates research efforts focused on inference acceleration, including techniques such as model compression and token reduction. Moreover, the paper also investigates critical components for constructing agent services and highlights notable intelligent applications. Finally, the paper presents potential research directions for developing real-time agent services with high Quality of Service (QoS).

Index Terms:

Foundation Model, AI Agent, Cloud/Edge Computing, Serving System, Distributed System, AGI.

I Introduction

Refer to caption — Figure 1: The framework of FM-powered agent services. The execution layer runs model inference with low-level optimizations. The resource layer focuses on designing strategies for parallelism and resource scaling. The model and agent layers work on optimizing FMs and various agent components. The application layer constructs different intelligent applications.

The rapid advancement of artificial intelligence (AI) has positioned foundation models (FMs) as a cornerstone of innovation, driving progress in various fields such as natural language processing, computer vision, and autonomous systems. These models, characterized by their vast parameter spaces and extensive training on broad datasets, incubate numerous applications from automated text generation to advanced multi-modal question answering and autonomous robot services [1]. Some popular FMs, such as GPT, Llama, ViT, and CLIP, are pivotal in pushing the boundaries of AI capabilities, offering sophisticated solutions for processing and analyzing large volumes of data across different formats and modalities. The continuous advancement of FMs significantly enhances AI’s ability to comprehend and interact with the world in a manner akin to human cognition.

However, traditional FMs are typically confined to providing question-and-answer services and generating responses based on pre-existing knowledge, often lacking the ability to incorporate the latest information or employ advanced tools. FM-powered agents are designed to enhance the capability of FM. These agents are incorporated with dynamic memory management, long-term task planning, advanced computational tools, and interactions with the external environment [2]. For example, FM-powered agents can call different external APIs to access real-time data, perform complex calculations, and generate updated responses based on the most current information available. This approach improves the reliability and accuracy of the responses and enables more personalized interactions with users.

Developing a serving system with low latency, high reliability, high elasticity, and minimal resource consumption is crucial for delivering high-quality agent services to users. Such a system can efficiently manage varying query loads while maintaining swift response and reducing resource costs. Moreover, constructing a serving system on heterogeneous edge-cloud devices is a promising solution to leverage the idle computational resources at the edge and the abundant computational clusters available in the cloud. The collaborative inference of edge-cloud devices can enhance overall system efficiency by dynamically allocating tasks to various edge-cloud machines based on computational load and real-time network conditions.

Although many research works investigate edge-cloud collaborative inference for small models, deploying FMs under this paradigm for diverse agent services still faces several severe challenges. First, the fluctuating query load severely challenges the model serving. The rapidly growing number of users want to experience intelligent agent services with FMs. For example, as of April 2024, ChatGPT has approximately 180.5 million users, with around 100 million of users being active weekly [3]. These users access the service at different times, resulting in varying request rates. An elastic serving system should dynamically scale the system capacity according to the current system characteristics. Secondly, the parameter space of an FM is particularly large, reaching the scale of several hundred billion, which is a significant challenge to the storage system. However, the storage capacity of edge devices and the consumer GPU is limited, making it unable to accommodate an entire model. The large number of parameters results in significant inference overhead and long execution latency. Therefore, it is necessary to design model compression methods and employ different parallelism approaches in diverse execution environments. In addition, users have different service requirements and inputs in different applications. For example, some applications prioritize low latency, while others prioritize high accuracy. This necessitates dynamic resource allocation and adjustment of the inference process. Moreover, AI agents need to deal with lots of hard tasks under complex environments, which require effective management of large-scale memory, real-time processing of updated rules, and specific domain knowledge. Additionally, agents possess distinct personalities and roles, necessitating the design of an efficient multi-agent collaboration framework.

To address the aforementioned challenges and promote the development of the real-time FM-powered agent service, this survey proposes a unified framework and investigates various research works from different optimization aspects. This framework is shown in Figure 1. The bottom layer is the execution layer, where edge or cloud devices execute an inference with FMs. Joint computation optimization, I/O optimization, and communication optimization are applied to accelerate inference and promote the building of a powerful infrastructure for FMs. The resource layer, comprised of two components, facilitates the deployment of the model on various devices. Parallelism methods design different model splitting and placement strategies to utilize the available resources and improve throughput collaboratively. Resource scaling dynamically adjusts the hardware resources during runtime based on query load and resource utilization, thereby improving overall scalability. The model layer focuses on the optimization of FMs. Two lightweight methods, including model compression and token reduction, are specifically designed to promote the widespread adoption of FMs. Based on these FMs, many AI agents are constructed to accomplish various tasks. Numerous methods have been proposed to enhance the four key components of agents, which encompass the multi-agent framework, planning capabilities, memory storage, and tool utilization. Ultimately, leveraging the aforementioned techniques, all kinds of applications can be developed to deliver intelligent and low-latency agent services to users.

I-A Previous Works

Many research works focus on system optimization to deploy machine learning models in edge-cloud environments. KACHRIS reviews some hardware accelerators for Large Language Models (LLMs) to address the computational challenges [4]. Tang et al. summarize scheduling methods designed for optimizing both network and computing resources [5]. Miao et al. present some acceleration methods to improve the efficiency of LLMs [6]. This survey includes system optimizations, such as memory management and kernel optimization, as well as algorithm optimizations, such as architectural design and compression algorithms, to accelerate model inference. Xu et al. focus on the deployment of Artificial Intelligence-Generated Content (AIGC), and they provide an overview of mobile network optimization for AIGC, covering the processes of dataset collection, AIGC pre-training, AIGC fine-tuning, and AIGC inference [7]. Djigal et al. investigate the application of machine learning and deep learning techniques in resource allocation for Multi-access Edge Computing (MEC) systems [8]. The survey includes resource offloading, resource scheduling, and collaborative allocation. Many research works propose different algorithms to optimize the design of FMs and agents. [1], [9] and [10] present popular FMs, especially LLMs. [11], [12] and [13] summarizes model compression and inference acceleration methods for LLM. [2], [14], and [15] review the challenges and progress for the development of agents.

In summary, the above studies either optimize edge-cloud resource allocation and scheduling for small models or design acceleration or efficiency methods for large FMs. To the best of our knowledge, this paper is the first comprehensive survey to review and discuss the deployment of real-time FM-powered agent services in heterogeneous devices, a research direction that has gained significant importance in recent years. We design a unified framework to fill this research gap and review current research works from different perspectives. This framework not only delineates essential techniques for the deployment of FMs but also identifies key components of FM-based agents and corresponding system optimizations specifically tailored for agent services.

I-B Contribution

This paper presents a comprehensive survey on the deployment of FM-powered agent services in edge-cloud environments, covering optimization approaches spanning from hardware to software layers. For the convenience of readers, we provide an outline of the survey in Figure 2. The contributions of this survey are summarized in the following aspects:

•

This survey proposes the first comprehensive framework to provide a deep understanding of the deployment of FM-powered agent services within the edge-cloud environment. Such a framework holds the potential to foster the advancement of AGI greatly.
•

From a low-level hardware perspective, we present research on various runtime optimization methods and resource allocation and scheduling methods. These techniques are designed to establish a reliable and flexible infrastructure for FMs.
•

From a high-level software perspective, we elucidate research efforts focused on model optimization and agent optimization, thereby offering diverse opportunities for building intelligent and lightweight agent applications.

The remainder of this article is organized as follows: Section II presents some low-level execution optimization methods. Section III describes resource allocation and parallelism mechanisms. Section IV discusses current FMs, as well as techniques for model compression and token reduction. Section V illustrates key components for agents. Section VI presents batching methods and some related applications. Finally, Section VII discusses the future works and draws conclusions.

II Execution Optimization

II-A Computation Optimization

Edge deployment of FMs poses severe challenges due to the heterogeneity of edge devices in terms of computing ability, hardware architecture, and communication bandwidth [16]. It is important to design computation optimization methods for different devices to accelerate model inference. Figure 3 provides an overview of this chapter. This diagram provides a global perspective for algorithm design (e.g., computation optimization). It also depicts the components of an edge computing system, including heterogeneous backends and the network. Table I shows some commonly used hardware devices at the edge. Traditionally, devices with CPUs have been deployed at the edge environment. However, due to their limited parallel computing capabilities and memory constraints, various specialized accelerators are designed for Deep Learning (DL) tasks. FPGAs have gained widespread attention for their programmable and parallel computing capabilities. ASICs, while non-programmable after manufacture, offer high speed and low power consumption. Many edge devices, such as personal computers and smartphones, have different computational resources, such as GPU and CPU. Consequently, many approaches focus on optimizing computational efficiency by jointly utilizing these resources for model inference. In-memory computing, with non-Von-Neumann architecture, has emerged as a prominent research area because its intrinsic parallelism can significantly reduce the I/O latency and improve computational efficiency. To enhance the understanding of different devices, we provide some detailed information about different hardware devices as below.

TABLE I: Computing Resource Types and Features

Resource Type	Features and Functions
Field Programmable Gate Arrays (FPGAs)	FPGAs are integrated circuits that can be reconfigured at the hardware level after manufacturing. It can have higher performance and lower latency than GPU accelerators.
Application-Specific Integrated Circuit (ASIC)	ASIC is a customized accelerator for a specific use scenario. It is not programmable after manufacturing and offers computing comparable to FPGAs. The producer decides the specifications.
In-memory Compute (IMC)	IMC is a non-von-Neumann method, able to conduct computation in the memory. For instance, IMC utilizes physical processes to execute addition and multiplication, thus accelerating matrix-vector computing.
Central Processing Unit (CPU)	CPU is a General-purpose processor; It executes basic operations and instructions and can run lowly parallel computing.
Graphics Processing Unit (GPU)	GPU excels in parallel computing and is dominantly used as accelerators for DL.

II-A1 FPGAs

FPGAs are widely deployed in edge computing applications and are re-configurable hardware devices that are efficient in power consumption. FMs are built upon Transformers, which contain non-linear computation components, such as layer normalization, SoftMax, and non-ReLU activation functions. Serving these models requires specific accelerator designs. On the other hand, matrix multiplications are conducted during the inference of FMs. Their substantial computational complexity poses challenges for optimizing FPGA-based accelerators. To solve these problems, a specialized hardware accelerator is designed for Multi-Head Attention (MHA) and feed-forward networks (FFN) in Transformers [17]. It incorporates a matrix partitioning strategy to optimize resource sharing between Transformer blocks, a computation flow that maximizes systolic array utilization, and optimizations of nonlinear functions to diminish complexity and latency. MnnFast provides a scalable architecture specifically for Memory-augmented Neural Networks [18]. It incorporates a column-based streaming algorithm, zero-skipping optimization, and a dedicated embedding cache to tackle the challenges posed by large-scale memory networks. NPE offers software-like programmability to developers [19]. Unlike previous FPGA designs that implement specialized accelerators for each nonlinear function, NPE can be easily upgraded for new models without requiring extensive reconfiguration, making it a more cost-effective solution. DFX is a multi-FPGA system that optimizes the latency and throughput for text generation. It leverages model parallelism and an optimized, model-and-hardware-aware dataflow to handle the sequential nature of text generation. It demonstrates the superior speed and cost-effectiveness of FPGA compared to GPU implementations[20].

Additionally, the development of Transformer-OPU (Overlay Processor)[21] provides a flexible and efficient FPGA-based processor that expedites the computation of Transformer networks. Moreover, implementing a tiny Transformer model through a Neural-ODE (Neural Ordinary Differential Equation) approach [22] leads to a substantial reduction in model size and power usage, making it ideal for edge computing devices in Internet-of-Things (IoT) applications. FlightLLM [23] addresses the computational efficiency, memory bandwidth utilization, and compilation overheads on FPGAs for LLM inference. It includes a configurable Digital Signal Processor (DSP) chain optimized for varying sparsity patterns in LLMs and always-on-chip activations during decoding and a length adaptive compilation method for dynamic sparsity patterns and input lengths.

In summary, the above designs for FPGA-based FM inference focus on several key aspects: enhancing computational efficiency through specialized architectures such as systolic arrays and DSP chains, optimizing memory usage via techniques like matrix partitioning and dedicated caches, reducing latency through efficient dataflows and model-and-hardware-aware optimizations, and improving adaptability and programmability to accommodate diverse and evolving model structures and non-linear functions.

II-A2 ASIC

An ASIC is an integrated circuit chip customized for a specific application. For example, router ASICs can handle packet processing and signal modulation. It often includes microprocessors, memory, and other components as a System-on-Chip (SoC). Recent advancements in ASIC design significantly enhance the performance and efficiency of attention mechanisms, which is crucial for applications across NLP and CV. The $A^{3}$ [24] accelerator employs algorithmic approximations and a prototype chip to significantly enhance energy efficiency and processing speed. Essentially, the attention mechanism is a content-based search to evaluate the correlation between the currently processed token and previous tokens. $A^{3}$ addresses the inefficiency of matrix-vector multiplication in the self-attention mechanism, which is sub-optimal for the content-based search, by implementing an efficient greedy candidate search method. Similarly, ELSA (Efficient, Lightweight Self-Attention) [25] tackles the quadratic complexity of self-attention by selectively filtering out less important relations in self-attention and implements an ASIC to achieve high energy efficiency.

SpAtten[26] prunes Transformer models at both the token and head levels and reduces the model size with quantization to minimize the computational and memory demands of Transformers. Sanger[27] framework enables sparse attention mechanisms through a reconfigurable architecture that supports dynamic software pruning and efficient sparse operations. Additionally, the Energon[28] co-processor, working together with other FM accelerators, introduces a dynamic sparse attention mechanism that uses a mix-precision multi-round filtering algorithm to optimize query-key pair evaluations. The above ASIC solutions demonstrate significant advancements in speed and energy efficiency compared to traditional devices while still preserving high accuracy. They pave the way for real-time, resource-efficient implementations for complex FMs.

II-A3 In-memory Compute

In-memory computing (IMC) is an emerging computational paradigm performing computational tasks directly within memory, eliminating frequent I/O requirements between processors and memory units. IMC is a scalable and energy-efficient solution to handle long sequence data in Transformers. [29] introduces ATT, a fault-tolerant Resistive Random-access Memory (ReRAM) accelerator specifically designed for attention-based neural networks. This accelerator capitalizes on the high-density storage capabilities and low leakage power of ReRAM to address the compatibility issues between traditional neural network architectures and the complex data flow of attention mechanisms. ReTransformer[30] is a ReRAM-based IMC architecture for Transformers. It accelerates the scaled dot-product attention mechanism, utilizes a matrix decomposition technique to avoid storing intermediate results, and designs the sub-matrix pipelines of MHA.

The iMCAT utilizes a combination of crossbar arrays to store the matrix in the memory array and uses Content Addressable Memories (CAM) to overcome the significant memory and computational bottlenecks in processing long sequences with MHA [30]. Recent works design different optimization methods to deploy computationally intensive Transformer models on edge AI accelerators [31]. The Google TPU, categorized as ASIC and IMC, is widely used in edge and cloud computing. Large Transformer models can be executed efficiently on the Coral Edge TPU by optimizing the computational graph and employing quantization techniques, ensuring real-time inference with minimal energy consumption. The techniques employed in IMC, including matrix decomposition and quantization, reduce the computational resources required to serve FMs, thereby facilitating the deployment of FMs at the edge.

II-A4 CPU & GPU

Current optimization algorithms, such as CPU-GPU hybrid computation, are hardware-independent and can be applied to various backends. Different methods are designed to accelerate Transformer inference by optimizing the MHA, FFN, skip connections, and normalization modules, which are identified as critical bottlenecks in Transformer architectures [32]. [33] aims to optimize the execution time of the SoftMax layer by decomposing it into multiple sub-layers and then fusing them with adjacent layers. LLMA [34] leverages the overlap between the text generated by LLMs and existing reference texts to enhance computational parallelism and speed up the inference process without compromising output quality. These algorithms are hardware-agnostic and can be applied to various backends besides CPU and GPU. UltraFastBERT optimized inference by activating only a fraction of the available neurons, thereby maintaining competitive performance levels [35]. Intel CPU clusters can serve LLMs by adopting the algorithms [36], such as quantization and optimized kernels, specifically designed to accelerate computations on CPUs.

Many optimizations have been proposed by coordinately utilizing GPUs and CPUs to improve the efficiency and speed of Transformer models. Similarly, [37] introduces PowerInfer, a high-speed LLM inference engine that leverages the high locality and power-law distribution in neuron activation to reduce GPU memory demands and CPU-GPU data transfers. [38] aims to reduce latency by coordinating CPU and GPU computing while mitigating the I/O bottlenecks by overlapping the data processing and I/O time. [39] utilizes a single GPU to serve LLMs, prioritizing throughput at the expense of latency. It designs a scheduling algorithm to schedule model parameters, KV caches, and activations between CPU, GPU, and disk. These advancements focus on hardware-independent algorithms and CPU-GPU collaborative inference strategies, signifying a robust movement towards more efficient FMs.

II-B Memory Optimization

II-B1 Vanilla FMs

Besides computational costs, memory overhead is another major challenge in deploying LLMs at resource-constrained edge devices. The inference process of the LLM involves two key stages: the prefill and decode phases. The prefill phase is responsible for creating the KV cache based on the prompts, while the decode phase focuses on generating the subsequent token autoregressively. The first stage is compute-intensive, and the primary challenge of the decoding stage lies in the memory wall. This barrier prevents the loading of the vast parameters associated with LLMs and the maintenance of the KV cache throughout the inference process. Consequently, considerable research effort has focused on memory scheduling, utilizing multiple storage levels to execute a single model. For example, jointly optimizing CPU and GPU memory usage makes it possible to deploy LLMs on personal computers, smartphones, and other weak devices.

Contextual sparsity, which refers to the input-dependent activation of a small subset of neurons in LLMs, has been exploited to reduce computational and memory costs. [40] proposes a method to predict contextual sparsity on the fly and an asynchronous, hardware-aware implementation to accelerate LLM inference. [41] explores storing model parameters in flash memory and loading only the required subset during inference, considering bandwidth, energy constraints, and read throughput.

Several studies have focused on optimizing the attention mechanism in LLMs to reduce I/O costs. [42] introduces the Multi-Query Attention to accelerate decoding by reducing memory bandwidth requirements during incremental inference. [43] proposes Grouped-Query Attention (GQA), an intermediate between MHA and Multi-Query Attention (MQA), achieving a balance between quality and speed. [44] introduces PagedAttention, which is inspired by virtual memory and paging techniques, to efficiently manage key-value cache memory and integrate it into vLLM, a high-throughput LLM serving system. [45] proposes FlashAttention, an I/O-aware attention algorithm that reduces memory accesses between High Bandwidth Memory (HBM) and on-chip Static Random-Access Memory (SRAM) in GPU, which avoids fully loading the large attention matrix, reducing I/O overhead significantly. [46] further improves FlashAttention to enhance efficiency and scale Transformers to longer sequence lengths by strategically reducing the number of non-matrix-multiplication operations and paralleling the sequence dimension. This adjustment enables improved resource utilization across the GPU thread block. FlashDecoding++ [47] introduces asynchronized partial parallel SoftMax. SoftMax requires subtracting a value from all input numbers before exponential computing to avoid overflow of the denominator. The subtracted value is usually the maximum input value, so synchronization is required to obtain this maximum value when computing softmax in parallel. However, FlashDecoding++ can avoid this synchronization by utilizing a unified max value pre-decided according to the statistical LLM-specified input distribution to the SoftMax.

BMInf[48] introduces the Big Model Inference and Tuning toolkit, which employs model quantization, parameter-efficient tuning, and CPU-GPU scheduling optimization to reduce computational and memory costs. [49] proposes Splitwise, a technique that schedules the prompt computation and token generation phases to different machines, optimizing hardware utilization and overall system efficiency. FastServe is a distributed serving system that optimizes job completion time for LLMs in interactive AI applications [50]. [51] introduces SpecInfer, a system that accelerates generative LLM serving through tree-based speculative inference and verification. LLMCad is an innovative on-device inference engine specifically designed for efficient generative NLP tasks [52]. It executes LLM on weak devices with limited memory capacity by utilizing a compact LLM that resides in memory to generate tokens and a high-precision LLM for validation.

II-B2 MoE FMs

Mixture-of-Experts (MoE), replacing an FFN with a router and multiple FFNs, is a promising approach to enhance the efficiency and scalability of LLMs. However, deploying MoE-LLMs presents challenges in memory scheduling because of their huge number of parameters and the uncertain choices of experts, particularly in resource-constrained environments. Recent research has addressed these challenges through novel architecture designs, model compression techniques, and efficient inference engines. [53] proposes DeepSpeed-MoE, an end-to-end solution for training and deploying large-scale MoE models. [54] introduces EdgeMoE, an on-device inference engine designed for MoE-LLMs that loads popular experts to GPU to prioritize memory and computational efficiency. [55] proposes a novel expert offloading strategy utilizing the intrinsic properties of MoE-LLMs. It uses Least Recently Used (LRU) caching to store experts since certain experts are reused across consecutive tokens. It also accelerates the loading process by predicting the selection of experts for future layers based on the hidden states of earlier layers.

Several studies have focused on developing efficient serving systems and inference engines for MoE-LLMs. [56] introduces MOE-INFINITY, an efficient MoE-LLMs serving system that implements cost-efficient expert offloading through activation-aware techniques, significantly reducing latency overhead and deployment costs. [57] proposes Fiddler, a resource-efficient inference engine that orchestrates CPU and GPU collaboration to accelerate the inference of MoE-LLMs in resource-constrained settings. Furthermore, [58] introduces Pre-gated MoE, an algorithm-system co-design approach that employs a novel pre-gating function to enhance the inference capabilities of MoE-LLMs by enabling more efficient activation of sparse experts and reducing the memory footprint. These advancements in MoE-LLMs deployment and inference engines demonstrate the ongoing efforts to make these powerful models more accessible and efficient across various computing environments.

II-C Communication Optimization

In addition to computation and memory optimization, communication overhead is another major challenge in deploying large models within the edge-cloud environment. The heterogeneity of edge devices and communication channels necessitates a scheduler for computation and network resources when executing FM tasks across different devices.

[59] introduces an “Intelligence-Endogenous Management Platform” for Computing and Network Convergence (CNC), which efficiently matches the supply and demand within a highly heterogeneous CNC environment. The CNC brain is prototyped using a deep reinforcement learning model. It theoretically comprises four key components: perception, scheduling, adaptation, and governance, collectively supporting the entire CNC lifecycle. Specifically, Perception perceives the incoming service request and real-time computing resources. Scheduling assigns the workload to heterogeneous computing nodes in a heterogeneous network. Adaptation is adapting to dynamic resources by ensuring the continuity of services through backup measures, and Governance is the self-governed decentralized computing nodes.

[60] discusses the integration of LLMs into 6G vehicular networks, focusing on the challenges and solutions related to computational demands and energy consumption. It proposes a framework where vehicles handle initial LLM computations locally and offloads more intensive tasks to Roadside Units (RSUs), leveraging edge computing and 6G networks’ capabilities. The authors formulate a multi-objective optimization framework that aims to minimize the cumulative cost incurred by vehicles and RSUs in processing computational tasks, including the cost associated with communication. LinguaLinked[61] is a system to deploy LLMs on distributed mobile devices. The high memory and computational demands of these models typically exceed the capabilities of a single mobile device. It utilized load balancing and ring topology to optimize the delay of computation and communication while maintaining the original model structure. The results demonstrate improvements in inference throughput on mobile devices. [62] presents MegaScale, a production system designed to train and deploy LLMs across over 10,000 GPUs. The synchronization of model parameters and gradients across GPUs can become a bottleneck as the number of GPUs increases. MegaScale addresses communication challenges through a combination of parallelism strategies, overlapping techniques, and network optimizations, resulting in improved training efficiency and fault tolerance [63].

Semantic communication enables much more efficient utilization of limited network resources at the edge by only transmitting the essential semantic information needed for communication purposes. [64] investigates multiple access designs to facilitate the coexistence of semantic and bit-based transmissions in future networks. The authors propose a heterogeneous semantic and bit communication framework where an access point simultaneously sends semantic and bit streams to a semantics-interested user and a bit-interested user. [65] presents a framework for semantic communications enabled by computing networks, aiming to provide sufficient computational resources for semantic processing and transmission. This framework leverages computing networks to support semantic communication. It introduces key techniques to optimize the network, such as semantic sampling and reconstruction, semantic-channel coding, and semantic-aware resource allocation and optimization based on cloud-edge-end computing coordination. Two use cases, an end-cloud computing-enabled video transmission system, and a semantic-aware task offloading system, are provided to demonstrate the advantages of the proposed framework.

II-D Integrated Frameworks

With the emergence of computational and memory optimization methods, numerous integrated frameworks have integrated these techniques, supporting various LLMs and lowering the barriers to LLM deployment. These frameworks leverage CPU and GPU backends, enabling the widespread local execution of LLMs.

The most popular framework to deploy LLMs at the edge is the llama.cpp[66], which pioneered the open-source implementation of LLM execution using only C++. We provide an overview of various open-source frameworks and their respective features in Table II. The frameworks can be categorized into heterogeneous and GPU-only backends, ranging from mobile and embedded devices to high-end computing systems. Different suppliers offer specialized development platforms and APIs for their hardware products (i.e., CPU and GPU). CUDA (Compute Unified Device Architecture) is a parallel computing platform developed for NVIDIA GPU. ROCm (Radeon Open Compute platform) is a software platform for high-performance computing using AMD GPU. Metal is a low-overhead hardware-accelerated 3D graphic and compute shader API developed by Apple for its own GPUs. Some frameworks like Vulkan and OpenCL (Open Computing Language) facilitate development across different hardware and operating systems. Vulkan is an API for cross-platform access to GPUs for graphics and computing. OpenCL is a framework for writing programs that execute across heterogeneous backends, including CPU, GPU, FPGA, and other hardware.

TABLE II: Integrated Frameworks

Framework	Features and Functions	Backend
LLaMA.cpp[66]	High-performance inference of LLaMA and other LLMs	CPU(x64, ARM), GPU(CUDA, ROCm, Metal), OpenCL
MLC-LLM[67]	Accelerating LLMs inference using Machine Learning Compilation	GPU(CUDA, ROCm, Metal, Vulkan, WebGPU), OpenCL
MNN-LLM[68]	Inference for Mobile Neural Network LLMs	CPU(x64, ARM), GPU(CUDA), OpenCL
FastChat[69]	Platform for both training and inference of LLM-based chatbots	CPU(x64), GPU(CUDA, ROCm, Metal)
DeepSpeed[70]	DeepSpeed-Inference integrates multiple parallelisms and custom kernels, communication, and memory optimizations.	CPU(x64, ARM), GPU(CUDA, ROCm)
OpenVINO[71]	Convert FMs and deploy them on Intel hardware	CPU(x64), GPU(OpenCL)
MLLM[72]	Mobile LLMs Inference	CPU(x86, ARM)
FP6[73]	Mixed precision for LLMs	GPU(CUDA)
Colossal-AI[74]	Distributed training and inference for LLMs using multiple parallelisms	GPU(CUDA)
Megatron-LM[75]	Efficient LLMs inference on GPUs with system-level optimizations	GPU(CUDA)
TensorRT-LLM[76]	NVIDIA TensorRT for LLMs inference	GPU(CUDA)

Recently, several novel frameworks have been specifically designed for the deployment of LLM-based applications/agents. These frameworks provide abstract interfaces that facilitate the design of complex LLM-based applications, such as chain-of-thought reasoning and retrieval-augmented LLMs. LangChain is a powerful framework aimed at simplifying the development of applications that interact with LLMs [77]. It offers a set of tools and abstractions that enable developers to construct complex, dynamic workflows by chaining various operations, such as querying a language model, retrieving documents, or processing data. Parrot is a serving system for LLM applications that optimizes performance across multiple requests [78]. It introduces the concept of a semantic variable to define the input or output information of LLM requests. To enhance system performance, Parrot schedules requests based on a predefined execution graph and latency sensitivity, and it shares the key-value cache among requests to accelerate the prefilling process. SGLang is another LLM agent framework designed to efficiently execute complex language model programs [79]. It simplifies the programming of LLM applications by providing primitives for generation and parallelism control. Additionally, it accelerates the execution of these applications by reusing key-value caches, enabling faster constrained decoding and designing API speculative execution.

III Resource Allocation and Parallelism

III-A Resource Allocation and Scaling

Edge-cloud computing leverages strong cloud servers and distributed edge devices to handle tasks near the data source. As shown in Figure 4, adaptive resource allocation is crucial for achieving an optimal balance between system performance and cost in edge-cloud environments. Resource management facilitates the optimal utilization of system resources, including processing power, storage space, and network bandwidth. Adaptive algorithms are designed to automatically adjust resource configurations based on real-time workloads and environmental changes, ensuring optimal performance under varying conditions.

Despite their numerous advantages, several challenges also arise in edge-cloud environments. 1) In edge-cloud environments, resources need to be allocated and managed between the central cloud and multiple edge nodes. Distributed resource management may increase the system’s complexity and require more fine-grained scheduling and coordination mechanisms. 2) The load in edge computing scenarios is highly dynamic. Accurately predicting this query load and dynamically adjusting resource allocation demands is difficult. 3) To meet the real-time processing requirements, the system must quickly adapt to changes in execution environments and adjust processing strategies and resource allocations in time. 4) Modern computing environments provide a variety of computing resources (CPU, GPU, TPU, etc.). Optimizing the allocation of these heterogeneous resources for various models based on their computational requirements and current workloads is a complex task.

We have summarized previous research works on resource allocation and adaptive optimization in Table III. The collected research works are divided into two categories: optimization for model services in cloud environments and optimization for IoT applications in edge computing environments. 1) Cloud Environments. This task is to efficiently deploy, manage, and scale machine learning models in the cloud. The objective is optimizing resource usage, reducing computing costs, and achieving performance goals in the face of dynamic and unpredictable query loads. This includes resource provisioning algorithms and strategies for utilizing heterogeneous computing resources. 2) Edge Environments. Because of the unique architectures and application scenarios in edge environments, data needs to be processed closer to the user or the geographic location of the data source. This requires well-designed scheduling algorithms and fault tolerance and resilience mechanisms to minimize latency and mitigate network overhead.

TABLE III: Summary of resource allocation and adaptation methods.

Scenario	Ref.	Year	Target	Method
Cloud	Clipper [80]	17	Accuracy, Latency and Throughput	Model Containers.
	MArk [81]	19	Latency and Resource Cost	Predictive scaling.
	Nexus [82]	19	Latency and Throughput	Squishy bin packing.
	InferLine [83]	20	Latency and Resource Cost	1. The low-frequency planner. 2. The high-frequency tuner.
	Clockwork [84]	20	Latency	DNN Workers.
	INFaaS [85]	21	Latency, Throughput and Resource Cost	Dynamic Model Variant Selection and Scaling.
	Morphling [86]	21	Resource Cost	Model-Agnostic Meta-Learning.
	Cocktail [87]	22	Accuracy, Latency and Resource Cost	1. Resource controller. 2. Autoscaler.
	Kairos [88]	23	Throughput	Query-distribution mechanism.
	SHEPHERD [89]	23	Throughput and Utilization	HERD: Planner for Resource Provisioning.
	SpotServe [90]	23	Resource Cost	Device Mapper.
Edge	Na et al. [91]	18	Throughput and Latency	Resource allocation scheme based on Lagrangian and Karush-Kuhn-Tucker conditions.
	Avasalcai et al. [92]	19	Latency	Deployment policy module.
	Yang et al. [93]	19	Latency	1. Multi-Dimensional Search and Adjust (MDSA) Algorithm. 2. Cooperative Online Scheduling (COS) Method.
	Tong et al. [94]	20	Latency and Energy consumption	Deep Reinforcement Learning (DRL) Approach.
	Xiong et al. [95]	20	Latency and Utilization	Improved deep reinforcement learning (DQN) Algorithm.
	CE-IOT [96]	20	Resource Cost	1. Delay-Aware Lyapunov Optimization Technique 2. Economic-Inspired Greedy Heuristic.
	Chang et al. [97]	20	Latency and Energy consumption	Online Algorithm for Real-Time Decision-Making.
	LaSS [98]	21	Latency and Utilization	Model-driven approaches.
	Ascigil et al. [99]	21	Latency and Utilization	Decentralized Strategies.
	KneeScale [100]	22	Throughput, Latency and Utilization.	Adaptive Auto-scaling with Knee Detection.
	CEC [101]	22	Latency, Utilization and Accuracy	Control-Based Resource Pre-Provisioning Algorithm (PRCT).

1) Cloud Environments. The following articles present various advanced solutions to optimize resource allocation in the cloud. Clipper uses model containers to encapsulate the model inference process in a Docker container [80]. Clipper supports replicating these model containers across the cluster to increase the system throughput and utilize additional hardware accelerators for serving. MArk is a generic inference system built on Amazon Web Services. It utilizes serverless functions to address service delays with horizontal and vertical scaling. Horizontal scaling expands the system by adding more hardware instances, while vertical scaling expands the system by increasing the resources of a single instance [81]. MArk uses a Long Short-Term Memory (LSTM) network for multi-step workload prediction. Leveraging the workload prediction results, MArk determines the instance type and quantity required to meet the SLOs using a heuristic approach. Nexus adopts the squishy bin packing method to batch different types of tasks on the same GPU, enhancing resource efficiency by considering the latency requirements and execution costs of each task [82]. It also merges multiple tasks into the same GPU execution cycle as long as the latency constraints are not violated. InferLine utilizes a low-frequency planner and a high-frequency tuner to manage the machine learning prediction pipeline effectively [83]. The low-frequency combinatorial planner finds the cost-optimal pipeline configuration under a given latency SLO. The high-frequency auto-scaling tuner monitors the dynamic request arrival pattern and adjusts the number of replicas for each model. Clockwork achieves an adaptive resource management framework through a fine-grained central controller over worker scheduling and resource management [84, 102]. DNN workers pre-allocate all GPU memory and divide it into three categories: workspace, I/O cache, and page cache. This can avoid repeated memory allocation calls and improve predictability. To cope with changes in query load, INFaaS adopts two automatic scaling mechanisms: vertical auto-scaling at the model level and horizontal auto-scaling at the Virtual Machine (VM) level [85]. The model auto-scaling is handled by the Model-Autoscaler, which decides each model variant’s scaling operations (replication, upgrade, or downgrade) by solving an integer linear programming (ILP) problem. The vertical auto-scaling adds a new VM if the utilization of any hardware resource exceeds a configurable threshold. Morphling utilizes model-agnostic meta-learning techniques to effectively navigate the high-dimensional configuration space, such as CPU cores, GPU memory, GPU timeshare, and GPU type [86]. It can significantly reduce the cost of configuration search and quickly find near-optimal configurations. Cocktail designs a resource controller to manage CPU and GPU instances in a cost-optimized manner and a load balancer to allocate queries to appropriate instances [87]. It also proposes an autoscaler that leverages predictive models to predict future request loads and dynamically adjusts the number of instances in each model pool based on the importance weight of the models. Kairos designs a query distribution mechanism to intelligently allocate queries of different batch sizes to different instances in order to maximize throughput [88]. Kairos transforms the query distribution problem into a minimum-cost bipartite matching problem and uses a heterogeneity coefficient to represent the relative importance of different types of instances. This allows for better balancing of the resource usage of different instance types. SHEPHERD comprises a planner (HERD), a request router, and a scheduler (FLEX) for each serving group [89]. HERD is responsible for partitioning the entire GPU cluster into multiple service groups. Besides, it performs periodic planning and informs each GPU worker of their designated service group and the models it must serve. SpotServe is a serverless LLM system that adjusts the GPU instances and updates the parallelism strategy flexibly [90]. It uses a bipartite graph matching algorithm (Kuhn-Munkres algorithm) to orchestrate model layers to hardware devices, thus maximizing the reusable model parameters and key-value caches.

2) Edge Environments. Given the limited and heterogeneous nature of edge computing resources, efficient management and scaling of these resources are crucial. The following papers present efficient solutions to improve the performance and efficiency of edge systems. To optimize resource allocation and interference management, [91] proposes a resource allocation scheme with Lagrangian and Karush-Kuhn-Tucker conditions. Based on the number of associated IoT end devices (IDs) and the queue length of the edge gateways (EGs), this scheme calculates the optimal resource allocation parameters and allocates resource blocks to each EG. A decentralized resource management technique is proposed in [92] to deploy latency-sensitive IoT applications on edge devices. The key design is the deployment policy module, which is responsible for finding a task allocation scheme that meets the application’s requirements based on the bids from the bidder nodes and deploying the application to the corresponding edge nodes. The authors in [93] introduce a novel method, called MDSA, to address the challenge of joint model partitioning and resource allocation for latency-sensitive applications in mobile edge clouds. They employ the task-oriented online scheduling method, COS, to collaboratively balance the workload across computing and network resources, thereby preventing excessive wait times. [94] designs a DRL algorithm to determine whether a task needs to be offloaded and to allocate computing resources efficiently. Tasks generated by mobile user equipment (UE) will be submitted to the task queue wait for execution. The algorithm models the task queue as a Poisson distribution. Then it uses the DRL method to select the suitable computing node for each task, learning the optimal strategy during the algorithm training process. Authors of [95] formulate the resource allocation problem using a Markov decision process model. They enhance the DQN algorithm by incorporating multiple replay memories to refine the training process. This modification involves using different replay memories to store experiences under different circumstances. The CE-IoT is an online cloud-edge resource provisioning framework based on delay-aware Lyapunov optimization to minimize operational costs while meeting latency requirements of requests [96]. The framework allows for online resource allocation decisions without prior knowledge of system statistics. The authors of [97] propose a dynamic optimization scheme that coordinates and allocates resources for multiple mobile devices in fog computing systems. They introduce a dynamic subcarrier allocation, power allocation, and computation offloading scheme to minimize the system execution cost through Lyapunov optimization. LaSS is a platform designed to run latency-sensitive serverless computations on edge resources [98]. LaSS employs model-driven strategies for resource allocation, auto-scaling, fair-share allocation, and resource reclamation to efficiently manage serverless functions at the edge. In [99], the authors discuss resource provisioning and allocation in FaaS edge-cloud environments, considering decentralized approaches to optimize CPU resource utilization and meet request deadlines. This paper design resource allocation and configuration algorithms with varying degrees of centralization and decentralization. KneeScale optimizes resource utilization for serverless functions by dynamically adjusting the number of function instances until reaching the knee point, where increasing resource allocation no longer provides significant performance benefits [100]. KneeScale utilizes the Kneedle algorithm to detect the knee points of functions through online monitoring and dynamic adjustment. CEC is a containerized edge computing framework that integrates workload prediction and resource pre-provisioning [101]. It adjusts the resource allocation of the container through the PRCT algorithm to achieve zero steady-state error between the actual response time and the ideal response time.

III-B Parallelism

III-B1 Parallelism type

TABLE IV: Summary of parallel methods.

Model	Ref.	Type	Target	Design
LLM	Megatron-lm [103]	TP	Latency	Introduce an intra-layer model parallelism method.
	DeepSpeed-inference [70]	TP, MP, DP	Latency and Throughput	1. Leverages heterogeneous memory systems; 2. Custom GEMM operations, kernel fusion, and memory access optimizations.
	Pope et al. [104]	TP, MP, DP	Latency and Throughput	Get the best partitioning strategy for a given model size with specific application requirements.
	AlpaServe [105]	MP	Latency	Model parallelism & Statistically multiplexing multiple devices when serving multiple models.
	LightSeq [106]	Sequence parallelism	Throughput	Partitioning solely the input tokens.
	PETALS [107]	MP, DP	Throughput	Fault-tolerant inference algorithms and load-balancing protocols.
	SpotServe [90]	TP, MP, DP	Latency	1. Dynamic Re-Parallelization. 2. Instance Migration Optimization. 3. Stateful Inference Recovery.
	SARATHI [108]	MP, DP	Throughput	Constructs a batch using a single prefill chunk and fills the remaining batch with decode requests.
	DistServe [109]	MP, DP	Latency	Assigns prefill and decoding computation to different GPUs.
Small Model	Zhou et al. [110]	MP	Latency	Utilizes dynamic spatial partitioning and layer fusion techniques to optimally distribute the DNN computation across multiple devices.
	DINA [111]	MP	Latency	A fine-grained, adaptive DNN partitioning and ofﬂoading strategy.
	CoEdge [112]	MP	Latency and Resource	Exploiting model parallelism and adaptive workload partitioning across heterogeneous edge devices.
	Li et al. [113]	MP	Throughput	Partitioning the DNN model on the IOT devices and a cloudlet.
	JellyBean [114]	MP	Throughput	Create optimized execution plans for complex ML workflows on diverse infrastructures.
	PipeEdge [115]	MP, DP	Throughput	An optimal partition strategy that accounts for the heterogeneity in computing power, memory capacity, and network bandwidth.
	PDD [116]	MP	Latency	A multipartitioning and offloading approach for streaming tasks with a Directed Acyclic Graph topology.
	B&B [117]	MP	Throughput	1. Latency Modeling and Prediction. 2. A branch and bound solver for DNN partitioning.
	Li et al. [118]	MP	Latency	Fine-grained model partitioning mechanism with multi-task learning based A3C approach.
	MoEI [119]	MP	Latency	Optimizes model partition and service migration in mobile systems.

TP: Tensor Parallelism; MP: Model Parallelism; DP: Data Parallelism.

The large FMs, such as GPT series and Llama series, and other transformer-based architectures, have significantly advanced the capabilities of AI applications. However, these models come with a substantial increase in computational requirements due to their size and complexity. Parallelism, including data parallelism (DP), model parallelism (MP), pipeline parallelism (PP), and tensor parallelism (TP), is a critical design aspect that addresses the scalability, efficiency, and performance challenges associated with deploying and serving large-scale machine learning models. As shown in Figure 5, data parallelism is a technique where data is split into smaller batches and distributed across multiple processors. Different machines can execute the inference simultaneously, thus significantly improving the throughput. Model parallelism aims at splitting the model across different processors, with each processor responsible for a portion of the model’s layers or parameters. This approach is useful for very large models that cannot fit into the memory of a single processor. Model parallelism can be more challenging to implement than data parallelism because it requires careful partitioning of the model and management of the dependencies between layers. Pipeline parallelism includes data parallelism and model parallelism, where the data passes through the model in a sequential manner, transitioning from one processor to another as it traverses different layers of the model. Tensor parallelism is a more fine-grained approach than model parallelism. It splits the individual operations within a layer across multiple processors. For example, if a layer performs a large matrix multiplication, the computation of this matrix can be distributed across multiple processors. This approach can be particularly useful for operations that are computationally intensive and can be easily parallelized, albeit at the cost of increased communication volume. Parallelism enhances performance by enabling simultaneous processing, significantly reducing execution time and increasing the scalability of applications to handle lots of users and more complex models.

There are still some challenges when designing a parallelism strategy for FM in the edge-cloud environment. 1. Memory Constraints: Large models may not fit into the memory of a single GPU, necessitating strategies to distribute them across multiple processing units. 2. Computational Load: The volume of computations required for a single request can be substantial, requiring efficient distribution of computational tasks. 3. Latency Requirements: Applications often require real-time responses, imposing strict latency constraints on the serving infrastructure. 4. Scalability: The ability to serve a large number of concurrent requests without degradation in performance is crucial for ensuring user’s satisfaction. 5. Heterogeneous environment: Edge devices can vary widely in their computation and communication capabilities, operating systems, and available software. A parallelism strategy must be adaptable to different platforms and capable of optimizing execution depending on the specific characteristics of each device.

III-B2 Auto-parallelism

Existing research works design different automatic parallelism methods in cloud and edge scenarios. As shown in Figure 6, the key problem lies in determining an optimal partitioning scheme of a model and strategically allocating each stage to an appropriate device. We summarize some automatic parallelism methods in Table IV. NVIDIA Triton and Tensorflow-serving are two famous inference frameworks for machine learning models [120, 121]. Megatron-lm introduces an efficient intra-layer model parallel approach (i.e., tensor parallelism) that allows for the training of transformer models with billions of parameters without requiring new compilers or significant library changes, fully implementable in PyTorch [103]. DeepSpeed-inference is a comprehensive system designed to address the challenges of efficiently executing transformer model inference at large scales with heterogeneous memory systems, custom GEMM operations, and more [70]. To understand and optimize the trade-offs (e.g., efficiency and latency) for LLM inference, Pope et al. develop an abstract and powerful partitioning framework designed for model parallelism [104]. It allows for dynamically analyzing the best partitioning strategy based on the specific requirements of a given model size and application scenario. AlpaServe presents a novel approach that harnesses the power of model parallelism to scale and statistically multiplex multiple devices, enabling efficient serving of multiple models [105]. This approach is particularly beneficial in scenarios where workloads are bursty and the demand fluctuates significantly. LightSeq designs a sequence parallelism method for long-context transformers that performs partitioning solely the input tokens [106]. PETALS explores cost-efficient methods for inference and fine-tuning of LLMs on consumer GPUs that are connected by the Internet [107]. This approach could enable the pooling of idle compute resources from multiple research groups and volunteers to run LLMs efficiently. It introduces two main innovations: fault-tolerant inference algorithms and load-balancing protocols. SpotServe is a novel system designed to serve generative LLMs on preemptible GPU instances in cloud environments [90]. Preemptible instances offer a cost-effective solution for accessing spare GPU resources at significantly reduced prices, although they can be interrupted or terminated at any moment. SpotServe addresses this challenge with dynamic re-parallelization, instance migration optimization and stateful inference recovery. SARATHI detects inference bubbles in pipeline parallelism caused by the imbalance (i.e., different execution times) between two distinct phases in the LLM: the prefill phase and the decode phase [108]. To tackle this issue, SARATHI addresses it through a decode-maximal batching approach and a chunked-prefills approach. This method divides a prefill request into equal-sized chunks and constructs a batch by utilizing a single chunk from the chunked-prefills and then populates the remaining batch slots with decode requests. DistServe enhances the performance of serving LLMs by disaggregating the prefill and decoding phases [109]. The disaggregation allows each phase to be assigned to different GPUs, eliminating interferences between prefill and decoding operations and allowing for tailored resource allocation and parallelism strategies for each phase.

Some research works design parallelism methods to deploy models on edge devices [122]. Zhou et al. present a framework for deploying deep neural network (DNN) inference on edge devices using model partitioning with containers [110]. This approach utilizes dynamic model partitioning and layer fusion techniques to optimally distribute the DNN computation across multiple devices, considering the available computational resources and network conditions. DINA is a system designed to optimize the deployment of DNN across edge devices in fog computing environments [111]. DINA employs a fine-grained, adaptive DNN partitioning and ofﬂoading strategy. By leveraging matching theory, DINA dynamically adapts the partitioning and ofﬂoading process based on real-time network conditions and device capabilities, aiming to minimize total latency for DNN inference tasks. CoEdge is designed to facilitate cooperative DNN inference across heterogeneous edge devices by dynamically partitioning the DNN inference workload with an adaptive algorithm [112]. Li et al. partition DNN models and utilize parallel processing techniques to meet the real-time requirements of various applications in mobile edge computing (MEC) environments [113]. They design an approximation algorithm and an online algorithm to determine the model partitioning and placement on a cloudlet as well as the local device. JellyBean aims to optimize machine learning inference workflows across heterogeneous computing infrastructures [114]. JellyBean performs model selection and worker assignment to reduce the costs associated with computing and network resources and meet the constraints of input throughput and accuracy. Recognizing the computational and memory limitations of edge devices, PipeEdge implements a strategic partitioning approach that considers the diversity in computational power, memory size, and network speed across different devices [115]. It designs a dynamic programming algorithm to find the optimal partitioning strategy for distributing the DNN inference tasks. PDD designs an efficient partitioning and offloading method based on greedy and dichotomy principles for DNNs with directed acyclic graph in streaming tasks [116]. B&B first introduces a prediction model for predicting both inference and transmission latency in distributed DNN deployments. They formulate the optimization problem for DNN partitioning and present a branch and bound solver to tackle this problem [117]. Li et al. investigate a multi-task learning approach with asynchronous advantage actor-critic (A3C) to optimize model partitioning in MEC networks for reducing inference delay [118]. MoEI is a task scheduling framework for device-edge systems that optimizes model partition and service migration in an MEC environment [119]. They develop two algorithms: one based on game theory for offline optimization and another leveraging proximal policy optimization for online, adaptive decision-making processes in a distributed environment.

IV Foundation Model & Compression

IV-A Current Foundation Model

TABLE V: Basic Language Models

Model	Time(Y.M)	$n_{para}$	$n_{layer}$	$n_{head}$	$d_{model}$	AF	Attention Type	PE
T5	19.10	60M $\sim$ 11B	6 $\sim$ 24	8 $\sim$ 128	512 $\sim$ 1024	ReLU	Multi-head	Relative
GPT-3	20.05	125M $\sim$ 175B	12 $\sim$ 96	12 $\sim$ 96	768 $\sim$ 12288	GELU	Multi-head	Sinusoidal
PanGu- $\alpha$	21.04	2.6B $\sim$ 207.0B	32 $\sim$ 64	40 $\sim$ 128	2560 $\sim$ 16384	GELU	Multi-head	Learned
ERNIE 3.0	21.07	10B	48, 12	64, 12	4096, 768	GELU	Multi-head	Relative
Jurassic-1	21.08	7.5B, 178B	32, 76	32, 96	4096, 13824	GELU	Multi-head	Sinusoidal
Gopher	21.12	44M $\sim$ 280B	8 $\sim$ 80	16 $\sim$ 128	512 $\sim$ 16384	GELU	Multi-head	Relative
LaMDA	22.01	2B $\sim$ 137B	10 $\sim$ 64	40 $\sim$ 128	2560 $\sim$ 8192	gated-GELU	Multi-head	Relative
MT-NLG	22.01	530B	205	128	20480	GELU	Multi-head	\
Chinchilla	22.04	44M $\sim$ 16.183B	8 $\sim$ 47	8 $\sim$ 40	512 $\sim$ 5120	\	Multi-head	\
PaLM	22.04	8.63B $\sim$ 540.35B	32 $\sim$ 118	16 $\sim$ 48	4096 $\sim$ 18432	SwiGLU	Multi-query	RoPE
OPT*	22.05	125M $\sim$ 175B	12 $\sim$ 96	12 $\sim$ 96	768 $\sim$ 12288	ReLU	Multi-head?	RoPE
Galactica*	22.11	125M $\sim$ 120.0B	12 $\sim$ 96	12 $\sim$ 80	768 $\sim$ 10240	GELU	Multi-head	Learned
BLOOM*	22.11	559M $\sim$ 176.274B	24 $\sim$ 70	16 $\sim$ 112	1024 $\sim$ 14336	GELU	Multi-head	Alibi
LLaMA*	23.02	6.7B $\sim$ 65.2B	32 $\sim$ 80	32 $\sim$ 64	4096 $\sim$ 8192	SwiGLU	Multi-head	RoPE
LLaMA 2*	23.07	7B $\sim$ 70B	32 $\sim$ 80	32 $\sim$ 64	4096 $\sim$ 8192	SwiGLU	Grouped-query	RoPE
Baichuan*	23.09	7B, 13B	32, 40	32, 40	4096, 5120	SwiGLU	Multi-head	RoPE, AliBi
Qwen*	23.09	1.8B $\sim$ 14B	24 $\sim$ 40	16 $\sim$ 40	2048 $\sim$ 5120	SwiGLU	Multi-head	RoPE
Skywork-13B*	23.10	13B	52	36	4608	SwiGLU	Multi-query	RoPE
Falcon*	23.11	7B $\sim$ 170B	32 $\sim$ 80	64	4544 $\sim$ 14848	GELU	Multi-group	RoPE
StarCoder*	23.12	15.5B	40	48	2048	\	Multi-query	Learned
Yi*	24.03	6B, 34B	32, 60	32, 56	4096, 7168	SwiGLU	Grouped-query	RoPE
Llama 3*	24.04	8B, 70B	\	\	\	SwiGLU	Grouped-query	RoPE

* indicates open-source. PE: Positional Embedding. AF: Activation Function

TABLE VI: Some Instruction-Tuned Models

Model	Time(Y.M)	$n_{para}$	Basic LM
mT5	20.10	300M $\sim$ 13B	T5
FLAN	21.09	137B	LaMDA
Flan-T5	21.10	80M $\sim$ 11B	FLAN, T5
Flan-cont-PaLM	21.10	62B	FLAN, PaLM
Flan-U-PaLM	21.10	540B	FLAN, U-PaLM
Alpaca	23.03	7B	LLaMA-7B
FLM-101B	23.09	101B	FreeLM

TABLE VII: Some Popular Multimodal Models

Model

Time

Language Model

Vision Model

Vision

\rightarrow

Language

Input

Output

Flamingo

22.04

Chichilla 70B

NFNet F6

Perceiver Resampler

text, image, video

text

MiniGPT-4

23.04

Vicuna

EVA-CLIP ViT-g/14

Linear

text, image, box

text, box

mPLUG-owl

23.04

Vicuna

CLIP-ViT-L/14

Linear

text, image

text

PandaGPT

23.05

Vicuna

ImageBind

Linear

text, image, audio, video,

depth, thermal, IMU

text

Shikra

23.06

Vicuna

CLIP ViT-L/14

Linear

text, box, image, point

text, point, box

Qwen-VL

23.08

Qwen-7B

CLIP ViT-G/14

Cross Attention

text, image, box

text, box

NExT-GPT

23.09

Vicuna

ImageBind

Linear

text. image, audio, video

CogVLM

23.09

Vicuna

EVA-CLIP Vit-E/14

MLP

text, image

text

Ferret

23.10

Vicuna

CLIP-ViT-L/14

Layer

text, image

text

OneLLM

23.12

LLaMA-2

CLIP-ViT

Transformer

text, image, video, audio,

point, IMU, fMRI

text

NExT-Chat

23.12

Vicuna

CLIP-ViT

text, image, box

text, box, mask

Gemini

23.12

text, image, video, audio

text

Gemini 1.5

24.04

text, image, video, audio

text

Since the release of ChatGPT, FMs, especially LLMs, have become increasingly important in daily life. Figure 7 presents a timeline of some popular LLMs. Table V highlights the structural features of several basic LLMs. Table VI and Table VII list various instruction-tuned models and multimodal models respectively.

IV-A1 Large Language Models

Current LLMs are based on Transformer, a neural network based on attention mechanisms[123]. Most LLMs use the decoder-only architecture, which is beneficial to few-shot capabilities. The researchers design different LLMs based on various Transformer architectures with different pretraining datasets.

T5 is based on the encoder-decoder transformer architecture [123]. The authors found that transfer learning can significantly enhance performance, especially when combined with a lot of high-quality data [124], so they design pre-training tasks as text-to-text tasks to pre-train a model. GPT-3 is published by OpenAI, with a major enhancement in in-context learning capabilities compared with GPT-2 [125] through model scaling [126]. To improve the multilingual understanding, Zeng et al train PanGu- $\alpha$ on a 1.1TB high-quality Chinese text corpus, which exhibits decent performance in various Chinese NLP tasks under few-shot or zero-shot conditions [127]. ERNIE 3.0 integrates autoregressive and autoencoding networks, which allows easy customization for natural language understanding and generation tasks, solving the disadvantage of downstream language understanding in previous LLMs trained on plain text [128]. Jurassic-1 series includes the J1-Large(7.5B) and the J1-Jumbo(178B). J1-Jumbo’s architecture is improved to address the depth-to-width expressivity tradeoff found in self-attention networks [129]. At a later stage, Gopher is introduced and achieves state-of-the-art performance in most of the 152 tasks [130].

To enhance LLMs’ safety and response quality, LaMDA is fine-tuned on labeled data and can reference external knowledge sources. LaMDA generates multiple response candidates in dialogues, filters out those with lower safety scores, and outputs the one with the highest quality score [131]. MT-NLG, a 540B model with strong zero, one, and few-shot capabilities, is trained with an efficient and scalable 3D parallel system [132]. Previous studies focus on increasing the model parameters without enlarging the size of pretraining tokens. Hence, DeepMind studied how to balance the number of parameters and tokens within a given computational budget and found that the number of parameters and tokens should scale equally. Based on this, Chinchilla is trained with 1.4T tokens and demonstrates superior performance on numerous downstream tasks compared to other LLMs such as Gopher (280B) [133] and GPT-3 (175B) [126] [130]. PaLM uses a decoder-only transformer architecture instead of an encoder-decoder architecture [125] to enhance few-shot capability. OPT, comparable to GPT-3, is developed to address the problem that most LLMs are not open-sourced and have limited access to other developers [134]. To help researchers find useful information, Galactica is trained on a vast amount of scientific corpora, reference materials, and other academic databases and outperforms other LLMs on various scientific tasks [135]. To promote the transparency of LLM research, BLOOM is published and open-sourced. It is trained on a dataset with hundreds of sources in 46 natural languages and 13 programming languages. Benchmarks suggest that fine-tuning BLOOM with multitask prompts can improve its performance [136].

Based on Chinchilla’s contribution, which indicates the number of tokens and parameters should scale equally [130], LLaMA focuses on using more training tokens to achieve optimal performance. Although Hoffman et al. suggest training a 10B model on 200B tokens, LLaMA-7B is trained on 4T tokens, indicating that increasing the number of tokens can still enhance the model’s performance [137]. Building on this observation, MetaAI publishes and open-sources LLaMA 2. It utilizes a larger corpus, longer context lengths, and grouped-query attention to train the model [138]. Following the LLaMA series, Baichuan is released to enhance the performance of Chinese NLP tasks. To enhance the compression rate for Chinese, Byte-Pair Encoding is adopted as the tokenization algorithm, and the tokenization model is trained on 20M multilingual corpora. It separates all numbers into individual digits to enhance the model’s mathematical capabilities and integrates various optimizations for Chinese support, including operator, tensor partitioning, mixed-precision, training recovery, and communication technologies [139] [140]. Bai et al. publish the QWEN series of language models, which includes QWEN, Qwen-Chat, CODE-QWEN, CODE-QWEN-CHAT, and MATH-QWEN-CHAT. QWEN series shows strong performance compared to other open-source models, while a little inferior compared to the proprietary models [141]. Wei et al. develop Skywork-13B, which is trained on a corpus of over 3.2T tokens extracted from English and Chinese texts [142]. Falcon series is trained on high-quality corpora primarily assembled from web data. Almazrouei et al. release a custom distributed training codebase that allows efficient pretraining of these models on up to 4,096 A100s on cloud AWS infrastructure [143]. Given the widespread use of code-generating LLMs, StarCoder, trained in over 80 programming languages with multi-query attention, is released and open-sourced for public use [144].

The team of Yi series designs a model of 34B parameters to retain complex reasoning and emergent capabilities while enabling inference on consumer-grade hardware, such as the RTX 4090. The model is trained on a high-quality dataset with 3.1T tokens [145].

IV-A2 Instruction-tuned Models

Instruction-tuning is an approach that fine-tunes pretrained LLMs on a formatted language dataset [146]. This technique improves the model’s understanding and response to inputs with specific instructions by training them on instructional task datasets. An instance of the dataset usually consists of instruction, input, and output. For example, the instruction is “What is the answer to the formula?”, the input is “7+3”, and the output is “10”.

In this subsection, several instruction-tuned models are introduced to demonstrate the development of instruction-tuning. To improve T5’s multilingual capabilities [124], mT5 is trained on a dataset that includes 101 languages [147]. FLAN is based on LaMDA 137B [131] and instruction-tuned on over 60 datasets with natural language instruction templates, significantly improving its performance. Ablation studies reveal that the number of fine-tuning datasets, model scales, and the utilization of natural language instructions are crucial to the success of instruction tuning [148]. The process of fine-tuning Flan-T5, Flan-PaLM, Flan-cont-PaLM, and Flan-U-PaLM is expanded to include datasets from Muffin, T0-SF, NoV2, and CoT. Notably, adding nine CoT datasets greatly enhances the models’ reasoning capabilities [149].

To promote academic research of LLMs, Taori et al. train Alpaca based on LLaMA-7B, using 52k instruction-following demonstrations generated by OpenAI’s text-davinci-003. Alpaca’s performance is very similar to that of text-davinci-003, yet surprisingly, it is smaller in scale and also inexpensive to train( $<600\$$ ) [150]. Based on FreeLM, different strategies are adapted to train FLM-101B on 0.31T tokens, significantly reducing training costs. With just a $100k training budget, FLM-101B is comparable to GPT-3 [126] and GLM-130B [151] [152].

IV-A3 Multimodal Models

LLMs are designed to process and generate text-based information, whereas humans interact with the world through multiple sensors, such as visual and auditory modalities. To bridge this gap, multimodal LLMs (MLLMs) are designed to handle text, images, videos, audio, points, boxes, inertial measurement unit, functional magnetic resonance imaging, etc, enabling them to process diverse tasks. Currently, the training of MLLMs typically involves three stages: pre-training, fine-tuning, and prompting. Training an MLLM from scratch is very costly. Most prior works on MLLMs have focused on aligning existing VMs with pre-trained LLMs and fine-tuning the alignment module to improve performance. We provide some popular MLLMs in Table VII.

Many research works design different methods to align a VM to an LLM. Flamingo designs new benchmarks for few-shot visual and language tasks. It uses the Perceiver Resampler and Gated Xattn-Dense to align an LLM and a VM, allowing it to process and integrate visual and textual data sequences. It demonstrates strong performance with few-shot capabilities in visual question answering and close-ended tasks [153]. Different from Flamingo, miniGPT-4, mPLUG-owl, and PandaGPT use a linear layer to align. miniGPT-4 demonstrates that correctly aligning visual features with an LLM can unlock the advanced multimodal capabilities of GPT-4 [154]. Training solely on a large-scale image-text paired dataset might lead to unnatural language outputs like repetition and fragmentation. To overcome this, MiniGPT-4 is fine-tuned on a small but higher-quality dataset with more detailed textual descriptions [155]. Meanwhile, mPLUG-owl also demonstrates impressive command and visual comprehension abilities, multi-turn dialogue, and knowledge reasoning [156]. Previous works align LLM and VM by training them on joint image-text tasks, which could compromise LLM’s language capabilities. CogVLM innovatively incorporates trainable visual expert modules between the attention and feed-forward neural network layers to align LLM and VM. This approach allows for deep integration of visual and textual elements without compromising language capabilities because all parameters of the LLM are fixed [157].

Lots of research works focus on enhancing the MLLMs to process different input and output modalities because previous models can only deal with a small set of modalities. PandaGPT utilizes ImageBind as an encoder that can embed data from different modalities into the same feature space [158]. Therefore, PandaGPT has strong cross-modal zero-shot capabilities, allowing it to naturally integrate multimodal inputs and perform complex multimodal tasks efficiently [159]. NExT-GPT is an end-to-end, general-purpose any-to-any MLLM. It connects the Vicuna with multimodal adapters and various diffusion decoders, enabling it to perceive different inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. It leverages pre-trained encoders and decoders and requires only a few parameters to be tuned, thus reducing training costs and facilitating expansion to more modalities [160]. OneLLM aligns eight modalities through a universal encoder and projection modules, along with a step-by-step multimodal alignment process, pioneering a universal MLLM architecture. Initially, it connects the LLM and visual encoder via an image projection module. It then expands to additional modalities using a universal projection module and dynamic routing, demonstrating its scalability and generality [161]. Gemini is an MLLM pre-trained on a large multimodal dataset, capable of handling a variety of text inputs interleaved with audio, video, and images. Gemini 1.5 utilizes an MoE architecture, enabling it to process multimodal inputs up to 10M tokens in length, thus exhibiting exceptional performance across modalities [162, 163].

A few research works enhance the MLLM to understand fine-grained visual information in an image. Based on QWEN [141], Qwen-VL uses a high-quality dataset with fine-grained vision-language annotation and a larger image input resolution, achieving remarkable performance on fine-grained visual understanding capabilities. It outperforms other models in various vision-centered benchmarks such as image description, question answering, and visual localization [141]. Shikra is the first MLLM capable of detecting specific areas in images. It takes natural language as input and outputs the coordinates. This feature supports visual question answering, image description, and more specialized spatial tasks without additional complex setups or external modules [164]. To address spatial understanding issues, especially in referring and grounding tasks, Ferret designs a hybrid regional representation that integrates discrete coordinates and continuous features to represent an area in an image jointly. Considering that grid-based processing (e.g., convolution, patch attention) has difficulty in handling regions with an irregular shape, a spatial-aware visual sampler is adopted, allowing it to handle various regional inputs [165]. NExT-Chat adopts the innovative pix2emb approach instead of pix2seq[166], enabling it to process and output different positional formats, which allows for more flexible and precise object localization and representation, such as bounding boxes and segmentation masks [167].

IV-A4 MoE Models

The MoE model is an advanced sparse model based on the Transformer architecture, where parameters are divided into groups, each representing an “expert” with unique weights. Only a subset of these experts are activated and contribute to the computations during inference. This is achieved through a dynamic routing mechanism. It enables MoE models to have higher computational efficiency, easier scalability, and faster pretraining and inference speeds.

Based on PanGu- $\alpha$ ’s work [127], PanGu- $\Sigma$ is trained using the MindSpore framework on the Ascend 910 AI processor. Random Routed Experts is utilized to expand the dense transformer model into a sparse one. They implement expert computation and storage separation for efficient training on 329 billion tokens, which increased the training throughput by 6.3 times [168]. Mixtral 8 $\times$ 7B is the most famous MoE model. Its architecture is similar to Mistral 7B [169], but each layer consists of eight feed-forward blocks (experts) and is pretrained on a multilingual dataset. Two experts assigned by the routing network process each token, and their outputs are merged. Therefore, although it has 47B parameters, only 13B are active in computations. The performance of Mixtral 8 $\times$ 7B matches or exceeds that of LLaMA 70B [137] or GPT-3 [126], showing its broad application potential [170]. FLAN-MoE series is also a well-known MoE model. Shen et al. discovered that after instruction tuning, MoE models demonstrate significant efficiency and performance compared to dense models with equivalent computational power, indicating that the MoE structure can enhance the computational efficiency and performance of models [148].

Based on Gemini [162], Gemini 1.5 Pro leverages the advantages of MoE, scaling the model size and enabling it to process multimodal data of up to 100k tokens. This model achieves desirable performance in cross-modal long-context retrieval tasks [163]. Switch Transformer extends T5 [124] to a MoE model to scale the parameters up to 1.6T while maintaining constant computational costs, showing remarkable scalability and effectiveness across various NLP tasks [171].

IV-A5 Tiny Models

The LLMs mentioned above typically require high-capacity memory and computational resources, preventing their deployment on resource-constrained edge devices. Tiny models fill this gap by employing techniques such as model pruning, quantization, tensor decomposition, and knowledge distillation to compress neural networks, allowing these models to be executed on edge devices like smartphones.

TinyBERT is an optimized version of BERT designed for deployment on resource-constrained devices. The team proposes a knowledge distillation method for Transformer models, transferring BERT’s knowledge to TinyBERT. TinyBERT significantly reduces model size, increases inference speed, and exhibits comparable performance to BERT [172] [173]. PanGu- $\pi$ 1B Pro and PanGu- $\pi$ 1.5B Pro optimize the tokenizer by removing low-frequency vocabularies to enhance the models’ representational efficiency. Tiny models have severe data forgetting issues, which can be improved by multiple-round training. A sample selection strategy is employed to decrease the training cost. Model configurations (e.g., depth, width, FFN expansion rate, etc) are changed to enhance the models’ performance. Initial parameters are inherited from an LLM to improve the tiny models’ generalization capabilities. The optimized models achieved significant improvements in benchmark evaluations [174]. Based on the LLaMA series, TinyLlama has 1.1B parameters and is trained on 3T tokens, showing competitive performance in various downstream tasks compared to existing similar open-source LMs. [175]. Phi-3-mini, a 3.8B-parameter model, is small enough to be deployed on mobile devices. 3.3T high-quality training data are used to enhance the performance of small models. Despite its small size, Phi-3-mini’s performance is comparable to larger models like Mixtral 8 $\times$ 7B or GPT3.5 [176].

IV-B Model Compression

IV-B1 Pruning

TABLE VIII: Pruning Methods

Type	Ref.	Challenge	Method
LLMs	Wanda [177]	Pruning LLMs requires costly retraining.	Prune weights with the smallest magnitudes multiplied by the corresponding input activations.
	Llm-pruner [178]	The training corpus of LLMs is enormous.	Remove non-critical coupled structures selectively based on gradient information.
	LoRAPRUNE [179]	Unstructured pruning cannot work with LoRA weights.	Use the weights and gradients of LoRA for importance estimation.
	LoRAShear [180]	The enormous size of LLM leads to computational costs.	A dynamic fine-tuning scheme with dynamic data adaptors.
	An et al. [181]	Existing retraining-free pruning approaches require hardware support for acceleration.	FLAP:fluctuation-based adaptive structured pruning.
	Shao et al. [182]	Existing pruning methods require extensive retraining of pruned models.	Allocate sparsity adaptively based on sensitivity.
	Compresso [183]	One-shot structured pruning leads to performance decline.	Learn optimal pruning decisions during the training process.
	SHEARED LLAMA [184]	Training smaller yet powerful LLMs from scratch is costly.	Targeted structured pruning and dynamic batch loading.
	Ji et al. [185]	The manual design of pruning features leads to a complex optimization pipeline.	Train a non-neural model as an accuracy predictor.
	GBLM-Pruner [186]	Prior approaches overlooked informative gradients derived from pretrained LLMs.	Harness normalized gradients from calibration samples to determine pruning metric.
	Anagnostidis et al. [187]	Autoregressive transformers are hard to scale to long sequences.	Employ a learnable mechanism to determine when to drop uninformative tokens from the context.
	ZipLM [188]	Generalize to different pruning settings.	Identify and remove components with the worst loss-runtime trade-off iteratively.
Audio	DPHuBERT [189]	High inference cost hinders speech model deployment.	Joint distillation and pruning for speech model compression.
	Peng et al. [190]	The frontend network of speech models has a large computational cost.	Design three task-specific structured pruning methods for heterogeneous networks.
Vision &	MoPE-CLIP [191]	Uni-modal compression metrics lead to limited performance and costly mask-search processes.	Evaluate the importance of modules by performance decline on cross-modal tasks.
Multi-modal	X-pruner [192]	Overlooking the relationship between network units and target classes leads to inferior model performance.	An explainable pruning framework considering the explainability of pruning criterion.
	Fang et al. [193]	Generative models have large computational overheads.	Disregard non-contributory diffusion steps and ensemble informative gradients to identify important weights.
	UP-ViT [194]	Limited research on model pruning for ViTs.	Prune the channels in ViTs in a unified manner.
	CAP [195]	It is hard to prune highly accurate ViTs.	A theoretically-justified pruner and an efficient finetuning procedure.

LLMs demonstrate remarkable capabilities in various tasks, including text generation, translation, and sentiment analysis. However, the deployment of them in real-world applications is often hindered by their huge size and computational requirements. Pruning, a technique aimed at lowering the computational cost and maintaining the performance of LLMs, has attracted significant attention from researchers. Pruning selectively removes weights or neurons from a pre-trained model to reduce its size and computational requirements. However, pruning LLMs poses unique challenges due to their complex architectures and the need to preserve specific language patterns during compression. Firstly, the size of LLMs is particularly large, which often consists of countless parameters, leading to significant computational expenses during both training and inference stages. Secondly, existing pruning methods often necessitate retraining or fine-tuning, resulting in additional computational overhead. Moreover, pruning techniques should retain performance across various NLP tasks, such as text modeling, text classification, and machine translation. Achieving explainability in pruned models is also essential for understanding their decision-making processes. Furthermore, the pruning techniques must adapt to different tasks with various language data. Language models trained on diverse datasets may exhibit significant differences and biases, requiring pruning methods to handle such variations while maintaining model performance.

Recent advancements in pruning techniques have addressed several challenges associated with LLMs. They typically design their pruning methods by simplifying dependency on pruning techniques or enhancing compatibility. For instance, they aim to eliminate the need for retraining or weight updates, avoid task-specific compression, and reduce dependence on the original training corpus. Alternatively, they combine the strengths of various pruning methods or focus on making the pruning techniques compatible with hardware platforms. To improve the effectiveness and flexibility of these methods, new criteria are also introduced for pruning, such as sensitivity-based sparsity allocation.

Dynamic context [187] selectively removes contextual information in autoregressive transformers during inference to improve the scalability. This approach reduces memory and computational requirements without significantly sacrificing model performance. Additionally, dynamic context pruning enhances interpretability by studying the model’s decision-making process, as uninformative tokens are dynamically pruned based on learnable mechanisms. Joint distillation and pruning methods, such as DPHuBERT [189], achieve task-agnostic compression by training a student model to emulate a teacher model, integrating structured pruning techniques. The method exhibits versatility and applicability to various speech SSL models. By leveraging sensitivity analysis, these methods identify and prune redundant weights or neurons to obtain compressed models with reduced computational costs and memory requirements.

Wanda [177] aims to induce sparsity for LLMs without requiring retraining or extensive computational resources. Wanda utilizes a unique pruning metric that considers both the weights and the relevant norm input activation. LLM-Pruner [178] removes non-critical coupled structures according to gradient data. LLM-Pruner reduces reliance on the original training data and is an automatic structural pruning framework. Loraprune [179] considers neural network pruning with LoRA, introducing a criterion for weight importance estimation and integrating parameter-efficient fine-tuning, demonstrating superior compression rates and reduced memory usage over existing methods on LLaMA models. An et al. [181] propose FLAP, which introduces structured importance metrics and compensation mechanisms to mitigate performance loss. Shao et al. [182] introduces a pruning method based on mixed Hessian-sensitive sparsity to achieve at least 50% sparsity in LLMs without retraining, which reduces pruning errors and maintains overall sparsity levels by adaptively allocating sparsity based on sensitivity. Compresso [183] incorporates techniques such as LoRA and L0 regularization to mitigate training expenses and data collection challenges and introduces a collaborative prompt to enhance interaction between the LLM and the pruning algorithm, leading to notable performance improvements. Sheared llama [184] prunes LLaMA2 from 7B to 1.3B and 2.7B parameters, outperforming equivalent-sized models while needing only 3% of the compute for training. It effectively addresses challenges in optimizing pruned architectures and continuing pre-training, showing superior performance. Ji et al. [185] employ gradient boosting decision trees (GBDT) as an accuracy predictor to guide pruning process based on specific performance requirements. This predictor is then utilized to optimize the search space and select the most suitable pruned model. Das et al. [186] introduce GBLM-Pruner, a sparsity-centric pruning method for billion-parameter LLMs, operating without retraining and surpassing competitors like SparseGPT and Wanda across benchmarks. It unveils structural patterns in unstructured pruning within LLMs, maintaining simplicity and efficiency. Anagnostidis et al. [187] employ an adaptable mechanism to identify and eliminate irrelevant tokens from context, thereby reducing memory and computational demands during inference. ZipLM [188] achieves advanced performance in different compression settings and model categories. Structured pruning algorithm in ZipLM considers both local and global correlations, ensuring precise pruning. Additionally, the algorithm is enhanced with a layer-wise token-level distillation technique.

The above research works describe pruning methods for LLMs. Next, we enumerate some pruning methods aimed at audio, visual, and multi-modal models. Pada [196] introduces cross-domain task-aware pruning, a novel pruning paradigm that utilizes fine-tuned out-of-domain models to enhance adaptation to the target domain. Peng et al. [190] propose heterogeneous joint pruning, which prunes both the CNN and Transformer components. MoPE-CLIP [191] accurately evaluates the importance of CLIP modules by assessing the performance decline on cross-modal tasks with model pruning. MoPE-CLIP effectively harnesses knowledge from the teacher model, substantially reducing pre-training costs and generating competitive task-specific models. X-pruner [192] proposes an explainable pruning framework that quantifies each unit’s contribution to class prediction using explainability-aware masks. Applied to representative transformer models such as DeiT and Swin Transformer, X-Pruner demonstrates superior performance with reduced computational costs and minimal performance degradation. Fang et al. [193] present Diff-Pruning, which utilizes a Taylor expansion over pruned timesteps to identify crucial weights and alleviate the computational burden. This approach achieves a notable 50% reduction in FLOPs with a mere 10% to 20% of the original training resources. Yu et al. [194] introduce UP-ViTs, a unified pruning framework tailored for ViTs, which focuses on structurally pruning ViTs and their variants while ensuring consistency in model structure. UP-ViTs achieves high accuracy in compressed models, surpassing previous ViTs and their variants. CAP [195] effectively and efficiently manages weight correlations throughout the pruning process. It is compatible with structured pruning and quantization, facilitating practical speedups without compromising accuracy. CAP achieves high sparsity levels with minimal impact on accuracy.

IV-B2 Quantization

TABLE IX: Quantization Methods

Type	Ref.	Challenge	Method
W8A8	Smoothquant [197]	The presence of outliers in activation.	Outlier smoothing and per-channel scaling transformation.
	RPTQ [198]	Varying ranges across channels.	Rearrange channels and quantize them in clusters.
	LoftQ [199]	Performance differences between 2 fine-tuning method.	Find a proper low-rank initialization for LoRA fine-tuning.
	Outlier suppression+ [200]	Existence of detrimental outliers in activations.	Channel-wise shifting for asymmetry and channel-wise scaling for concentration.
	FPTQ [201]	The W4A8 faces notorious performance degradation.	Layerwise activation quantization strategies.
Low-bit weight-only	OWQ [202]	The presence of activation outliers.	Prioritize structured weights sensitive to quantization in high-precision.
	AWQ [203]	The significant model sizes of modern LLMs.	Search for optimal per-channel scaling by observing the activation.
	Zhang et al. [204]	The superiority of low-bit Integer versus Floating Point formats is unclear.	Select the optimal format on a layer-wise basis.
	Omniquant [205]	Hand-craft quantization parameters lead to low performance.	Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET).
	IntactKV [206]	Previous quantization methods compromise LLM performance.	Construct KV cache of pivot tokens from full-precision model.
	Kim [207]	Previous quantization methods are designed for inference.	Update solely the quantization scales during fine-tuning.
	QLLM [208]	Activation outliers in particular channels.	Reallocate the magnitude of outliers to other channels.
QAT	LLM-QAT [209]	Post-training quantization methods cannot perform well at lower bit precision.	A data-free distillation method leveraging outputs generated by a pre-trained model.
	QuIP [210]	High memory usage of LLMs.	Use random orthogonal matrices to guarantee weight and Hessian incoherence.
	Norm tweaking [211]	Lower-bit quantization leads to severe performance degradation.	Update normalization weights with calibration data generation and channel-wise constraints.
	Zeroquant-v2 [212]	Lack of a systematic examination of various quantization schemes.	An evaluation and comparison of existing quantization methods.
	QA-LORA [213]	The imbalance of quantization and adaptation during fine-tuning.	Use group-wise operators.
	Int2.1 [214]	Errors induced by the quantization process.	Add LoRA layers to bring quantized model close to its float point counterpart.

W8A8: 8-bit weights and activation. QAT: Quantization-Aware Training.

LLMs have brought NLP to a new era, showing great performance across a mass of missions ranging from text translation to generation. However, these models’ widespread deployment presents severe challenges, such as the massive memory requirements, computational overhead, and huge model sizes. To alleviate these issues, researchers have developed different quantization techniques tailored specifically for LLMs. Quantization reduces the bit-width of models, thereby diminishing memory footprint and enhancing inference speed. Researchers need to strike a balance between compression ratios and task-specific performance metrics. Key concepts in quantization include post-training quantization (PTQ), quantization-aware training (QAT), and methodologies for managing weight and activation quantization.

Despite the promise of quantization, many challenges persist in its application to LLMs. Achieving high compression rates while ensuring task performance is difficult. It is necessary to balance quantization ratios and task-specific accuracy. Quantization introduces quantization errors, particularly at lower bit precisions, which can significantly impair model accuracy. Moreover, quantization methods should be compatible with a variety of hardware, such as different edge devices.

In recent years, the research community has made significant efforts in addressing the challenges associated with quantizing LLMs. The quantization methods can be divided into three categories: 8-bit weight and 8-bit activation (W8A8) quantization, low-bit weight-only quantization and quantization-aware training. The W8A8 method employs a quantization process that reduces both weights and activation to 8-bit formats. Smoothquant [197] achieves a balance between accuracy and hardware efficiency by considering activation outliers and simplifying quantization complexities. Enabling W8A8 quantization, it delivers significant performance enhancement with up to $1.56\times$ speedup and $2\times$ memory reduction without sacrificing accuracy. RPTQ [198] saves up to 80% memory with high accuracy levels when quantizing OPT-175b. LoftQ [199] quantizes LLMs while identifying an appropriate low-rank initialization for LoRA fine-tuning, thereby enhancing generalization in downstream tasks. Outlier suppression+ [200] presents efficient techniques for determining optimal shifting and scaling values. Its performance is comparable to floating-point precision and establishes new benchmark criteria for 4-bit BERT models. FPTQ [201] introduces layerwise activation quantization strategies, including a novel logarithmic equalization technique, to enhance performance. FPTQ achieves outstanding W4A8 quantized performance without the need for additional fine-tuning, thereby simplifying the production of LLMs. QLLM [208] is an efficient low-bitwidth quantization method with a channel reassembly technique for handling activation outliers. It also includes an adaptive strategy to determine the optimal number of disassembled channels and an efficient error correction mechanism with low-rank parameters.

The low-bit weight-only quantization methods quantize LLM weights into low-bit integers, usually 4 bits or fewer. OWQ [202] prioritizes critical weights for high-precision storage while applying quantization to the remaining dense weights, achieving desirable performance by reducing the quantization error significantly. AWQ [203] utilizes activation information to identify significant weights and optimizes per-channel scaling to preserve these important weights during quantization. AWQ surpasses existing methods on various language modeling and domain-specific benchmarks, exhibiting outstanding quantization performance for instruction-tuned and multi-modal LLMs. The Mixture of Formats Quantization (MoFQ) [204] selects the optimal format on a layer-wise basis, performing well in weight-only and weight-activation post-training quantization scenarios. MoFQ incurs no hardware overhead compared to INT/FP-only quantization. Omniquant [205] efficiently optimizes quantization parameters and preserves original full-precision weights with limited learnable parameters, performing well at low-bit scenarios. IntactKV [206] uses full-precision model to generate KV cache of pivot tokens, thereby effectively reducing the quantization error and achieving lossless weight-only INT4 quantization. PEQA [207] combines parameter-efficient fine-tuning with quantized LLMs. It updates only the quantization scales, minimizing memory overhead and model sizes. PEQA-tuned LLMs exhibit competitive performance in language modeling, few-shot learning, and comprehension, even at sub-4-bit precision.

Some research works propose quantization-aware training (QAT) methods to simulate quantization in training, making the model adapt to lower bits without decreasing accuracy. LLM-QAT [209] preserves the original output distribution by distilling data freely, facilitating the quantization of any generative model. This method could quantize LLMs to 4 bits for weights and 6 bits for activation. Norm tweaking [211] redistributes the quantized activation to match its floating-point value. Norm tweaking achieves high-performance quantization for general LLMs. Zeroquant-v2 [212] explores PTQ to reduce memory and computational costs in LLMs, introducing Low-Rank Compensation (LoRC) to maintain model quality in low-bit settings. QA-LORA [213] algorithm employs group-wise operators to increase quantization flexibility while reducing adaptation complexity. This enables efficient fine-tuning by quantizing the weights of LLMs and integrating them into a quantized model without sacrificing accuracy. Int2.1 [214] incorporates an Extremely Memory-Efficient Fine-Tuning (EMEF) framework utilizing LoRA alongside an Error Correction Framework (LREC) to minimize quantization errors. Memory requirements are reduced by up to $5.6\times$ , enabling fine-tuning on consumer laptops. Michaud et al. [215] elucidates the power law drop-off of loss with model and data size and the emergence of new capabilities with scale. They propose a method for automatically discovering quanta in language models and find that the frequency at which these quanta are used follows a power law. MobileBERT [216] assesses the performance of converted and quantized models on edge devices for English tweet reputation analysis, reducing accuracy loss with smaller footprints.

These approaches represent tremendous advances in mitigating the challenges associated with quantizing LLMs, preparing for future more efficient and scalable language models that can be seamlessly deployed across diverse applications and platforms.

IV-B3 Knowledge Distillation

TABLE X: Distillation Methods

Type	Ref.	Challenge	Method
Language Models	Hsieh et al. [217]	Finetuning and distillation require large amounts of training data.	Train small models in a multi-task system by extracting LLM rationales as extra supervision.
	ZEPHYR [218]	Models from distilled supervised fine-tuning do not respond well to natural prompts.	Apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment.
	Lion [219]	Overlook of incorporating reciprocal feedback.	An adversarial cycle including imitation, discrimination, and generation.
	PaD [220]	Synthetic Chain-of-Thought (CoT) data often contains faulty reasoning.	Utilize the reasoning program to substitute the CoT, allowing automated error checking of synthetic data.
	DISTILLSPEC [221]	Identify a well-aligned compact draft model with a target model.	Use knowledge distillation to better align the draft model with the target model for speculative inference.
	Latif et al. [222]	Deploy large models on constrained devices.	Use prediction probabilities of LLM as soft labels to train smaller student models.
	Less is more [223]	Student models are often under-fitted.	Match hidden features of student and teacher by task-aware filters for every layer.
	HOMODISTIL [224]	Student models cannot produce predictions that match the results of teacher models over massive training data.	A task-agnostic distillation approach equipped with iterative pruning.
	SCoTD [225]	Only large models (beyond 50B parameters) can gain from the chain of thought.	Sample in a larger teacher model to generate a smaller student model.
	EvoKD [226]	Lack of exploring LLMs’ potential to comprehend the target task and acquire valuable knowledge.	Interactively enhance data generation using LLMs with active learning.
	MiniDisc [227]	Teacher assistant-based distillation requires numerous trials to find the optimal teacher assistant.	Introduce a new $\lambda$ -tradeoff metric that quantifies the optimality of the teacher assistant.
	SLaM [228]	The noise of teacher’s pseudo-labels leads to students’ ineffectiveness.	Student-Label Mixing: Knowledge distillation with unlabeled examples.
	UniversalNER [229]	Student models trail original LLMs by large margins in downstream applications.	Targeted distillation with mission-focused instruction tuning to train student models.
	SCOTT [230]	Generated rationales of LLMs are seldom consistent with the predictions or faithfully justify the decisions.	Train counterfactual inference student model by teacher-provided concepts.
Visual Models	DETRDistill [231]	Knowledge distillation methods designed for convolution-based detectors may not be directly applicable to Transformers.	A Hungarian-matching logits distillation, a target-aware feature distillation, and a query-prior assignment distillation.
	AdaAD [232]	Student models are more likely to encounter adversarial attacks at the edge.	Adaptively searches for optimal match points in the inner optimization.
	DIME-FM [233]	The difficulty of training a small custom vision-language FM for resource-limited applications.	Transfer knowledge from large vision-language FMs to compact, personalized foundation models.

LLMs have revolutionized NLP and other AI domains, yet their deployment poses challenges because of immense calculating requirements and storage inefficiency. To resolve these difficulties, researchers have explored knowledge distillation techniques to compress LLMs into smaller, more deployable models while retaining their performance. These techniques leverage teacher-student paradigms, where a large teacher model transfers its knowledge to a smaller student model. A previously trained model often distills its knowledge into a smaller model using task-specific data.

However, the optimization of large model distillation faces some challenges. One major challenge is the significant capacity gap between teacher and student models, leading to suboptimal distillation performance. Additionally, noisy pseudo-labels generated by the teacher model may negatively influence the distillation process. LLMs possess the capability for chain-of-thought reasoning, and it is crucial to transfer these reasoning abilities to smaller models through distillation.

Researchers have put forward innovative methods to improve knowledge distillation with large FMs. Distilling step-by-step [217] is a novel method with a multi-task framework that utilizes LLM rationales for additional supervision. It performs better with fewer training examples and enhances performance with smaller model sizes. Zephyr [218] employs distilled direct preference optimization (dDPO) to enhance user intent alignment in chat models, leveraging preference data from AI feedback. It achieves significant enhancements to align 7B models with user intent in chat tasks. Lion [219] has three-stage adversarial loops, including imitation, discrimination, and generation, shifting knowledge from a sophisticated large LLM to a compact, open-source one. Program-aided Distillation (PaD) [220] utilizes reasoning programs to check errors in synthetic data for distillation, enabling small models to outperform large LLMs with significantly fewer parameters and training data. Distillspec [221] uses KD to align a compact draft model with a larger target model in speculative decoding. This method has a substantial reduction in latency with minimal performance drop. TED [223] aligns hidden representations and selects pertinent knowledge by task-aware filters, achieving notable advancements in compressing language models. Homodistil [224] mitigates large prediction discrepancies between teacher and student models. This approach initializes the student model from the teacher and gradually prunes neurons until reaching the desired width, ensuring consistent knowledge transfer with minimal prediction discrepancies throughout the distillation process. Li et al. [225] introduce Symbolic Chain-of-Thought Distillation (SCoTD) techniques, which equip smaller language models with the capability for chain-of-thought prompting, thus performing better in supervised and few-shot settings. Liu et al. [226] presents EvoKD, which leverages active learning to enhance data generation with LLMs, thereby improving the capabilities of smaller domain models (student models). EvoKD integrates evolving knowledge distillation and active learning to optimize model training and distill informative knowledge effectively. Zhang et al. [227] introduce Minimal Distillation Schedule (MINIDISC), which aims to identify an optimal teacher assistant in a single trial for extreme compression scenarios, such as compressing to 5% scale. Slam [228] introduces the teacher model’s noise to modify the student’s loss function and improve the performance. Universalner [229] aims to train student models that excel in open named entity recognition, thus creating more cost-efficient student models. Scott [230] utilizes contrastive decoding to extract rationales that support gold answers from the teacher model. It also employs counterfactual reasoning to ensure faithful distillation in the student model. Scott generates more faithful CoT rationales compared to baselines while maintaining comparable end-task performance. Li et al. [234] instructs LLMs to transform structural triplets into context-rich segments and introduces auxiliary tasks for smaller knowledge graph completion (KGC) models, enhancing KGC models by leveraging LLMs. Hu et al. [235] propose Linguistic Graph Knowledge Distillation (LinguGKD), which enhances the predictive accuracy and convergence rate of GNNs by distilling knowledge from LLMs without requiring additional data or model parameters. Marrie et al. [236] propose leveraging Mixup based on stable diffusion as a data augmentation strategy to enhance distillation. Their findings demonstrate the effectiveness of linear probing, task-specific distillation, and the successful use of diffusion models for data augmentation without class information, offering insights for improved distillation techniques.

Some research works aim to design knowledge distillation methods for visual tasks, such as object detection [237]. Detrdistill [231] includes Hungarian-matching logits distillation, target-aware feature distillation, and query-prior assignment distillation. DETRDistill enhances various methods by more than 2.0 mAP, often surpassing their teacher models. Huang et al. [232] address the vulnerability of student models to edge adversarial attacks by introducing AdaAD. This approach incorporates the teacher model in knowledge optimization, significantly enhancing the student model’s performance in terms of both accuracy and adversarial robustness. Dime-fm [233] transfers knowledge from large Vision-Language Foundation Models (VLFMs) to smaller models with minimal data requirements and employs a novel distillation mechanism that matches the similarity of images to sentences, ensuring transferability and robustness. It selects visually-grounded sentences efficiently to construct a distillation text corpus.

IV-C Model adaptation

TABLE XI: The model adaptation methods.

Type	Ref.	Selector/Exit	Target	Scenario
Model Selection	INFaaS [85]	A Greedy Heuristic	Accuracy & Latency & Recourse cost	Cloud
	Edgeadaptor [238]	Online optimization and approximate optimization	Accuracy & Latency & Recourse cost	Edge
	JellyBean [114]	Profile and a beam-search.	Accuracy & Throughput	Edge/Cloud
	STI [239]	A greedy method	Accuracy & Latency	Edge
Model Iteration	Tabi [240]	Confidence of logits	Accuracy & Latency	Cloud
	CATs [241]	A meta consistency classifier	Accuracy & Latency	Edge/Cloud

Model adaptation in edge-cloud systems aims to dynamically select and possibly adapt ML models for inference tasks based on the current execution context, such as available computational resources, network conditions, and specific requirements of the task. Dynamic model selection and adaptation enable elastic acceleration in edge-cloud systems. By exploring the trade-off between performance, efficiency, privacy, and cost, this approach significantly enhances the feasibility and effectiveness of AI applications across different edge-cloud scenarios. While model adaptation for inference in edge-cloud systems brings substantial benefits, it also introduces several challenges. First, edge devices, such as IoT sensors or smartphones, often have limited computational power, memory, and energy resources. Adapting and running complex FMs within these constraints without compromising performance or accuracy is a significant challenge. Second, edge environments can be highly variable, with changes in network conditions, device capabilities, and application contexts. Dynamically adapting models to these changing conditions without human intervention requires real-time monitoring and long-term decision-making algorithms that can accurately assess the current environment and predict the best model or configuration. Third, quickly finding a sweet spot between inference speed (latency) and model accuracy is a challenge in different applications.

As shown in Figure 8, we categorize research works of model adaptation into three types. 1) Model Selection. The most prevalent method is model selection, which dynamically selects a suitable model for inference in a serving system. Research works design a model selector by considering various system characteristics. The first factor is hardware resources, which fundamentally determine the computational capacity of the system. Generally, lightweight models are deployed on edge devices, and stronger models are deployed on cloud centers. The second factor is the input. If the images are complex and belong to different domains, or if the language task is challenging to process, it is necessary to deploy a large model to enhance accuracy. The third factor is the service demands of users in different applications. Users’ requirements for accuracy and latency can vary significantly based on their use cases. For instance, those using entertainment devices like virtual reality may prioritize low-latency AI services, even if it means accepting a slight decrease in accuracy [242]. Conversely, users in the medical industry require impeccable accuracy, even if it results in longer waiting time for results [243]. The selector takes these factors as input and outputs the model selection strategy based on an optimization algorithm or a heuristic greedy method. A larger model can deliver more accurate results at the expense of increased latency, while a smaller model may provide faster responses but with potentially reduced accuracy. The model selector should find a sweet spot between latency and accuracy under different settings of a serving system. 2) Model Iteration. The model selection process should estimate both accuracy and latency in advance and then execute only one model per request. However, directly selecting a small model may lead to an inaccurate result. A few works execute model inference iteratively, starting from a small model and progressively moving to a larger model, and decide when to return results based on the prediction probability. The existing methods are designed based on the entropy of the probability, where a higher entropy value indicates increased uncertainty and necessitates forwarding the request to a larger model. 3) Speculative Decoding. Speculative decoding has emerged as an efficient and widely adopted technique in LLM inference to address the limitations of multi-step decoding. It utilizes a smaller model to generate a sequence of candidate words, which are then simultaneously verified by a larger model, resulting in improved performance. The benefits of speculative decoding stem from two aspects: (1) Small models can quickly generate text sequences, and (2) large models can group generated sequences into a batch to improve throughput by leveraging hardware resources more effectively.

We summarize recent works of model adaptation in Table XI. INFaaS is an automated model-less serving system, which generates model variants optimized along different dimensions and automatically selects the most appropriate variant for each query based on performance, cost, and accuracy objectives [85]. EdgeAdaptor is designed to efficiently manage the trade-offs between inference accuracy, latency, and resource costs for edge-based DNN inference services. This framework addresses the problem by jointly optimizing application configuration, DNN model selection, and edge resource provisioning dynamically, in response to fluctuating demand and system conditions [238]. JellyBean is a system designed for optimizing and serving machine learning inference workflows across heterogeneous computing infrastructures [114]. For each ML operator within a workflow, JellyBean selects a model that meets the accuracy requirements at the lowest possible cost. This process considers the interaction between models to estimate the overall workflow’s accuracy and utilizes a beam search algorithm to explore the space of possible model configurations efficiently. STI is an on-device inference system with two novel techniques: model sharding and elastic pipeline planning with a preload buffer [239]. STI manages model parameters as independently tunable shards, profiling their importance to accuracy and managing them on disk. Moreover, the elastic pipeline planning module utilizes a small preload buffer to initiate execution without delays, selecting and assembling shards according to their importance to maximize inference accuracy within resource constraints. Tabi is an inference system that employs a multi-level inference engine to serve queries using smaller models by default and only switches to more computationally expensive LLMs for more complex applications [240]. Tabi uses a calibrated confidence score to decide whether the results from smaller models are accurate or if a query should be rerouted to a larger model for processing. CATs trains additional prediction heads on intermediate layers of a transformer model and uses a meta consistency classifier to dynamically decide when to stop computation for each input, based on a unique extension of conformal prediction [241]. Leviathan et al. first introduce speculative decoding. Let $p(x_{t}|x_{<t})$ be the distribution of a target model $M_{p}$ and $q(x_{t}|x_{<t})$ be the distribution of a draft model $M_{q}$ , outputs can be sampled from these two distributions. Then it samples from $q(x)$ and keeps it if $q(x)\leq p(x)$ . If $q(x)>p(x)$ , it rejects the sample with probability $1-\frac{p(x)}{q(x)}$ and samples $x$ again from an adjusted distribution $p^{\prime}(x)=norm(\max(0,p(x)-q(x)))$ . LLMCad designs an on-device system with three novel techniques: constructing a token tree for broader candidate token pathways, a self-adjusting fallback strategy for error correction, and a speculative token generation during the verification process to maintain efficiency [52]. SpecInfer introduces a novel approach where small speculative models predict the LLM’s outputs, organizing these predictions into a token tree, with each node representing a candidate token sequence [51]. The system then verifies the correctness of all candidate sequences in parallel against the LLM, significantly reducing end-to-end latency and computational requirements while maintaining model quality. Sequoia employs a dynamic programming algorithm to determine the optimal speculative token tree structure, enabling it to scale with the size of the speculation budget [244]. Sequoia also features a hardware-aware tree optimizer that selects the optimal token tree size and depth based on the available resources to maximize speculative decoding performance. Minions employs multiple small speculative models to predict the output of an LLM, using a majority-voted approach to improve inference performance [245]. It also dynamically adjusts the speculation length of small models, optimizing the trade-off between the number of tokens speculated by small models and the verification cost by the LLM.

IV-D Token adaptation

TABLE XII: The token reduction methods.

Type	Reduction Method	Ref.	Highlight	Model
Language	Token Pruning	FastGen [246]	Pruning KV cache based on special tokens, punctuation, locality and frequency.	LLaMa
		StreamingLLM [247]	1. Attention sink mechanism: A small set of initial tokens is important. 2. Placeholder token as dedicated attention sink.	Llama-2, MPT, PyThia and Falcon
		H2O [248]	KV cache pruning based on the accumulated attention scores	OPT, LLaMA, and GPT-NeoX
		ToP [249]	1. Ranking-distilled token distillation. 2. A two-tier binary masking system.	Bert
		LLMLingua [250]	1. Budget controller: Allocate different compression ratios. 2. Token-level Iterative compression algorithm: Evaluating the importance of each token and iteratively removing tokens.	GPT-3.5, Claude-v1.3
		Longllmlingua [251]	A post-compression subsequence recovery strategy.	Llama-2
		Selective Context [252]	Pruning tokens on lexical units based on self-information.	GPT-3.5, GPT-4, LLaMA, Vicuna
		LTP [253]	1. Learnable threshold-based token pruning. 2. Differentiable soft binarized mask.	Bert
		AdapLeR [254]	1. Contribution predictor. 2. Gradient-based saliency method to train the predictor. 3. A soft-removal function for masking.	Bert
		LengthDrop [255]	1. Randomly generates a length configuration during training. 2. Multi-objective evolutionary search to find an optimal length. 3. Drop-and-restore Process.	Bert
		Power-bert [256]	1. Determining word-vector significance with attention. 2. Learning how many word vectors to eliminate.	Bert
	Token Summary	ICAE [257]	The encoder is adapted from an LLM with LoRA for encoding a long context into a small number of memory slots.	LlaMa
		Gist [258]	Compress prompts into smaller sets of “gist” tokens for language models.	LLaMA, FLAN-T5
		AutoCompressor [259]	Adapting language models to process long text documents by compressing them into compact summary vectors.	Llama-2
Vision	Token Pruning	STAR [260]	1. Dynamic evaluation of intra-Layer patch importance. 2. Offline evaluation of inter-Layer patch importance.	DeiT
		METR [261]	METR integrates a multi-exit architecture into ViTs to encourage the model to prioritize task-relevant information from the initial stages.	ViT, DeiT
		HeatViT [262]	1. An attention-based selector for pruning less informative tokens. 2. Hardware optimization strategies such as 8-bit quantization.	DeiT, LV-ViT
		Slimming [263]	Dynamically prunes less informative patches with the guidance of the last layer.	ViT
		Dynamicvit [264]	The framework includes several trainable prediction modules to determine the tokens to be pruned.	DeiT, LV-ViT
	Token Merging	ToMe [265]	ToMe employs a simple, lightweight matching algorithm to gradually combine similar tokens within a transformer.	ViT, DeiT
		EViT [266]	The approach involves computing token attentiveness between image tokens and the class token, preserving attentive tokens, and fusing inattentive tokens.	DeiT, LV-ViT
	Token Pruning & Token Merging	DiffRate [267]	Automatic learning of different compression rates for different layers.	ViT, DeiT
		Beyond [268]	An efficient token decoupling and merging method that considers both token importance and diversity for token pruning.	DeiT, LV-ViT
		TPS [269]	1. Split tokens into reserved and pruned subsets. 2. Merge pruned tokens into the reserved tokens.	DeiT, ViTs
Multi-modal & Video	Token Pruning & Token Merging	PuMer [270]	PuMer employs a token reduction strategy that combines text-informed pruning and modality-aware merging techniques to selectively reduce the number of tokens from both input images and text.	ViLT, METER
	Token Merging	TESTA [271]	1. Sampling input frames from the video. 2. Adaptively aggregating similar frames and patches within frames.	BLIP
	Token Pruning	STA [272]	Evaluates tokens based on temporal redundancy and semantic importance.	ViT, VideoSwin
		STTS [273]	Dynamically selects informative tokens in both temporal and spatial dimensions.	MViT-B16

Although model adaptation can accelerate the inference of a transformer model, it may lead to large I/O latency because it needs to switch different versions of models between GPU memory and disk, which becomes the bottleneck during inference. By analyzing the inference process, researchers find that the significant computational cost of transformer models primarily comes at the self-attention mechanism [274]. It calculates the matrix multiplication between the key and query and scales quadratically with the sequence length. Processing long sequences or large batches of data can become computationally intensive because the number of tokens increases dramatically. Token reduction, encompassing token pruning, token merging and token summary, is an advanced technique in the field of artificial intelligence aimed at enhancing the efficiency and performance of large transformer models [275]. The core idea behind token reduction is to shorten the input data that a model processes, thereby reducing computational overhead and potentially improving the model’s ability to focus on the most important information. A key research topic in token reduction is to identify redundant or similar tokens in the input and remove or merge them without negatively affecting model performance.

Designing token reduction methods faces several challenges. 1) Information loss due to token reduction. Although token reduction can decrease computation costs, it often leads to significant information loss. The errors introduced by pruning strategies can adversely affect the model performance, as essential context and details necessary for accurate predictions may be discarded. 2) Trade-off between performance and efficiency. Achieving a good balance between model performance and computational efficiency is a major challenge. Aggressive token pruning can lead to large accuracy drops due to the loss of essential information. 3) Robustness to various reduction ratios. A large reduction ratio may inevitably remove crucial tokens, resulting in incomplete inputs. A robust token reduction method should maintain optimal performance across different reduction ratios. 4) Adaptability to different transformer architecture. The diversity of transformer architectures, including vanilla transformers and hybrid transformers, poses a challenge for developing a compression method that is both effective and flexible across different models. 5) Maintaining hardware-friendly inference. Another challenge lies in ensuring that reduction methods facilitate hardware-compatible inference, particularly when running on certain edge devices. This is crucial for the practical deployment of compression methods in real-world applications.

To solve the above challenges, a number of token reduction methods have been developed to accelerate the inference of the transformer. 1) Token pruning. There are only a limited number of words or image patches that contribute to the prediction of final results, and a lot of redundant tokens can be regarded as noisy information. Therefore, token pruning selectively removes tokens from the input sequence during the inference process of a transformer model. The core idea is to identify and retain only the most informative tokens for the task, thereby reducing the sequence length and the computational load. As shown in Figure 9a, there are two common approaches to indicate the significance of tokens that can guide the pruning process. The first method is training a prediction module, which takes tokens as input and outputs the importance score for each token. This module is constructed with a two-layer linear network and trained with soft masking of tokens. During inference, the $k$ tokens with the lowest importance scores will be removed to reduce the token number. The second method is directly collecting the attention weight to prioritize tokens. The attention mechanism calculates the interdependence among tokens, and the attention weight reflects the relationship between them. Tokens with lower attention values contribute less to the output of other tokens, therefore indicating their smaller significance. Therefore, the attention weight serves as a valuable metric for evaluating the token importance. 2) Token Merging. Token merging, on the other hand, combines adjacent or similar tokens into single tokens, effectively condensing the content and further reducing the sequence length without a substantial loss of information. This technique is particularly useful for large transformer models, where processing extensive sequences of tokens can be computationally intensive. The motivation behind token merging is the presence of numerous similar patches in images and videos, which exhibit similar functionalities within the transformer model. One of the most well-known methods is TOME, which demonstrates the effectiveness of token merging [265]. As shown in Figure 9b, the input tokens are initially divided into two sets, with tokens in set A selecting the most similar token in set B using cosine similarity of features. The top $k$ edges, representing the highest similarity of tokens, are retained. Finally, these tokens are merged using a weighted average approach to reduce the length of the input. 3) Token Summary. Token summary is a crucial technique for optimizing LLMs by condensing lengthy instructions and external knowledge sources. Instructions typically contain detailed task information and example references, while knowledge pools provide additional external information to assist the LLM’s inference process. However, these sources can be voluminous, potentially exceeding the model’s window size and causing latency issues. To address this, token summary expedites inference by summarizing the content into concise forms. It distills the content into smaller memory tokens using the LLM and concatenates them with user input tokens for inference. By leveraging token summaries, LLMs can effectively manage the overloaded information and enhance their ability to generate accurate responses.

Based on the application scenarios, we can classify existing token reduction methods into three types. We summarize all token reduction methods and list their contributions in Table XII.

1) Language: Token reduction is initially applied to the BERT model. Power-bert determines the pruned tokens based on the attention value and learns a configuration of how many tokens should be eliminated [256]. To improve the robustness, LengthDrop randomly generates a token length configuration during training and designs a multi-objective evolutionary search method to find an optimal token length during inference. Besides, a drop-and-restore process is employed to recover certain pruned tokens that might be important for the deeper layers [255]. AdapLeR trains a contribution predictor to evaluate tokens and design a soft-removal function to mask tokens for gradient propagation [254]. LTP assesses token importance with attention weight and learns a threshold to dynamically remove tokens [253]. To improve the accuracy of token importance evaluation, ToP transfers the token importance rankings derived from the final layer of unpruned models to the initial layers of pruned models [249]. They also introduce a coarse-to-fine pruning approach to dynamically select pruning layers. The above methods are designed for discriminative language models, such as Bert. In recent years, there has been a significant advancement in generative language models, such as GPT-3 and Llama. As a result, several studies propose token reduction techniques specifically tailored for these models. The self-information, such as entropy and perplexity, is used to measure the token importance for pruning [252]. LLMLingua uses a small model to compress prompts, sets different compression ratios for instructions, questions and demonstrations, and iteratively prunes fine-grained tokens to improve the accuracy [250]. A few methods aim to reduce the size of the Key-Value (KV) caches to expedite the decoding process. The authors of Heavy Hitter Oracle (H20) observe that a small subset of tokens, termed Heavy Hitters (H2), contribute most of the value when computing attention scores [248]. Based on this, they propose the H2O, a KV cache eviction policy based on the accumulated attention scores that dynamically retains a balance of recent and H2 tokens, effectively reducing the cache size without sacrificing performance. This method can improve the throughput by up to $29\times$ on OPT-6.7B and OPT-30B compared to other leading inference systems. FastGen incorporates four policies to dynamically adjust the KV cache [246]. These policies are as follows: 1. Retaining special tokens like $<$ s $>$ and $<$ INST $>$ . 2. Preserving punctuation tokens like “.” and “?”. 3. Evicting tokens that are distantly located from the current token. 4. Retaining tokens with high attention scores. StreamingLLM is an efficient method for deploying LLMs in streaming applications, such as multi-round dialogue [247]. StreamingLLM leverages an observed phenomenon called “attention sink”, where maintaining the KV states of initial tokens significantly improves performance. This method can accelerate the inference of Lllama-2 by up to $22.2\times$ compared to other methods. For token summary, AutoCompressor and ICAE adopt compact summary vectors to distill information from the original contexts [259, 257]. Gist compresses instruction prompts into “gist” tokens that represent a specific task [258].

2) Vision. There are a lot of redundant tokens in an image, such as the background. DynamicViT introduces a trainable prediction module to determine the token scores. Paper [263] dynamically prunes less informative patches from the input image in a top-down manner by formulating token pruning as an optimization problem. HeatViT deploys attention-based token pruning methods on FPGA devices and also incorporates hardware optimization strategies such as 8-bit quantization [262]. Authors of METR find that the attention weight with class token ([CLS]) fails to gather task-specific information because the attention mechanism tends to concentrate on more general tokens at the initial layers [261]. This phenomenon may degrade the performance of attention-based token reduction because of the inaccurate token importance estimation. This paper proposes Multi-Exit Token Reduction (METR) that integrates a multi-exit architecture into ViTs to encourage the model to prioritize task-relevant information from the initial stages [261]. The motivation of STAR is to evaluate and prune patches based on their importance within and across layers [260]. It combines online intra-layer importance assessment with offline inter-layer importance analysis, using a fusion mechanism to selectively prune patches, aiming to retain those most critical for model performance. The idea of token merging originated from EViT [266], which divides tokens into attentive and inattentive tokens based on the attention weight and fuses inattention tokens into one token. TOME is a pure token merging method to gradually combine similar tokens within a transformer, achieving a balance of speed and accuracy that rivals pruning methods [265]. Recent works focus on combining token pruning and token merging to improve accuracy. Long et al. delve into the importance and diversity among image patches to preserve discriminative local tokens while maximizing global token diversity [268]. First, tokens are decoupled into two separate groups by leveraging class token attention. Inattentive tokens are then clustered using a density peak clustering algorithm, and attentive tokens are merged with a gentle matching method. TPS is also a token pruning and squeezing method [269]. First, it splits tokens into reserved and pruned subsets. Then, it employs unidirectional nearest-neighbor matching and similarity-based fusing steps to combine the information from pruned tokens into the reserved tokens. DiffRate enables automatic learning of different compression rates for different layers, thereby enhancing the optimization of the pruning and merging configuration [267].

3) Multi-modal & Video. When designing token reduction methods for multi-modal models, it is important to take into account the interactions between different modalities and adjust the strategy for evaluating token importance accordingly. In order to speed up the inference of video transformers, it is crucial to leverage both the redundancy within individual frames (intra-frame redundancy) and the redundancy between consecutive frames (inter-frame redundancy). TESTA extends the idea of token merging to video-language applications [271]. It begins by densely sampling input frames from the video. Subsequently, the framework incorporates two distinct merging schemes: frame merging, which combines similar frames, and token merging, which merges similar tokens. By employing these merging strategies, TESTA enhances the efficiency and effectiveness of video processing and analysis. PuMer employs a novel token reduction strategy that combines text-informed pruning and modality-aware merging techniques to selectively reduce the number of tokens from both input images and text [270]. It selectively removes image tokens that are irrelevant to the accompanying text and are deemed unimportant for the vision-language task predictions. STTS dynamically selects informative tokens in both temporal and spatial dimensions to improve the efficiency of video transformers [273]. By formulating token selection as a ranking problem, a lightweight scorer network estimates the importance of each token, and only those with top scores are used for downstream evaluation. Ding et al. introduce a method to optimize the speed-accuracy trade-off in video recognition tasks by pruning spatio-temporal tokens using a Semantic-aware Temporal Accumulation (STA) score. This score evaluates tokens based on temporal redundancy and semantic importance, allowing for the pruning of tokens that are temporally redundant or semantically less significant without the need for additional parameters or re-training.

Adaptive token reduction. Token reduction techniques demonstrate remarkable advantages in expediting transformer inference while maintaining prediction performance. Furthermore, it is crucial to develop adaptation methods for token reduction to cater to various applications across diverse execution environments. By doing so, we can fully leverage the potential of token reduction and optimize its effectiveness. OTAS incorporates token prompting and token reduction in an elastic serving system, enabling dynamic selection of an optimal token number [276]. This selection process takes into account factors such as fluctuating query load, diverse service targets, and limited hardware resources. It designs an optimization method to investigate the balance between accuracy and throughput across various token reduction ratios. AdaTape enhances the flexibility and performance of a transformer model by dynamically adjusting the computation based on the input’s complexity [277]. It employs an elastic input sequence mechanism through adaptive tape tokens (i.e., prompt), which are generated from a tape bank and appended to the input sequences, allowing for dynamic read-and-write operations. This method enables the model to adaptively control both the content and the number of tape tokens used for each input, thereby adjusting the computational budget and potentially improving efficiency and performance on tasks.

V AI Agent

As early as the 1950s, Alan Turing had already expanded the concept of “intelligence” to artificial entities and proposed the famous Turing Test to assess whether machines could exhibit intelligence similar to that of humans. This test determines the level of machine intelligence based on whether it can make its text-based communication indistinguishable from that of humans. Such evaluated AI entities are commonly referred to as Agents. The concept of an agent originally stemmed from philosophy and is used to describe an entity with autonomy, not only capable of action but also possessing the desire, belief, and intent to decide when and how to act. This concept has been further developed in AI. Agents are not just entities executing programmed instructions. They also make decisions, solve problems, and navigate complex environments.

Over time, the application of intelligent agents has expanded from simple automation tasks to scenarios requiring complex decision-making and adaptability. For example, intelligent agents are now widely used in areas such as autonomous vehicles, high-frequency trading, personalized medicine, and smart home systems. These systems can respond to external inputs, learn users’ behavior patterns, and adjust their strategies according to environmental changes. This evolution from philosophy to technology marks technical advancements in AI and triggers meaningful sociological and ethical discussions about machine ethics and the control of intelligent systems. As AI technology develops and becomes more widespread, intelligent agents will keep shaping our ways of working and living while challenging our traditional understanding of intelligence, autonomy, and control.

The development from NLP to AGI can be divided into five levels: corpora, internet, perception, embodiment, and social attributes[278]. LLMs have reached the second level, where they can handle text input and output on the internet scale. If perceptual and action spaces are introduced to LLM-based agents, these agents could advance to the third and fourth stages of development. This means that these intelligent agents would not only be able to understand vast amounts of textual data but also understand and influence their physical or virtual environments through perception and interaction.

In an autonomous agent system powered by LLM, the LLM serves as the agent’s central nervous system or cognitive core, playing a pivotal role in processing information, making decisions, and generating actions. The agent is augmented by other essential elements that ensure the agent’s robustness, adaptability, and efficiency: Multi-agent framework, Planning, Memory, and Tool use.

V-A Multi-agent Framework

An LLM multi-agent framework is a system that leverages multiple LLMs as independent agents working collaboratively to handle complex tasks. Each agent may specialize in different areas or subtasks, and they communicate and cooperate to share information, verify outputs, and enhance overall performance. This approach improves robustness, efficiency, and scalability by enabling parallel processing and modular updates. For example, A multi-agent framework in software development organizes specialized roles such as architects, programmers, and testers into a cohesive, collaborative system. Each agent has distinct responsibilities and communicates effectively through defined protocols. This multi-agent framework enhances efficiency, reduces errors, and ensures consistency by leveraging task management systems, collaboration tools, and shared knowledge bases.

In multi-agent systems, communication and coordination are the foundations of achieving cooperation. It is necessary to design practical communication protocols to ensure that agents exchange information efficiently and accurately. Agents must maintain consistent semantic understanding and objectives to avoid misunderstandings and conflicts. The reasonable allocation and division of tasks are central to multi-agent cooperation. Tasks should be distributed among agents reasonably to avoid overloading some agents while leaving others idle. Knowledge sharing is the basis for achieving collaborative work in multi-agent systems. Different agents may have different backgrounds, and efficiently integrating this knowledge is challenging.

Multi-agent systems integrate the advantages of multiple LLMs with distinct functionalities. Each agent within these systems possesses expertise in specific domains, and this broad spectrum of specialized knowledge ensures that the generated results are comprehensive and accurate. Solving complex problems requires a multifaceted approach, and by leveraging the strengths of multiple agents, these systems can offer more refined and practical solutions. AgentVerse[279] is a multi-agent framework designed to facilitate collaborative problem-solving among autonomous agents powered by LLMs. AgentVerse emulates human group dynamics to improve task performance and explores emergent behaviors within agent collaborations. Agentbench[280] presents a comprehensive benchmark designed to evaluate the capabilities of LLM when acting as autonomous agents across a variety of real-world challenges. It consists of eight distinct environments categorized into code-grounded, game-grounded, and web-grounded scenarios to test the LLMs’ reasoning, decision-making abilities, and their performance in multi-turn, open-ended generation settings. CoELA[281] presents a novel framework for building cooperative embodied agents using LLM, focusing on multi-agent cooperation without requiring fine-tuning or few-shot prompting. This framework is evaluated in various embodied environments, demonstrating that agents can effectively plan, communicate, and cooperate on long-horizon tasks. The approach surpasses traditional planning methods and shows that natural language communication enhances cooperation with humans. LLM-Co explores the potential of LLM for multi-agent coordination, focusing on their ability to collaborate in various scenarios effectively[282]. CLIP[283] explores a novel approach to enhance the ability of LLM to execute tasks in embodied settings, such as with robots, where understanding the physical world is crucial. Semantic ICL offers a promising approach to improve conversational agents by leveraging both semantic search and LLM[284]. Yashar[285] introduces a novel framework to enhance the capabilities of LLM by leveraging multi-agent systems, allowing for collaborative environments where intelligent agents work to handle complex tasks efficiently. The authors demonstrate the framework’s superior performance by conducting case studies on models such as Auto-GPT, BabyAGI, and Gorilla, which incorporate external APIs into the LLM. DyLAN[286] aims to optimize the collaboration of LLM agents for complex tasks such as reasoning and code generation. DyLAN is introduced to enhance the performance and efficiency of LLM-agent collaboration on complex tasks by enabling dynamic interaction architecture and inference-time agent selection. Traditional methods employ a static ensemble of agents, limiting adaptability and requiring extensive human input for agent design. DyLAN overcomes these limitations through a framework that dynamically constructs agent teams based on task requirements, supports multi-round interactions among agents, and incorporates an automatic agent team optimization algorithm. This optimization relies on an unsupervised metric, the Agent Importance Score, to assess and select the most contributory agents. Empirical evaluations demonstrate DyLAN’s superiority in reasoning and code generation tasks, showcasing significant improvements over single-agent performances and traditional static approaches.

Multi-agent systems can fully leverage the expertise and skills of multiple independent agents to achieve solutions for complex tasks through collaborative cooperation. Each agent focuses on in-depth research in a specific domain, ensuring high professionalism, while the overall system integrates this information to provide comprehensive and accurate results. This architecture enhances the ability to handle complex problems. It possesses high flexibility and scalability, allowing adjustments and optimizations based on different needs, ultimately offering a more robust solution than a single LLM.

V-B Planning

Intelligent agents enhance task-handling capabilities in complex systems by breaking down large tasks into smaller, more specific sub-goals. This decomposition strategy makes task management more feasible and improves overall efficiency and effectiveness. The planning has the following characteristics:

•

Hierarchical approach: Agents solve problems layer by layer by creating a hierarchical structure of tasks. From high-level strategic decisions to low-level specific operations, let LLM think step-by-step.
•

Parallel processing: After task decomposition, agents can process multiple sub-goals in parallel, utilizing resources effectively. For example, in multi-agent systems, different agents can simultaneously deal with various tasks.
•

Dynamic adjustment: Agents adjust the priority and resource allocation of sub-goals based on real-time feedback to adapt to environmental changes and unforeseen circumstances, ensuring the optimal task execution strategy.

Intelligent agents can enhance their ability to handle complex tasks through effective sub-goal setting, task decomposition, and continuous reflection and improvement. This strengthens the agents’ ability to solve problems independently and optimizes the entire system’s performance, contributing to efficient, adaptive, intelligent system design.

MindAgent[287] is designed to evaluate planning and coordination capabilities in gaming interactions. DEPS [288]is an innovative approach leveraging LLM to plan tasks in multi-task agents in open-world environments. DEPS uniquely integrates interactive planning with LLMs, focusing on error correction and efficiency improvement in plan execution through a goal selector module. It significantly improves task success rates in Minecraft and other tasks like ALFWorld and tabletop manipulation. MCTS[289] explores the integration of LLM with Monte Carlo Tree Search for task planning. This combination, termed LLM-MCTS, leverages the commonsense knowledge of LLMs to enhance the efficiency of planning algorithms. The key findings indicate that LLM-MCTS significantly outperforms traditional MCTS and LLM-induced policies in complex task scenarios, demonstrating the advantage of using LLMs as both a world model and a heuristic policy guide. ICPI[290] introduces a method for implementing policy iteration using LLM. ICPI updates the prompt content to derive policies through trial-and-error interaction with an RL environment. The experiments demonstrate the efficacy of this method across six RL tasks, employing a range of LLMs such as Codex and GPT variants. Code-LLMs[291] investigate the limitations of current code-oriented LLM in handling code completion tasks when the provided code contains potential bugs.

There are limitations to the direct use of LLMs as planners, such as their limited reasoning and planning capabilities and the inefficiency of the utilization of human feedback. PDDL[292] proposes a new method for planning tasks using pre-trained LLM. The method uses LLMs to construct PDDL models, employs PDDL validation and human feedback to correct initial errors in these models, and generates plans using the corrected PDDL models.[292]. ReCon[293] incorporates two processes: formulation contemplation for generating initial thoughts and speech and refinement contemplation for polishing these thoughts. MPC[294] proposes a novel approach for creating high-quality conversational agents without fine-tuning. This approach harnesses the power of pre-trained LLMs as discrete modules, thereby ensuring sustained coherence and adaptability in open-domain conversational contexts. The author demonstrates that MPC performs comparably with fine-tuned chatbot models in open-domain settings, offering an effective solution for generating consistent and engaging chatbots through human evaluations.

In an LLM-powered autonomous agent system, LLM functions as the agent’s brain. The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks. The agent is capable of engaging in autonomous introspection and critiquing its past actions, extracting lessons from its errors, and subsequently refining its strategies for subsequent endeavors, culminating in an enhancement of the ultimate outcome’s caliber.

V-C Memory

Based on the LLM, memory of the historical chat sequence is dynamically maintained and extracted through a retrieval method. The historical chatbot manages and updates a memory system to ensure relevant information is retained and can be accessed efficiently. This rethinking method helps refine, organize, and integrate this information for future use, enhancing the model’s ability to learn from past interactions and apply this knowledge to new situations. This methodology effectively improves the model’s decision-making capabilities and adapts to evolving contexts.

Many scholars are currently studying long-term memory in agents. Agents are capable of storing and retrieving information over extended periods, allowing them to maintain context and consistency across multiple interactions. REMEMBERER[295] introduces a novel framework for LLM that integrates a long-term experience memory, allowing the model to leverage past interaction experiences for decision-making across different tasks. This approach, termed reinforcement learning (RL) with experience memory, enables the LLM to evolve its capabilities without fine-tuning parameters, positioning it as a semi-parametric reinforcement learning agent. The methodology updates the experience memory through RL processes, and experiments on two RL task sets demonstrate REMEMBERER’s superiority over state-of-the-art methods. LONGMEM[296] enhances LLM by incorporating long-term memory capabilities. This framework addresses the limitations of current LLMs, which are constrained by a fixed-sized input, preventing them from leveraging extensive long-context information from past interactions. LONGMEM can handle unlimited-length context, significantly expanding its use cases, particularly in scenarios requiring understanding and generation based on extensive historical data. The model demonstrates superior performance on long-context modeling benchmarks such as ChapterBreak and shows remarkable improvements in memory-augmented in-context learning tasks over traditional LLMs. MaLP [297]is a novel framework for personalizing LLM like GPT-3.5 without fully retraining them, which is resource-intensive. The authors introduce a dual-process enhanced memory mechanism and a parameter-efficient fine-tuning schema, aiming to make LLMs more user-oriented by leveraging both short-term and long-term memory components in coordination. Hatalis[298] focuses on the agents’ memory management, particularly long-term memory, which is implemented using vector databases for storing and retrieving information. This aspect is crucial for maintaining context-specific knowledge and recalling past experiences, enabling coherent and efficient interactions. However, these methods require a lot of extra space to store long-term memory. To alleviate this issue, some research investigates the combination of self-reflection and memory mechanisms to achieve self-improvement without the need for additional data. MoT[299] aimed at enhancing the capabilities of LLM like ChatGPT without the need for additional annotated datasets or computationally expensive fine-tuning. This approach draws inspiration from the human ability to self-improve through self-reflection and memory. It proposes a mechanism for LLMs to self-improve by leveraging their internal generation processes and an external memory component. Liu[300] introduces Reasoning and Acting through Scratchpad and Examples (RAISE), a sophisticated architecture designed to improve the integration of LLM like GPT-4 into conversational agents. The architecture uses a memory system to maintain the continuity of contextual conversations better. It outlines a comprehensive method for the construction of agents, including historical conversation selection, scene extraction, chain of thought completion, and scene augmentation, leading to the LLMs training phase. The approach aims to enhance agent controllability and adaptability in complex, multi-turn dialogues. Maharana[301] introduces a machine-human pipeline to generate high-quality, very long-term dialogues leveraging LLM-based agent architectures. These dialogues are grounded on personas and temporal event graphs, and agents are equipped with the capability to share and react to images. Human annotators verify and edit the generated conversations for consistency and grounding to the event graphs. Wang[302] presents a novel approach to language modeling that incorporates corpus-level discourse information, termed a larger-context language model. This model introduces a late fusion technique to a recurrent language model based on LSTM units, which enables the model to distinguish between intra-sentence dependencies and inter-sentence dependencies more effectively. To improve the efficiency of retrieval-based generation, RaLMSpec and Speculative RAG are proposed to use a small knowledge pool to approximate a large knowledge base with a small knowledge base [303, 304].

Traditional LLMs remember patterns, facts, and information across the data they were trained on, allowing them to generate knowledgeable responses. With the support of the memory system, LLMs can quickly retrieve information from memory, making them highly effective for tasks that require rapid access to a broad range of information, such as question-answering systems or recommendation engines.

V-D Tool Use

In the context of multi-agent systems, particularly those involving LLM, the capability for agents to call external APIs represents a significant enhancement to their functionality and adaptability. This approach addresses several limitations of pre-trained models, offering the following benefits:

•

Access to current information: LLMs, once trained, do not inherently update their knowledge unless retrained with new data. By enabling agents to call external APIs, they can access the most current information available, which is crucial for tasks that depend on up-to-date data such as news updates, stock prices, or weather forecasts.
•

Enhanced Functional Capabilities: External APIs can extend the abilities of an LLM beyond its original training. For instance, agents can perform calculations, access mapping services, or retrieve specialized data that is not stored within their pre-trained weights.
•

Interactivity with Proprietary Systems: Agents can interact with proprietary information sources that are not publicly available or part of their training data. This is particularly useful in enterprise settings where access to internal databases or specific industry-related systems is required.
•

Dynamic Adaptation to New Domains: The ability to call external APIs allows agents to dynamically adapt to new domains or changes in their operating environment without the need for retraining. This makes the system more flexible and responsive to user needs.

Many researchers design different automation and standardization methods for integrating diverse tools into the execution of LLMs, enabling LLMs to seamlessly invoke various APIs and external resources. They are dedicated to developing context-aware tool selection mechanisms that allow LLMs to intelligently choose the most suitable tools based on the dialogue content and achieve multi-step decision-making in complex tasks. Toolformer[305] is specifically designed to make informed decisions regarding the selection and invocation of appropriate Application Programming Interfaces (APIs). It determines not only the optimal timing for API calls but also the most suitable arguments to be passed to these interfaces. They integrate a diverse array of utilities, encompassing tools such as a computational calculator, a question-and-answer system, a robust search engine, a multilingual translation service, and a scheduling calendar. Remarkably, Toolformer demonstrates significantly enhanced performance in zero-shot scenarios across a wide array of downstream tasks, frequently matching or even outperforming much larger models. Gpt4tools[306] introduces an innovative methodology aimed at empowering LLMs, such as LLaMA and OPT, with the capacity to leverage tools. This is achieved by crafting an instruction adherence dataset through the strategic prompting of a sophisticated teacher model across diverse multimodal environments. Low-Rank Adaptation optimization techniques are then employed during the fine-tuning process, enabling these LLMs to undertake visual understanding tasks and generate images. The resultant system showcases marked enhancements in accuracy for both familiar and novel tool applications. ToolEmu[307] is a framework utilizing LLMs to emulate tool execution for testing LM agents against diverse tools and scenarios, identifying risks such as data leakage or financial losses. It features an LM-based automatic safety evaluator to quantify risks from agent failures. LLMs have shown remarkable abilities in various NLP tasks but struggle with issues like hallucination and numerical reasoning[308]. To enhance their capabilities, the integration of external tools has been explored. However, existing evaluation methods fail to clearly distinguish between LLMs leveraging their internal knowledge and those effectively using external tools. The ToolQA[308] dataset is introduced to address this gap, consisting of data from 8 domains and incorporating 13 types of tools for accessing external information. This dataset is unique in requiring the use of external tools for question answering, minimizing reliance on LLMs’ internal knowledge. Ruan[309] focuses on task planning and tool usage. They introduce two types of agents, one-step and sequential, to execute tasks through planning and using external tools. ToolkenGPT[310] enhances LLMs with external tools without the need for fine-tuning. This method aims to solve complex problems by incorporating a vast array of tools, such as calculators or databases. The key innovation of ToolkenGPT is representing each tool as a token and learning an embedding for it. This enables the LLM to call tools as easily as generating text, thereby combining the strengths of fine-tuning while overcoming their limitations. Automatic Reasoning and Tool-use[311] enhances the capabilities of LLMs by automatically generating multi-step reasoning chains and integrating external tool use. It generates reasoning programs for new tasks by retrieving related demonstrations and then uses external tools as needed within those programs. This approach allows for a structured and extensible way to incorporate complex reasoning and external knowledge into LLMs’ responses. INFERCEPT is a novel framework designed to optimize inference for augmented LLMs that interact with external tools or environments [312]. Current LLM inference systems handle external interactions by discarding the context and re-initiating a new request after receiving the response, leading to significant GPU resource wastage due to the recomputation of context. INFERCEPT addresses this by minimizing GPU memory waste through improved handling of intercepted LLM generations, employing techniques like swap pipelining and recomputation chunking. It dynamically selects a strategy from discarding, serving, and swapping. The framework significantly improves throughput and request completion rates.

Using APIs, agents can leverage external computing resources, which can be more efficient and scalable than processing everything locally. This is especially valuable for resource-intensive tasks. This approach transforms the agent from a static information processor to a dynamic information seeker, greatly enhancing its utility and ensuring its relevance regardless of when it was last trained. This capability is central to developing intelligent systems that are expected to operate in real-world, ever-changing environments.

VI Application Layer

In this section, we first illustrate batching, an effective strategy that groups requests for execution in an application to improve throughput. Next, we describe some representative applications of foundation models and AI agents.

VI-A Batching

TABLE XIII: The summary of batching methods.

Issue	Ref.	Contribution	Model
Different lengths	ORCA [313]	Iteration-level batching instead of request-level batching.	GPT
	Dvabatch [314]	A multi-entry multi-exit batching scheme.	CNN, Bert
	PiA [315]	Accurately perceive and predict the response length.	GPT-2, Llama, Vicuna
	TurboTransformers [316]	Use dynamic programming for batching to maximize the response throughput.	Bert series
Different service features	Clipper [80]	Dynamic batch size and delayed batching.	CNN
	OTAS [276]	Batch queries with similar service characteristics.	ViT
Different weights	FLORA [317]	Introduction of example-specific adapter.	StarCoder
	S-LoRA [318]	Highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation.	Llama
	Punica [319]	A new CUDA kernel for the batching of GPU operations for different LoRA models	Llama-2

Batching is a widely adopted and beneficial strategy in serving systems, as it involves grouping incoming queries into batches and executing inferences for them together. This approach offers several advantages, including improved throughput and reduced resource consumption. The advantages of batching can be attributed to two primary factors. Firstly, it helps mitigate the costs associated with RPC (Remote Procedure Call) calls and minimizes overhead in internal frameworks, such as the need to copy inputs to GPU memory. Secondly, batching enables machine learning frameworks to leverage data-parallel optimizations more effectively. By performing simultaneous batch inference on multiple inputs, the framework can exploit parallel processing capabilities, such as those offered by GPUs, to enhance overall inference performance.

However, there are certain challenges associated with implementing batching in a serving system. We summarize the issues and their corresponding methods in Table XIII. First, when processing text queries, requests usually have different lengths. To handle this, padding with zeros is often used for shorter requests. However, this approach can result in computational inefficiency when there are significant differences in request lengths. Additionally, requests can also have different output lengths, ranging from just a few words to long articles. In such cases, shorter requests may need to wait for longer requests before returning results. TurboTransformer addresses this issue by employing a dynamic programming algorithm that groups requests with similar input lengths [316]. By doing so, it maximizes the overall throughput and reduces the inference latency. PiA takes into account both the input length and the output length of requests [315]. It introduces a predictor that accurately perceives and predicts the response length. PiA groups similar requests together, allowing for better resource allocation and improved efficiency. Dvabatch proposes a multi-entry multi-exit batching scheme, utilizing three meta operations—new, stretch, and split to dynamically adjust ongoing batches for different types of diversities. This approach allows for more flexible and efficient processing, such as enabling short queries to exit early, splitting batches to match an operator’s preferred batch size, and allowing later queries to join ongoing batches. [314]. ORCA introduces a novel scheduling mechanism called iteration-level scheduling, which allows for more efficient processing of inference requests by scheduling execution at the granularity of iteration rather than the entire request [313].

The second challenge involves addressing various service features of requests, including arrival time, latency requirements, and utility. For instance, one common dilemma is balancing throughput and latency, as prioritizing high throughput may lead to longer wait times for individual requests, potentially violating latency requirements. Clipper tackles this issue by employing dynamic adjustments to the batch size and modify the delayed batching [80]. This approach allows more queries to be batched together, enhancing throughput, while still considering the latency requirements of individual queries. Similarly, OTAS designs its batching algorithm by taking into account similar service characteristics [276]. This ensures that requests with comparable attributes are grouped together, optimizing the overall system performance. The premise of batching is that requests are processed by the same model. However, pre-trained models are fine-tuned with LORA for different tasks, resulting in distinct model parameters tailored for specific queries [320]. To deal with this issue, FLORA introduces an example-specific adapter [317]. This approach is computationally efficient, as it leverages matrix multiplications and element-wise operations that are inherently batch-friendly on modern GPUs. S-LoRA and Punica design some highly optimized CUDA kernels for heterogeneous batching of LoRA computation [318, 319].

VI-B Foundation Model Applications

Generative AI has demonstrated robust performance and vast potential in commercial applications, playing a significant role across various industries. Commercial generative AI can be categorized into four types: (1) text-generating AI (including some multimodal AI), (2) code-generating AI, (3) image-generating AI, and (4) video-generating AI. Among these, text-generating, code-generating AI, and image-generating AI have already seen widespread application. Video-generating AI slightly lags behind the former three, but it exhibits considerable potential for future application.

•

Text-generating and multimodal AI. ChatGPT is the most famous chatbot developed by OpenAI, and performs well in various NLP tasks[321]. ERNIE Bot (Wenxin Yiyan) is similar to ChatGPT, supporting input and output of images and text, and outperforms ChatGPT 4.0 in several Chinese tasks[322].
•

Code-generating AI. Github Copilot is the most widely adopted code-generating tool in the world. It can automatically generate code and comments, and provide code explanations, allowing developers to focus on problem-solving and collaboration[323]. Codex, similar to GitHub Copilot, is a product based on GPT-3 and is proficient in lots of programming languages. OpenAI claims that Codex is more powerful than GitHub Copilot[324].
•

Photo and Video-generating AI. Midjourney can generate high-quality images according to natural language. It entered public beta on July 12, 2022, and is continuously evolving[325]. Sora, based on diffusion models, GPT and DALL·E, can generate videos from images and language, or expand on already generated videos. OpenAI claims that it is an important step towards AGI[326]. Runway Gen-2 is a high-performance video-generation AI developed by Runway AI, which can take text, image, or video input and convert it into new videos[327].

VI-C AI Agent Applications

The rise of LLMs has significantly accelerated research related to AI agents, which are now considered a principal avenue toward achieving AGI. Andrej Karpathy, co-founder of OpenAI, has stated that AI agents represent a significant future direction for AI. In the development of digital entities across various industries, these agents are expected to play a pivotal role in applying AGI. AI agents, as products, are anticipated to conduct business operations. Since March 2023, the field of AI Agents has witnessed several groundbreaking research developments. AutoGPT, a project developed by Significant Gravitas, automatically optimizes GPT’s hyperparameters by searching for algorithms that enhance its performance across diverse tasks [328]. Users simply set one or more objectives, and AutoGPT can autonomously generate and execute tasks, continually analyzing and refining its strategies throughout the process. Researchers at Stanford and Google have developed a simulated environment named Smallville, where 25 AI agents exhibit complex behaviors and interactions [329]. Each agent possesses a unique seed memory, which enables them to form and recall relationships through social interactions with other agents. VOYAGER, designed with the long-term goal of ‘discovering as many different things as possible’, is the first AI agent in Minecraft to continuously explore the world, acquiring a diverse set of skills and making new discoveries [330]. It introduces an expanding arsenal of skills for storing and retrieving complex behavioral executables, alongside a novel iterative cueing mechanism that enables the agent to autonomously explore and adapt to unknown environments without human intervention, demonstrating superior proficiency. ChatDev (Chat-powered Software Development) is proposed as a virtual software company operated by multiple intelligent entities [331]. After a user specifies a task, ChatDev utilizes the software engineering waterfall model. Through a sequence known as the Chat Chain, consisting of atomic tasks, it facilitates automated interactions, collaboration, and decision-making among diverse AI agents to produce comprehensive software, including source code, environment dependency specifications, and user manuals. Research on AI Agents is currently led primarily by the academic community and developers. The outstanding performance of AI Agents in independent operation, collective collaboration, and human-machine interaction is increasingly recognized as a potent tool for enhancing productivity across various industries in the digital economy. However, the widespread application of LLM-based agents faces significant challenges due to the costs associated with token interactions. Additionally, the absence of a profit-sharing mechanism indirectly hampers the development of AI Agents. The main strategies for reducing costs include combining different model sizes and optimizing inference infrastructure. Furthermore, rapid advancements in hardware and models are expected to resolve cost issues in the near future. OpenAI predicts that AI will surpass human intelligence levels within the next decade, achieving what is termed superintelligence. AI Agents are expected to become mainstream in future products, with numerous agent-centered products likely to emerge and be implemented across various fields in the coming years.

VII Summary and Future Work

VII-A Lesson Learned

VII-A1 Heterogeneous edge computing for FM inference

Our survey of various edge computing devices for FM inference has yielded several important insights. Edge computing devices each have unique characteristics, designed to meet specific requirements.

•

ASICs, while optimal for fixed model architectures in FMs, are immutable post-production, limiting their adaptability to new models. FPGAs offer greater flexibility through programmability, supporting various model sizes. However, they are primarily suited for linear computations, necessitating additional design considerations for non-linear operators prevalent in transformer architectures.
•

Among IMCs, TPUs are representatives that have limited memory capacity for FMs. Another drawback is their support for integer operations, which constrains precision and performance, making them suboptimal for direct FM inference.
•

CPUs, while significantly less powerful than other accelerators, offer larger memory. This characteristic enables offloading some computation workload and part of model caches to CPUs, facilitating collaborative inference with other accelerators.
•

GPUs benefit from a mature CUDA ecosystem, making them suitable for out-of-the-box FM inference. Consequently, numerous research efforts focus on accelerating GPU inference in data centers. However, their high cost and energy consumption hinder widespread adoption in edge scenarios. This limitation has spurred interest in edge GPU devices, exemplified by NVIDIA Jetson, which has limited computing capacity and enough memory.

In conclusion, heterogeneous edge computing for different kinds of FMs is an under-explored field because these designs only focused on LLM. Researchers tended to assume that the transformer architecture of FMs was fixed and did not focus on hybrid architecture, such as ViT plus LLM, which are pretty popular in multi-modal FMs. Thus, these approaches were limited to specific models and sub-optimal to others.

On the other hand, most works do not consider inter-accelerator optimization (e.g., parallelism strategy and communication optimization between heterogeneous devices) and focus on designing hardware-specific optimization methods. However, these accelerators can cooperate to form a better FMs serving system if designed approaches consider the complementary advantages and disadvantages of the heterogeneous devices.

VII-A2 Trade-off on the FMs

While foundation models exhibit stronger reasoning and generation capabilities as their scale increases, this improvement is not linear and comes with significant trade-offs in terms of resource consumption and complexity. Larger models require exponentially more computation, memory, and energy, which can lead to bottlenecks in real-world deployments, especially when there are scarce resources at the edge, but specialized tasks and private data require fine-tuning locally. Tuning larger models is more challenging because they tend to overfit by relying too much on patterns within massive pre-trained datasets, potentially limiting their broader reasoning abilities. Therefore, practical applications require balancing model size with inference efficiency, using domain-specific optimizations to maximize performance while controlling computational costs.

Another key lesson is the challenge of effectively integrating and balancing multiple data modalities (e.g., text, image, audio) within a single model. While multi-modal FMs have shown the potential to understand and generate rich, cross-modal content, they often struggle with maintaining performance consistency across all modalities. This is largely due to discrepancies in data availability, quality, and structure for different modalities, which can lead to imbalanced learning. Moreover, aligning diverse modalities in a common representational space requires sophisticated cross-attention mechanisms or specialized fusion techniques, which are still areas of active research. A major takeaway is that while multimodal FMs can offer robust flexibility and broad applicability, ensuring seamless integration and coherence across modalities remains a significant hurdle.

VII-A3 Elastic agent serving system

A key lesson learned from examining the current serving systems for FM agents is the significant gap in elasticity at the agent layer. While substantial progress has been made to introduce elasticity at the levels of the resources, models, or execution tokens, there remains a notable lack of dynamic adaptability within the agent itself. The flexibility at the agent layer is essential for unleashing the full potential of FM-based systems, particularly in handling complex, real-world tasks. The agent layer should be more flexible by incorporating adaptivity in API calls, external knowledge retrieval, reasoning capabilities, and multi-agent collaboration.

•

Firstly, an adaptive agent could decide when it needs to access external databases or APIs for more specialized information, or when it should rely on internal reasoning capabilities to make decisions autonomously.
•

Additionally, instead of hardcoded processes for knowledge integration, agents could be equipped with mechanisms to interact with evolving knowledge repositories and search from different levels of knowledge databases.
•

Besides, agents could be adaptive in reasoning and multi-agent collaboration. Current systems tend to employ static reasoning approaches, where the agent either follows a predefined logic or relies entirely on a single model’s output. In contrast, a more flexible agent could dynamically choose between different reasoning strategies depending on the task complexities, and potentially break down a problem into subtasks to be handled by specialized agents within a collaborative network. This multi-agent coordination could allow the system to generate more scalable and robust solutions.

Without this level of flexibility, LLM agents remain limited in their ability to handle a broad spectrum of tasks. They miss opportunities to leverage external knowledge, adapt their problem-solving strategies, or collaborate effectively in multi-agent environments. As more complex and specialized applications arise, the lack of agent-layer flexibility becomes a bottleneck, preventing these systems from fully realizing their potential. Thus, future work in this field should prioritize developing more flexible, adaptable agents capable of handling diverse, evolving tasks in real-world environments.

VII-B Future Directions

VII-B1 Efficiently deploying large FMs on heterogeneous devices

Efficiently deploying FM on heterogeneous devices at the edge-cloud involves leveraging a combination of edge computing and cloud resources to optimize performance, scalability, and cost-effectiveness. Existing research works primarily emphasize optimizing the deployment of small models at the edge or large models in the cloud, leaving effectively deploying large FMs on heterogeneous devices at the edge cloud as an unresolved issue. This paradigm can make use of the idle computational resources at the edge, enabling low-latency and low-cost inference while also ensuring privacy preservation. There are still some challenges when deploying FMs at the edge. First, large FMs have substantial computational requirements, while edge devices have limited processing powers and vary significantly in their computational capabilities. Second, FMs typically require significant memory to store model parameters, especially for large-scale datasets. Memory constraints on heterogeneous devices, such as edge devices with limited RAM or storage, can hinder the deployment of large FMs. Third, in distributed deployments at the edge, the communication ability is weak. Transferring data between devices and coordinating computation introduces communication overhead and greatly increases inference latency. Additionally, energy consumption is also a critical consideration, especially for edge devices powered by batteries or with limited power budgets. Deploying large FMs on such devices requires energy-efficient algorithms and optimizations to prolong battery life and reduce operating costs. In summary, advanced model compression methods and resource scaling methods need to be designed in this scenario.

VII-B2 Deploying multi-modal models and MoE models on edge-cloud devices

Traditional serving systems have mainly been developed to accommodate vision or language models that exclusively handle visual or textual inputs. Recently, advanced multi-modal models are developed to support diverse types of inputs, thereby expanding the range of interaction possibilities for users. Besides, different model architectures (e.g., MoE) have been designed to enlarge the parameter space and enhance the model intelligence. To effectively support these new models, it is necessary to enhance the design of the serving system. First, multi-modal models consist of multiple modules, such as modality-aware encoders and decoders. These modules have different dependencies in different tasks. Consequently, the serving system necessitates new resource allocation and inference acceleration methods tailored to these models. Second, multi-modal and MoE models have particularly large parameter spaces, and the inference process relies on the received input. For example, different inputs activate distinct experts. Thus, during online inference, it becomes crucial to dynamically schedule available resources to accommodate these varying demands.

VII-B3 Specific serving system for agents

Currently, the serving systems are designed for the inference of FMs. With the increasing prevalence of agent services, it needs to specifically optimize the serving system by considering the specific characteristics of agents. For example, different agent services employ various execution strategies, such as different planning methods, memory retrieval schemes, and multi-agent frameworks in different applications. These variations offer opportunities to optimize the inference process and resource scheduling approaches for agent systems. For instance, we can strategically allocate simple agents to weaker devices and complex agents to more powerful devices. By considering the computation and communication capacities, we can optimize their collaboration frequency and method effectively.

VII-C Conclusion

FM-powered agent services are widely recognized as a crucial way of achieving AGI and are expected to spearhead development in this field over the next decade. Constructing a robust infrastructure for these agent services within an edge-cloud environment is of great importance and has a substantial impact on the user experience. In this survey, we have presented a unified framework for a thorough literature review on diverse techniques for deploying FM-powered agent services. We have introduced a series of low-level optimization methods for the model execution, including computation, communication, and I/O optimization. Subsequently, we have discussed the parallelism methods and resource allocation schemes designed to optimize resource utilization. We’ve also highlighted several popular FMs and introduced two lightweight methods, namely model compression and token reduction, to expedite the inference processes. Furthermore, we have also reviewed the latest research endeavors focusing on the development of an effective multi-agent framework and explored several recent applications. Finally, the future research directions for serving multi-modal agents on heterogeneous devices are outlined.

This survey is anticipated to significantly advance the development of large-scale model applications in both academia and industry. The proposed framework, along with the referenced research works, provides a comprehensive perspective on deploying FM-powered agent services. It showcases the latest advancements and encourages more researchers to contribute to this practical and compelling field. The technologies discussed in the paper collectively contribute to the development of a reliable and flexible serving system, enabling low-latency agent services that enhance users’ daily experiences with increased intelligence. The integrated optimization of system architecture and AI algorithms will expedite the deployment of large-scale model applications for a diverse range of users, thereby promoting societal progress. Looking ahead, we are dedicated to advancing research in large model applications and aim to expand our investigation to include a broader range of models, execution environments, and practical applications.

References

[1] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J. Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Goel, N. Goodman, S. Grossman, N. Guha, T. Hashimoto, P. Henderson, J. Hewitt, D. E. Ho, J. Hong, K. Hsu, J. Huang, T. Icard, S. Jain, D. Jurafsky, P. Kalluri, S. Karamcheti, G. Keeling, F. Khani, O. Khattab, P. W. Koh, M. Krass, R. Krishna, R. Kuditipudi, A. Kumar, F. Ladhak, M. Lee, T. Lee, J. Leskovec, I. Levent, X. L. Li, X. Li, T. Ma, A. Malik, C. D. Manning, S. Mirchandani, E. Mitchell, Z. Munyikwa, S. Nair, A. Narayan, D. Narayanan, B. Newman, A. Nie, J. C. Niebles, H. Nilforoshan, J. Nyarko, G. Ogut, L. Orr, I. Papadimitriou, J. S. Park, C. Piech, E. Portelance, C. Potts, A. Raghunathan, R. Reich, H. Ren, F. Rong, Y. Roohani, C. Ruiz, J. Ryan, C. Ré, D. Sadigh, S. Sagawa, K. Santhanam, A. Shih, K. Srinivasan, A. Tamkin, R. Taori, A. W. Thomas, F. Tramèr, R. E. Wang, W. Wang, B. Wu, J. Wu, Y. Wu, S. M. Xie, M. Yasunaga, J. You, M. Zaharia, M. Zhang, T. Zhang, X. Zhang, Y. Zhang, L. Zheng, K. Zhou, and P. Liang, “On the Opportunities and Risks of Foundation Models,” 2022.
[2] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, R. Zheng, X. Fan, X. Wang, L. Xiong, Y. Zhou, W. Wang, C. Jiang, Y. Zou, X. Liu, Z. Yin, S. Dou, R. Weng, W. Cheng, Q. Zhang, W. Qin, Y. Zheng, X. Qiu, X. Huang, and T. Gui, “The Rise and Potential of Large Language Model Based Agents: A Survey,” 2023.
[3] Nerdynav, “107 Up-to-Date ChatGPT Statistics & User Numbers,” 2024, accessed: 2024-04-24. [Online]. Available: https://nerdynav.com/chatgpt-statistics/
[4] C. Kachris, “A Survey on Hardware Accelerators for Large Language Models,” 2024.
[5] S. Tang, Y. Yu, H. Wang, G. Wang, W. Chen, Z. Xu, S. Guo, and W. Gao, “A Survey on Scheduling Techniques in Computing and Network Convergence,” IEEE Communications Surveys & Tutorials, vol. 26, no. 1, pp. 160–195, 2024.
[6] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, H. Jin, T. Chen, and Z. Jia, “Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems,” 2023.
[7] M. Xu, H. Du, D. Niyato, J. Kang, Z. Xiong, S. Mao, Z. Han, A. Jamalipour, D. I. Kim, X. Shen, V. C. M. Leung, and H. V. Poor, “Unleashing the Power of Edge-Cloud Generative AI in Mobile Networks: A Survey of AIGC Services,” IEEE Communications Surveys & Tutorials, pp. 1–1, 2024.
[8] H. Djigal, J. Xu, L. Liu, and Y. Zhang, “Machine and Deep Learning for Resource Allocation in Multi-Access Edge Computing: A Survey,” IEEE Communications Surveys & Tutorials, vol. 24, no. 4, pp. 2449–2494, 2022.
[9] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J.-Y. Nie, and J.-R. Wen, “A Survey of Large Language Models,” 2023.
[10] H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A Comprehensive Overview of Large Language Models,” 2024.
[11] X. Zhu, J. Li, Y. Liu, C. Ma, and W. Wang, “A Survey on Model Compression for Large Language Models,” 2023.
[12] W. Wang, W. Chen, Y. Luo, Y. Long, Z. Lin, L. Zhang, B. Lin, D. Cai, and X. He, “Model Compression and Efficient Inference for Large Language Models: A Survey,” 2024.
[13] X. Xu, M. Li, C. Tao, T. Shen, R. Cheng, J. Li, C. Xu, D. Tao, and T. Zhou, “A survey on knowledge distillation of large language models,” arXiv preprint arXiv:2402.13116, 2024.
[14] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin et al., “A survey on large language model based autonomous agents,” Frontiers of Computer Science, vol. 18, no. 6, pp. 1–26, 2024.
[15] T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V. Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,” arXiv preprint arXiv:2402.01680, 2024.
[16] H. Wang, Y. Jia, M. Zhang, Q. Hu, H. Ren, P. Sun, Y. Wen, and T. Zhang, “FedDSE: Distribution-aware Sub-model Extraction for Federated Learning over Resource-constrained Devices,” in Proceedings of the ACM on Web Conference 2024, 2024, pp. 2902–2913.
[17] S. Lu, M. Wang, S. Liang, J. Lin, and Z. Wang, “Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer,” in 2020 IEEE 33rd International System-on-Chip Conference (SOCC). IEEE, 2020, pp. 84–89.
[18] H. Jang, J. Kim, J.-E. Jo, J. Lee, and J. Kim, “Mnnfast: A fast and scalable system architecture for memory-augmented neural networks,” in Proceedings of the 46th International Symposium on Computer Architecture, 2019, pp. 250–263.
[19] H. Khan, A. Khan, Z. Khan, L. B. Huang, K. Wang, and L. He, “Npe: An fpga-based overlay processor for natural language processing,” arXiv preprint arXiv:2104.06535, 2021.
[20] S. Hong, S. Moon, J. Kim, S. Lee, M. Kim, D. Lee, and J.-Y. Kim, “Dfx: A low-latency multi-fpga appliance for accelerating transformer-based text generation,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2022, pp. 616–630.
[21] Y. Bai, H. Zhou, K. Zhao, J. Chen, J. Yu, and K. Wang, “Transformer-opu: An fpga-based overlay processor for transformer networks,” in 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2023, pp. 221–221.
[22] I. Okubo, K. Sugiura, and H. Matsutani, “A cost-efficient fpga implementation of tiny transformer model using neural ode,” arXiv preprint arXiv:2401.02721, 2024.
[23] S. Zeng, J. Liu, G. Dai, X. Yang, T. Fu, H. Wang, W. Ma, H. Sun, S. Li, Z. Huang et al., “Flightllm: Efficient large language model inference with a complete mapping flow on fpga,” arXiv preprint arXiv:2401.03868, 2024.
[24] T. J. Ham, S. J. Jung, S. Kim, Y. H. Oh, Y. Park, Y. Song, J.-H. Park, S. Lee, K. Park, J. W. Lee et al., “A^ 3: Accelerating attention mechanisms in neural networks with approximation,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 328–341.
[25] T. J. Ham, Y. Lee, S. H. Seo, S. Kim, H. Choi, S. J. Jung, and J. W. Lee, “Elsa: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 692–705.
[26] H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse attention architecture with cascade token and head pruning,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 97–110.
[27] L. Lu, Y. Jin, H. Bi, Z. Luo, P. Li, T. Wang, and Y. Liang, “Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture,” in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 977–991.
[28] Z. Zhou, J. Liu, Z. Gu, and G. Sun, “Energon: Toward efficient acceleration of transformers using dynamic sparse attention,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 1, pp. 136–149, 2022.
[29] H. Guo, L. Peng, J. Zhang, Q. Chen, and T. D. LeCompte, “Att: A fault-tolerant reram accelerator for attention-based neural networks,” in 2020 IEEE 38th International Conference on Computer Design (ICCD). IEEE, 2020, pp. 213–221.
[30] A. F. Laguna, A. Kazemi, M. Niemier, and X. S. Hu, “In-memory computing based accelerator for transformer networks for long sequences,” in 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2021, pp. 1839–1844.
[31] B. Reidy, M. Mohammadi, M. Elbtity, H. Smith, and Z. Ramtin, “Work in progress: Real-time transformer inference on edge ai accelerators,” in 2023 IEEE 29th Real-Time and Embedded Technology and Applications Symposium (RTAS), 2023, pp. 341–344.
[32] B. He and T. Hofmann, “Simplifying transformer blocks,” arXiv preprint arXiv:2311.01906, 2023.
[33] J. Choi, H. Li, B. Kim, S. Hwang, and J. H. Ahn, “Accelerating transformer networks through recomposing softmax layers,” in 2022 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 2022, pp. 92–103.
[34] N. Yang, T. Ge, L. Wang, B. Jiao, D. Jiang, L. Yang, R. Majumder, and F. Wei, “Inference with reference: Lossless acceleration of large language models,” arXiv preprint arXiv:2304.04487, 2023.
[35] P. Belcak and R. Wattenhofer, “Exponentially faster language modelling,” arXiv preprint arXiv:2311.10770, 2023.
[36] H. Shen, H. Chang, B. Dong, Y. Luo, and H. Meng, “Efficient llm inference on cpus,” arXiv preprint arXiv:2311.00502, 2023.
[37] Y. Song, Z. Mi, H. Xie, and H. Chen, “Powerinfer: Fast large language model serving with a consumer-grade gpu,” arXiv preprint arXiv:2312.12456, 2023.
[38] X. Zhao, B. Jia, H. Zhou, Z. Liu, S. Cheng, and Y. You, “Hetegen: Heterogeneous parallel inference for large language models on resource-constrained devices,” arXiv preprint arXiv:2403.01164, 2024.
[39] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. Re, I. Stoica, and C. Zhang, “Flexgen: High-throughput generative inference of large language models with a single gpu,” 2023.
[40] Z. Liu, J. Wang, T. Dao, T. Zhou, B. Yuan, Z. Song, A. Shrivastava, C. Zhang, Y. Tian, C. Re et al., “Deja vu: Contextual sparsity for efficient llms at inference time,” in International Conference on Machine Learning. PMLR, 2023, pp. 22 137–22 176.
[41] K. Alizadeh, I. Mirzadeh, D. Belenko, K. Khatamifard, M. Cho, C. C. Del Mundo, M. Rastegari, and M. Farajtabar, “Llm in a flash: Efficient large language model inference with limited memory,” arXiv preprint arXiv:2312.11514, 2023.
[42] N. Shazeer, “Fast transformer decoding: One write-head is all you need,” arXiv preprint arXiv:1911.02150, 2019.
[43] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebron, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head checkpoints,” in The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
[44] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 611–626.
[45] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” Advances in Neural Information Processing Systems, vol. 35, pp. 16 344–16 359, 2022.
[46] T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,” arXiv preprint arXiv:2307.08691, 2023.
[47] K. Hong, G. Dai, J. Xu, Q. Mao, X. Li, J. Liu, K. Chen, H. Dong, and Y. Wang, “Flashdecoding++: Faster large language model inference on gpus,” arXiv preprint arXiv:2311.01282, 2023.
[48] X. Han, G. Zeng, W. Zhao, Z. Liu, Z. Zhang, J. Zhou, J. Zhang, J. Chao, and M. Sun, “Bminf: An efficient toolkit for big model inference and tuning,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2022, pp. 224–230.
[49] P. Patel, E. Choukse, C. Zhang, Í. Goiri, A. Shah, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative llm inference using phase splitting,” arXiv preprint arXiv:2311.18677, 2023.
[50] B. Wu, Y. Zhong, Z. Zhang, G. Huang, X. Liu, and X. Jin, “Fast distributed inference serving for large language models,” arXiv preprint arXiv:2305.05920, 2023.
[51] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, R. Y. Y. Wong, A. Zhu, L. Yang, X. Shi, C. Shi, Z. Chen, D. Arfeen, R. Abhyankar, and Z. Jia, “Specinfer: Accelerating generative large language model serving with speculative inference and token tree verification,” 2023.
[52] D. Xu, W. Yin, X. Jin, Y. Zhang, S. Wei, M. Xu, and X. Liu, “Llmcad: Fast and scalable on-device large language model inference,” arXiv preprint arXiv:2309.04255, 2023.
[53] S. Rajbhandari, C. Li, Z. Yao, M. Zhang, R. Y. Aminabadi, A. A. Awan, J. Rasley, and Y. He, “Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale,” in International conference on machine learning. PMLR, 2022, pp. 18 332–18 346.
[54] R. Yi, L. Guo, S. Wei, A. Zhou, S. Wang, and M. Xu, “Edgemoe: Fast on-device inference of moe-based large language models,” arXiv preprint arXiv:2308.14352, 2023.
[55] A. Eliseev and D. Mazur, “Fast inference of mixture-of-experts language models with offloading,” arXiv preprint arXiv:2312.17238, 2023.
[56] L. Xue, Y. Fu, Z. Lu, L. Mai, and M. Marina, “Moe-infinity: Activation-aware expert offloading for efficient moe serving,” arXiv preprint arXiv:2401.14361, 2024.
[57] K. Kamahori, Y. Gu, K. Zhu, and B. Kasikci, “Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models,” arXiv preprint arXiv:2402.07033, 2024.
[58] R. Hwang, J. Wei, S. Cao, C. Hwang, X. Tang, T. Cao, M. Yang, and M. Rhu, “Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference,” arXiv preprint arXiv:2308.12066, 2023.
[59] Z. Hong, X. Qiu, J. Lin, W. Chen, Y. Yu, H. Wang, S. Guo, and W. Gao, “Intelligence-endogenous management platform for computing and network convergence,” IEEE Network, 2023.
[60] C. Liu and J. Zhao, “Resource allocation in large language model integrated 6g vehicular networks,” arXiv preprint arXiv:2403.19016, 2024.
[61] J. Zhao, Y. Song, S. Liu, I. G. Harris, and S. A. Jyothi, “Lingualinked: A distributed large language model inference system for mobile devices,” arXiv preprint arXiv:2312.00388, 2023.
[62] Z. Jiang, H. Lin, Y. Zhong, Q. Huang, Y. Chen, Z. Zhang, Y. Peng, X. Li, C. Xie, S. Nong et al., “ $\{$ MegaScale $\}$ : Scaling large language model training to more than 10,000 $\{$ GPUs $\}$ ,” in 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 745–760.
[63] H. Wang, Z. Qu, S. Guo, N. Wang, R. Li, and W. Zhuang, “LOSP: Overlap synchronization parallel with local compensation for fast distributed training,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 8, pp. 2541–2557, 2021.
[64] X. Mu, Y. Liu, L. Guo, and N. Al-Dhahir, “Heterogeneous semantic and bit communications: A semi-noma scheme,” IEEE Journal on Selected Areas in Communications, vol. 41, no. 1, pp. 155–169, 2022.
[65] Z. Qin, J. Ying, D. Yang, H. Wang, and X. Tao, “Computing networks enabled semantic communications,” IEEE Network, 2024.
[66] G. Gerganov, “ggerganov/llama.cpp: Port of facebook’s llama model in c/c++.” https://github.com/ggerganov/llama.cpp, 2023.
[67] M. team, “MLC-LLM,” 2023. [Online]. Available: https://github.com/mlc-ai/mlc-llm
[68] mnn llm, “mnn-llm: llm deploy project based mnn.” https://github.com/wangzhaode/mnn-llm, 2023.
[69] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging llm-as-a-judge with mt-bench and chatbot arena,” 2023.
[70] R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Rasley et al., “Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale,” in SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2022, pp. 1–15.
[71] Y. Gorbachev, M. Fedorov, I. Slavutin, A. Tugarev, M. Fatekhov, and Y. Tarkan, “Openvino deep learning workbench: Comprehensive analysis and tuning of neural networks inference,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
[72] mllm, “mllm is a fast and lightweight multimodal llm inference engine for mobile and edge devices.” https://github.com/UbiquitousLearning/mllm, 2023.
[73] H. Xia, Z. Zheng, X. Wu, S. Chen, Z. Yao, S. Youn, A. Bakhtiari, M. Wyatt, D. Zhuang, Z. Zhou et al., “Fp6-llm: Efficiently serving large language models through fp6-centric algorithm-system co-design,” arXiv preprint arXiv:2401.14112, 2024.
[74] S. Li, H. Liu, Z. Bian, J. Fang, H. Huang, Y. Liu, B. Wang, and Y. You, “Colossal-ai: A unified deep learning system for large-scale parallel training,” in Proceedings of the 52nd International Conference on Parallel Processing, 2023, pp. 766–775.
[75] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro et al., “Efficient large-scale language model training on gpu clusters using megatron-lm,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15.
[76] tensorrtllm, “A tensorrt toolbox for optimized large language model inference,” https://github.com/NVIDIA/TensorRT-LLM, 2023.
[77] Harrison Chase, “Langchain,” https://github.com/langchain-ai/langchain, 2024, accessed: 2024-04-07.
[78] C. Lin, Z. Han, C. Zhang, Y. Yang, F. Yang, C. Chen, and L. Qiu, “Parrot: Efficient Serving of LLM-based Applications with Semantic Variable,” in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). Santa Clara, CA: USENIX Association, Jul. 2024.
[79] L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y. Sheng, “SGLang: Efficient Execution of Structured Language Model Programs,” 2024.
[80] D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, “Clipper: A $\{$ Low-Latency $\}$ online prediction serving system,” in 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), 2017, pp. 613–627.
[81] C. Zhang, M. Yu, W. Wang, and F. Yan, “ $\{$ MArk $\}$ : Exploiting cloud services for $\{$ Cost-Effective $\}$ , $\{$ SLO-Aware $\}$ machine learning inference serving,” in 2019 USENIX Annual Technical Conference (USENIX ATC 19), 2019, pp. 1049–1062.
[82] H. Shen, L. Chen, Y. Jin, L. Zhao, B. Kong, M. Philipose, A. Krishnamurthy, and R. Sundaram, “Nexus: A GPU cluster engine for accelerating DNN-based video analysis,” in Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019, pp. 322–337.
[83] D. Crankshaw, G.-E. Sela, X. Mo, C. Zumar, I. Stoica, J. Gonzalez, and A. Tumanov, “InferLine: latency-aware provisioning and scaling for prediction serving pipelines,” in Proceedings of the 11th ACM Symposium on Cloud Computing, 2020, pp. 477–491.
[84] A. Gujarati, R. Karimi, S. Alzayat, W. Hao, A. Kaufmann, Y. Vigfusson, and J. Mace, “Serving $\{$ DNNs $\}$ like clockwork: Performance predictability from the bottom up,” in 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020, pp. 443–462.
[85] F. Romero, Q. Li, N. J. Yadwadkar, and C. Kozyrakis, “ $\{$ INFaaS $\}$ : Automated model-less inference serving,” in 2021 USENIX Annual Technical Conference (USENIX ATC 21), 2021, pp. 397–411.
[86] L. Wang, L. Yang, Y. Yu, W. Wang, B. Li, X. Sun, J. He, and L. Zhang, “Morphling: Fast, near-optimal auto-configuration for cloud-native model serving,” in Proceedings of the ACM Symposium on Cloud Computing, 2021, pp. 639–653.
[87] Gunasekaran, Jashwant Raj and Mishra, Cyan Subhra and Thinakaran, Prashanth and Sharma, Bikash and Kandemir, Mahmut Taylan and Das, Chita R, “Cocktail: A multidimensional optimization for model serving in cloud,” in 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), 2022, pp. 1041–1057.
[88] B. Li, S. Samsi, V. Gadepally, and D. Tiwari, “Kairos: Building cost-efficient machine learning inference systems with heterogeneous cloud resources,” in Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, 2023, pp. 3–16.
[89] H. Zhang, Y. Tang, A. Khandelwal, and I. Stoica, “ $\{$ SHEPHERD $\}$ : Serving $\{$ DNNs $\}$ in the wild,” in 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 787–808.
[90] X. Miao, C. Shi, J. Duan, X. Xi, D. Lin, B. Cui, and Z. Jia, “Spotserve: Serving generative large language models on preemptible instances,” arXiv preprint arXiv:2311.15566, 2023.
[91] W. Na, S. Jang, Y. Lee, L. Park, N.-N. Dao, and S. Cho, “Frequency resource allocation and interference management in mobile edge computing for an Internet of Things system,” IEEE Internet of Things Journal, vol. 6, no. 3, pp. 4910–4920, 2018.
[92] C. Avasalcai, C. Tsigkanos, and S. Dustdar, “Decentralized resource auctioning for latency-sensitive edge computing,” in 2019 IEEE international conference on edge computing (EDGE). IEEE, 2019, pp. 72–76.
[93] L. Yang, B. Liu, J. Cao, Y. Sahni, and Z. Wang, “Joint computation partitioning and resource allocation for latency sensitive applications in mobile edge clouds,” IEEE Transactions on Services Computing, vol. 14, no. 5, pp. 1439–1452, 2019.
[94] Z. Tong, X. Deng, F. Ye, S. Basodi, X. Xiao, and Y. Pan, “Adaptive computation offloading and resource allocation strategy in a mobile edge computing environment,” Information Sciences, vol. 537, pp. 116–131, 2020.
[95] X. Xiong, K. Zheng, L. Lei, and L. Hou, “Resource allocation based on deep reinforcement learning in IoT edge computing,” IEEE Journal on Selected Areas in Communications, vol. 38, no. 6, pp. 1133–1146, 2020.
[96] Z. Zhou, S. Yu, W. Chen, and X. Chen, “CE-IoT: Cost-effective cloud-edge resource provisioning for heterogeneous IoT applications,” IEEE Internet of Things Journal, vol. 7, no. 9, pp. 8600–8614, 2020.
[97] Z. Chang, L. Liu, X. Guo, and Q. Sheng, “Dynamic resource allocation and computation offloading for IoT fog computing system,” IEEE Transactions on Industrial Informatics, vol. 17, no. 5, pp. 3348–3357, 2020.
[98] B. Wang, A. Ali-Eldin, and P. Shenoy, “Lass: Running latency sensitive serverless computations at the edge,” in Proceedings of the 30th international symposium on high-performance parallel and distributed computing, 2021, pp. 239–251.
[99] O. Ascigil, A. G. Tasiopoulos, T. K. Phan, V. Sourlas, I. Psaras, and G. Pavlou, “Resource provisioning and allocation in function-as-a-service edge-clouds,” IEEE Transactions on Services Computing, vol. 15, no. 4, pp. 2410–2424, 2021.
[100] X. Li, P. Kang, J. Molone, W. Wang, and P. Lama, “KneeScale: Efficient resource scaling for serverless computing at the edge,” in 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 2022, pp. 180–189.
[101] S. Hu, W. Shi, and G. Li, “CEC: A containerized edge computing framework for dynamic resource provisioning,” IEEE Transactions on Mobile Computing, 2022.
[102] H. Wang, H. Xu, Y. Li, Y. Xu, R. Li, and T. Zhang, “FedCDA: Federated Learning with Cross-rounds Divergence-aware Aggregation,” in The Twelfth International Conference on Learning Representations, 2024.
[103] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
[104] R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean, “Efficiently scaling transformer inference,” Proceedings of Machine Learning and Systems, vol. 5, 2023.
[105] Z. Li, L. Zheng, Y. Zhong, V. Liu, Y. Sheng, X. Jin, Y. Huang, Z. Chen, H. Zhang, J. E. Gonzalez et al., “Alpaserve: Statistical multiplexing with model parallelism for deep learning serving,” in 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), 2023, pp. 663–679.
[106] D. Li, R. Shao, A. Xie, E. P. Xing, J. E. Gonzalez, I. Stoica, X. Ma, and H. Zhang, “Lightseq: Sequence level parallelism for distributed training of long context transformers,” arXiv preprint arXiv:2310.03294, 2023.
[107] A. Borzunov, M. Ryabinin, A. Chumachenko, D. Baranchuk, T. Dettmers, Y. Belkada, P. Samygin, and C. A. Raffel, “Distributed Inference and Fine-tuning of Large Language Models Over The Internet,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[108] A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee, “Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills,” arXiv preprint arXiv:2308.16369, 2023.
[109] Y. Zhong, S. Liu, J. Chen, J. Hu, Y. Zhu, X. Liu, X. Jin, and H. Zhang, “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving,” arXiv preprint arXiv:2401.09670, 2024.
[110] L. Zhou, H. Wen, R. Teodorescu, and D. H. Du, “Distributing deep neural networks with containerized partitions at the edge,” in 2nd USENIX Workshop on Hot Topics in Edge Computing (HotEdge 19), 2019.
[111] T. Mohammed, C. Joe-Wong, R. Babbar, and M. Di Francesco, “Distributed inference acceleration with adaptive DNN partitioning and offloading,” in IEEE INFOCOM 2020-IEEE Conference on Computer Communications. IEEE, 2020, pp. 854–863.
[112] L. Zeng, X. Chen, Z. Zhou, L. Yang, and J. Zhang, “Coedge: Cooperative dnn inference with adaptive workload partitioning over heterogeneous edge devices,” IEEE/ACM Transactions on Networking, vol. 29, no. 2, pp. 595–608, 2020.
[113] J. Li, W. Liang, Y. Li, Z. Xu, X. Jia, and S. Guo, “Throughput maximization of delay-aware DNN inference in edge computing by exploring DNN model partitioning and inference parallelism,” IEEE Transactions on Mobile Computing, vol. 22, no. 5, pp. 3017–3030, 2021.
[114] Y. Wu, M. Lentz, D. Zhuo, and Y. Lu, “Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures,” Proceedings of the VLDB Endowment, vol. 16, no. 3, pp. 406–419, 2022.
[115] Y. Hu, C. Imes, X. Zhao, S. Kundu, P. A. Beerel, S. P. Crago, and J. P. Walters, “PipeEdge: Pipeline Parallelism for Large-Scale Model Inference on Heterogeneous Edge Devices,” in 2022 25th Euromicro Conference on Digital System Design (DSD), 2022, pp. 298–307.
[116] L. Wu, G. Gao, J. Yu, F. Zhou, Y. Yang, and T. Wang, “PDD: partitioning DAG-topology DNNs for streaming tasks,” IEEE Internet of Things Journal, 2023.
[117] T. Feltin, L. Marchó, J.-A. Cordero-Fuertes, F. Brockners, and T. H. Clausen, “Dnn partitioning for inference throughput acceleration at the edge,” IEEE Access, vol. 11, pp. 52 236–52 249, 2023.
[118] H. Li, X. Li, Q. Fan, Q. He, X. Wang, and V. C. Leung, “Distributed DNN Inference with Fine-grained Model Partitioning in Mobile Edge Computing Networks,” IEEE Transactions on Mobile Computing, 2024.
[119] Z. Liu, M. Tian, M. Dong, X. Wang, C. Qiu, and C. Zhang, “MoEI: Mobility-Aware Edge Inference Based on Model Partition and Service Migration,” IEEE Transactions on Mobile Computing, no. 01, pp. 1–14, 2024.
[120] NVIDIA Corporation, “NVIDIA Triton Inference Server,” https://developer.nvidia.com/nvidia-triton-inference-server, 2024, accessed: 2024-04-17.
[121] Google LLC, “TensorFlow Serving,” https://www.tensorflow.org/tfx/guide/serving, 2024, accessed: 2024-04-17.
[122] H. Wang, P. Zheng, X. Han, W. Xu, R. Li, and T. Zhang, “FedNLR: Federated Learning with Neuron-wise Learning Rates,” in Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 3069–3080.
[123] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[124] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.
[125] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners.”
[126] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[127] W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang, X. Jiang, Z. Yang, K. Wang, X. Zhang et al., “Pangu- $\alpha$ : Large-scale autoregressive pretrained chinese language models with auto-parallel computation,” arXiv e-prints, pp. arXiv–2104, 2021.
[128] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen, Y. Zhao, Y. Lu et al., “Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation,” arXiv preprint arXiv:2107.02137, 2021.
[129] O. Lieber, O. Sharir, B. Lenz, and Y. Shoham, “Jurassic-1: Technical details and evaluation,” White Paper. AI21 Labs, vol. 1, p. 9, 2021.
[130] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al., “Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556, 2022.
[131] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language models for dialog applications,” arXiv preprint arXiv:2201.08239, 2022.
[132] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti et al., “Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model,” arXiv preprint arXiv:2201.11990, 2022.
[133] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language models: Methods, analysis & insights from training gopher,” arXiv preprint arXiv:2112.11446, 2021.
[134] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022.
[135] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez, and R. Stojnic, “Galactica: A large language model for science,” arXiv preprint arXiv:2211.09085, 2022.
[136] T. Le Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé et al., “Bloom: A 176b-parameter open-access multilingual language model,” 2023.
[137] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
[138] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
[139] BaiChuan-Inc, “Baichuan-7b: About: A large-scale 7b pretraining language model developed by baichuan-inc,” https://github.com/baichuan-inc/Baichuan-7B/tree/main, 2024, accessed: 2024-04-07.
[140] Baichuan Intelligent Technology, “A 13b large language model developed by baichuan intelligent technology,” https://github.com/baichuan-inc/Baichuan-13B/tree/main, 2024, accessed: 2024-04-07.
[141] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023.
[142] T. Wei, L. Zhao, L. Zhang, B. Zhu, L. Wang, H. Yang, B. Li, C. Cheng, W. Lü, R. Hu et al., “Skywork: A more open bilingual foundation model,” arXiv preprint arXiv:2310.19341, 2023.
[143] E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, É. Goffinet, D. Hesslow, J. Launay, Q. Malartic et al., “The falcon series of open language models,” arXiv preprint arXiv:2311.16867, 2023.
[144] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source be with you!” arXiv preprint arXiv:2305.06161, 2023.
[145] A. Young, B. Chen, C. Li, C. Huang, G. Zhang, G. Zhang, H. Li, J. Zhu, J. Chen, J. Chang et al., “Yi: Open foundation models by 01. ai,” arXiv preprint arXiv:2403.04652, 2024.
[146] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” 2022.
[147] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A massively multilingual pre-trained text-to-text transformer,” arXiv preprint arXiv:2010.11934, 2020.
[148] S. Shen, L. Hou, Y. Zhou, N. Du, S. Longpre, J. Wei, H. W. Chung, B. Zoph, W. Fedus, X. Chen et al., “Flan-moe: Scaling instruction-finetuned language models with sparse mixture of experts,” arXiv e-prints, pp. arXiv–2305, 2023.
[149] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, and J. Wei, “Scaling instruction-finetuned language models,” 2022.
[150] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Alpaca,” 2023. [Online]. Available: https://crfm.stanford.edu/2023/03/13/alpaca.html
[151] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia, W. L. Tam, Z. Ma, Y. Xue, J. Zhai, W. Chen, P. Zhang, Y. Dong, and J. Tang, “Glm-130b: An open bilingual pre-trained model,” 2023.
[152] X. Li, Y. Yao, X. Jiang, X. Fang, X. Meng, S. Fan, P. Han, J. Li, L. Du, B. Qin et al., “Flm-101b: An open llm and how to train it with $100 k budget,” arXiv preprint arXiv:2309.03852, 2023.
[153] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in neural information processing systems, vol. 35, pp. 23 716–23 736, 2022.
[154] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
[155] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592, 2023.
[156] Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, A. Hu, P. Shi, Y. Shi et al., “mplug-owl: Modularization empowers large language models with multimodality,” arXiv preprint arXiv:2304.14178, 2023.
[157] W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, Z. Yang, L. Zhao, X. Song et al., “Cogvlm: Visual expert for pretrained language models,” arXiv preprint arXiv:2311.03079, 2023.
[158] R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 180–15 190.
[159] Y. Su, T. Lan, H. Li, J. Xu, Y. Wang, and D. Cai, “Pandagpt: One model to instruction-follow them all,” arXiv preprint arXiv:2305.16355, 2023.
[160] S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any-to-any multimodal llm,” arXiv preprint arXiv:2309.05519, 2023.
[161] J. Han, K. Gong, Y. Zhang, J. Wang, K. Zhang, D. Lin, Y. Qiao, P. Gao, and X. Yue, “Onellm: One framework to align all modalities with language,” arXiv preprint arXiv:2312.03700, 2023.
[162] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
[163] M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024.
[164] K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao, “Shikra: Unleashing multimodal llm’s referential dialogue magic,” arXiv preprint arXiv:2306.15195, 2023.
[165] H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, and Y. Yang, “Ferret: Refer and ground anything anywhere at any granularity,” arXiv preprint arXiv:2310.07704, 2023.
[166] T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. Hinton, “Pix2seq: A language modeling framework for object detection,” 2022.
[167] A. Zhang, L. Zhao, C.-W. Xie, Y. Zheng, W. Ji, and T.-S. Chua, “Next-chat: An lmm for chat, detection and segmentation,” arXiv preprint arXiv:2311.04498, 2023.
[168] X. Ren, P. Zhou, X. Meng, X. Huang, Y. Wang, W. Wang, P. Li, X. Zhang, A. Podolskiy, G. Arshinov et al., “Pangu- $\{$ $\backslash$ Sigma $\}$ : Towards trillion parameter language model with sparse heterogeneous computing,” arXiv preprint arXiv:2303.10845, 2023.
[169] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
[170] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand et al., “Mixtral of experts,” arXiv preprint arXiv:2401.04088, 2024.
[171] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022.
[172] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019.
[173] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu, “Tinybert: Distilling bert for natural language understanding,” arXiv preprint arXiv:1909.10351, 2019.
[174] Y. Tang, F. Liu, Y. Ni, Y. Tian, Z. Bai, Y.-Q. Hu, S. Liu, S. Jui, K. Han, and Y. Wang, “Rethinking optimization and architecture for tiny language models,” arXiv preprint arXiv:2402.02791, 2024.
[175] P. Zhang, G. Zeng, T. Wang, and W. Lu, “Tinyllama: An open-source small language model,” arXiv preprint arXiv:2401.02385, 2024.
[176] M. Abdin, S. A. Jacobs, A. A. Awan, J. Aneja, A. Awadallah, H. Awadalla, N. Bach, A. Bahree, A. Bakhtiari, H. Behl et al., “Phi-3 technical report: A highly capable language model locally on your phone,” arXiv preprint arXiv:2404.14219, 2024.
[177] M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effective pruning approach for large language models,” arXiv preprint arXiv:2306.11695, 2023.
[178] X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the structural pruning of large language models,” Advances in neural information processing systems, vol. 36, 2024.
[179] M. Zhang, H. Chen, C. Shen, Z. Yang, L. Ou, X. Yu, and B. Zhuang, “Loraprune: Pruning meets low-rank parameter-efficient fine-tuning,” 2023.
[180] T. Chen, T. Ding, B. Yadav, I. Zharkov, and L. Liang, “Lorashear: Efficient large language model structured pruning and knowledge recovery,” arXiv preprint arXiv:2310.18356, 2023.
[181] Y. An, X. Zhao, T. Yu, M. Tang, and J. Wang, “Fluctuation-based adaptive structured pruning for large language models,” arXiv preprint arXiv:2312.11983, 2023.
[182] H. Shao, B. Liu, and Y. Qian, “One-shot sensitivity-aware mixed sparsity pruning for large language models,” arXiv preprint arXiv:2310.09499, 2023.
[183] S. Guo, J. Xu, L. L. Zhang, and M. Yang, “Compresso: Structured pruning with collaborative prompting learns compact large language models,” arXiv preprint arXiv:2310.05015, 2023.
[184] M. Xia, T. Gao, Z. Zeng, and D. Chen, “Sheared llama: Accelerating language model pre-training via structured pruning,” arXiv preprint arXiv:2310.06694, 2023.
[185] Y. Ji, Y. Cao, and J. Liu, “Pruning large language models via accuracy predictor,” arXiv preprint arXiv:2309.09507, 2023.
[186] R. J. Das, L. Ma, and Z. Shen, “Beyond size: How gradients shape pruning decisions in large language models,” arXiv preprint arXiv:2311.04902, 2023.
[187] S. Anagnostidis, D. Pavllo, L. Biggio, L. Noci, A. Lucchi, and T. Hofmann, “Dynamic context pruning for efficient and interpretable autoregressive transformers,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[188] E. Kurtić, E. Frantar, and D. Alistarh, “ZipLM: Inference-Aware Structured Pruning of Language Models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[189] Y. Peng, Y. Sudo, S. Muhammad, and S. Watanabe, “DPHuBERT: Joint distillation and pruning of self-supervised speech models,” arXiv preprint arXiv:2305.17651, 2023.
[190] Y. Peng, K. Kim, F. Wu, P. Sridhar, and S. Watanabe, “Structured Pruning of Self-Supervised Pre-Trained Models for Speech Recognition and Understanding,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[191] H. Lin, H. Bai, Z. Liu, L. Hou, M. Sun, L. Song, Y. Wei, and Z. Sun, “MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric,” arXiv preprint arXiv:2403.07839, 2024.
[192] L. Yu and W. Xiang, “X-pruner: explainable pruning for vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 24 355–24 363.
[193] G. Fang, X. Ma, and X. Wang, “Structural pruning for diffusion models,” Advances in neural information processing systems, vol. 36, 2024.
[194] H. Yu and J. Wu, “A unified pruning framework for vision transformers,” Science China Information Sciences, vol. 66, no. 7, p. 179101, 2023.
[195] D. Kuznedelev, E. Kurtić, E. Frantar, and D. Alistarh, “CAP: Correlation-Aware Pruning for Highly-Accurate Sparse Vision Models,” in Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 28 805–28 831.
[196] V. S. Lodagala, S. Ghosh, and S. Umesh, “Pada: Pruning assisted domain adaptation for self-supervised speech representations,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 136–143.
[197] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” in International Conference on Machine Learning. PMLR, 2023, pp. 38 087–38 099.
[198] Z. Yuan, L. Niu, J. Liu, W. Liu, X. Wang, Y. Shang, G. Sun, Q. Wu, J. Wu, and B. Wu, “Rptq: Reorder-based post-training quantization for large language models,” arXiv preprint arXiv:2304.01089, 2023.
[199] Y. Li, Y. Yu, C. Liang, P. He, N. Karampatziakis, W. Chen, and T. Zhao, “Loftq: Lora-fine-tuning-aware quantization for large language models,” arXiv preprint arXiv:2310.08659, 2023.
[200] X. Wei, Y. Zhang, Y. Li, X. Zhang, R. Gong, J. Guo, and X. Liu, “Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling,” arXiv preprint arXiv:2304.09145, 2023.
[201] Q. Li, Y. Zhang, L. Li, P. Yao, B. Zhang, X. Chu, Y. Sun, L. Du, and Y. Xie, “FPTQ: Fine-grained Post-Training Quantization for Large Language Models,” arXiv preprint arXiv:2308.15987, 2023.
[202] C. Lee, J. Jin, T. Kim, H. Kim, and E. Park, “Owq: Lessons learned from activation outliers for weight quantization in large language models,” arXiv preprint arXiv:2306.02272, 2023.
[203] J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han, “Awq: Activation-aware weight quantization for llm compression and acceleration,” arXiv preprint arXiv:2306.00978, 2023.
[204] Y. Zhang, L. Zhao, S. Cao, W. Wang, T. Cao, F. Yang, M. Yang, S. Zhang, and N. Xu, “Integer or floating point? new outlooks for low-bit quantization on large language models,” arXiv preprint arXiv:2305.12356, 2023.
[205] W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, P. Gao, Y. Qiao, and P. Luo, “Omniquant: Omnidirectionally calibrated quantization for large language models,” arXiv preprint arXiv:2308.13137, 2023.
[206] R. Liu, H. Bai, H. Lin, Y. Li, H. Gao, Z. Xu, L. Hou, J. Yao, and C. Yuan, “IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact,” arXiv preprint arXiv:2403.01241, 2024.
[207] J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, and D. Lee, “Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization,” in Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 36 187–36 207.
[208] J. Liu, R. Gong, X. Wei, Z. Dong, J. Cai, and B. Zhuang, “Qllm: Accurate and efficient low-bitwidth quantization for large language models,” arXiv preprint arXiv:2310.08041, 2023.
[209] Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y. Mehdad, Y. Shi, R. Krishnamoorthi, and V. Chandra, “Llm-qat: Data-free quantization aware training for large language models,” arXiv preprint arXiv:2305.17888, 2023.
[210] J. Chee, Y. Cai, V. Kuleshov, and C. M. De Sa, “Quip: 2-bit quantization of large language models with guarantees,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[211] L. Li, Q. Li, B. Zhang, and X. Chu, “Norm tweaking: High-performance low-bit quantization of large language models,” arXiv preprint arXiv:2309.02784, 2023.
[212] Z. Yao, X. Wu, C. Li, S. Youn, and Y. He, “Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation,” arXiv preprint arXiv:2303.08302, 2023.
[213] Y. Xu, L. Xie, X. Gu, X. Chen, H. Chang, H. Zhang, Z. Chen, X. Zhang, and Q. Tian, “Qa-lora: Quantization-aware low-rank adaptation of large language models,” arXiv preprint arXiv:2309.14717, 2023.
[214] Y. Chai, J. Gkountouras, G. G. Ko, D. Brooks, and G.-Y. Wei, “Int2.1: Towards fine-tunable quantized large language models with error correction through low-rank adaptation,” arXiv preprint arXiv:2306.08162, 2023.
[215] E. J. Michaud, Z. Liu, U. Girit, and M. Tegmark, “The Quantization Model of Neural Scaling,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[216] M. W. U. Rahman, M. M. Abrar, H. G. Copening, S. Hariri, S. Shao, P. Satam, and S. Salehi, “Quantized transformer language model implementations on edge devices,” arXiv preprint arXiv:2310.03971, 2023.
[217] C.-Y. Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y. Fujii, A. Ratner, R. Krishna, C.-Y. Lee, and T. Pfister, “Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes,” arXiv preprint arXiv:2305.02301, 2023.
[218] L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y. Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib et al., “Zephyr: Direct distillation of lm alignment,” arXiv preprint arXiv:2310.16944, 2023.
[219] Y. Jiang, C. Chan, M. Chen, and W. Wang, “Lion: Adversarial distillation of closed-source large language model,” arXiv preprint arXiv:2305.12870, 2023.
[220] X. Zhu, B. Qi, K. Zhang, X. Long, and B. Zhou, “Pad: Program-aided distillation specializes large models in reasoning,” arXiv preprint arXiv:2305.13888, 2023.
[221] Y. Zhou, K. Lyu, A. S. Rawat, A. K. Menon, A. Rostamizadeh, S. Kumar, J.-F. Kagy, and R. Agarwal, “Distillspec: Improving speculative decoding via knowledge distillation,” arXiv preprint arXiv:2310.08461, 2023.
[222] E. Latif, L. Fang, P. Ma, and X. Zhai, “Knowledge distillation of llm for education,” arXiv preprint arXiv:2312.15842, 2023.
[223] C. Liang, S. Zuo, Q. Zhang, P. He, W. Chen, and T. Zhao, “Less is more: Task-aware layer-wise distillation for language model compression,” in International Conference on Machine Learning. PMLR, 2023, pp. 20 852–20 867.
[224] C. Liang, H. Jiang, Z. Li, X. Tang, B. Yin, and T. Zhao, “Homodistil: Homotopic task-agnostic distillation of pre-trained transformers,” arXiv preprint arXiv:2302.09632, 2023.
[225] L. H. Li, J. Hessel, Y. Yu, X. Ren, K.-W. Chang, and Y. Choi, “Symbolic chain-of-thought distillation: Small models can also” think” step-by-step,” arXiv preprint arXiv:2306.14050, 2023.
[226] C. Liu, Y. Kang, F. Zhao, K. Kuang, Z. Jiang, C. Sun, and F. Wu, “Evolving Knowledge Distillation with Large Language Models and Active Learning,” arXiv preprint arXiv:2403.06414, 2024.
[227] C. Zhang, Y. Yang, Q. Wang, J. Liu, J. Wang, W. Wu, and D. Song, “Minimal Distillation Schedule for Extreme Language Model Compression,” in Findings of the Association for Computational Linguistics: EACL 2024, 2024, pp. 1378–1394.
[228] V. Kontonis, F. Iliopoulos, K. Trinh, C. Baykal, G. Menghani, and E. Vee, “Slam: Student-label mixing for distillation with unlabeled examples,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[229] W. Zhou, S. Zhang, Y. Gu, M. Chen, and H. Poon, “Universalner: Targeted distillation from large language models for open named entity recognition,” arXiv preprint arXiv:2308.03279, 2023.
[230] P. Wang, Z. Wang, Z. Li, Y. Gao, B. Yin, and X. Ren, “Scott: Self-consistent chain-of-thought distillation,” arXiv preprint arXiv:2305.01879, 2023.
[231] J. Chang, S. Wang, H.-M. Xu, Z. Chen, C. Yang, and F. Zhao, “Detrdistill: A universal knowledge distillation framework for detr-families,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6898–6908.
[232] B. Huang, M. Chen, Y. Wang, J. Lu, M. Cheng, and W. Wang, “Boosting accuracy and robustness of student models via adaptive adversarial distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 24 668–24 677.
[233] X. Sun, P. Zhang, P. Zhang, H. Shah, K. Saenko, and X. Xia, “Dime-fm: Distilling multimodal and efficient foundation models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 521–15 533.
[234] D. Li, Z. Tan, T. Chen, and H. Liu, “Contextualization distillation from large language model for knowledge graph completion,” arXiv preprint arXiv:2402.01729, 2024.
[235] S. Hu, G. Zou, S. Yang, B. Zhang, and Y. Chen, “Large Language Model Meets Graph Neural Network in Knowledge Distillation,” arXiv preprint arXiv:2402.05894, 2024.
[236] J. Marrie, M. Arbel, J. Mairal, and D. Larlus, “On Good Practices for Task-Specific Distillation of Large Pretrained Models,” arXiv preprint arXiv:2402.11305, 2024.
[237] H. Wang, Y. Li, W. Xu, R. Li, Y. Zhan, and Z. Zeng, “Dafkd: Domain-aware federated knowledge distillation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 20 412–20 421.
[238] K. Zhao, Z. Zhou, X. Chen, R. Zhou, X. Zhang, S. Yu, and D. Wu, “Edgeadaptor: Online configuration adaption, model selection and resource provisioning for edge dnn inference serving at scale,” IEEE Transactions on Mobile Computing, 2022.
[239] L. Guo, W. Choe, and F. X. Lin, “Sti: Turbocharge nlp inference at the edge via elastic pipelining,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2023, pp. 791–803.
[240] Y. Wang, K. Chen, H. Tan, and K. Guo, “Tabi: An efficient multi-level inference system for large language models,” in Proceedings of the Eighteenth European Conference on Computer Systems, 2023, pp. 233–248.
[241] T. Schuster, A. Fisch, T. Jaakkola, and R. Barzilay, “Consistent Accelerated Inference via Confident Adaptive Transformers,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 4962–4979.
[242] T. Ribeiro de Oliveira, B. Biancardi Rodrigues, M. Moura da Silva, R. Antonio N. Spinassé, G. Giesen Ludke, M. Ruy Soares Gaudio, G. Iglesias Rocha Gomes, L. Guio Cotini, D. da Silva Vargens, M. Queiroz Schimidt et al., “Virtual reality solutions employing artificial intelligence methods: A systematic literature review,” ACM Computing Surveys, vol. 55, no. 10, pp. 1–29, 2023.
[243] P. Esmaeilzadeh, “Use of AI-based tools for healthcare purposes: a survey study from consumers’ perspectives,” BMC medical informatics and decision making, vol. 20, pp. 1–19, 2020.
[244] Z. Chen, A. May, R. Svirschevski, Y. Huang, M. Ryabinin, Z. Jia, and B. Chen, “Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding,” arXiv preprint arXiv:2402.12374, 2024.
[245] S. Wang, H. Yang, X. Wang, T. Liu, P. Wang, X. Liang, K. Ma, T. Feng, X. You, Y. Bao et al., “Minions: Accelerating Large Language Model Inference with Adaptive and Collective Speculative Decoding,” arXiv preprint arXiv:2402.15678, 2024.
[246] S. Ge, Y. Zhang, L. Liu, M. Zhang, J. Han, and J. Gao, “Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs,” in The Twelfth International Conference on Learning Representations, 2023.
[247] G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” in The Twelfth International Conference on Learning Representations, 2023.
[248] Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Ré, C. Barrett et al., “H2o: Heavy-hitter oracle for efficient generative inference of large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[249] J. Li, L. L. Zhang, J. Xu, Y. Wang, S. Yan, Y. Xia, Y. Yang, T. Cao, H. Sun, W. Deng et al., “Constraint-aware and ranking-distilled token pruning for efficient transformer inference,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 1280–1290.
[250] H. Jiang, Q. Wu, C.-Y. Lin, Y. Yang, and L. Qiu, “LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 13 358–13 376.
[251] H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y. Lin, Y. Yang, and L. Qiu, “Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression,” arXiv preprint arXiv:2310.06839, 2023.
[252] Y. Li, B. Dong, F. Guerin, and C. Lin, “Compressing Context to Enhance Inference Efficiency of Large Language Models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 6342–6353.
[253] S. Kim, S. Shen, D. Thorsley, A. Gholami, W. Kwon, J. Hassoun, and K. Keutzer, “Learned token pruning for transformers,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 784–794.
[254] A. Modarressi, H. Mohebbi, and M. T. Pilehvar, “AdapLeR: Speeding up Inference by Adaptive Length Reduction,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 1–15.
[255] G. Kim and K. Cho, “Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 6501–6511.
[256] S. Goyal, A. R. Choudhury, S. Raje, V. Chakaravarthy, Y. Sabharwal, and A. Verma, “Power-bert: Accelerating bert inference via progressive word-vector elimination,” in International Conference on Machine Learning. PMLR, 2020, pp. 3690–3699.
[257] T. Ge, H. Jing, L. Wang, X. Wang, S.-Q. Chen, and F. Wei, “In-context Autoencoder for Context Compression in a Large Language Model,” in The Twelfth International Conference on Learning Representations, 2023.
[258] J. Mu, X. Li, and N. Goodman, “Learning to compress prompts with gist tokens,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[259] A. Chevalier, A. Wettig, A. Ajith, and D. Chen, “Adapting Language Models to Compress Contexts,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 3829–3846.
[260] Y. Zhang, L. Wei, and N. M. Freris, “SYNERGISTIC PATCH PRUNING FOR VISION TRANS-FORMER: UNIFYING INTRA-&INTER-LAYER PATCH IMPORTANCE.”
[261] D. Liu, M. Kan, S. Shan, and C. Xilin, “A Simple Romance Between Multi-Exit Vision Transformer and Token Reduction,” in The Twelfth International Conference on Learning Representations, 2023.
[262] P. Dong, M. Sun, A. Lu, Y. Xie, K. Liu, Z. Kong, X. Meng, Z. Li, X. Lin, Z. Fang et al., “Heatvit: Hardware-efficient adaptive token pruning for vision transformers,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 442–455.
[263] Y. Tang, K. Han, Y. Wang, C. Xu, J. Guo, C. Xu, and D. Tao, “Patch slimming for efficient vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 165–12 174.
[264] Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, “Dynamicvit: Efficient vision transformers with dynamic token sparsification,” Advances in neural information processing systems, vol. 34, pp. 13 937–13 949, 2021.
[265] D. Bolya, C.-Y. Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token Merging: Your ViT But Faster,” in The Eleventh International Conference on Learning Representations, 2022.
[266] Y. Liang, G. Chongjian, Z. Tong, Y. Song, J. Wang, and P. Xie, “EViT: Expediting Vision Transformers via Token Reorganizations,” in International Conference on Learning Representations, 2021.
[267] M. Chen, W. Shao, P. Xu, M. Lin, K. Zhang, F. Chao, R. Ji, Y. Qiao, and P. Luo, “DiffRate: Differentiable Compression Rate for Efficient Vision Transformers,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023, pp. 17 118–17 128.
[268] S. Long, Z. Zhao, J. Pi, S. Wang, and J. Wang, “Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2023, pp. 10 334–10 343.
[269] S. Wei, T. Ye, S. Zhang, Y. Tang, and J. Liang, “Joint token pruning and squeezing towards more aggressive compression of vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2092–2101.
[270] Q. Cao, B. Paranjape, and H. Hajishirzi, “PuMer: Pruning and Merging Tokens for Efficient Vision Language Models,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 12 890–12 903.
[271] S. Ren, S. Chen, S. Li, X. Sun, and L. Hou, “TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 932–947.
[272] S. Ding, P. Zhao, X. Zhang, R. Qian, H. Xiong, and Q. Tian, “Prune spatio-temporal tokens by semantic-aware temporal accumulation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 16 945–16 956.
[273] J. Wang, X. Yang, H. Li, L. Liu, Z. Wu, and Y.-G. Jiang, “Efficient video transformers with spatial-temporal token selection,” in European Conference on Computer Vision. Springer, 2022, pp. 69–86.
[274] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient Transformers: A Survey,” ACM Comput. Surv., vol. 55, no. 6, dec 2022.
[275] J. B. Haurum, S. Escalera, G. W. Taylor, and T. B. Moeslund, “Which tokens to use? investigating token reduction in vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 773–783.
[276] J. Chen, W. Xu, Z. Hong, S. Guo, H. Wang, J. Zhang, and D. Zeng, “OTAS: An Elastic Transformer Serving System via Token Adaptation,” 2024.
[277] F. Xue, V. Likhosherstov, A. Arnab, N. Houlsby, M. Dehghani, and Y. You, “Adaptive computation with elastic input sequence,” in International Conference on Machine Learning. PMLR, 2023, pp. 38 971–38 988.
[278] M. R. Morris, J. Sohl-dickstein, N. Fiedel, T. Warkentin, A. Dafoe, A. Faust, C. Farabet, and S. Legg, “Levels of AGI: Operationalizing Progress on the Path to AGI,” arXiv preprint arXiv:2311.02462, 2023.
[279] W. Chen, Y. Su, J. Zuo, C. Yang, C. Yuan, C.-M. Chan, yang Yu, Y. Lu, Y.-H. Hung, C. Qian, Y. Qin, X. Cong, R. Xie, Z. Liu, M. Sun, and J. Zhou, “AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors,” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=EHg5GDnyq1
[280] X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang et al., “Agentbench: Evaluating llms as agents,” arXiv preprint arXiv:2308.03688, 2023.
[281] H. Zhang, W. Du, J. Shan, Q. Zhou, Y. Du, J. B. Tenenbaum, T. Shu, and C. Gan, “Building cooperative embodied agents modularly with large language models,” arXiv preprint arXiv:2307.02485, 2023.
[282] S. Agashe, Y. Fan, and X. E. Wang, “Evaluating multi-agent coordination abilities in large language models,” arXiv preprint arXiv:2310.03903, 2023.
[283] W. Huang, F. Xia, D. Shah, D. Driess, A. Zeng, Y. Lu, P. Florence, I. Mordatch, S. Levine, K. Hausman, and B. Ichter, “Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied Agents,” in Conference on Neural Information Processing Systems, 2023.
[284] A. Omidvar and A. An, “Empowering Conversational Agents using Semantic In-Context Learning,” in Annual Meeting of the Association for Computational Linguistics, 2023.
[285] Y. Talebirad and A. Nadiri, “Multi-agent collaboration: Harnessing the power of intelligent llm agents,” arXiv preprint arXiv:2306.03314, 2023.
[286] Z. Liu, Y. Zhang, P. Li, Y. Liu, and D. Yang, “Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization,” arXiv preprint arXiv:2310.02170, 2023.
[287] R. Gong, Q. Huang, X. Ma, H. Vo, Z. Durante, Y. Noda, Z. Zheng, S.-C. Zhu, D. Terzopoulos, L. Fei-Fei et al., “Mindagent: Emergent gaming interaction,” arXiv preprint arXiv:2309.09971, 2023.
[288] Z. Wang, S. Cai, G. Chen, A. Liu, X. S. Ma, and Y. Liang, “Describe, explain, plan and select: interactive planning with LLMs enables open-world multi-task agents,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[289] Z. Zhao, W. S. Lee, and D. Hsu, “Large language models as commonsense knowledge for large-scale task planning,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[290] E. Brooks, L. Walls, R. L. Lewis, and S. Singh, “Large language models can implement policy iteration,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[291] T. Dinh, J. Zhao, S. Tan, R. Negrinho, L. Lausen, S. Zha, and G. Karypis, “Large language models of code fail at completing code with potential bugs,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[292] L. Guan, K. Valmeekam, S. Sreedharan, and S. Kambhampati, “Leveraging pre-trained large language models to construct and utilize world models for model-based task planning,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[293] S. Wang, C. Liu, Z. Zheng, S. Qi, S. Chen, Q. Yang, A. Zhao, C. Wang, S. Song, and G. Huang, “Avalon’s Game of Thoughts: Battle Against Deception through Recursive Contemplation,” arXiv preprint arXiv:2310.01320, 2023.
[294] G. Lee, V. Hartmann, J. Park, D. Papailiopoulos, and K. Lee, “Prompted LLMs as Chatbot Modules for Long Open-domain Conversation,” in Annual Meeting of the Association for Computational Linguistics, 2023.
[295] D. Zhang, L. Chen, S. Zhang, H. Xu, Z. Zhao, and K. Yu, “Large Language Models Are Semi-Parametric Reinforcement Learning Agents,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[296] W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, and F. Wei, “Augmenting language models with long-term memory,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[297] K. Zhang, F. Zhao, Y. Kang, and X. Liu, “Memory-augmented llm personalization with short-and long-term memory coordination,” arXiv preprint arXiv:2309.11696, 2023.
[298] K. Hatalis, D. Christou, J. Myers, S. Jones, K. Lambert, A. Amos-Binks, Z. Dannenhauer, and D. Dannenhauer, “Memory Matters: The Need to Improve Long-Term Memory in LLM-Agents,” in Proceedings of the AAAI Symposium Series, vol. 2, no. 1, 2023, pp. 277–280.
[299] X. Li and X. Qiu, “Mot: Memory-of-thought enables chatgpt to self-improve,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 6354–6374.
[300] N. Liu, L. Chen, X. Tian, W. Zou, K. Chen, and M. Cui, “From llm to conversational agent: A memory enhanced architecture with fine-tuning of large language models,” arXiv preprint arXiv:2401.02777, 2024.
[301] A. Maharana, D.-H. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y. Fang, “Evaluating Very Long-Term Conversational Memory of LLM Agents,” arXiv preprint arXiv:2402.17753, 2024.
[302] T. Wang and K. Cho, “Larger-Context Language Modelling with Recurrent Neural Network,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1319–1329.
[303] Z. Wang, Z. Wang, L. Le, H. S. Zheng, S. Mishra, V. Perot, Y. Zhang, A. Mattapalli, A. Taly, J. Shang et al., “Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting,” arXiv preprint arXiv:2407.08223, 2024.
[304] Z. Zhang, A. Zhu, L. Yang, Y. Xu, L. Li, P. M. Phothilimthana, and Z. Jia, “Accelerating retrieval-augmented language model serving with speculation,” arXiv preprint arXiv:2401.14021, 2024.
[305] T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[306] R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li, and Y. Shan, “Gpt4tools: Teaching large language model to use tools via self-instruction,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[307] Y. Ruan, H. Dong, A. Wang, S. Pitis, Y. Zhou, J. Ba, Y. Dubois, C. J. Maddison, and T. Hashimoto, “Identifying the risks of lm agents with an lm-emulated sandbox,” arXiv preprint arXiv:2309.15817, 2023.
[308] Y. Zhuang, Y. Yu, K. Wang, H. Sun, and C. Zhang, “ToolQA: A Dataset for LLM Question Answering with External Tools,” in Advances in Neural Information Processing Systems, A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 50 117–50 143. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2023/file/9cb2a7495900f8b602cb10159246a016-Paper-Datasets_and_Benchmarks.pdf
[309] J. Ruan, Y. Chen, B. Zhang, Z. Xu, T. Bao, G. Du, S. Shi, H. Mao, Z. Li, X. Zeng et al., “TPTU: large language model-based AI agents for task planning and tool usage,” arXiv preprint arXiv:2308.03427, 2023.
[310] S. Hao, T. Liu, Z. Wang, and Z. Hu, “ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings,” in Advances in Neural Information Processing Systems, A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 45 870–45 894. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2023/file/8fd1a81c882cd45f64958da6284f4a3f-Paper-Conference.pdf
[311] B. Paranjape, S. Lundberg, S. Singh, H. Hajishirzi, L. Zettlemoyer, and M. T. Ribeiro, “ART: Automatic multi-step reasoning and tool-use for large language models,” 2023.
[312] R. Abhyankar, Z. He, V. Srivatsa, H. Zhang, and Y. Zhang, “InferCept: Efficient Intercept Support for Augmented Large Language Model Inference,” in Forty-first International Conference on Machine Learning.
[313] G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, “Orca: A distributed serving system for $\{$ Transformer-Based $\}$ generative models,” in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022, pp. 521–538.
[314] W. Cui, H. Zhao, Q. Chen, H. Wei, Z. Li, D. Zeng, C. Li, and M. Guo, “ $\{$ DVABatch $\}$ : Diversity-aware $\{$ Multi-Entry $\}$ $\{$ Multi-Exit $\}$ batching for efficient processing of $\{$ DNN $\}$ services on $\{$ GPUs $\}$ ,” in 2022 USENIX Annual Technical Conference (USENIX ATC 22), 2022, pp. 183–198.
[315] Z. Zheng, X. Ren, F. Xue, Y. Luo, X. Jiang, and Y. You, “Response length perception and sequence scheduling: An llm-empowered llm inference pipeline,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[316] J. Fang, Y. Yu, C. Zhao, and J. Zhou, “Turbotransformers: an efficient gpu serving system for transformer models,” in Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021, pp. 389–402.
[317] Y. Wen and S. Chaudhuri, “Batched Low-Rank Adaptation of Foundation Models,” in The Twelfth International Conference on Learning Representations, 2023.
[318] Y. Sheng, S. Cao, D. Li, C. Hooper, N. Lee, S. Yang, C. Chou, B. Zhu, L. Zheng, K. Keutzer, J. E. Gonzalez, and I. Stoica, “S-LoRA: Serving Thousands of Concurrent LoRA Adapters,” arXiv preprint arXiv:2311.03285, 2023.
[319] L. Chen, Z. Ye, Y. Wu, D. Zhuo, L. Ceze, and A. Krishnamurthy, “Punica: Multi-tenant lora serving,” arXiv preprint arXiv:2310.18547, 2023.
[320] E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen et al., “LoRA: Low-Rank Adaptation of Large Language Models,” in International Conference on Learning Representations, 2021.
[321] “Openai chatgpt,” https://chat.openai.com, accessed: 2024-05-12.
[322] Baidu, “Wenxin yiyan,” 2024. [Online]. Available: https://yiyan.baidu.com/
[323] (2023) Github copilot: Your ai pair programmer. Github. [Online]. Available: https://github.com/features/copilot
[324] OpenAI, “Openai codex,” 2021. [Online]. Available: https://openai.com/index/openai-codex/
[325] Midjourney, “Midjourney,” 2023. [Online]. Available: https://www.midjourney.com/
[326] OpenAI, “Sora,” 2024. [Online]. Available: https://openai.com/index/sora/
[327] Runway, “Gen-2 by runway,” 2023. [Online]. Available: https://research.runwayml.com/gen2
[328] Toran Bruce Richards, Pi, Blake Werlinger, Douglas Schonholtz, Hunter Araujo, Dion, David Wurtz, Fergus, Andrew Minnella, Ian, Robin Sallay, “Auto-GPT,” https://news.agpt.co/, 2023.
[329] J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” in Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023, pp. 1–22.
[330] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An open-ended embodied agent with large language models,” arXiv preprint arXiv:2305.16291, 2023.
[331] C. Qian, X. Cong, C. Yang, W. Chen, Y. Su, J. Xu, Z. Liu, and M. Sun, “Communicative agents for software development,” arXiv preprint arXiv:2307.07924, 2023.