Near-Optimal Stochastic Bin-Packing in Large Service Systems with Time-Varying Item Sizes

1. Introduction

2. Problem Formulation

3. Main Result and Our Approach

4. Proposed meta-policy: Join-Requesting-Server (JRS)

5. Proof of Theorem 3 (Conversion Theorem) Assuming Irreducibility

6. Conclusion

References

Appendix A Proof of Theorem 2 (Lower Bound)

Appendix B The Rest of the Proofs Needed for Theorem 3 (Conversion Theorem)

Appendix C Solving the Single-Server Problem

Appendix D Performance guarantee of Join-Requesting-Server with an estimated model

1.1. Background and Motivation

1.2. Problem Formulation: A Simplified Version

1.3. Main Result

1.4. Approach Overview

1.5. Paper organization

3.1. Main Result

3.2. Our Approach

4.1. How the Single-Server Policy Requests Jobs

4.2. Description of Join-Requesting-Server (JRS)

4.3. Practical considerations in implementing Join-Requesting-Server (JRS)

5.1. Preliminaries

5.2. Steps and Lemmas Needed for the Proof of Theorem 3 Assuming Irreducibility

5.3. Proof of Theorem 3 Assuming Irreducibility Based on Lemmas 1–4.

5.4. More Details on the System and Proof of Lemma 2

5.5. Role of Tokens and Virtual Jobs

B.1. Proof of Lemma 1

B.2. Proofs of Lemma 3 and Lemma 4

B.3. Proof of Conversion Theorem without Assuming Irreducibility

C.1. Lower Bound via LP Relaxation

C.2. Policy Construction

C.3. Proof of Theorem 4

D.1. Assumptions and result

D.2. Lemmas

D.3. Proof sketch for Lemma 2

D.4. Proof sketch for Lemma 3 and Lemma 4

A policy-conversion framework.

Intuition behind JRS

Computational complexity of JRS

Model parameter estimation.

Connection to practical algorithms.

5.4.1. More Details on System Dynamics and Generator

5.4.2. Proof of Lemma 2.

B.2.1. Proofs of Lemma 3 and Lemma 4.

B.2.2. Deriving the equality in (39)

Single-server problem.

How is the single-server problem related to the original problem?

A meta-policy, Join-Requesting-Server (JRS), and its asymptotic optimality.

Relation to the mean-field approach.

Job Model.

Server Model.

System Dynamics.

Active Servers.

Cost of Resource Contention.

Performance Goal.

Asymptotic Optimality.

Specialization to Non-Time-Varying Resource Requirements.

A Single-Server System.

Lower Bound.

Converting From the Single-Server System to the Infinite-Server System.

Optimal Single-Server Policy.

Abstract.

Theorem 1 (Asymptotic Optimality).

Theorem 2 (Lower Bound).

Theorem 3 (Conversion Theorem).

Theorem 4 (Optimality of Single-OPT, Informal).

Proof of the Theorem 1 Based on Theorems 2–4..

Lemma 0 (Unique Stationary Distribution).

Step 1.

Lemma 0.

Step 2.

Lemma 0.

Lemma 0.

Proof.

Remark 1.

Proof.

Acknowledgements.

Proof.

Proof.

Proof.

Lemma 0 (Decomposing The Reducible Policy).

Proof.

Proof of Theorem 3..

Definition 0 (Nominal Transition).

Definition 0 (Transition Frequency).

Lemma 0 (Stationary Equation).

Proof.

Remark 2.

Lemma 0 (Properties of LP-based Policies).

Proof.

Theorem 4 (Optimality of Single-OPT).

Proof.

Assumption 1 ( $\delta$ -accurate estimation).

Assumption 2 (Scaling of the single-server objective value).

Assumption 3 ( $\delta$ -insensitivity of the optimal value).

Assumption 4 (Accurate simulation).

Proposition 0 (Optimality gap with model estimation).

Remark 3.

Remark 4.

Lemma 0.

Lemma 0.

Lemma 0.

Lemma 0.

Proof.

\xspaceaddexceptions

(

Yige Hong [email protected] 0000-0001-8534-1063 Carnegie Mellon UniversityPittsburghPennsylvaniaUSA15213 , Qiaomin Xie [email protected] 0000-0003-2834-6866 University of Wisconsin-MadisonMadisonWisconsinUSA53706 and Weina Wang [email protected] 0000-0001-6808-0156 Carnegie Mellon UniversityPittsburghPennsylvaniaUSA15213

In modern computing systems, jobs’ resource requirements often vary over time. Accounting for this temporal variability during job scheduling is essential for meeting performance goals. However, theoretical understanding on how to schedule jobs with time-varying resource requirements is limited. Motivated by this gap, we propose a new setting of the stochastic bin-packing problem in service systems that allows for time-varying job resource requirements, also referred to as ‘item sizes’ in traditional bin-packing terms. In this setting, a job or ‘item’ must be dispatched to a server or ‘bin’ upon arrival. Its resource requirement may vary over time while in service, following a Markovian assumption. Once the job’s service is complete, it departs from the system. Our goal is to minimize the expected number of active servers, or ‘non-empty bins’, in steady state.

Under our problem formulation, we develop a job dispatch policy, named Join-Requesting-Server (JRS). Broadly, JRS lets each server independently evaluate its current job configuration and decide whether to accept additional jobs, balancing the competing objectives of maximizing throughput and minimizing the risk of resource capacity overruns. The JRS dispatcher then utilizes these individual evaluations to decide which server to dispatch each arriving job to. The theoretical performance guarantee of JRS is in the asymptotic regime where the job arrival rate scales large linearly with respect to a scaling factor $r$ . We show that JRS achieves an additive optimality gap of $O(\sqrt{r})$ in the objective value, where the optimal objective value is $\Theta(r)$ . When specialized to constant job resource requirements, our result improves upon the state-of-the-art $o(r)$ optimality gap. Our technical approach highlights a novel policy conversion framework that reduces the policy design problem into a single-server problem.

stochastic bin-packing, large service systems, policy conversion, asymptotic optimality

^†^†ccs: Networks Network performance analysis^†^†ccs: Mathematics of computing Markov processes

In modern computing systems, a job often takes the form of a virtual machine (VM) or a container (Cloud, 2023b; Foundation, 2023a). Such a job comes with a resource requirement, such as a certain number of CPUs and amount of memory, while in service. Each server in the system offers a limited amount of these resources. When a job arrives at the system, the job dispatch policy needs to decide which server the job should be assigned to, given the job’s resource requirement and servers’ current job configurations. This job scheduling problem can be approached as a Stochastic Bin-Packing (SBP) problem, where jobs are viewed as items, job resource requirements as item sizes, and servers as bins. A traditional SBP setting considers a finite set of jobs that arrive online but do not depart from the system. The objective is to minimize the number of servers that have jobs on them, or ‘non-empty bins’, subject to the resource capacities of the servers. SBP, with a rich history in operations research and theoretical computer science (Courcoubetis and Weber, 1986; Courcobetis and Weber, 1990; Csirik et al., 2006), is a field of continuous developments and advancements (Gupta and Radovanović, 2020; Freund and Banerjee, 2019; Ayyadevara et al., 2022).

To incorporate job departures into the problem formulation, a setting referred to as stochastic bin-packing in service systems has been proposed recently (Stolyar, 2013; Stolyar and Zhong, 2013, 2015; Stolyar, 2017; Stolyar and Zhong, 2021; Ghaderi et al., 2014). In this setting, jobs not only arrive but also depart over time. More specifically, jobs are assumed to arrive according to Poisson processes, and each job is assumed to stay in the system for an exponentially distributed service time. The service time of a job remains unknown until the job departs. Before delving further into SBP in service systems, it is worth mentioning that there is a parallel thread of research on the so-called dynamic bin-packing problem that also handles job departures (see, e.g., (Coffman et al., 1983; Li et al., 2014; Buchbinder et al., 2021), and references therein), but it is primarily from a worst-case analysis perspective. Additionally, the virtual machine scheduling problem with objectives different from minimizing the number of active servers has also been widely studied (see, e.g., (Maguluri et al., 2012; Maguluri and Srikant, 2013; Maguluri et al., 2014; Xie et al., 2015; Psychas and Ghaderi, 2018, 2019, 2021, 2022)).

For SBP in service systems, the goal is to design a job dispatch policy $\sigma$ that minimizes the expected number of active servers in steady state, denoted as $N(\sigma)$ . The optimality gap of a policy $\sigma$ is defined as $N(\sigma)-N(\sigma^{*})$ , where $\sigma^{*}$ is the optimal policy. Since SBP in service systems aims to model today’s large-scale computing systems, the optimality gap of a policy is usually studied in the regime where the total job arrival rate becomes large. As we scale up the total job arrival rate linearly with a scaling factor, $r$ , the optimal value $N(\sigma^{*})$ can be shown to be $\Theta(r)$ . ¹¹1We use the standard Bachmann–Landau notation. Consider two functions $a(r)$ and $b(r)$ , where $b(r)$ is positive for large enough $r$ . Then $a=O\left(b\right)$ if $\limsup_{r\to+\infty}\frac{|a(r)|}{b(r)}<\infty$ ; $a=o\left(b\right)$ if $\lim_{r\to+\infty}\frac{a(r)}{b(r)}=0$ ; $a=\Theta\left(b\right)$ if $a=O\left(b\right)$ and $b=O\left(a\right)$ . Therefore, we say a policy is asymptotically optimal if its optimality gap is $o(r)$ .

The optimality gap for SBP in service systems has been characterized in the line of work (Stolyar, 2013; Stolyar and Zhong, 2013, 2015; Stolyar, 2017; Stolyar and Zhong, 2021). In particular, Stolyar (2013), Stolyar and Zhong (2013) propose greedy policies that are asymptotically optimal, but the scheduler that executes these policies needs to know detailed state information, which is in a high-dimensional space. Stolyar and Zhong (2015, 2021) later develop policies that use much less state information and achieve $\Theta(r)$ (with an arbitrarily small constant) and $o(r)$ optimality gaps, respectively.

While prior work on SBP in service systems has provided substantial insights into scheduling virtual-machine-type jobs, it primarily focuses on job resource requirements that remain fixed over time. However, in modern computing systems, jobs’ resource requirements often vary over time (Reiss et al., 2012; Delimitrou and Kozyrakis, 2014; Lo et al., 2015; Tirmazi et al., 2020; Rzadca et al., 2020; Bashir et al., 2021). For example, when a job involves providing user-facing services, the instantaneous requirement on CPUs and memory depends on the service demand, which is subject to fluctuation over time (Delimitrou and Kozyrakis, 2014; Lo et al., 2015). Time-varying job resource requirements pose significant challenges in optimizing system efficiency, particularly when aiming to minimize the number of active servers, thereby improving server utilization. It is pertinent to note that low utilization has been recognized as a significant obstacle to the continued scaling of today’s computing systems.

Motivated by this gap, in this paper, we propose a new setting of stochastic bin-packing in service systems that allows job resource requirements, or ‘item sizes’, to vary over time.

Refer to caption — (a) A simplified version of our job model. Each job in service is in either an $L$ phase or an $H$ phase, associated with low and high resource requirements, respectively. When the job is completed, it is said to be in the state $\perp$ . The job transitions between the two phases while in service until it is completed, following a continuous-time Markov chain with rates $\mu_{ii^{\prime}}$ , $i,i^{\prime}\in\{L,H,\perp\}$ .

We first describe our job model that features time-varying resource requirements. For ease of exposition, here we present a simplified setting where each job in service can be in one of the two phases, $L$ and $H$ , associated with low and high resource requirements, respectively. Our full model, presented in Section 2, allows more than two phases and more than one type of resources. To model the temporal variation in the resource requirement, we assume that each job transitions between the two phases while in service until it is completed, following a continuous-time Markov chain illustrated in Figure 1(a). We use an absorbing state $\perp$ to denote that the job is completed. A job can initialize in either phase $L$ or phase $H$ , and they are referred to as type $L$ and type $H$ jobs, respectively. Note that the setting where a job’s resource requirement does not vary over time is a special case of our job model where the transition rates between phases are $0$ .

We consider a system with an infinite number of identical servers, illustrated in Figure 1(b). We assume jobs arrive according to a Poisson process as existing work on SBP in service systems. In particular, we assume that the two types of jobs arrive at the system following two independent Poisson processes, with rates $\Lambda_{L}$ and $\Lambda_{H}$ , respectively; i.e., the interarrival times of type $L$ and type $H$ jobs are i.i.d. following exponential distributions with means $1/\Lambda_{L}$ and $1/\Lambda_{H}$ , respectively. Upon arrival, a job needs to be dispatched to a server according to a dispatch policy, and the job enters service immediately. The goal is to design a policy $\sigma$ to minimize the expected number of active servers (servers currently serving a positive number of jobs) in steady state, denoted as $N(\sigma)$ .

As job resource requirements vary over time, situations can arise where the total job resource requirement on a server exceeds the server’s resource capacity, resulting in resource contention. Modern computing systems can tolerate temporary overruns of resource capacity, though they often incur performance degradation or other costs (Cloud, 2023a; Foundation, 2023b). In our model, we incorporate a rate at which the cost accumulates due to resource contention. We first represent the state of a server by its configuration, a vector ${\bm{k}}=(k_{L},k_{H})$ where $k_{L}$ and $k_{H}$ are the numbers of jobs in phase $L$ and phase $H$ , respectively. Then a cost rate function $h(\cdot)$ maps a server’s configuration to a rate of cost. For example, the cost rate can be proportional to how much the total resource requirement of the jobs on the server exceeds this server’s resource capacity. A more general definition of $h(\cdot)$ is given in Section 2. We assume that the resource contention does not affect the transition rates in the job model nor prompt jobs to be terminated, suitable for the application scenarios where the contention level is low and manageable. Let $C(\sigma)$ denote the average expected cost rate per server.

Now our bin-packing problem can be formulated as follows:

(1)			$\displaystyle\underset{\sigma}{\text{minimize}}$		$\displaystyle N(\sigma)$
(1)			subject to		$\displaystyle C(\sigma)\leq\epsilon,$

where $\epsilon>0$ is a budget for the cost rate of resource contention. We are interested in solving this problem in the asymptotic regime where the arrival rates $(\Lambda_{L},\Lambda_{H})$ scale to infinity (Stolyar and Zhong, 2013, 2015; Stolyar, 2017; Stolyar and Zhong, 2021), motivated by the ever-increasing computing demand that drives today’s computing systems to be large-scale. Specifically, we assume $(\Lambda_{L},\Lambda_{H})=(\lambda_{L}r,\lambda_{H}r)$ for some fixed coefficients $\lambda_{L}$ and $\lambda_{H}$ and a scaling factor $r$ , and we study the asymptotic regime where $r$ increases.

Our main result is an asymptotically optimal policy, named Join-Requesting-Server (JRS), for this new setting of SBP in service systems with time-varying job resource requirements. The asymptotic optimality is in the sense that under our proposed policy JRS, the expected number of active servers is at most $\left(1+O\left(r^{-0.5}\right)\right)$ times the optimal objective value of the optimization problem in (1), while the cost rate incurred is at most $\left(1+O\left(r^{-0.5}\right)\right)\cdot\epsilon$ (i.e., exceeding the budget by at most a diminishing fraction). This asymptotic optimality result translates into an additive optimality gap of $O(\sqrt{r})$ in the objective value (expected number of active servers), since the optimal objective value can be shown to be $\Theta(r)$ . This main result is formally presented in Theorem 1.

Our model can be specialized to the traditional setting of SBP in service systems where jobs’ resource requirements remain fixed over time. For this specialization, we replace the constraint $C(\sigma)\leq\epsilon$ in the problem formulation (1) with a capacity constraint, which requires the total resource requirement by jobs on a server to be within the server’s resource capacity. Our proposed policy JRS can then be adapted into one that has an $O(\sqrt{r})$ optimality gap in the objective value, which improves upon the state-of-the-art $o(r)$ optimality gap. A discussion on the implementation complexity of JRS and how it compares with existing policies for the traditional setting of SBP in service systems is provided in Section 4.3. To be clear, this setting is not a strict special case of the formulation in (1) because $\epsilon>0$ is required there, but our approach and proof carry over.

From a technical approach perspective, our contribution is a novel approach that decomposes the policy design into two steps: defining a single-server sub-problem, and then converting the solution of the sub-problem into a policy in the original problem. This decomposition not only reduces the complexity of policy design but also makes the analysis tractable. We provide an overview of this approach in Section 1.4.

To motivate our approach, we ask two questions:

How should we design a good dispatch policy for this system?

How can we prove that a dispatch policy is asymptotically optimal?

Before presenting our answers to these two questions, we first comment on why they are challenging to answer. On the one hand, solving this problem directly via dynamic programming is intractable due to the unbounded state space resulting from the infinitely many servers. Even if we restrict ourselves to the servers that are active, the state space is still prohibitively large. On the other hand, we can consider designing a heuristic policy. However, unlike traditional SBP problems where we can simply seek to pack the servers as compactly as possible, here the question of how many jobs should be put on a server is complicated. The complexity comes from the time-varying resource requirements of jobs, which affect future resource contention.

We answer the two questions at the same time with a novel policy-conversion framework. The framework has two steps:

(1)

Define the single-server problem, which is a easy-to-solve low-dimensional subproblem;
(2)

Convert the optimal policy of the single-server problem into a policy in the original problem.

This framework allows us to break down the complicated policy design problem into two components. In defining the single-server problem in Step 1, our goal is to quantify the throughput of each server under the resource contention constraint; in the policy conversion in Step 2, our goal is to dispatch jobs optimally based on each single server’s characteristics. As we will show, a careful construction of the single-server problem and the conversion procedure naturally leads to a policy for the original system and a proof of its asymptotic optimality. Below we give a quick overview of how we carry out these two steps and the motivation for the design choices.

To define the single-server problem, consider the following setting: suppose that our goal is to maximize the throughput of one specific server while keeping its expected cost rate of resource contention below $\epsilon$ , then how should we send jobs to this server? Observe that even though we want to send jobs to the server as frequently as possible, the frequency is fundamentally limited by how fast the server is able to serve jobs and how many jobs can be packed on the server. This motivates us to consider the single-server system illustrated in Figure 2. The system has one server and an infinite supply of jobs of all types, so the server can start the service of any number of new jobs of any type at any time. We say the server requests a job from the infinite supply whenever it starts serving a new job. We assume the same job model and cost model as in the original infinite-server system. The single-server problem aims to find a job-requesting policy that maximizes the throughput (the number of each type of job that can be served) along the direction of the arrival rate vector $(\Lambda_{L},\Lambda_{H})=(\lambda_{L}r,\lambda_{H}r)$ , while maintaining the steady-state expected cost rate of resource contention below $\epsilon$ .

Let $\overline{N}^{*}$ be the number such that the total throughput of $\overline{N}^{*}$ single-server systems under the optimal job-requesting policy is equal to $(\lambda_{L}r,\lambda_{H}r)$ (assuming $\overline{N}^{*}$ is an integer for simplicity). Consider the following policy in the original system: let each of the first $\overline{N}^{*}$ servers in the original system adopt the optimal job-requesting policy and send requests to the dispatcher based on its current configuration. If the requested jobs were to arrive as soon as the dispatcher received the requests, the dispatcher would be able to fulfill the requests immediately. In this case, the first $\overline{N}^{*}$ servers in the original system would have the same dynamics as $\overline{N}^{*}$ i.i.d. single-server systems, achieving the largest possible throughput and satisfying the constraint on resources contention. So the original system would have achieved the optimal number of active servers.

However, in the actual model, the dispatcher cannot immediately fulfill a job request because jobs arrive stochastically over time. Nevertheless, the dispatcher can still find a suitable way to match each job arrival with the requests. To see this, note that the time points when the dispatcher receives type $i$ requests result from the superposition of $\overline{N}^{*}=\Theta(r)$ independent point processes, each with the average rate $\lambda_{i}r/\overline{N}^{*}$ . As $r\to\infty$ , the instantaneous rates of requesting type $i$ jobs concentrate around the arrival rate $\lambda_{i}r$ for each $i$ . As a result, most job requests can be fulfilled within a diminishing delay, so most servers in the original system can closely track the optimal single-server dynamics.

Based on the single-server problem and the idea of tracking the optimal single-server dynamics, we propose a meta-policy, Join-Requesting-Server (JRS), which converts a single-server policy $\overline{\sigma}$ to a dispatch policy in the original infinite-server system. We say that JRS takes $\overline{\sigma}$ as a subroutine. The full definition of JRS is given in Section 4, along with discussions on various practical considerations in its implementation.

We show that the asymptotic performance of JRS is related to its subroutine in the sense described in Theorem 3, which we refer to as the conversion theorem. In particular, JRS with the optimal single-server policy (which we refer to as Single-OPT) as the subroutine is asymptotically optimal for the original infinite-server problem as $r\to\infty$ .

In order to track the optimal single-server dynamics, JRS uses a more sophisticated mechanism to control the long-term consequences of missing job requests or fulfilling requests with delays. The mechanism involves the auxiliary variables of tokens and virtual jobs, which regularize the process of generating requests and matching arrivals with requests. These auxiliary variables play a crucial role in the proof of Theorem 3 in Section 5, where a novel Stein’s method argument is carried out. We discuss the role of tokens and virtual jobs and their necessity at the end of Section 4 and in Section 5.5.

Finally, we comment that our policy conversion framework can be applicable to other systems with similar structures. Specifically, we can try to define a suitable single-server problem, solve for its optimal policy, and convert the optimal single-server policy to the original problem. A similar conversion theorem should hold as long as the servers are weakly coupled in some sense. See Section 6 for a discussion of such systems.

We remark that the mean-field approach often studies the empirical distribution of configurations on all servers (Stolyar, 2013; Stolyar and Zhong, 2013, 2015; Stolyar, 2017; Stolyar and Zhong, 2021; Ghaderi et al., 2014), which can be viewed as a probability distribution of a single server’s configuration. However, the mean-field approach is typically used to analyze this empirical distribution under a given policy for the original system. In contrast, our approach solves the single-server problem to design a single-server policy, and then converts it to a policy in the original system with performance guarantee.

In Section 2, we present the general problem formulation, which generalizes the simplified version in this section. In Section 3, we give a more detailed overview of our main result and approach, with a short proof of Theorem 1 (main result) based on Theorems 2–4 at the end of the section. Section 4 provides a detailed description of our meta-policy, Join-Requesting-Server (JRS), along with discussions on practical considerations in its implementation. In Section 5, we prove the performance guarantee of JRS (Theorem 3) under an irreducibility assumption. The proof for the general case and other proofs are deferred to the appendices. We conclude the paper and discuss some future directions in Section 6.

As described in Section 1, we consider a job model where each job in service can be in one of multiple phases, each phase associated with a different resource requirement. Here the resource requirement can be a multi-dimensional vector, with each coordinate specifying the requirement of one type of resource. To model the temporal variation in the resource requirement, we assume that each job transitions between phases while in service until it is completed. The phase transition process is described by a continuous-time Markov chain on the state space $\mathcal{I}\cup\{\perp\}$ , where $\mathcal{I}$ is the set of phases and $\perp$ is the absorbing state that denotes the completion of the job. We call a transition between two phases in $\mathcal{I}$ an internal transition, and let $\mu_{ii^{\prime}}$ denote the transition rate from phase $i$ to phase $i^{\prime}$ ; the departure of a job then corresponds to a transition from a phase $i\in\mathcal{I}$ to $\perp$ , whose transition rate is denoted as $\mu_{i\perp}$ . The phase transitions of different jobs are assumed to be independent of each other.

We classify a job as a type $i$ job if it starts from phase $i\in\mathcal{I}$ when entering service. Jobs of each type $i$ arrive to the system according to an independent Poisson process with rate $\Lambda_{i}$ .

We consider an infinite-server system with identical servers. As soon as a job arrives to the system, the job needs to be dispatched to a server to start service immediately. Note that this is always feasible because there are an infinite number of servers in the system. We assume that jobs cannot be preempted or migrated. To describe the state of a server, we define the configuration of a server as an $\left\lvert\mathcal{I}\right\rvert$ -dimensional vector ${\bm{k}}=(k_{i})_{i\in\mathcal{I}}$ , whose $i$ -th entry $k_{i}$ is the number of jobs in phase $i$ on the server. Each server has a limit on the total number of jobs in service at the same time. This limit is denoted as $K_{\max}$ and referred to as the service limit. Then the set of feasible server configurations is $\mathcal{K}\triangleq\{{\bm{k}}\colon\sum_{i\in\mathcal{I}}k_{i}\leq K_{\max}\}$ .

The system state can be represented by the concatenation of the configurations of all servers. Specifically, we index the servers by positive integer numbers and denote the configuration of server $\ell$ at time $t$ as ${\bm{K}}^{\ell}(t)$ . Then the state of the entire system can be represented by the infinite vector $({\bm{K}}^{\ell}(t))_{\ell\in\mathbb{Z}_{+}}$ .

Suppose that the system is in state $({\bm{k}}^{\ell})_{\ell\in\mathbb{Z}_{+}}$ . Let $\bm{e}_{i}$ be an $\left\lvert\mathcal{I}\right\rvert$ -dimensional vector whose $i$ -th entry is $1$ and all other entries are $0$ . Then the following state transitions can happen:

•

${\bm{k}}^{\ell}\rightarrow{\bm{k}}^{\ell}+\bm{e}_{i}$ , ${\bm{k}}^{\ell^{\prime}}\rightarrow{\bm{k}}^{\ell^{\prime}}$ $\forall\ell^{\prime}\neq\ell$ : a type $i$ job arrives and is dispatched to server $\ell$ ;
•

${\bm{k}}^{\ell}\rightarrow{\bm{k}}^{\ell}+\bm{e}_{i^{\prime}}-\bm{e}_{i}$ , ${\bm{k}}^{\ell^{\prime}}\rightarrow{\bm{k}}^{\ell^{\prime}}$ $\forall\ell^{\prime}\neq\ell$ : a job on server $\ell$ transits from phase $i$ to phase $i^{\prime}$ ;
•

${\bm{k}}^{\ell}\rightarrow{\bm{k}}^{\ell}-\bm{e}_{i}$ , ${\bm{k}}^{\ell^{\prime}}\rightarrow{\bm{k}}^{\ell^{\prime}}$ $\forall\ell^{\prime}\neq\ell$ : a job on server $\ell$ departs the system from phase $i$ .

The specifics of the system dynamics depend on the employed dispatch policy that decides which server to dispatch to when a job arrives.

We are interested in the number of active servers, i.e., servers currently serving a positive number of jobs. Note that given the arrival rates of jobs, the smaller the number of active servers, the better the system is utilized. Let $X_{{\bm{k}}}(t)$ be the number of servers in configuration ${\bm{k}}$ at time $t$ , i.e., $X_{{\bm{k}}}(t)=\sum_{\ell=1}^{\infty}\mathds{1}_{\{{\bm{K}}^{\ell}(t)={\bm{k}}\}}.$ Then the number of active servers can be written as $\sum_{{\bm{k}}\neq\bm{0}}X_{{\bm{k}}}(t),$ where $\bm{0}\in\mathbb{R}^{|\mathcal{I}|}$ is the zero vector.

Recall that the cost rate function $h(\cdot)$ maps a server’s configuration to a rate of cost. We assume that $h(\cdot)$ is any function that is $\Gamma$ -Lipschitz continuous with respect to the $L^{1}$ distance for some constant $\Gamma>0$ and satisfies $h(\bm{0})=0$ .

Our high-level goal is to design dispatch policies that minimize the number of active users while keeping the cost rate of resource contention within a certain budget. Specifically, we consider policies that are allowed to be randomized and non-Markovian (i.e., the policies can make history-dependent decisions). We further focus on policies that induce a unique stationary distribution on the configuration process $\{({\bm{K}}^{\ell}(t))_{\ell\in\mathbb{Z}_{+}}\}$ , assuming that the configuration process is embedded in a Markov chain that has a unique stationary distribution. We are interested in such policies because the resulting time averages of quantities related to the configurations are equal to the corresponding expectations under the unique stationary distribution regardless of the initial state. Let $\sigma$ be a policy of interest, $({\bm{K}}^{\ell})_{\ell\in\mathbb{Z}_{+}}$ be a random element that follows the stationary distribution of the system state induced by $\sigma$ , and $X_{{\bm{k}}}$ be the corresponding number of servers in configuration ${\bm{k}}$ in steady state under $\sigma$ . Then the expected number of active servers is given by

N(\sigma)\triangleq\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}].

We define the expected cost rate per expected active server as

C(\sigma)\triangleq\frac{\sum_{{\bm{k}}\neq\bm{0}}h({\bm{k}})\mathbb{E}[X_{{\bm{k}}}]}{\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}]}.

Note that if $C(\sigma)\leq\epsilon$ , we have $\sum_{{\bm{k}}\neq\bm{0}}h({\bm{k}})\mathbb{E}[X_{{\bm{k}}}]\leq\epsilon\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}]$ . Now our goal can be formulated as the following optimization problem, referred to as problem $\mathcal{P}((\Lambda_{i})_{i\in\mathcal{I}},\epsilon)$ :

(2)			$\displaystyle\underset{\sigma}{\text{minimize}}$		$\displaystyle N(\sigma)$
(2)			subject to		$\displaystyle C(\sigma)\leq\epsilon,$

where $\epsilon$ is a budget for the cost rate of resource contention.

We focus on the asymptotic regime where for all $i\in\mathcal{I}$ , the arrival rate is given by $\Lambda_{i}=\lambda_{i}r$ for some constant coefficient $\lambda_{i}$ and a positive scaling factor $r\to+\infty$ . To define asymptotic optimality, we first define the following notion of approximation to the optimization problem $\mathcal{P}((\Lambda_{i})_{i\in\mathcal{I}},\epsilon)$ in (2): a policy $\sigma$ is said to be $(\alpha,\beta)$ -optimal if $N(\sigma)\leq\alpha\cdot N^{*}((\Lambda_{i})_{i\in\mathcal{I}},\epsilon)$ and $C(\sigma)\leq\beta\cdot\epsilon$ , where $N^{*}((\Lambda_{i})_{i\in\mathcal{I}},\epsilon)$ is the optimal objective value in (2). Now consider a family of policies $\sigma^{(r)}$ indexed by the scaling factor $r$ . We say that the policy $\sigma^{(r)}$ is asymptotically optimal if it is $\left(\alpha^{(r)},\beta^{(r)}\right)$ -optimal to the optimization problem $\mathcal{P}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ with $\alpha^{(r)},\beta^{(r)}\to 1$ as $r\to\infty$ . We will suppress the superscript $(r)$ for simplicity when it is clear from the context.

We note that under any policy $\sigma$ , $N(\sigma)=\Theta(r)$ . This can be proven using the renowned Little’s Law (Kleinrock, 1975) in the following way. The total job arrival rate is $\Theta(r)$ and the expected time that a job spends in the system is $O(1)$ . So by Little’s Law, the expected number of jobs in the system in steady state is $\Theta(r)$ . Since each server can accommodate a constant number of jobs, the expected number of active servers is $\Theta(r)$ . Given this, $\left(\alpha^{(r)},\beta^{(r)}\right)$ -optimality implies an optimality gap of $\alpha^{(r)}\cdot r$ in the objective value.

Our main result, Theorem 1, is the asymptotic optimality of our proposed policy Join-Requesting-Server (JRS), with a subroutine we call Single-OPT, as briefly discussed in Section 1. This asymptotic optimality result implies an $O(\sqrt{r})$ optimality gap in the expected number of active servers. We defer the detailed descriptions of JRS and Single-OPT to Section 4 and Appendix C. Theorem 1 follows immediately from Theorems 2–4 to be introduced in Section 3.2; a short proof is included at the end of Section 3.2 for clarity.

Consider a stochastic bin-packing problem in service systems with time-varying job resource requirements. Let the arrival rates be $(\lambda_{i}r)_{i\in\mathcal{I}}$ and the cost rate budget be $\epsilon>0$ . Then the policy Join-Requesting-Server (JRS) with the subroutine Single-OPT is $\left(1+O\left(r^{-0.5}\right),1+O\left(r^{-0.5}\right)\right)$ -optimal. That is, the expected number of active servers under JRS with Single-OPT is at most $\left(1+O\left(r^{-0.5}\right)\right)$ times the optimal value of the problem $\mathcal{P}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ , while the cost rate incurred is at most $\left(1+O\left(r^{-0.5}\right)\right)\cdot\epsilon$ .

As mentioned in Section 1, we can specialize this result to the setting where the resource requirement of a job does not vary over time. To do that, we remove the cost constraint in $\mathcal{P}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ , and redefine the set of feasible server configurations, $\mathcal{K}$ , to also incorporate hard capacity constraints for each type of resources. The rest of the analysis is almost identical to that of the analysis for time-varying resource requirements; we omit the details due to the space limit. This specialization results in a policy that is $\left(1+O\left(r^{-0.5}\right)\right)$ -optimal in the expected number of active servers.

In a nutshell, our approach is to reduce the original optimization problem in an infinite-server system to an optimization problem in a single-server system, which is defined below.

Consider a single-server system serving jobs with time-varying resource requirements. The system has an infinite supply of jobs of all types. As a result, the server can request any number of new jobs of any type at any time. Once a job is requested, it immediately enters service.

We represent the server configuration at time $t$ using a vector $\overline{{\bm{K}}}(t)=(\overline{K}_{i}(t))_{i\in\mathcal{I}}$ , whose $i$ -th entry denotes the number of jobs in phase $i$ . We assume that the single-server system has the same service limit $K_{\max}$ and cost rate function $h(\cdot)$ as a server in the original infinite-server system. Therefore, the server configuration $\overline{{\bm{K}}}(t)$ is also in the set $\mathcal{K}=\{{\bm{k}}\colon\sum_{i\in\mathcal{I}}k_{i}\leq K_{\max}\}$ , and the cost rate at time $t$ is $h(\overline{{\bm{K}}}(t))$ .

A single-server policy $\overline{\sigma}$ determines when and how many jobs of each type to request. We allow the single-server policy to be randomized and assume it is Markovian, i.e., it makes decisions only based on the current configuration. Note that allowing non-Markovian policies will not change the optimal value of the single-server problem that we will consider (see Appendix C). Let $\pi\triangleq(\pi({\bm{k}}))_{{\bm{k}}\in\mathcal{K}}$ be a stationary distribution of the server configuration under the policy $\overline{\sigma}$ , and let $\overline{{\bm{K}}}(\infty)$ be a random variable with the distribution $\pi$ . When we consider a policy $\overline{\sigma}$ and its stationary distribution $\pi$ , we assume that the system is initialized from $\pi$ . The policy $\overline{\sigma}$ together with $\pi$ defines the request rate of type $i$ jobs $\overline{\lambda}_{i}$ , which is the expected number of type $i$ jobs requested per unit time in steady state. Note that $\overline{\lambda}_{i}$ is the throughput of type $i$ jobs since the system has a finite state space.

We consider the following single-server problem, denoted as $\overline{\mathcal{P}}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ :

(3)	$\displaystyle\underset{\overline{N},\mspace{10.0mu}\overline{\sigma},\mspace{10.0mu}\pi}{\text{minimize}}$	$\displaystyle\overline{N}$
	subject to	$\displaystyle\mathbb{E}\left[h\big{(}\overline{{\bm{K}}}(\infty)\big{)}\middle\|\overline{{\bm{K}}}(\infty)\neq\bm{0}\right]\leq\epsilon,$
	$\displaystyle\overline{N}\,\overline{\lambda}_{i}=\lambda_{i}r,\quad\forall i\in\mathcal{I}.$

The single-server problem can be interpreted as follows. We can think of $\overline{N}$ as the number of copies of the single-server system under $\overline{\sigma}$ needed to support the arrival rates $(\lambda_{i}r)_{i\in\mathcal{I}}$ in the infinite-server system. To minimize $\overline{N}$ , it is equivalent to maximizing the throughput $(\overline{\lambda}_{i})_{i\in\mathcal{I}}$ in each single-server system, while maintaining their proportions as $(\lambda_{i}r)_{i\in\mathcal{I}}$ .

We remark that for the problem $\overline{\mathcal{P}}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ , we only need to consider policies that do not depend on the scaling factor $r$ . To see this, we can replace the decision variable $\overline{N}$ with $\overline{n}\triangleq\overline{N}/r$ and the optimization problem can be equivalently formulated as follows, which does not involve $r$ :

(4)	$\displaystyle\underset{\overline{n},\mspace{10.0mu}\overline{\sigma},\mspace{10.0mu}\pi}{\text{minimize}}$	$\displaystyle\overline{n}$
	subject to	$\displaystyle\mathbb{E}\left[h\big{(}\overline{{\bm{K}}}(\infty)\big{)}\middle\|\overline{{\bm{K}}}(\infty)\neq\bm{0}\right]\leq\epsilon,$
	$\displaystyle\overline{n}\,\overline{\lambda}_{i}=\lambda_{i},\quad\forall i\in\mathcal{I}.$

The single-server problem gives a lower bound to the original problem in (2) as stated in the following theorem. The proof is given in Appendix A.

Consider a stochastic bin-packing problem in service systems with time-varying job resource requirements. Let the arrival rates be $(\lambda_{i}r)_{i\in\mathcal{I}}$ and the cost rate budget be $\epsilon>0$ . Let $N^{*}$ be the optimal value of the original infinite-server problem in (2), and let $\overline{N}^{*}$ be the optimal value of the single-server problem $\overline{\mathcal{P}}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ , then $N^{*}\geq\overline{N}^{*}$ .

Having established a lower bound on the infinite-server problem $\mathcal{P}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ in terms of the optimal value of the single-server problem $\overline{\mathcal{P}}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ , next we focus on finding an asymptotically optimal policy. We will characterize the performance guarantee of a class of policies and then show that the best policy within the class is asymptotically optimal. Specifically, we consider a meta-policy called Join-Requesting-Server (JRS), which converts a Markovian single-server policy $\overline{\sigma}$ into an infinite-server policy. We call the policy resulting from the conversion a JRS policy with a subroutine $\overline{\sigma}$ . Through analyzing the meta-policy JRS, we show that the performance of each JRS policy can be characterized by the performance of its subroutine, as stated in Theorem 3 below. The proof of Theorem 3 under an irreducibility assumption is given in Section 5, and the proof for the full version is given in Appendix B.3.

Consider a stochastic bin-packing problem in service systems with time-varying job resource requirements. Let the arrival rates be $(\lambda_{i}r)_{i\in\mathcal{I}}$ and the cost rate budget be $\epsilon>0$ . Let $(\overline{N},\overline{\sigma},\pi({\bm{k}}))$ be a solution feasible to the single-server problem $\overline{\mathcal{P}}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ . In addition, we assume that the policy $\overline{\sigma}$ is Markovian. Let the infinite-server policy $\sigma$ be JRS with a subroutine $\overline{\sigma}$ . Then under $\sigma$ , we have

(5)		$\displaystyle\left\lvert\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}\left[X_{{\bm{k}}}\right]-\left\lceil\overline{N}\right\rceil\cdot\mathbb{P}\left(\overline{{\bm{K}}}\neq\bm{0}\right)\right\rvert$	$\displaystyle=O\left(\sqrt{r}\right),$
(6)		$\displaystyle\left\lvert\sum_{{\bm{k}}\neq\bm{0}}h({\bm{k}})\mathbb{E}\left[X_{{\bm{k}}}\right]-\left\lceil\overline{N}\right\rceil\cdot\mathbb{E}\left[h(\overline{{\bm{K}}})\right]\right\rvert$	$\displaystyle=O\left(\sqrt{r}\right).$

As a result,

(7)		$\displaystyle N(\sigma)$	$\displaystyle\leq\left(1+O\left(r^{-0.5}\right)\right)\cdot\overline{N},$
(8)		$\displaystyle C(\sigma)$	$\displaystyle\leq\left(1+O\left(r^{-0.5}\right)\right)\cdot\epsilon.$

Theorem 3 together with the lower bound in Theorem 2 reduces the infinite-server problem $\mathcal{P}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ in (2) to the single-server problem $\overline{\mathcal{P}}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ in (3). We can obtain the optimal single-server policy, Single-OPT, by solving a linear program, as stated in the theorem below.

There exists a linear program $\overline{\mathcal{LP}}((\lambda_{i})_{i\in\mathcal{I}},\epsilon)$ that is equivalent to the single-server problem $\overline{\mathcal{P}}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ . In particular, we can construct an optimal Markovian policy for $\overline{\mathcal{P}}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ from the optimal solution of $\overline{\mathcal{LP}}((\lambda_{i})_{i\in\mathcal{I}},\epsilon)$ .

Proof of Theorem 4 and details on the construction of the optimal policy are given in Appendix C.

Because Single-OPT (along with an optimal stationary distribution) optimally solves $\overline{\mathcal{P}}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ , it achieves the optimal value $\overline{N}^{*}$ . Let $\sigma$ be JRS with a subroutine Single-OPT, then according to Theorem 3, we have $N(\sigma)\leq\left(1+O\left(r^{-0.5}\right)\right)\cdot\overline{N}^{*}$ and $C(\sigma)\leq\left(1+O\left(r^{-0.5}\right)\right)\cdot\epsilon$ . By Theorem 2, we also have $N^{*}\geq\overline{N}^{*}.$ So we conclude that JRS with a subroutine Single-OPT is $\left(1+O(r^{-0.5}),1+O(r^{-0.5})\right)$ -optimal. ∎

In this section, we describe our meta-policy, Join-Requesting-Server (JRS), in full detail. For ease of presentation, we focus on the case where the subroutine policy $\overline{\sigma}$ for JRS is ${\bm{k}}^{0}$ -irreducible, i.e., under $\overline{\sigma}$ , there exists a configuration ${\bm{k}}^{0}$ such that the single-server system can return to ${\bm{k}}^{0}$ from any other configurations (which is equivalent to assuming that the configuration of the single-server system under policy $\overline{\sigma}$ forms a unichain). The algorithm for the general case is given in Section B.3.

Before going into the definition of JRS, we first take a closer look at how the Markovian single-server policy $\overline{\sigma}$ requests jobs, to avoid potential ambiguity caused by the fact that a single-server policy can request jobs at any time. Let $a_{i}$ denote the number of type $i$ jobs requested, and let $\bm{a}\triangleq(a_{i})_{i\in\mathcal{I}}$ . We say $\bm{a}$ is feasible if the total number of jobs on the server does not exceed $K_{\max}$ after adding the jobs. The policy $\overline{\sigma}$ performs one of the following two types of requests based on the current configuration.

•

Reactive requests. A reactive request is triggered by either an internal transition or a departure. The changes in the configuration when a reactive request is made can be represented by the diagram

${\bm{k}}\to{\bm{k}}^{\prime}\to{\bm{k}}^{\prime}+\bm{a},$

where ${\bm{k}}\to{\bm{k}}^{\prime}$ is due to the internal transition or departure, and ${\bm{k}}^{\prime}\to{\bm{k}}^{\prime}+\bm{a}$ happens since the policy immediately requests $\bm{a}$ jobs. The policy $\overline{\sigma}$ specifies a probability distribution over all feasible $\bm{a}$ when it decides to perform reactive requests for the configuration ${\bm{k}}^{\prime}$ .
•

Proactive requests. A proactive request happens on its own, and it happens at a finite rate depending on the current configuration of the server. The change of the configuration when a proactive request happens can be represented by the diagram

${\bm{k}}\to{\bm{k}}+\bm{a}.$

More specifically, suppose the policy $\overline{\sigma}$ decides to perform proactive requests for a configuration ${\bm{k}}$ . Then for each feasible $\bm{a}$ , the policy $\overline{\sigma}$ specifies a rate and runs a timer with an exponentially distributed duration with the specified rate. When the timer ticks, $\bm{a}$ jobs are requested. When the configuration changes, all the timers are canceled and restarted with new rates based on the new configuration.

The inputs of JRS include: (i) the single-server policy $\overline{\sigma}$ , (ii) the objective value of $\overline{\sigma}$ in the single-server problem (3), denoted as $\overline{N}$ , and (iii) the transition rates in the job model.

We first divide the infinite server pool into two sets based on the server index $\ell$ . Let $L=\lceil\overline{N}\rceil$ . We call servers with index $\ell\leq L$ normal servers; we call servers with index $\ell>L$ backup servers. The normal servers are responsible for serving most of the jobs, while the backup servers are activated only to handle overflow jobs (jobs that are not dispatched to normal servers).

The JRS is specified in two steps.

•

Step 1 (Job Requesting on a Normal Server): We let each normal server request jobs using its subroutine, the single-server policy $\overline{\sigma}$ . The input to the policy $\overline{\sigma}$ is what we refer to as the observed configuration of the server, which will be further explained below. When $\overline{\sigma}$ requests $\bm{a}=(a_{i})_{i\in\mathcal{I}}$ jobs, $a_{i}$ type $i$ tokens are generated for each $i\in\mathcal{I}$ to store the job requests. The server pauses the job requesting process if it already has any type of tokens, and resumes when all the tokens that it generated are removed.
•
Step 2 (Arrival Dispatching):
- –
  
  Real jobs. When a type $i$ job arrives, the dispatcher chooses a type $i$ token uniformly at random, removes the token, and assigns the job to the corresponding server. When there are no type $i$ tokens, the dispatcher sends the job to an idle backup server.
- –
  
  Virtual jobs. When the total number of type $i$ tokens throughout the system exceeds the limit $\eta_{\max}=\lceil\sqrt{L}\rceil$ (called the token limit), a type $i$ virtual arrival is triggered, which causes the dispatcher to choose a type $i$ token uniformly at random, remove the token, and assign a virtual job to the corresponding server. A virtual job has the same transition dynamics as a real job but does not consume physical resources.

The observed configuration of a normal server in Step 1 is the configuration resulting from real jobs and virtual jobs combined. That is, it is a vector whose $i$ -th entry represents the total number of real and virtual jobs in phase $i$ on this server. The observed configuration changes when there is a new real or virtual job arrival assigned to the server, or when a real or virtual job on the server has a phase transition or departs. We update the input to the policy $\overline{\sigma}$ when the observed configuration changes. Whenever the observed configuration changes, the policy $\overline{\sigma}$ cancels the exponential timers in progress; but a reactive request from the policy $\overline{\sigma}$ can only be triggered when a real or virtual job on the server has a phase transition or departs.

To provide a better understanding of the main design ideas of JRS, here we give an intuitive description of how it works. Broadly, servers generate job requests and store unfulfilled requests as tokens; the dispatcher assigns jobs to servers according to the tokens to fulfill job requests. This is the mechanism for matching job arrivals with requests, which is referenced at the end of Section 1.4. However, rather than matching all tokens with job arrivals, JRS opts to convert some of the tokens into virtual jobs to keep the total number of tokens within an upper limit $\eta_{\max}$ . By capping the number of tokens, JRS ensures that the job requests generated by each server get fulfilled quickly (either by a real job or a virtual job), and thus the observed configurations of servers maintain proximity to i.i.d. copies of the single-server systems.

The choice of the token limit $\eta_{\max}=\Theta(\sqrt{r})$ balances two key considerations. On the one hand, a smaller $\eta_{\max}$ brings the observed configurations closer to i.i.d. copies of single-server systems. On the other hand, if $\eta_{\max}$ is overly small, the rate of generating virtual jobs becomes high and the probability for a job arrival to see no tokens is also high. As a result, the observed configurations, which include both real and virtual jobs, deviate from the real-job configurations. A more in-depth discussion on the role of tokens and virtual jobs and whether they are fundamental is in Section 5.5.

The computational complexity of JRS consists of two components: the offline component that computes a single-server policy $\overline{\sigma}$ and its objective value $\overline{N}$ , and the online component that carries out the two steps of JRS.

The offline component reduces to solving the linear program given in (61) in Appendix C.1, whose number of optimization variables is linear in the number of feasible configurations times the number of phases, i.e., $|\mathcal{K}|\times|\mathcal{I}|$ , on a single server. Admittedly, $|\mathcal{K}|$ can be large when a single server has a large quantity of resources and there are many job phases. However, we opt for the view that a single server is not excessively large and the system’s scale is primarily captured by the scaling factor $r$ . Therefore, it is advantageous that the computational complexity of this offline component is independent of $r$ .

In the online component, the bulk of the computation is in job requesting and virtual job simulation, which can be executed distributedly on the normal servers. Specifically, each normal server monitors its observed configuration and generates tokens according to the single-server policy $\overline{\sigma}$ ; additionally, when a virtual job is assigned to the server, the server simulates the dynamics of the virtual job, i.e., generates random variables corresponding to phase transitions and job departure. Backup servers do not need to perform any computation beyond serving jobs.

The scheduler, which stores all the tokens, has two responsibilities in the online component: (i) the scheduler matches each newly arrived job to a token of the same type, chosen uniformly at random, or sends the job to a backup server when there are no tokens of the same type; (ii) the scheduler monitors the number of tokens of each type and assigns virtual jobs when the number of tokens exceeds the limit $\eta_{\max}$ .

It is informative to compare the computational complexity of JRS with existing algorithms designed for the traditional setting of stochastic bin-packing in service systems, where the resource requirements are non-time-varying (Stolyar, 2013; Stolyar and Zhong, 2013; Ghaderi et al., 2014; Stolyar and Zhong, 2015; Stolyar, 2017; Stolyar and Zhong, 2021). At a high level, these existing algorithms function as follows: upon the arrival of a job, the scheduler checks the current configurations of all servers and assigns the job to a server whose configuration optimizes certain predefined criteria. Among these, the GRAND algorithm (Stolyar and Zhong, 2015; Stolyar, 2017; Stolyar and Zhong, 2021) stands out for its simplicity and asymptotic optimality. Under GRAND, the scheduler only needs to identify configurations that can accommodate the incoming job and then randomly assigns the job to one of these feasible servers, along with some idle servers. Compared to JRS, GRAND does not have an offline planning component, and individual servers do not perform computation beyond serving jobs. The scheduler’s role in GRAND is slightly more complex than in JRS. Consequently, when considering using JRS in settings where job resource requirements are non-time-varying, practitioners should weigh whether the additional computational complexity is warranted.

A limitation of JRS is its dependency on known model parameters, including job arrival rates and phrase transition rates. Such dependency is not present in existing algorithms designed for the setting with non-time-varying resource requirements. In real-world applications, the model parameters can be estimated from workload traces such as (Tirmazi et al., 2020; Wilkes, 2019). Estimation errors can impact the system’s performance, an issue that merits further in-depth investigation in future work. Here, we provide a preliminary result on the performance degradation due to parameter estimation errors. Roughly speaking, suppose that the estimation error in the job arrival rate coefficients $\lambda_{i}$ ’s and the phase transition rates $\mu_{ii^{\prime}}$ ’s and $\mu_{i\perp}$ ’s are bounded by $\delta\geq 0$ (along with an insensitivity assumption on the single-server problem). Then if we use JRS where the single-server policy is obtained by solving for the optimal single-server policy under the estimated parameters, the resulting JRS is $\left(1+B\delta+O\left(r^{-0.5}\right),1+B\delta+O\left(r^{-0.5}\right)\right)$ -optimal for any $\delta\leq\delta_{\max}$ , where $B$ and $\delta_{\max}$ are positive constants independent of $r$ . The exact statement is given in Proposition 1 in Appendix D, along with a proof.

Recent progress has been made in addressing the issue of low utilization due to time-varying job resource requirements, notably within Google’s datacenters, as discussed in (Bashir et al., 2021). The approach in (Bashir et al., 2021) makes predictions on the future resource requirements of jobs, which lead to a further prediction on the future peak resource requirement on a server if a newly arrived job were to be sent to that server (assuming no future job arrivals). This prediction categorizes each server as either feasible or infeasible for the new job, and this binary outcome is subsequently used by a separate scheduler for job assignment.

Our proposed JRS policy can be viewed as giving more detailed predictions on whether it is suitable for a server to take on new jobs, represented by the tokens. The predictions are optimized by taking into account future job arrivals and the stochastic dynamics of jobs.

In this section, we prove Theorem 3 to establish the performance guarantee of JRS. For ease of presentation, we focus on the case where the subroutine policy $\overline{\sigma}$ is ${\bm{k}}^{0}$ -irreducible. The proof for the general case is in Section B.3.

This section is organized as follows. We first provide some preliminaries in Section 5.1. Then we outline the steps and lemmas needed for the proof in Section 5.2. In Section 5.3, we prove Theorem 3 based on the lemmas. In Section 5.4, we prove one of the lemmas, Lemma 2, where we devise a novel approach to employ Stein’s method. Finally, in Section 5.5, we discuss the role of tokens and virtual jobs and their necessity from a proof perspective. The proofs of the rest of the lemmas presented in this section are given in Appendix B.

Consider an infinite-server system under the JRS policy. For each normal server $\ell$ , we describe its status at time $t$ using the following variables: configuration of real jobs ${\bm{K}}^{\ell}(t)$ (referred to simply as configuration in previous sections), tokens $\bm{\eta}^{\ell}(t)$ , configuration of virtual jobs $\bm{\zeta}^{\ell}(t)$ , and observed configuration $\widehat{{\bm{K}}}^{\ell}(t)\triangleq{\bm{K}}^{\ell}(t)+\bm{\zeta}^{\ell}(t)$ . We use the superscript “ ${1:L}$ ” to refer to a certain descriptor of all normal servers, for example, $\widehat{{\bm{K}}}^{1:L}(t)\triangleq\big{(}\widehat{{\bm{K}}}^{\ell}(t)\big{)}_{\ell=1,2,\dots L}$ . The system under JRS is a Markov chain with a unique Markovian representation $(({\bm{K}}^{\ell}(t))_{\ell=1,2,\dots},\bm{\zeta}^{1:L}(t),\bm{\eta}^{1:L}(t))$ . The following lemma shows that the system has a unique stationary distribution (the proof is provided in Section B.1).

Consider an infinite-server system under the JRS policy with $\overline{\sigma}$ as its subroutine, where $\overline{\sigma}$ is a single-server policy that is Markovian and ${\bm{k}}^{0}$ -irreducible. Then the state of the system $(({\bm{K}}^{\ell}(t))_{\ell=1,2,\dots},\bm{\zeta}^{1:L}(t),\bm{\eta}^{1:L}(t))$ has a unique stationary distribution.

Let $\overline{{\bm{K}}}^{1:L}(t)\triangleq\big{(}\overline{{\bm{K}}}^{\ell}(t)\big{)}_{\ell=1,2,\dots,L}$ be the configuration vector of $L$ i.i.d. copies of the single-server system under $\overline{\sigma}$ . As discussed in Section 4.2, we will show that ${\bm{K}}^{1:L}(\infty)$ can be approximated by $\overline{{\bm{K}}}^{1:L}(\infty)$ . In the remainder of this section, we omit the steady-state symbol $(\infty)$ for simplicity.

To rigorously discuss the approximation of the steady-state random variables, we define some metrics. Recall that $\mathcal{K}\triangleq\{{\bm{k}}\colon\sum_{i\in\mathcal{I}}k_{i}\leq K_{\max}\}$ is the set of feasible single-server configurations. Let $\mathcal{K}^{L}\triangleq\{{\bm{k}}^{1:L}\colon{\bm{k}}^{\ell}\in\mathcal{K},\forall\ell\}$ be the set of feasible configurations for all normal servers. We use $\lVert\cdot\rVert$ to denote the $L^{1}$ norm in both space $\mathcal{K}$ and space $\mathcal{K}^{L}$ :

	$\displaystyle\lVert{\bm{k}}-{\bm{k}}^{\prime}\rVert$	$\displaystyle=\textstyle\sum_{i\in\mathcal{I}}\left\lvert k_{i}-k_{i}^{\prime}\right\rvert,\quad\text{for }{\bm{k}},{\bm{k}}^{\prime}\in\mathcal{K},$
	$\displaystyle\lVert{\bm{k}}^{1:L}-{\bm{k}}^{\prime{1:L}}\rVert$	$\displaystyle=\textstyle\sum_{\ell=1}^{L}\lVert{\bm{k}}^{\ell}-{\bm{k}}^{\prime\ell}\rVert,\quad\text{for }{\bm{k}}^{1:L},{\bm{k}}^{\prime{1:L}}\in\mathcal{K}^{L}.$

For any two random variables $\bm{U}^{a},\bm{U}^{b}\in\mathcal{K}^{L}$ , their closeness will be measured in terms of Wasserstein distance as follows:

d(\bm{U}^{a},\bm{U}^{b})\triangleq\sup_{f\in\text{Lip}(1)}\left\{\mathbb{E}\left[f\left(\bm{U}^{a}\right)\right]-\mathbb{E}[f(\bm{U}^{b})]\right\},

where the supremum is taken over the all Lipschitz- $1$ functions from $\mathcal{K}^{L}$ to $\mathbb{R}$ .

Our goal is to show that the steady-state distribution of the normal servers’ real-job configurations ${\bm{K}}^{1:L}$ is close to the steady-state distribution of i.i.d. copies of the single-server systems $\overline{{\bm{K}}}^{1:L}$ in Wasserstein distance, and that the backup servers are almost empty as the arrival rate gets large. More formally, we aim to show that $d({\bm{K}}^{1:L},\overline{{\bm{K}}}^{1:L})=O(\sqrt{r})$ and $\sum_{\ell=L+1}^{\infty}\sum_{i\in\mathcal{I}}K_{i}^{\ell}=O(\sqrt{r})$ as $r\to\infty$ . These two bounds provide the performance guarantee claimed in Theorem 3.

Instead of directly looking into the distribution of real-job configuration ${\bm{K}}^{1:L}$ , we show that the distribution of each of the three sums, ${\bm{K}}^{1:L}+\bm{\zeta}^{1:L}+\bm{\eta}^{1:L}$ , ${\bm{K}}^{1:L}+\bm{\zeta}^{1:L}$ , and ${\bm{K}}^{1:L}$ , can be approximated by the distribution of $\overline{{\bm{K}}}^{1:L}$ in Wasserstein distance. The approximation result for each sum helps us derive the approximation result for the sum with one fewer term. The result that the backup servers are almost empty also follows from these approximations. This sequence of approximations is illustrated in the figure below, where recall that $\widehat{{\bm{K}}}^{\ell}(t)\triangleq{\bm{K}}^{\ell}(t)+\bm{\zeta}^{\ell}(t)$ is the observed configuration.

A crucial observation that leads to this stepwise proof is that the process $(\widehat{{\bm{K}}}^{1:L}(t),\bm{\eta}^{1:L}(t))$ forms a Markov chain on its own. This is because real jobs and virtual jobs have the same transition dynamics and are indistinguishable by the subroutine when requesting jobs. Moreover, by the construction of JRS, the Markov chain $(\widehat{{\bm{K}}}^{1:L}(t),\bm{\eta}^{1:L}(t))$ governs the dynamics of the virtual-job configurations $\bm{\zeta}^{1:L}(t)$ and the configurations on backup servers.

Our proof consists of two steps. In Step 1, we focus on the process $(\widehat{{\bm{K}}}^{1:L}(t),\bm{\eta}^{1:L}(t))$ . We show that $d(\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L},\overline{{\bm{K}}}^{1:L})=O(\sqrt{r})$ , which immediately implies $d(\widehat{{\bm{K}}}^{1:L},\overline{{\bm{K}}}^{1:L})=O(\sqrt{r})$ because we have limited the total number of tokens to $O(\sqrt{r})$ . In Step 2, we use the approximation result for $\widehat{{\bm{K}}}^{1:L}$ in Step 1 to show that the total number of virtual jobs, $\sum_{i\in\mathcal{I}}\sum_{\ell=1}^{L}\zeta_{i}^{\ell}$ , and the total number of jobs on backup servers are both $O(\sqrt{r})$ . Recall that ${\bm{K}}^{1:L}=\widehat{{\bm{K}}}^{1:L}-\bm{\zeta}^{1:L}$ , so we get $d({\bm{K}}^{1:L},\overline{{\bm{K}}}^{1:L})=O(\sqrt{r})$ .

Next, we state the specific lemmas.

Lemma 2 below bounds the Wasserstein distance between $\widehat{{\bm{K}}}^{1:L}$ and $\overline{{\bm{K}}}^{1:L}$ .

Under the conditions of Theorem 3 and $\overline{\sigma}$ being ${\bm{k}}^{0}$ -irreducible, we have

d\left(\widehat{{\bm{K}}}^{1:L},\overline{{\bm{K}}}^{1:L}\right)=O\left(\sqrt{r}\right).

The key challenge for proving Lemma 2 is that the job dispatching decisions are based on the configurations of all normal servers, which creates dependencies among the transitions of different servers. The key idea that helps us overcome this challenge is to consider the sum $\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}$ , which remains unchanged under job arrivals regardless of dispatching decisions. Observe that $\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}$ has decoupled dynamics across servers because it is only changed by internal phase transitions, departures, and requests of new jobs, which happen independently on each server. This helps us prove $d(\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L},\overline{{\bm{K}}}^{1:L})=O\left(\sqrt{r}\right)$ , which implies Lemma 2, as argued earlier in the section.

Formally, the proof of Lemma 2 makes use of Stein’s method (see, e.g., (Braverman and Dai, 2017; Braverman et al., 2017; Braverman, 2022)) to compare $\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}$ with $\overline{{\bm{K}}}^{1:L}$ . Stein’s method usually consists of three steps: generator comparison, Stein factor bound, and moment bound. In our case, due to the finiteness of the state space $\mathcal{K}$ , we only need to do the generator comparison and the Stein factor bound. In the generator comparison step, we show that the instantaneous transition rates of $\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}$ match with those of $\overline{{\bm{K}}}^{1:L}$ ; in the Stein factor bound step, we show that small difference in the transition rates of $\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}$ and $\overline{{\bm{K}}}^{1:L}$ does not cause much increase in the overall distance of the distributions. The detailed proof is in Section 5.4.

We establish Lemma 3 and Lemma 4 below, which bound the steady-state expected number of virtual jobs and the jobs on backup servers.

Under the conditions of Theorem 3 and $\overline{\sigma}$ being ${\bm{k}}^{0}$ -irreducible, for each $i\in\mathcal{I}$ , the steady-state expected number of virtual jobs of type $i$ is of the order $O(\sqrt{r})$ , i.e.,

\mathbb{E}\left[\textstyle\sum_{\ell=1}^{L}\zeta_{i}^{\ell}\right]=O\left(\sqrt{r}\right).

Under the conditions of Theorem 3 and $\overline{\sigma}$ being ${\bm{k}}^{0}$ -irreducible, for each $i\in\mathcal{I}$ , the steady-state expected number of type $i$ jobs on backup servers is of the order $O(\sqrt{r})$ , i.e.,

\mathbb{E}\left[\textstyle\sum_{\ell=L+1}^{\infty}K_{i}^{\ell}\right]=O\left(\sqrt{r}\right).

The key idea for proving Lemma 3 and Lemma 4 is that by the characterization of $\widehat{{\bm{K}}}^{1:L}$ in Lemma 2 and the fact that the job requests are made based on $\widehat{{\bm{K}}}^{1:L}$ , we can show that the rate of requesting jobs is approximately equal to the arrival rate for each job type. Therefore, the number of tokens rarely reaches $0$ or $\eta_{\max}$ . This implies the rarity of virtual jobs and jobs on backup servers. The proofs are provided in Section B.2.

First we show that Lemmas 2 and 3 imply the closeness between ${\bm{K}}^{1:L}$ and $\overline{{\bm{K}}}^{1:L}$ . By Lemma 2, for any $f\in\text{Lip}(1)$ , we have $\mathbb{E}\big{[}f\big{(}\overline{{\bm{K}}}^{1:L}\big{)}\big{]}-\mathbb{E}\big{[}f\big{(}\widehat{{\bm{K}}}^{1:L}\big{)}\big{]}=O\left(\sqrt{r}\right).$ By Lemma 3, $\mathbb{E}\left[\sum_{\ell=1}^{L}\zeta_{i}^{\ell}\right]=O\left(\sqrt{r}\right)$ . Recall that $\widehat{{\bm{K}}}^{\ell}={\bm{K}}^{\ell}+\bm{\zeta}^{\ell}$ , so $\mathbb{E}\big{[}f(\widehat{{\bm{K}}}^{1:L})\big{]}-\mathbb{E}\big{[}f({\bm{K}}^{1:L})\big{]}=O\left(\sqrt{r}\right)$ . Therefore,

(9)

\mathbb{E}\left[f(\overline{{\bm{K}}}^{1:L})\right]-\mathbb{E}\left[f({\bm{K}}^{1:L})\right]=O\left(\sqrt{r}\right).

Now we prove $\eqref{eq:sat:active-servers}$ , the bound on the expected number of the active servers, by taking a suitable $f$ in (9). Observe that

(10)

\displaystyle\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}\Big{[}X_{{\bm{k}}}\Big{]}-L\cdot\mathbb{P}(\overline{{\bm{K}}}\neq 0)

\displaystyle=\mathbb{E}\Big{[}\sum_{\ell=1}^{L}\mathds{1}_{\{{\bm{K}}^{\ell}\neq\bm{0}\}}\Big{]}-\mathbb{E}\Big{[}\sum_{\ell=1}^{L}\mathds{1}_{\{\overline{{\bm{K}}}^{\ell}\neq\bm{0}\}}\Big{]}+\mathbb{E}\Big{[}\sum_{\ell=L+1}^{\infty}\mathds{1}_{\{{\bm{K}}^{\ell}\neq\bm{0}\}}\Big{]},

where the last term on RHS is $O\left(\sqrt{r}\right)$ by Lemma 4. To show that the difference between the first two terms on the RHS of (10) are also $O\left(\sqrt{r}\right)$ , consider $f_{1}({\bm{k}}^{1:L})\triangleq\sum_{\ell=1}^{L}\mathds{1}_{\{{\bm{k}}^{\ell}\neq\bm{0}\}}$ . Because

\displaystyle\left\lvert f_{1}({\bm{k}}^{1:L})-f_{1}({\bm{k}}^{\prime{1:L}})\right\rvert=\left\lvert\sum_{\ell=1}^{L}\left(\mathds{1}_{\{{\bm{k}}^{\ell}\neq\bm{0}\}}-\mathds{1}_{\{{\bm{k}}^{\prime\ell}\neq\bm{0}\}}\right)\right\rvert\leq\sum_{\ell=1}^{L}\mathds{1}_{\{{\bm{k}}^{\ell}\neq{\bm{k}}^{\prime\ell}\}}\leq\lVert{\bm{k}}^{1:L}-{\bm{k}}^{\prime{1:L}}\rVert,

for any ${\bm{k}}^{1:L},{\bm{k}}^{\prime{1:L}}\in\mathcal{K}^{L}$ , we have $f_{1}\in\text{Lip}(1)$ . By (9), $\mathbb{E}\big{[}f_{1}({\bm{K}}^{\ell})\big{]}-\mathbb{E}\big{[}\sum_{\ell=1}^{L}f_{1}(\overline{{\bm{K}}}^{\ell})\big{]}=O\left(\sqrt{r}\right)$ . Therefore, $\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}\big{[}X_{{\bm{k}}}\big{]}-L\cdot\mathbb{P}\big{(}\overline{{\bm{K}}}\neq 0\big{)}=O\left(\sqrt{r}\right).$ Recall that $L=\left\lceil\overline{N}\right\rceil$ , so we get $\eqref{eq:sat:active-servers}$ .

Similarly, to prove (6), we observe that

(11)

\displaystyle\sum_{{\bm{k}}\neq\bm{0}}h({\bm{k}})\mathbb{E}[X_{{\bm{k}}}]-L\cdot\mathbb{E}\Big{[}h(\overline{{\bm{K}}})\Big{]}

\displaystyle=\mathbb{E}\Big{[}\sum_{\ell=1}^{L}h({\bm{K}}^{\ell})\Big{]}-\mathbb{E}\Big{[}\sum_{\ell=1}^{L}h(\overline{{\bm{K}}}^{\ell})\Big{]}+\mathbb{E}\Big{[}\sum_{\ell=L+1}^{\infty}h({\bm{K}}^{\ell})\Big{]}.

The last term of (11) can be bounded as $\mathbb{E}\Big{[}\sum_{\ell=L+1}^{\infty}h({\bm{K}}^{\ell})\Big{]}\leq\mathbb{E}\Big{[}\sum_{\ell=L+1}^{\infty}\mathds{1}_{\{K_{i}^{\ell}\neq\bm{0}\}}\Big{]}\cdot\max_{{\bm{k}}\in\mathcal{K}}h({\bm{k}})$ , which is $O\left(\sqrt{r}\right)$ by Lemma 4 and the fact that $\mathcal{K}$ is a finite set. To show that the difference between the first two terms on the RHS of (11) is also $O\left(\sqrt{r}\right)$ , consider $f_{2}({\bm{k}}^{1:L})=\frac{1}{\Gamma}\sum_{\ell=1}^{L}h({\bm{k}}^{\ell})$ , where $\Gamma$ is the Lipschitz constant of $h(\cdot)$ . Because

\left\lvert f_{2}({\bm{k}}^{1:L})-f_{2}({\bm{k}}^{\prime{1:L}})\right\rvert=\frac{1}{\Gamma}\left\lvert\sum_{\ell=1}^{L}(h({\bm{k}}^{\ell})-h({\bm{k}}^{\prime\ell}))\right\rvert\leq\sum_{\ell=1}^{L}\lVert{\bm{k}}^{\ell}-{\bm{k}}^{\prime\ell}\rVert=\lVert{\bm{k}}^{1:L}-{\bm{k}}^{\prime{1:L}}\rVert,

for any ${\bm{k}}^{1:L},{\bm{k}}^{\prime{1:L}}\in\mathcal{K}^{L}$ , we have $f_{2}\in\text{Lip(1)}$ . By (9), $\mathbb{E}\big{[}f_{2}({\bm{K}}^{\ell})\big{]}-\mathbb{E}\big{[}\sum_{\ell=1}^{L}f_{2}(\overline{{\bm{K}}}^{\ell})\big{]}=O\left(\sqrt{r}\right)$ . Therefore, $\sum_{{\bm{k}}\neq\bm{0}}h({\bm{k}})\mathbb{E}[X_{{\bm{k}}}]-L\cdot\mathbb{E}\left[h(\overline{{\bm{K}}})\right]=O\left(\sqrt{r}\right)$ . Recall that $L=\left\lceil\overline{N}\right\rceil$ , so we get $\eqref{eq:sat:cost}$ .

To show (7) and (8), noting that $\left\lceil\overline{N}\right\rceil=\Theta\left(r\right)$ , we have

	$\displaystyle N(\sigma)=\textstyle\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}]\leq\left\lceil\overline{N}\right\rceil+O\left(\sqrt{r}\right)=\left(1+O\left(r^{-0.5}\right)\right)\cdot\overline{N},$
	$\displaystyle C(\sigma)=\frac{\sum_{{\bm{k}}\neq\bm{0}}h({\bm{k}})\mathbb{E}[X_{{\bm{k}}}]}{\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}]}=\frac{\sum_{{\bm{k}}\neq\bm{0}}h({\bm{k}})\pi({\bm{k}})+O\left(r^{-0.5}\right)}{1-\pi(\bm{0})+O\left(r^{-0.5}\right)}\leq\left(1+O\left(r^{-0.5}\right)\right)\cdot\epsilon,$

where in the last inequality we have used the fact that $\epsilon>0$ . This completes the proof.

∎

To bound the distance between $\overline{{\bm{K}}}^{1:L}$ and $\widehat{{\bm{K}}}^{1:L}$ , observe that because $f\in\text{Lip}(1)$ and $\sum_{\ell=1}^{L}\sum_{i\in\mathcal{I}}\eta_{i}^{\ell}=O\left(\sqrt{r}\right)$ , it suffices to bound bound the Wasserstein distance between $\overline{{\bm{K}}}^{1:L}$ and $\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}$ , i.e.,

(12)

\sup_{f\in\text{Lip}(1)}\left\{\mathbb{E}\left[f\big{(}\overline{{\bm{K}}}^{1:L}\big{)}\right]-\mathbb{E}\left[f\big{(}\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}\big{)}\right]\right\}=O\left(\sqrt{r}\right),

where $f\big{(}\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}\big{)}$ is a valid expression because $\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}\in\mathcal{K}$ as discussed in Remark 1 below.

To prepare for the proof, we first look into the dynamics of the two systems under study. In particular, we write out the generators of $\overline{{\bm{K}}}^{1:L}(t)$ and $(\widehat{{\bm{K}}}^{1:L}(t),\bm{\eta}^{1:L}(t))$ , which are used in the Stein’s method arguments.

We first examine the dynamics of the single-server system under Markovian policy $\overline{\sigma}$ . Four types of events change a single-server system’s configuration: internal transitions, departures, reactive requests, and proactive requests (see Section 4.1). The change of configuration due to any event can be represented by the diagram

{\bm{k}}\overset{}{\rightarrow}{\bm{k}}^{\prime}\overset{}{\rightarrow}{\bm{k}}^{\prime}+\bm{a},

where the arrow ${\bm{k}}\to{\bm{k}}^{\prime}$ denotes an internal transition or a departure from configuration ${\bm{k}}$ to ${\bm{k}}^{\prime}$ if ${\bm{k}}\neq{\bm{k}}^{\prime}$ ; the arrow ${\bm{k}}^{\prime}\to{\bm{k}}^{\prime}+\bm{a}$ denotes a reactive request that adds $\bm{a}$ jobs to the system if ${\bm{k}}\neq{\bm{k}}^{\prime}$ , and denotes a proactive request if ${\bm{k}}={\bm{k}}^{\prime}$ . We call the above change of configuration a transition, and denote its rate as $\gamma_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}$ . Let $E({\bm{k}})$ denote the set of possible $({\bm{k}}^{\prime},\bm{a})$ pairs in a transition.

We define the total transition rate at configuration ${\bm{k}}$ as $\gamma_{{\bm{k}}}\triangleq\sum_{({\bm{k}}^{\prime},\bm{a})\in E({\bm{k}})}\gamma_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})},$ and define the maximal transition rate $\gamma_{\max}=\max_{{\bm{k}}\in\mathcal{K}}\gamma_{{\bm{k}}}$ . Since $\mathcal{K}$ is a finite set, we have $\gamma_{\max}<\infty$ . Also, observe that the request rate of type $i$ jobs is given by

(13)

\overline{\lambda}_{i}\triangleq\sum_{{\bm{k}}}\sum_{({\bm{k}}^{\prime},\bm{a})\in E({\bm{k}})}\gamma_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}a_{i}\cdot\pi({\bm{k}}),

where $\pi$ denotes the stationary distribution of single-server configuration under policy $\overline{\sigma}.$

Next, we focus on the dynamics of $L$ i.i.d. copies of single-server systems. Consider the generator $\overline{G}$ of the corresponding Markov chain $\{\overline{{\bm{K}}}^{1:L}(t)\}$ , which is a linear operator on functions $g\colon\mathcal{K}^{L}\to\mathbb{R}$ defined as:

(14)

\overline{G}g({\bm{k}}^{1:L})\triangleq\frac{d}{dt}\mathbb{E}\left[g\left(\overline{{\bm{K}}}^{1:L}(t)\right)\middle|\overline{{\bm{K}}}^{1:L}(0)={\bm{k}}^{1:L}\right]\Big{|}_{t=0},

and we call the resulting function $\overline{G}g(\cdot)$ the drift of $g(\cdot)$ . Based on the transition rates defined above, we have

(15)

\overline{G}g({\bm{k}}^{1:L})=\sum_{\ell=1}^{L}\sum_{({\bm{k}}^{\prime},\bm{a})\in E({\bm{k}}^{\ell})}\gamma_{{\bm{k}}^{\ell},({\bm{k}}^{\prime},\bm{a})}\left(g(\cdot,{\bm{k}}^{\prime}+\bm{a},\cdot)-g(\cdot,{\bm{k}}^{\ell},\cdot)\right),

where $g(\cdot,{\bm{k}}^{\prime}+\bm{a},\cdot)-g(\cdot,{\bm{k}}^{\ell},\cdot)$ is a shorthand for $g({\bm{k}}^{1},\dots,{\bm{k}}^{\ell-1},{\bm{k}}^{\prime}+\bm{a},{\bm{k}}^{\ell+1},\dots,{\bm{k}}^{L})-g({\bm{k}}^{1:L})$ , i.e., we use $\cdot$ to omit the entries that agree with ${\bm{k}}^{1:L}$ .

Similarly, for the infinite-server system, consider the generator $\widehat{G}$ of $(\widehat{{\bm{K}}}^{1:L}(t),\bm{\eta}^{1:L}(t))$ defined as

(16)

\widehat{G}\psi({\bm{k}}^{1:L},\bm{\eta}^{1:L})\triangleq\frac{d}{dt}\mathbb{E}\left[\psi\left(\widehat{{\bm{K}}}^{1:L}(t),\bm{\eta}^{1:L}(t)\right)\middle|\widehat{{\bm{K}}}^{1:L}(0)={\bm{k}}^{1:L},{\bm{K}}^{1:L}(0)=\bm{\eta}^{1:L}\right]\Big{|}_{t=0},

for any function $\psi\colon(\mathcal{K}\times\mathcal{K})^{L}\to\mathbb{R}$ . The drift of $\psi$ under $\widehat{G}$ turns out to have a similar decoupled form as $\overline{G}g$ : observe that for each $\ell$ , the transition of $(\widehat{{\bm{K}}}^{\ell}(t),\bm{\eta}^{\ell}(t))$ from $({\bm{k}},\bm{\eta})$ to $({\bm{k}}^{\prime},\bm{\eta}+\bm{a}\mathds{1}_{\{\bm{\eta}=\bm{0}\}})$ occurs at the rate $\gamma_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}$ for each $({\bm{k}}^{\prime},\bm{a})\in E({\bm{k}})$ , and any real or virtual job arrivals do not change the sum $\widehat{{\bm{K}}}^{\ell}(t)+\bm{\eta}^{\ell}(t)$ . Consider any function $g\colon\mathcal{K}^{L}\to\mathbb{R}$ and the function $\psi({\bm{k}}^{1:L},\bm{\eta}^{1:L})=g({\bm{k}}^{1:L}+\bm{\eta}^{1:L})$ .

	$\displaystyle\widehat{G}\psi({\bm{k}}^{1:L},\bm{\eta}^{1:L})$	$\displaystyle=\sum_{\ell=1}^{L}\sum_{({\bm{k}}^{\prime},\bm{a})\in E({\bm{k}}^{\ell})}\gamma_{{\bm{k}}^{\ell},({\bm{k}}^{\prime},\bm{a})}\left(g(\cdot,{\bm{k}}^{\prime}+\bm{a},\cdot)-g(\cdot,{\bm{k}}^{\ell},\cdot)\right)\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}$
(17)			$\displaystyle\mspace{23.0mu}+\sum_{\ell=1}^{L}\sum_{({\bm{k}}^{\prime},\bm{a})\in E({\bm{k}}^{\ell})}\gamma_{{\bm{k}}^{\ell},({\bm{k}}^{\prime},\bm{a})}\left(g(\cdot,{\bm{k}}^{\prime}+\bm{\eta}^{\ell},\cdot)-g(\cdot,{\bm{k}}^{\ell}+\bm{\eta}^{\ell},\cdot)\right)\mathds{1}_{\{\bm{\eta}^{\ell}\neq\bm{0}\}}.$

In this context $g(\cdot,{\bm{k}}^{\prime}+\bm{\eta}^{\ell},\cdot)-g(\cdot,{\bm{k}}^{\ell}+\bm{\eta}^{\ell},\cdot)$ is a shorthand for $g({\bm{k}}^{1}+\bm{\eta}^{1},\dots,{\bm{k}}^{\ell-1}+\bm{\eta}^{\ell-1},{\bm{k}}^{\prime}+\bm{\eta}^{\ell},{\bm{k}}^{\ell+1}+\bm{\eta}^{\ell+1},\dots,{\bm{k}}^{L}+\bm{\eta}^{L})-g({\bm{k}}^{1:L}+\bm{\eta}^{1:L})$ . In other words, we use $\cdot$ to omit the entries of $g$ ’s input that agree with the corresponding entries of ${\bm{k}}^{1:L}+\bm{\eta}^{1:L}$ .

In (17), although $g$ is only defined on the domain $\mathcal{K}^{L}$ , it is valid to write ${\bm{k}}^{1:L}+\bm{\eta}^{1:L}$ as its input because we always have $\widehat{{\bm{K}}}^{\ell}+\bm{\eta}^{\ell}\in\mathcal{K}$ , i.e., the total number of real jobs, virtual jobs, and tokens on a normal server never exceeds $K_{\max}$ . To see why this is true, the single-server policy $\overline{\sigma}$ requests jobs only when there are no tokens on the server, and it will not request more than $K_{\max}-n$ jobs if there are already $n$ real and virtual jobs on the server.

Generator Comparison. For any $f\in\text{Lip}(1)$ , consider the Poisson equation (see, e.g., (Braverman, 2022)) that solves for $g_{f}\colon\mathcal{K}^{L}\to\mathbb{R}$ :

(18)

\mathbb{E}\left[f\big{(}\overline{{\bm{K}}}^{1:L}\big{)}\right]-f\big{(}{\bm{k}}^{1:L}\big{)}=\overline{G}g_{f}\big{(}{\bm{k}}^{1:L}\big{)}.

We let ${\bm{k}}^{1:L}=\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}$ in (18) and take the expectation. This results in

(19)

\mathbb{E}\left[f\big{(}\overline{{\bm{K}}}^{1:L}\big{)}\right]-\mathbb{E}\left[f\big{(}\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}\big{)}\right]=\mathbb{E}\left[\overline{G}g_{f}\big{(}\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}\big{)}\right].

On the other hand, because $(\widehat{{\bm{K}}}^{1:L}(t),\bm{\eta}^{1:L}(t))$ is a finite-state Markov chain, we have

(20)

\mathbb{E}\left[\widehat{G}\psi_{f}\big{(}\widehat{{\bm{K}}}^{1:L},\bm{\eta}^{1:L}\big{)}\right]=0,

where $\psi_{f}$ is given by $\psi_{f}\big{(}\widehat{{\bm{K}}}^{1:L},\bm{\eta}^{1:L}\big{)}=g_{f}(\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}).$ Subtracting (20) from (19), we get

(21)

\mathbb{E}\left[f\big{(}\overline{{\bm{K}}}^{1:L}\big{)}\right]-\mathbb{E}\left[f\big{(}\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}\big{)}\right]=\mathbb{E}\left[\overline{G}g_{f}\big{(}\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}\big{)}-\widehat{G}\psi_{f}\big{(}\widehat{{\bm{K}}}^{1:L},\bm{\eta}^{1:L}\big{)}\right].

We want to show that $\overline{G}$ and $\widehat{G}$ are close so that we can bound the RHS of (21).

Now we plug the formula of the generators in (15) and (17) into the RHS of (21) and get

	$\displaystyle\mspace{18.0mu}\left\lvert\overline{G}g_{f}\big{(}\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}\big{)}-\widehat{G}\psi_{f}\big{(}\widehat{{\bm{K}}}^{1:L},\bm{\eta}^{1:L}\big{)}\right\rvert$
	$\displaystyle\stackrel{{\scriptstyle\text{(i)}}}{{=}}\Bigg{\lvert}\sum_{\ell=1}^{L}\sum_{({\bm{k}}^{\prime},\bm{a})\in E(\widehat{{\bm{K}}}^{\ell}+\bm{\eta}^{\ell})}\gamma_{\widehat{{\bm{K}}}^{\ell}+\bm{\eta}^{\ell},({\bm{k}}^{\prime},\bm{a})}\left(g_{f}\big{(}\cdot,{\bm{k}}^{\prime}+\bm{a},\cdot\big{)}-g_{f}\big{(}\cdot,\widehat{{\bm{K}}}^{\ell}+\bm{\eta}^{\ell},\cdot\big{)}\right)\cdot\mathds{1}_{\{\bm{\eta}^{\ell}\neq\bm{0}\}}$
	$\displaystyle\mspace{23.0mu}-\sum_{\ell=1}^{L}\sum_{({\bm{k}}^{\prime},\bm{a})\in E(\widehat{{\bm{K}}}^{\ell})}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}\left(g_{f}\big{(}\cdot,{\bm{k}}^{\prime}+\bm{\eta}^{\ell},\cdot\big{)}-g_{f}\big{(}\cdot,\widehat{{\bm{K}}}^{\ell}+\bm{\eta}^{\ell},\cdot\big{)}\right)\cdot\mathds{1}_{\{\bm{\eta}^{\ell}\neq\bm{0}\}}\Bigg{\rvert}$
	$\displaystyle\stackrel{{\scriptstyle\text{(ii)}}}{{\leq}}\sum_{\ell=1}^{L}\gamma_{\widehat{{\bm{K}}}^{\ell}+\bm{\eta}^{\ell}}\cdot\sup_{({\bm{k}}^{\prime},\bm{a})\in E(\widehat{{\bm{K}}}^{\ell}+\bm{\eta}^{\ell})}\left\lvert g_{f}\big{(}\cdot,{\bm{k}}^{\prime}+\bm{a},\cdot\big{)}-g_{f}\big{(}\cdot,\widehat{{\bm{K}}}^{\ell}+\bm{\eta}^{\ell},\cdot\big{)}\right\rvert\cdot\mathds{1}_{\{\bm{\eta}^{\ell}\neq\bm{0}\}}$
	$\displaystyle\mspace{23.0mu}+\sum_{\ell=1}^{L}\gamma_{\widehat{{\bm{K}}}^{\ell}}\cdot\sup_{({\bm{k}}^{\prime},\bm{a})\in E(\widehat{{\bm{K}}}^{\ell})}\left\lvert g_{f}\big{(}\cdot,{\bm{k}}^{\prime}+\bm{\eta}^{\ell},\cdot\big{)}-g_{f}\big{(}\cdot,\widehat{{\bm{K}}}^{\ell}+\bm{\eta}^{\ell},\cdot\big{)}\right\rvert\cdot\mathds{1}_{\{\bm{\eta}^{\ell}\neq\bm{0}\}}$
	$\displaystyle\stackrel{{\scriptstyle\text{(iii)}}}{{\leq}}2\gamma_{\max}\cdot\sum_{\ell=1}^{L}\sup_{{\bm{k}}^{\prime}\in\mathcal{K}}\left\lvert g_{f}\big{(}\cdot,{\bm{k}}^{\prime},\cdot\big{)}-g_{f}\big{(}\cdot,\widehat{{\bm{K}}}^{\ell}+\bm{\eta}^{\ell},\cdot\big{)}\right\rvert\cdot\mathds{1}_{\{\bm{\eta}^{\ell}\neq\bm{0}\}}$
(22)		$\displaystyle\leq 2\gamma_{\max}\cdot\sup_{{\bm{k}},{\bm{k}}^{\prime}\in\mathcal{K}}\left\lvert g_{f}\big{(}\cdot,{\bm{k}}^{\prime},\cdot\big{)}-g_{f}\big{(}\cdot,{\bm{k}},\cdot\big{)}\right\rvert\cdot\sum_{\ell=1}^{L}\mathds{1}_{\{\bm{\eta}^{\ell}\neq\bm{0}\}},$

where in $g_{f}\big{(}\cdot,{\bm{k}}^{\prime},\cdot\big{)}-g_{f}\big{(}\cdot,{\bm{k}},\cdot\big{)}$ we have omitted the entries that agree with ${\bm{k}}^{1:L}$ . The equality (i) is true because each of the $\ell$ -th terms in $\overline{G}$ and $\widehat{G}$ is equal if $\bm{\eta}^{\ell}=\bm{0}$ . For the inequalities (ii) and (iii), recall that $\gamma_{{\bm{k}}}$ is the total transition rate given by $\gamma_{{\bm{k}}}\triangleq\sum_{({\bm{k}}^{\prime},\bm{a})\in E({\bm{k}})}\gamma_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}$ , and $\gamma_{\max}=\max_{{\bm{k}}\in\mathcal{K}}\gamma_{{\bm{k}}}$ . Observe that

\sum_{\ell=1}^{L}\mathds{1}_{\{\bm{\eta}^{\ell}\neq\bm{0}\}}\leq\sum_{\ell=1}^{L}\sum_{i\in\mathcal{I}}\eta_{i}^{\ell}\leq\left\lvert\mathcal{I}\right\rvert\cdot\eta_{\max}=O\left(\sqrt{r}\right).

Therefore (22) can be further bounded by

	$\displaystyle\left\lvert\overline{G}g_{f}\big{(}\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}\big{)}-\widehat{G}\psi_{f}\big{(}\widehat{{\bm{K}}}^{1:L},\bm{\eta}^{1:L}\big{)}\right\rvert$	$\displaystyle\leq 2\gamma_{\max}\cdot\sup_{{\bm{k}},{\bm{k}}^{\prime}\in\mathcal{K}}\left\lvert g_{f}\big{(}\cdot,{\bm{k}}^{\prime},\cdot\big{)}-g_{f}\big{(}\cdot,{\bm{k}},\cdot\big{)}\right\rvert\cdot\sum_{\ell=1}^{L}\sum_{i\in\mathcal{I}}\eta_{i}^{\ell}$
		$\displaystyle\leq 2\gamma_{\max}\cdot\sup_{{\bm{k}},{\bm{k}}^{\prime}\in\mathcal{K}}\left\lvert g_{f}\big{(}\cdot,{\bm{k}}^{\prime},\cdot\big{)}-g_{f}\big{(}\cdot,{\bm{k}},\cdot\big{)}\right\rvert\cdot O\left(\sqrt{r}\right).$

To prove (12), it remains to show that

(23)

\sup_{{\bm{k}},{\bm{k}}^{\prime}\in\mathcal{K}}\left\lvert g_{f}(\cdot,{\bm{k}}^{\prime},\cdot)-g_{f}(\cdot,{\bm{k}},\cdot)\right\rvert=O\left(1\right).

Stein Factor Bound. To prove (23), observe that the following $g_{f}(\cdot)$ is a solution to the Poisson equation (18):

(24)

g_{f}\big{(}{\bm{k}}^{1:L}\big{)}=\int_{0}^{\infty}\mathbb{E}\left[\left(f\big{(}\overline{{\bm{K}}}^{1:L}(t)\big{)}-\mathbb{E}\left[f\big{(}\overline{{\bm{K}}}^{1:L}\big{)}\right]\right)\middle|\overline{{\bm{K}}}^{1:L}(0)={\bm{k}}^{1:L}\right]dt.

This allows us to bound the difference of $g_{f}$ using coupling. Specifically, we define the coupling of two systems, each consisting of $L$ i.i.d. copies of the single-server system under $\overline{\sigma}$ . The two systems are initialized with configurations $(\cdot,{\bm{k}}^{\prime},\cdot)$ and $(\cdot,{\bm{k}},\cdot)$ that only differ at the $\ell$ -th server, where we omit the entries that agree with ${\bm{k}}^{1:L}$ . Let $\big{(}\overline{{\bm{K}}}^{{1:L},1}(t),\overline{{\bm{K}}}^{{1:L},2}(t)\big{)}$ be the joint configuration of the two systems, which is actually $2L$ i.i.d. copies of the single-server system. As a result, we can specify the couplings $(\overline{{\bm{K}}}^{\ell^{\prime},1}(t),\overline{{\bm{K}}}^{\ell^{\prime},2}(t))$ for different $\ell^{\prime}$ separately. For $\ell^{\prime}\neq\ell$ , the corresponding server in the two systems have the same initial configurations, so we can always keep their configurations identical. For the $\ell$ -th servers, we let them evolve independently following their own dynamics until a stopping time $\tau_{\text{mix}}$ when their configurations become the same. After that, we can use coupling to keep their configurations identical. Under this coupling, it is not hard to see that

	$\displaystyle\left\lvert g_{f}(\cdot,{\bm{k}}^{\prime},\cdot)-g_{f}(\cdot,{\bm{k}},\cdot)\right\rvert$	$\displaystyle=\left\lvert\int_{0}^{\infty}\mathbb{E}\left[f\big{(}\overline{{\bm{K}}}^{{1:L},1}(t)\big{)}-f\big{(}\overline{{\bm{K}}}^{{1:L},2}(t)\big{)}\right]dt\right\rvert$
		$\displaystyle\leq\mathbb{E}\left[\int_{0}^{\infty}\left\lvert f\big{(}\overline{{\bm{K}}}^{{1:L},1}(t)\big{)}-f\big{(}\overline{{\bm{K}}}^{{1:L},2}(t)\big{)}\right\rvert dt\right]$
		$\displaystyle\leq\mathbb{E}\left[\int_{0}^{\infty}\sum_{\ell^{\prime}=1}^{L}\big{\lVert}\overline{{\bm{K}}}^{\ell^{\prime},1}(t)-\overline{{\bm{K}}}^{\ell^{\prime},2}(t)\big{\rVert}dt\right]$
(25)			$\displaystyle=\mathbb{E}\left[\int_{0}^{\tau_{\text{mix}}}\big{\lVert}\overline{{\bm{K}}}^{\ell,1}(t)-\overline{{\bm{K}}}^{\ell,2}(t)\big{\rVert}dt\right],$

where in the second inequality we have used the fact that $f$ is $1$ -Lipschitz continuous under the $L^{1}$ norm of the space $\mathcal{K}^{L}$ . For each pair of ${\bm{k}},{\bm{k}}^{\prime}$ , observe that because $\overline{\sigma}$ is a ${\bm{k}}^{0}$ -irreducible policy, $\mathbb{E}[\tau_{\text{mix}}]$ is finite; and because $\mathcal{K}$ is a finite set, $\big{\lVert}\overline{{\bm{K}}}^{\ell,1}(t)-\overline{{\bm{K}}}^{\ell,2}(t)\big{\rVert}$ is uniformly bounded. All these finite quantities depend on a single-server system under a policy $\overline{\sigma}$ that is independent of $r$ . As a result, the last expression in (25) is of constant order. Moreover, because there are finite pairs of $({\bm{k}},{\bm{k}}^{\prime})$ , the supremum $\sup_{{\bm{k}},{\bm{k}}^{\prime}}\mathbb{E}\left[\int_{0}^{\tau_{\text{mix}}}\lVert\overline{{\bm{K}}}^{\ell,1}(t)-\overline{{\bm{K}}}^{\ell,2}(t)\rVert dt\right]$ is also of constant order, independent of $r$ . This proves the Stein factor bound in (23). Together with the generator comparison, we have proved

(12)

\sup_{f\in\text{Lip}(1)}\mathbb{E}\left[f\big{(}\overline{{\bm{K}}}^{1:L}\big{)}\right]-\mathbb{E}\left[f\big{(}\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}\big{)}\right]=O\left(\sqrt{r}\right).

Because $\sum_{\ell=1}^{L}\sum_{i\in\mathcal{I}}\eta_{i}^{\ell}\leq\left\lvert\mathcal{I}\right\rvert\eta_{\max}=O\left(\sqrt{r}\right)$ , for any $f\in\text{Lip}(1)$ , we have

(26)

\left\lvert\mathbb{E}\left[f\big{(}\widehat{{\bm{K}}}^{1:L}\big{)}\right]-\mathbb{E}\left[f\big{(}\widehat{{\bm{K}}}^{1:L}+\bm{\eta}^{1:L}\big{)}\right]\right\rvert\leq\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{i\in\mathcal{I}}\eta_{i}^{\ell}(t)\right]=O\left(\sqrt{r}\right).

Plugging the above equation to (12), we get $\sup_{f\in\text{Lip}(1)}\mathbb{E}\left[f\big{(}\overline{{\bm{K}}}^{1:L}\big{)}\right]-\mathbb{E}\left[f\big{(}\widehat{{\bm{K}}}^{1:L}\big{)}\right]=O\left(\sqrt{r}\right).$ This proves Lemma 2. ∎

This section aims to shed light on the role of tokens and virtual jobs in the proposed policy, JRS. We first outline how we devise the token-and-virtual-job mechanism from the perspective of generators. To begin with, consider the scenario where dispatch decisions are solely based on real-job configurations $({\bm{K}}^{\ell})_{\ell=1,2,\dots}$ . In this case, the transitions of servers’ configurations would be correlated in general due to job arrivals, which are dispatched based on the joint configuration of all servers. To break this correlation, let us introduce tokens but not set an upper limit yet on the number of tokens (thus no virtual jobs). Observe that ${\bm{K}}^{1:L}(t)+\bm{\eta}^{1:L}(t)$ remains unchanged by job arrivals, so tokens remove the correlation brought about by job arrivals. However, because tokens lack internal phase transitions or departures, the transition dynamics of ${\bm{K}}^{1:L}(t)+\bm{\eta}^{1:L}(t)$ will diverge from $\overline{{\bm{K}}}^{1:L}(t)$ when $\bm{\eta}(t)$ is large. In other words, although tokens help decouple the transitions on servers, they cannot keep the transitions of ${\bm{K}}^{1:L}(t)+\bm{\eta}^{1:L}(t)$ close to $\overline{{\bm{K}}}^{1:L}(t)$ . To solve this issue, we finally introduce the mechanism of converting tokens into virtual jobs when the number of tokens is high, where virtual jobs can make internal phase transitions or departures just like real jobs. Now, the sum ${\bm{K}}^{1:L}(t)+\bm{\zeta}^{1:L}(t)+\bm{\eta}^{1:L}(t)$ remains unchanged by job arrivals nor the creation of virtual jobs, and the internal phase transitions and job departures are similar to those of $\overline{{\bm{K}}}^{1:L}(t)$ . More formally, the generators of ${\bm{K}}^{1:L}(t)+\bm{\zeta}^{1:L}(t)+\bm{\eta}^{1:L}(t)$ and $\overline{{\bm{K}}}^{1:L}(t)$ are close to each other – their additive difference can be upper bounded by a quantity proportional to the expected number of tokens, as shown in (22). Therefore, by regulating the number of tokens, we can control the difference between the generators of ${\bm{K}}^{1:L}(t)+\bm{\zeta}^{1:L}(t)+\bm{\eta}^{1:L}(t)$ and $\overline{{\bm{K}}}^{1:L}(t)$ .

Another key design component of JRS is that the subroutine requests jobs based on the observed configurations. This has been used in the proof of Lemma 2 to show that the observed configurations ${\bm{K}}^{1:L}+\bm{\zeta}^{1:L}$ are close to $\overline{{\bm{K}}}^{1:L}$ , which consist of $L$ i.i.d. single-server systems. Recall that each single-server system in $\overline{{\bm{K}}}^{1:L}$ is designed to have a throughput of $\lambda_{i}r/L$ for each job type $i\in\mathcal{I}$ . Therefore, the proximity between ${\bm{K}}^{1:L}+\bm{\zeta}^{1:L}$ and $\overline{{\bm{K}}}^{1:L}$ ensures that the job request rate mirrors the arrival rate for each job type, regardless of the real-job configurations. The fact that these two rates are approximately equal is important for proving Lemma 3 and Lemma 4. It guarantees that both the rate of generating virtual jobs (when there are too many tokens) and the rate of dispatching jobs to backup servers (when there are no tokens) are appropriate.

A natural follow-up question is whether the usage of tokens and virtual jobs is fundamental or an artifact of our analysis technique. For example, it is unclear whether removing the upper limit on the number of tokens would still yield an asymptotically optimal policy. This is an interesting question that we do not have a complete answer to. The token-and-virtual-job mechanism emerges as a natural choice under our analysis framework. Nevertheless, it is worth noting that our analysis primarily treats each server’s configuration as a generic Markov chain, without utilizing many properties specific to the stochastic bin-packing setting. An exception to this is the proofs in Section B.2, where we use the model that each job leaves the system within a constant expected time. It would be interesting to explore more properties of the problem to better understand policy designs without auxiliary state variables like tokens and virtual jobs.

In this paper, we study a new setting of stochastic bin-packing in service systems that features time-varying item sizes. Since our formulation is motivated by the problem of virtual-machine scheduling in computing systems, we use the terminology of jobs and servers, where jobs are viewed as items, whose sizes are their resource requirements, and servers as bins. The time-varying item sizes capture the emerging trend in practice that jobs’ resource requirements vary over time. Our goal is to design a job dispatch policy to minimize the expected number of active servers in steady state, subject to a constraint on resource contentions. Our main result is the design of a policy that achieves an optimality gap of $O(\sqrt{r})$ , where $r$ is the scaling factor of the arrival rate. When specialized to the setting where jobs’ resource requirements remain fixed over time, this result improves upon the state-of-the-art $o(r)$ optimality gap. Our technical approach highlights a novel policy conversion framework, Join-Requesting-Server, that reduces the policy design problem to that in a single-server system.

There are several potential directions that may be worth further exploration. One direction is to strengthen the optimality result within the current setting. Specifically, it is interesting to investigate: (i) whether it is possible to achieve an optimality gap smaller than $\Theta(\sqrt{r})$ ; and (ii) whether there exist asymptotically optimal policies whose average cost rate of resource contention satisfies the budget strictly instead of asymptotically.

We are also interested in extending our technique to the optimal control of other systems with similar structures. Intuitively, this technique could be applied to systems with many components that evolve mostly independently but are weakly coupled by certain constraints. Viewing each component as a server, we can define a suitable single-server problem and then design a policy for the original system to track the dynamics of the optimal single-server solution. Below we list several variations of our model that can potentially be analyzed using the proposed technique.

•

A model where jobs running on each server will be put into a local queue when there are resource contentions. The goal thus becomes finding the optimal trade-offs between the number of active servers and the waiting time of the jobs.
•

A model that allows each server to have a Markovian state that affects the dynamics of the jobs running on the server.
•

A model that allows jobs to migrate to different servers at the cost of migration delays.
•

A closed-system model where jobs re-enter the system after completion.

A third possible direction is to tackle the problem when the arrival rates and the parameters in the job model are unknown, as mentioned in Section 4.3. A possible approach is to develop an approximate version of the JRS framework, where the optimal single-server policy and the simulator for the virtual jobs are both learned from data. It is desirable to design such an approximate framework whose performance degrades gracefully as the approximation error increases.

Y. Hong and W. Wang are supported in part by NSF grants CNS-200773 and ECCS-2145713. Q. Xie is supported in part by NSF grant CNS-1955997.

(1)
Ayyadevara et al. (2022) Nikhil Ayyadevara, Rajni Dabas, Arindam Khan, and K. V. N. Sreenivas. 2022. Near-Optimal Algorithms for Stochastic Online Bin Packing. In Proc. Int. Conf. Automata, Languages and Programming (ICALP), Vol. 229. 12:1–12:20.
Bashir et al. (2021) Noman Bashir, Nan Deng, Krzysztof Rzadca, David Irwin, Sree Kodak, and Rohit Jnagal. 2021. Take It to the Limit: Peak Prediction-Driven Resource Overcommitment in Datacenters. In Proc. European Conf. Computer Systems (EuroSys). Online Event, United Kingdom, 556–573.
Braverman (2022) Anton Braverman. 2022. The Prelimit Generator Comparison Approach of Stein’s Method. Stoch. Syst. 12, 2 (2022), 181–204.
Braverman and Dai (2017) Anton Braverman and J. G. Dai. 2017. Stein’s method for steady-state diffusion approximations of $M/\mathit{Ph}/n+M$ systems. Ann. Appl. Probab. 27 (Feb. 2017), 550–581.
Braverman et al. (2017) Anton Braverman, J. G. Dai, and Jiekun Feng. 2017. Stein’s method for steady-state diffusion approximations: an introduction through the Erlang-A and Erlang-C models. Stoch. Syst. 6, 2 (2017), 301–366.
Buchbinder et al. (2021) Niv Buchbinder, Yaron Fairstein, Konstantina Mellou, Ishai Menache, and Joseph (Seffi) Naor. 2021. Online Virtual Machine Allocation with Lifetime and Load Predictions. ACM SIGMETRICS Perform. Eval. Rev. 49, 1 (May 2021), 9–10.
Cloud (2023a) Google Cloud. 2023a. Overcommitting CPUs on sole-tenant VMs. https://cloud.google.com/compute/docs/nodes/overcommitting-cpus-sole-tenant-vms.
Cloud (2023b) Google Cloud. 2023b. Virtual machine instances. https://cloud.google.com/compute/docs/instances.
Coffman et al. (1983) E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson. 1983. Dynamic Bin Packing. SIAM J. Comput. 12, 2 (1983), 227–258.
Courcobetis and Weber (1990) Coastas Courcobetis and Richard Weber. 1990. Stability of On-Line Bin Packing with Random Arrivals and Long-Run-Average Constraints. Probab. Eng. Inf. Sci. 4, 4 (1990), 447–460.
Courcoubetis and Weber (1986) C. Courcoubetis and R. R. Weber. 1986. Necessary and Sufficient Conditions for Stability of a Bin-Packing System. J. Appl. Probab. 23, 4 (1986), 989–999.
Csirik et al. (2006) Janos Csirik, David S. Johnson, Claire Kenyon, James B. Orlin, Peter W. Shor, and Richard R. Weber. 2006. On the Sum-of-Squares Algorithm for Bin Packing. J. ACM 53, 1 (Jan. 2006), 1–65.
Delimitrou and Kozyrakis (2014) Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In Proc. Int. Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS). Salt Lake City, UT, 127–144.
Foundation (2023a) Apache Software Foundation. 2023a. Apache Mesos: Containerizers. https://mesos.apache.org/documentation/latest/containerizers/.
Foundation (2023b) Apache Software Foundation. 2023b. Apache Mesos: Oversubscription. https://mesos.apache.org/documentation/latest/oversubscription/.
Freund and Banerjee (2019) Daniel Freund and Siddhartha Banerjee. 2019. Good prophets know when the end is near. Available at SSRN: https://ssrn.com/abstract=3479189 (Nov. 2019).
Ghaderi et al. (2014) Javad Ghaderi, Yuan Zhong, and R. Srikant. 2014. Asymptotic Optimality of BestFit for Stochastic Bin Packing. SIGMETRICS Perform. Eval. Rev. 42, 2 (Sept. 2014), 64–66.
Gupta and Radovanović (2020) Varun Gupta and Ana Radovanović. 2020. Interior-Point-Based Online Stochastic Bin Packing. Oper. Res. 68, 5 (2020), 1474–1492.
Kleinrock (1975) Leonard Kleinrock. 1975. Queueing Systems. John Wiley & Son.
Li et al. (2014) Yusen Li, Xueyan Tang, and Wentong Cai. 2014. On Dynamic Bin Packing for Resource Allocation in the Cloud. In Proc. Ann. ACM Symp. Parallelism in Algorithms and Architectures (SPAA). Prague, Czech Republic, 2–11.
Lo et al. (2015) David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. In Proc. ACM/IEEE Ann. Int. Symp. Computer Architecture (ISCA). Portland, OR, 450–462.
Maguluri and Srikant (2013) Siva Theja Maguluri and R. Srikant. 2013. Scheduling jobs with unknown duration in clouds. In Proc. IEEE Int. Conf. Computer Communications (INFOCOM). Turin, Italy, 1887–1895.
Maguluri et al. (2012) Siva Theja Maguluri, R Srikant, and Lei Ying. 2012. Stochastic Models of Load Balancing and Scheduling in Cloud Computing Clusters. In Proc. IEEE Int. Conf. Computer Communications (INFOCOM). Orlando, FL, 702–710.
Maguluri et al. (2014) Siva Theja Maguluri, R. Srikant, and Lei Ying. 2014. Heavy traffic optimal resource allocation algorithms for cloud computing clusters. Perform. Eval. 81 (2014), 20–39.
Meyn (2007) Sean Meyn. 2007. Control Techniques for Complex Networks (1st ed.). Cambridge University Press, USA.
Psychas and Ghaderi (2018) Konstantinos Psychas and Javad Ghaderi. 2018. On Non-Preemptive VM Scheduling in the Cloud. In Proc. ACM SIGMETRICS Int. Conf. Measurement and Modeling of Computer Systems. Irvine, CA, 67–69.
Psychas and Ghaderi (2019) Konstantinos Psychas and Javad Ghaderi. 2019. Scheduling Jobs with Random Resource Requirements in Computing Clusters. In Proc. IEEE Int. Conf. Computer Communications (INFOCOM). 2269–2277.
Psychas and Ghaderi (2021) Konstantinos Psychas and Javad Ghaderi. 2021. High-Throughput Bin Packing: Scheduling Jobs With Random Resource Demands in Clusters. IEEE/ACM Trans. Netw. 29, 1 (2021), 220–233.
Psychas and Ghaderi (2022) Konstantinos Psychas and Javad Ghaderi. 2022. A Theory of Auto-Scaling for Resource Reservation in Cloud Services. Stoch. Syst. 12, 3 (2022), 227–252.
Reiss et al. (2012) Charles Reiss, Alexey Tumanov, Gregory R. Ganger, Randy H. Katz, and Michael A. Kozuch. 2012. Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis. In Proc. ACM Symp. Cloud Computing (SoCC). San Jose, CA, Article 7, 13 pages.
Rzadca et al. (2020) Krzysztof Rzadca, Pawel Findeisen, Jacek Swiderski, Przemyslaw Zych, Przemyslaw Broniek, Jarek Kusmierek, Pawel Nowak, Beata Strack, Piotr Witusowski, Steven Hand, and John Wilkes. 2020. Autopilot: Workload Autoscaling at Google. In Proc. European Conf. Computer Systems (EuroSys). Heraklion, Greece, Article 16, 16 pages.
Stolyar (2013) Alexander L. Stolyar. 2013. An Infinite Server System with General Packing Constraints. Oper. Res. 61, 5 (2013), 1200–1217.
Stolyar (2017) Alexander L. Stolyar. 2017. Large-scale heterogeneous service systems with general packing constraints. Adv. Appl. Probab. 49 (March 2017), 61–83. Issue 1.
Stolyar and Zhong (2013) Alexander L. Stolyar and Yuan Zhong. 2013. A large-scale service system with packing constraints: Minimizing the number of occupied servers. ACM SIGMETRICS Perform. Eval. Rev. 41, 1 (June 2013), 41–52.
Stolyar and Zhong (2015) Alexander L. Stolyar and Yuan Zhong. 2015. Asymptotic optimality of a greedy randomized algorithm in a large-scale service system with general packing constraints. Queueing Syst. 79 (June 2015), 117–143. Issue 2.
Stolyar and Zhong (2021) Alexander L. Stolyar and Yuan Zhong. 2021. A Service System with Packing Constraints: Greedy Randomized Algorithm Achieving Sublinear in Scale Optimality Gap. Stoch. Syst. 11 (June 2021), 83–111. Issue 2.
Tirmazi et al. (2020) Muhammad Tirmazi, Adam Barker, Nan Deng, Md E. Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter, and John Wilkes. 2020. Borg: The next Generation. In Proc. European Conf. Computer Systems (EuroSys). Heraklion, Greece, Article 30, 14 pages.
Wilkes (2019) John Wilkes. 2019. Google cluster-usage traces v3. http://github.com/google/cluster-data.
Xie et al. (2015) Qiaomin Xie, Xiaobo Dong, Yi Lu, and R. Srikant. 2015. Power of d Choices for Large-Scale Bin Packing: A Loss Model. In Proc. ACM SIGMETRICS Int. Conf. Measurement and Modeling of Computer Systems. Portland, OR, 321–334.

It is sufficient to show that given an infinite-server policy $\sigma$ for $\mathcal{P}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ in (2), we have $N(\sigma)\geq\overline{N}^{*}$ . To this end, we will construct a single-server policy $\overline{\sigma}$ such that the resulting system configuration $\overline{{\bm{K}}}(\infty)$ in steady state satisfies:

(27)		$\displaystyle\mathbb{E}\left[h\big{(}\overline{{\bm{K}}}(\infty)\big{)}\middle\|\overline{{\bm{K}}}(\infty)\neq\bm{0}\right]=C(\sigma)\leq\epsilon,$
(28)		$\displaystyle\overline{\lambda}_{i}=\frac{\lambda_{i}r}{N(\sigma)},\quad\forall i\in\mathcal{I}.$

Let $\pi$ be the distribution of $\overline{{\bm{K}}}(\infty)$ , then $(N(\sigma),\overline{\sigma},\pi)$ is a feasible solution to the problem $\overline{\mathcal{P}}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ in (3). As a result, we have $N(\sigma)\geq\overline{N}^{*}$ . Note that although $\overline{\sigma}$ is actually non-Markovian, i.e., it makes decisions based on not only the current configuration but also the history, as we will show in Appendix C, $\overline{N}^{*}$ is still a lower bound to the objective value that $\overline{\sigma}$ can achieve in $\overline{\mathcal{P}}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ .

The construction of the single-server policy $\overline{\sigma}$ involves simulating an infinite-server system under $\sigma$ from the empty configuration. At time $0$ , the policy $\overline{\sigma}$ randomly chooses the $\ell$ -th server in the infinite-server system with probability $p^{\ell}$ , for $\ell=1,2,\cdots$ . It then requests jobs for the single-server system according to a policy $\overline{\sigma}^{\ell}$ . The key to our policy $\overline{\sigma}^{\ell}$ is to make the single-server system emulate the job assignment at the $\ell$ -th server of the simulated infinite-server system, but without incurring idleness. We first construct the policy $\overline{\sigma}^{\ell}$ , and then specify the probabilities $p^{\ell}$ .

Let us start by introducing some useful notation. Let $\overline{{\bm{K}}}^{\ell}(t)$ be the single-server system configuration under $\overline{\sigma}^{\ell}$ at time $t$ and ${\bm{K}}^{\ell}(t)$ be the configuration of the $\ell$ -th simulated server in the infinite-server system under $\sigma$ . We define a stochastic process $\{s^{\ell}(t),t\geq 0\}$ as follows: $s^{\ell}(t)=\max_{\tau}\big{\{}\tau:\int_{0}^{\tau}\mathds{1}_{\{{\bm{K}}^{\ell}(x)\neq\bm{0}\}}dx=t\big{\}}$ . The “ $\max$ ” is well-defined because the integral is continuous in $\tau$ . Intuitively, $s^{\ell}(t)$ gives the maximum time when the accumulative busy time of the $\ell$ -th server is $t$ . Note that $\{s^{\ell}(t),t\geq 0\}$ is only discontinuous when ${\bm{K}}^{\ell}(\tau)$ reaches $0$ , thus it is right-differentiable with derivative equal to $1$ at any point.

We construct $\overline{\sigma}^{\ell}$ and the simulation of the infinite-server system under $\sigma$ in a way such that:

(29)

\overline{{\bm{K}}}^{\ell}(t)={\bm{K}}^{\ell}(s^{\ell}(t))\quad\forall t.

That is, we want that the single-server system has the same dynamic of the simulated $\ell$ -th server except skipping the idle period. To this end, we couple the two systems as follows:

(1)

When the $\ell$ -th simulated server ${\bm{K}}^{\ell}(s^{\ell}(t))$ receives a type $i$ job, we let the single-server system $\overline{{\bm{K}}}^{\ell}(t)$ request a type $i$ job at time $t$ . For each such job, its phase transition process in the $\ell$ -th simulated server is the same as that in the single-server system. That is, when we observe any internal transition or departure event in $\overline{{\bm{K}}}^{\ell}(t)$ , we produce a same event on the $\ell$ -th simulated server ${\bm{K}}^{\ell}(s^{\ell}(t))$ .
(2)

The simulations of the rest of the infinite-server system under policy $\sigma$ are driven by independently generated random seeds.

It is not hard to see that the simulated infinite server-system has the same stochastic behavior as an uncoupled system under $\sigma.$ Moreover, as we couple all the events that happen in $\overline{{\bm{K}}}^{\ell}(t)$ and ${\bm{K}}^{\ell}(s^{\ell}(t))$ , together with the facts that $\overline{{\bm{K}}}^{\ell}(t)$ and ${\bm{K}}^{\ell}(s^{\ell}(t))$ are piecewise constant and $\overline{{\bm{K}}}^{\ell}(0-)={\bm{K}}^{\ell}(s^{\ell}(0-))=\bm{0}$ , we get (29).

Next we claim that (29) implies the following relationship between the steady-state cost of the single-server system under $\overline{\sigma}^{\ell}$ and the steady-state cost of the $\ell$ -th simulated server in the infinite-server system under $\sigma$ :

(30)

\mathbb{E}\left[h\left(\overline{{\bm{K}}}^{\ell}(\infty)\right)\right]=\frac{\mathbb{E}\left[h({\bm{K}}^{\ell}(\infty))\right]}{\mathbb{P}\left({\bm{K}}^{\ell}(\infty)\neq\bm{0}\right)}.

This is because for all ${\bm{k}}\neq\bm{0}$ , we have

	$\displaystyle\frac{\mathbb{P}\left({\bm{K}}^{\ell}(\infty)={\bm{k}}\right)}{\mathbb{P}\left({\bm{K}}^{\ell}(\infty)\neq\bm{0}\right)}$	$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\lim_{S\to\infty}\frac{\int_{0}^{S}\mathds{1}_{\{{\bm{K}}^{\ell}(s)={\bm{k}}\}}ds}{\int_{0}^{S}\mathds{1}_{\{{\bm{K}}^{\ell}(s)\neq\bm{0}\}}ds}\stackrel{{\scriptstyle(b)}}{{=}}\lim_{T\to\infty}\frac{\int_{0}^{T}\mathds{1}_{\{{\bm{K}}^{\ell}(s^{\ell}(t))={\bm{k}}\}}dt}{\int_{0}^{T}\mathds{1}_{\{{\bm{K}}^{\ell}(s^{\ell}(t))\neq\bm{0}\}}dt}$
		$\displaystyle\stackrel{{\scriptstyle(c)}}{{=}}\lim_{T\to\infty}\frac{1}{T}\int_{0}^{T}\mathds{1}_{\{{\bm{K}}^{\ell}(s^{\ell}(t))={\bm{k}}\}}dt\stackrel{{\scriptstyle(d)}}{{=}}\lim_{T\to\infty}\frac{1}{T}\int_{0}^{T}\mathds{1}_{\{\overline{{\bm{K}}}^{\ell}(t)={\bm{k}}\}}dt$
		$\displaystyle\stackrel{{\scriptstyle(e)}}{{=}}\mathbb{P}\left(\overline{{\bm{K}}}^{\ell}(\infty)={\bm{k}}\right),$

where $(a)$ and $(e)$ hold because long-run averages converge to steady-state expectations; $(b)$ is due to the fact that $\int_{0}^{T}\mathds{1}_{\{{\bm{K}}^{\ell}(s^{\ell}(t))={\bm{k}}\}}dt=\int_{0}^{s^{\ell}(T)}\mathds{1}_{\{{\bm{K}}^{\ell}(s)={\bm{k}}\}}ds,$ for any ${\bm{k}}\neq\bm{0}$ ; $(c)$ is due to the fact that $\mathds{1}_{\{{\bm{K}}^{\ell}(s^{\ell}(t))\neq\bm{0}\}}=1;$ and $(d)$ follows from (29).

Let $\overline{\lambda}_{i}^{\ell}$ be the long-run request rates of type $i$ jobs in the single-server system under $\overline{\sigma}^{\ell}$ , and $\lambda_{i}^{\ell}$ be the throughput of type $i$ jobs of $\ell$ -th simulated server under $\sigma$ . By the construction of $\overline{\sigma}^{\ell}$ , the single-server system requests jobs based on the arrival events of the $\ell$ -th simulated server, we have $\overline{\lambda}_{i}^{\ell}=\frac{\lambda_{i}^{\ell}}{\mathbb{P}\left({\bm{K}}^{\ell}(\infty)\neq\bm{0}\right)},\forall i\in\mathcal{I}.$

With the constructed policies $\{\overline{\sigma}^{\ell},\ell=1,2,\dots\}$ , we are ready to define the policy $\overline{\sigma}$ . We let $\overline{\sigma}$ choose an index $\ell$ with probability $p^{\ell}$ at time $0$ , and then follow $\overline{\sigma}^{\ell}$ . We set the probability $p^{\ell}$ as

(31)

\displaystyle p^{\ell}=\frac{\mathbb{P}\left({\bm{K}}^{\ell}(\infty)\neq\bm{0}\right)}{\sum_{\ell^{\prime}=1}^{\infty}\mathbb{P}\left({\bm{K}}^{\ell^{\prime}}(\infty)\neq\bm{0}\right)}=\frac{\mathbb{P}\left({\bm{K}}^{\ell}(\infty)\neq\bm{0}\right)}{\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}(\infty)]},\quad\forall\ell=1,2,\dots

where the second inequality uses the fact that $\sum_{\ell^{\prime}=1}^{\infty}\mathbb{P}\left({\bm{K}}^{\ell^{\prime}}(\infty)\neq\bm{0}\right)=\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}(\infty)]$ . Then under $\overline{\sigma}$ , we have

	$\displaystyle\mathbb{E}\left[h\big{(}\overline{{\bm{K}}}(\infty)\big{)}\right]$	$\displaystyle=\sum_{\ell=1}^{\infty}p^{\ell}\mathbb{E}\left[h\left(\overline{{\bm{K}}}^{\ell}(\infty)\right)\right]\stackrel{{\scriptstyle(a)}}{{=}}\sum_{\ell=1}^{\infty}\frac{\mathbb{P}\left({\bm{K}}^{\ell}(\infty)\neq\bm{0}\right)}{\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}(\infty)]}\cdot\frac{\mathbb{E}\left[h({\bm{K}}^{\ell}(\infty))\right]}{\mathbb{P}\left({\bm{K}}^{\ell}(\infty)\neq\bm{0}\right)}$
		$\displaystyle=\frac{\sum_{\ell=1}^{\infty}\mathbb{E}\left[h({\bm{K}}^{\ell}(\infty))\right]}{\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}(\infty)]}=\frac{\sum_{{\bm{k}}\neq\bm{0}}h({\bm{k}})\mathbb{E}[X_{{\bm{k}}}(\infty)]}{\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}(\infty)]}=C(\sigma),$

where $(a)$ follows from (30) and (31). Observe that under $\overline{\sigma}$ , $\overline{{\bm{K}}}(\infty)\neq\bm{0}$ almost surely, we thus have

\mathbb{E}\left[h\big{(}\overline{{\bm{K}}}(\infty)\big{)}\middle|\overline{{\bm{K}}}(\infty)\neq\bm{0}\right]=\mathbb{E}\left[h\big{(}\overline{{\bm{K}}}(\infty)\big{)}\right]=C(\sigma),

which proves (27). Moreover, for each $i\in\mathcal{I}$ the request rate $\overline{\lambda}_{i}$ is given by

\displaystyle\overline{\lambda}_{i}

\displaystyle=\sum_{\ell=1}^{\infty}p^{\ell}\cdot\overline{\lambda}_{i}^{\ell}\stackrel{{\scriptstyle}}{{=}}\sum_{\ell=1}^{\infty}\frac{\mathbb{P}\left({\bm{K}}^{\ell}(\infty)\neq\bm{0}\right)}{\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}(\infty)]}\cdot\frac{\lambda_{i}^{\ell}}{\mathbb{P}\left({\bm{K}}^{\ell}(\infty)\neq\bm{0}\right)}=\frac{\sum_{\ell=1}^{\infty}\lambda_{i}^{\ell}}{\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}(\infty)]}\stackrel{{\scriptstyle}}{{=}}\frac{\lambda_{i}r}{N(\sigma)}.

This proves (28). By the argument presented at the beginning of the proof, we get $N(\sigma)\geq\overline{N}^{*}$ . ∎

We will show that under the JRS policy, the Markov chain for the system state (represented as $(({\bm{K}}^{\ell}(t))_{\ell=1,2,\dots},\bm{\zeta}^{1:L}(t),\bm{\eta}^{1:L}(t))$ ) has a unique stationary distribution by first arguing that it is ${\bm{k}}^{0}$ -irreducible (here being ${\bm{k}}^{0}$ -irreducible means the Markov chain has a state that can be reached by all other states through transitions, “ ${\bm{k}}^{0}$ ” in “ ${\bm{k}}^{0}$ -irreducible” does not refer to any specific states), and then use Foster-Lyapunov theorem to show the positive recurrence (see, e.g., Meyn, 2007). Combining ${\bm{k}}^{0}$ -irreducibility with positive recurrence, we can conclude that the Markov chain under study has a unique stationary distribution.

First, we show that the Markov chain $(({\bm{K}}^{\ell}(t))_{\ell=1,2,\dots},\bm{\zeta}^{1:L}(t),\bm{\eta}^{1:L}(t))$ is ${\bm{k}}^{0}$ -irreducible. Specifically, observe that the Markov chain starting from any state $(({\bm{k}}^{\ell})_{\ell=1,2,\dots},\bm{\zeta}^{1:L},\bm{\eta}^{1:L})$ can reach the state $({\bm{k}}^{1:L},(\bm{0})_{\ell=L+1,\dots},\bm{0}^{1:L},\bm{0}^{1:L})$ , after experiencing a sequence of departures and arrivals that clears up all the tokens, virtual jobs and jobs on backup servers. Further, letting $\widetilde{{\bm{k}}}$ be the configuration reachable by all other configuration in the single-server system under the policy $\overline{\sigma}$ , we argue that starting from any states of the form $({\bm{k}}^{1:L},(\bm{0})_{\ell=L+1,\dots},\bm{0}^{1:L},\bm{0}^{1:L})$ , the Markov chain can reach the state $(\widetilde{{\bm{k}}}^{1:L},(\bm{0})_{\ell=L+1,\dots},\bm{0}^{1:L},\bm{0}^{1:L})$ . Because for any $\ell\leq L$ , there is a transition path from ${\bm{k}}^{\ell}$ to $\widetilde{{\bm{k}}}$ , consider the sequence of events where each ${\bm{K}}^{\ell}(t)$ transitions independently following the path, and the jobs arrive right after ${\bm{K}}^{\ell}(t)$ making a request, so that the tokens are checked out before ${\bm{K}}^{\ell}(t)$ has a further transition. In this way, each ${\bm{K}}^{\ell}(t)$ with $\ell\leq L$ can eventually reach $\widetilde{{\bm{k}}}$ from ${\bm{k}}^{\ell}$ . This proves the ${\bm{k}}^{0}$ -irreducibility of $(({\bm{K}}^{\ell}(t))_{\ell=1,2,\dots},\bm{\zeta}^{1:L}(t),\bm{\eta}^{1:L}(t))$ .

Next, we show that $(({\bm{K}}^{\ell}(t))_{\ell=1,2,\dots},\bm{\zeta}^{1:L}(t),\bm{\eta}^{1:L}(t))$ satisfies the Foster-Lyapunov criterion, i.e.,

(32)

\widehat{G}g\leq-1+b\mathds{1}_{\{S\}},

where $g$ is a non-negative function of the states, $S$ is a finite set, $b$ is a finite number, and $\widehat{G}$ is the infinitesimal generator of the continuous-time Markov chain. Let $t_{i}$ be the expected remaining time in the system when a job is in phase $i$ for each $i\in\mathcal{I}$ . According to the job model, we have the recurrence relation

(33)

\Big{(}\mu_{i\perp}+\sum_{i^{\prime}\in\mathcal{I}\colon i^{\prime}\neq i}\mu_{ii^{\prime}}\Big{)}t_{i}=\sum_{i^{\prime}\in\mathcal{I}\colon i^{\prime}\neq i}\mu_{ii^{\prime}}t_{i^{\prime}}\quad\forall i\in\mathcal{I}.

We construct a Lyapunov function $g$ as follows:

(34)

g(({\bm{k}}^{\ell})_{\ell=1,2,\dots},\bm{\zeta}^{1:L},\bm{\eta}^{1:L})=\sum_{i\in\mathcal{I}}\sum_{\ell=1}^{\infty}t_{i}k_{i}^{\ell}+\sum_{i\in\mathcal{I}}\sum_{\ell=1}^{\infty}t_{i}\zeta_{i}^{\ell}.

Using the relation (33), it can be verified that the drift of $g$ satisfies

	$\displaystyle\mspace{23.0mu}\widehat{G}g(({\bm{k}}^{\ell})_{\ell=1,2,\dots},\bm{\zeta}^{1:L},\bm{\eta}^{1:L})$
	$\displaystyle\leq\sum_{i\in\mathcal{I}}\Big{(}\lambda_{i}t_{i}r-\sum_{\ell=1}^{\infty}k_{i}^{\ell}\Big{)}+\sum_{i\in\mathcal{I}}\Big{(}\sum_{\ell=1}^{L}\sum_{({\bm{k}}^{\prime},\bm{a})\in E({\bm{k}}^{\ell})}\gamma_{{\bm{k}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}t_{i}-\sum_{\ell=1}^{\infty}\zeta_{i}^{\ell}\Big{)}$
	$\displaystyle\leq\sum_{i\in\mathcal{I}}\Big{(}\lambda_{i}t_{i}r+L\cdot\max_{{\bm{k}}\in\mathcal{K}}\sum_{({\bm{k}}^{\prime},\bm{a})\in E({\bm{k}})}\gamma_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}a_{i}t_{i}\Big{)}-\Big{(}\sum_{\ell=1}^{\infty}\sum_{i\in\mathcal{I}}k_{i}^{\ell}+\sum_{\ell=1}^{\infty}\sum_{i\in\mathcal{I}}\zeta_{i}^{\ell}\Big{)},$

where the first inequality uses the fact that virtual jobs are generated at a rate no faster than the total rate of job requests. Then the Foster-Lyapunov criterion in (32) is satisfied with $b$ and $S$ given by

b=\sum_{i\in\mathcal{I}}\Big{(}\lambda_{i}t_{i}r+L\cdot\max_{{\bm{k}}\in\mathcal{K}}\sum_{({\bm{k}}^{\prime},\bm{a})\in E({\bm{k}})}\gamma_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}a_{i}t_{i}\Big{)},

S=\left\{(({\bm{k}}^{\ell})_{\ell=1,2,\dots},\bm{\zeta}^{1:L},\bm{\eta}^{1:L})\colon g(({\bm{k}}^{\ell})_{\ell=1,2,\dots},\bm{\zeta}^{1:L},\bm{\eta}^{1:L})\leq b+1\right\}.

By the Foster-Lyapunov theorem, $(({\bm{K}}^{\ell}(t))_{\ell=1,2,\dots},\bm{\zeta}^{1:L}(t),\bm{\eta}^{1:L}(t))$ is positive recurrent. ∎

In this subsection, we prove Lemma 3 and Lemma 4 together. We begin by introducing some notations. We use $\bm{U}^{\ell}\triangleq(\widehat{{\bm{K}}}^{\ell},\bm{\eta}^{\ell})$ to represent the state of the $\ell$ -th server, and use $\bm{U}^{1:L}$ to represent the joint state of the first $L$ servers. We also use the lowercase $\bm{u}^{\ell}$ , $\bm{u}^{1:L}$ to represent the realizations of the corresponding random variables. We denote the total number of type $i$ virtual jobs as $V_{i}\triangleq\sum_{\ell=1}^{L}\zeta_{i}^{\ell}$ for $i\in\mathcal{I}$ , and its realization as $v_{i}$ . We denote the total number of type $i$ jobs on backup servers as $Y_{i}$ for $\in\mathcal{I}$ , and its realizations as $y_{i}$ . We also denote the total number of type $i$ tokens throughout the system as $Z_{i}\triangleq\sum_{\ell=1}^{L}\eta_{i}^{\ell}$ , and its realization as $z_{i}$ . Our goal can be rewritten as proving $\mathbb{E}[V_{i}]=O\left(\sqrt{r}\right)$ and $\mathbb{E}[Y_{i}]=O\left(\sqrt{r}\right)$ for each $i\in\mathcal{I}$ .

We first give an overview of the proof. Observe that in our model, the expected time that a job stay in the system is fixed. As a result, bounding the number of virtual jobs or jobs on backup servers in the system is equivalent to bounding the rate that they are generated, according to Little’s Law. By our construction of the policy, the rate of generating those jobs is closely related to the dynamics of the total number of type $i$ tokens $Z_{i}(t)$ .

To describe the dynamics of $Z_{i}(t)$ , we first introduce two functions $dv_{i}$ and $dy_{i}$ :

	$\displaystyle dv_{i}(a_{i},z_{i})$	$\displaystyle\triangleq(z_{i}+a_{i}-\eta_{\max})^{+},$
	$\displaystyle dy_{i}(z_{i})$	$\displaystyle\triangleq(1-z_{i})^{+}.$

The function $dv_{i}$ represents the increment in the number of type $i$ virtual jobs due to the event that the total number of type $i$ tokens on the first $L$ servers exceeds the token limit $\eta_{\max}$ . The function $dy_{i}$ corresponds to the increment in the total number of type $i$ jobs on backup servers due to the event that a type $i$ job arrives to the system without seeing a type $i$ token. For a function $g\colon(\mathcal{K}\times\mathcal{K})^{L}\to\mathbb{R}$ that only depends on the number of type $i$ tokens $z_{i}$ , its drift can be written as

	$\displaystyle\widehat{G}g(\bm{u}^{1:L})$	$\displaystyle=\sum_{\ell=1}^{L}\sum_{({\bm{k}}^{\prime},\bm{a})\in E({\bm{k}}^{\ell})}\gamma_{{\bm{k}}^{\ell},({\bm{k}}^{\prime},\bm{a})}\left(g(z_{i}+a_{i}-dv_{i}(a_{i},z_{i}))-g(z_{i})\right)\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}$
(35)			$\displaystyle\mspace{15.0mu}+\lambda_{i}r\left(g(z_{i}-1+dy_{i}(z_{i}))-g(z_{i})\right).$

We abuse the notation of $g$ here. For ease of exposition, we will simply write $dv_{i}$ and $dy_{i}$ to represent $dv_{i}(a_{i},z_{i})$ and $dy_{i}(z_{i})$ .

By construction, the total number of type $i$ tokens $\{Z_{i}(t\})$ is a stochastic process constrained within $[0,\eta_{\max}]$ . Note that $Z_{i}(t)$ increases when some servers request new tokens, and decreases when a real or virtual arrival checks out the token or when some servers have the excessive tokens removed. When $Z_{i}(t)$ is away from the boundaries, the average rate that it increases is approximately equal to $\lambda_{i}r$ , and the average rate that $Z_{i}(t)$ decreases is given by

\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{({\bm{k}}^{\prime},\bm{a})\in E(\widehat{{\bm{K}}}^{\ell})}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}\right]\approx\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{({\bm{k}}^{\prime},\bm{a})\in E(\overline{{\bm{K}}}^{\ell})}\gamma_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}\right]=L\cdot\overline{\lambda}_{i}\approx\lambda_{i}r,

where we have used the approximations that $\widehat{{\bm{K}}}^{\ell}\overset{d}{\approx}\overline{{\bm{K}}}^{\ell}$ , $\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}\approx 1$ and $L=\lceil\overline{N}\rceil\approx\overline{N}$ .

As $\{Z_{i}(t)\}$ randomly moves up and down with approximately the same rate and reflects on the boundaries of $0$ and $\eta_{\max}$ , it behaves as a reflected simple symmetric random walk. Intuitively speaking, the steady-state distribution of $Z_{i}$ is approximately a uniform distribution over $[0,\eta_{\max}]$ . Recall that $dv_{i}$ and $dy_{i}$ can only be non-zero when $Z_{i}(t)$ is near the boundaries. Since the length of the interval $\eta_{\max}=\Theta\left(\sqrt{r}\right)$ , we can expect that $dv_{i}$ and $dy_{i}$ diminish as $r\to\infty$ .

In the proof, we first establish the relationship between $\mathbb{E}[V_{i}]$ , $\mathbb{E}[Y_{i}]$ and $dv_{i}$ , $dy_{i}$ using Little’s Law. Then we derive bounds on $dv_{i}$ and $dy_{i}$ by analyzing the drift of several test functions of $Z_{i}$ . This step is implicitly based on the intuition that $Z_{i}$ is approximately uniformly distributed over $[0,\eta_{\max}]$ , with the tokens being generated and eliminated at similar rates. Finally, we invoke Lemma 2 to show that the tokens are indeed generated and eliminated at similar speeds, which leads to bounds on $dv_{i}$ and $dy_{i}$ .

Finally, we make some additional remarks on the notations. First, $dv_{i}$ and $dy_{i}$ depend on the total number of type $i$ tokens $z_{i}$ and the number of newly requested jobs $a_{i}$ , although we omit the dependency expression for ease of exposition. Second, we abuse the notation $dv_{i}$ and $dy_{i}$ to denote the corresponding random variables. We also write $\sum_{{\bm{k}}^{\prime},\bm{a}}$ as a shorthand for $\sum_{({\bm{k}}^{\prime},\bm{a})\in E({\bm{k}}^{\ell})}$ when the context is clear.

We are now ready to prove Lemma 3 and Lemma 4.

Step 1: Bounding Virtual Jobs and Jobs on Backup Servers using Little’s Law. We first apply Little’s Law to $V_{i}$ and $Y_{i}$ . For each type $i$ , we let the expected time that a type $i$ job stays in the system be $t_{i}$ . Let $t_{\max}=\max_{i\in\mathcal{I}}t_{i}$ . Because there are only finitely many types of jobs, and each job spends finite expected time in the system, $t_{\max}$ is a finite constant. By Little’s law,

(36)		$\displaystyle\mathbb{E}[V_{i}]$	$\displaystyle\leq t_{\max}\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}dv_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}\right],$
(37)		$\displaystyle\mathbb{E}[Y_{i}]$	$\displaystyle\leq t_{\max}\mathbb{E}\left[\lambda_{i}r\cdot dy_{i}\right].$

Step 2: Drift Analysis. The above two equations (36) and (37) suggest that we can derive upper bounds on $\mathbb{E}[V_{i}]$ and $\mathbb{E}[Y_{i}]$ by analyzing the following two terms:

•

$\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}dv_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}\right]$ , interpreted as the average rate that $Z_{i}(t)$ reflects on the boundary at $\eta_{\max}$ ;
•

$\mathbb{E}\left[\lambda_{i}r\cdot dy_{i}\right]$ , interpreted as the average rate that $Z_{i}(t)$ reflects on the boundary at $0$ .

We establish the relationships of $dv_{i}$ , $dy_{i}$ and $Z_{i}$ by analyzing the drift of two test functions $g$ .

Letting $g(z_{i})=z_{i}$ and taking steady-state expectation over its drift, by (B.2) and the fact that the drift is zero in steady state, we get

(38)

\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}(a_{i}-dv_{i})\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}+\lambda_{i}r(-1+dy_{i})\right]=0.

Similarly, letting $g(z_{i})=z_{i}^{2}$ and taking steady-state expectation over its drift, one can verify that

(39)		$\displaystyle\mspace{23.0mu}\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}dv_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}\right]$
(40)		$\displaystyle=\frac{1}{\eta_{\max}}\cdot\mathbb{E}\left[\left(\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}-\lambda_{i}r\right)\cdot Z_{i}\right]$
(41)		$\displaystyle\mspace{2.0mu}+\frac{1}{2\eta_{\max}}\cdot\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}\left(a_{i}^{2}-(dv_{i})^{2}\right)\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}+\lambda_{i}r\cdot(1-(dy_{i})^{2})\right].$

Readers may refer to the complete calculation at the end of this subsection.

Step 3: Estimating the Terms Obtained from Drift Analysis. We will first focus on bounding $\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}dv_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}\right]$ analyzing the two terms in (40) and (41) separately. Then we invoke (38) to bound $\mathbb{E}\left[\lambda_{i}r\cdot dy_{i}\right]$ .

The term in (41) is easy to deal with. Observe that the number of jobs requested each time should be no more than the maximal number of jobs that a server can hold, i.e., $a_{i}\leq K_{\max}$ , so

	$\displaystyle\eqref{eq:sat:ttk-second-order-term2}$	$\displaystyle\leq\frac{1}{2\eta_{\max}}\cdot\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}K_{\max}^{2}+\lambda_{i}r\right]$
		$\displaystyle\leq\frac{1}{2\eta_{\max}}\cdot\mathbb{E}\left[\sum_{\ell=1}^{L}\gamma_{\max}K_{\max}^{2}+\lambda_{i}r\right]$
(42)			$\displaystyle=O\left(\sqrt{r}\right),$

where in the second inequality we have used the fact that the total rate is uniformly bounded by $\gamma_{\max}$ , and the last step uses the facts that $L=O\left(r\right)$ and $\eta_{\max}=\Theta\left(\sqrt{r}\right)$ .

To bound the term in (40), first observe that $Z_{i}\leq\eta_{\max}$ , which implies that

\eqref{eq:sat:ttk-second-order-term1}\leq\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}-\lambda_{i}r\right\rvert\right].

The term on the RHS of the above equation is the expected absolute difference between the rates of generating and eliminating type $i$ tokens, which can be shown to be small relative to $r$ . Specifically, we claim that

(43)

\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}-\lambda_{i}r\right\rvert\right]=O\left(\sqrt{r}\right).

To show (43), first notice that we can remove the indicator $\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}$ without introducing much error:

	$\displaystyle\mspace{23.0mu}\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}-\lambda_{i}r\right\rvert\right]$
	$\displaystyle\leq\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\lambda_{i}r\right\rvert\right]+\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}\mathds{1}_{\{\bm{\eta}^{\ell}\neq\bm{0}\}}\right\rvert\right]$
	$\displaystyle\leq\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\lambda_{i}r\right\rvert\right]+\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\gamma_{\max}K_{\max}\mathds{1}_{\{\bm{\eta}^{\ell}\neq\bm{0}\}}\right\rvert\right]$
	$\displaystyle\leq\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\lambda_{i}r\right\rvert\right]+\gamma_{\max}K_{\max}\left\lvert\mathcal{I}\right\rvert\eta_{\max},$

where the first inequality is due to triangular inequality, the second inequality is due to the definition of $\gamma_{\max}$ , and the last inequality is because $\sum_{\ell=1}^{L}\mathds{1}_{\{\bm{\eta}^{\ell}\neq\bm{0}\}}\leq\left\lvert\mathcal{I}\right\rvert\eta_{\max}$ . It remains to bound the term $\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\lambda_{i}r\right\rvert\right]$ , which can be seen as showing that the rate of generating type $i$ tokens concentrates around the type $i$ jobs’ arrival rate $\lambda_{i}r$ . It is natural to think of using some Law of Large Numbers. Unfortunately, $\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}$ is not a sum of i.i.d. random variables due to dependencies among $\widehat{{\bm{K}}}^{\ell}$ for different $\ell$ ’s. As a result, we want to invoke the Wasserstein distance bound in Lemma 2 to replace $\widehat{{\bm{K}}}^{\ell}$ in the above expression with $\overline{{\bm{K}}}^{\ell}$ . We define the function $f({\bm{k}}^{1:L})$ as

(44)

f({\bm{k}}^{1:L})=\frac{1}{2\gamma_{\max}K_{\max}}\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{{\bm{k}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\lambda_{i}r\right\rvert.

We claim that $f\in\text{Lip}(1)$ . For any two ${\bm{k}}^{{1:L},1}$ , ${\bm{k}}^{{1:L},2}$ ,

	$\displaystyle\mspace{23.0mu}2\gamma_{\max}K_{\max}\cdot\left(f({\bm{k}}^{{1:L},1})-f({\bm{k}}^{{1:L},2})\right)$
	$\displaystyle=\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{{\bm{k}}^{\ell,1},({\bm{k}}^{\prime},\bm{a})}a_{i}-\lambda_{i}r\right\rvert-\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{{\bm{k}}^{\ell,2},({\bm{k}}^{\prime},\bm{a})}a_{i}-\lambda_{i}r\right\rvert$
	$\displaystyle\leq\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{{\bm{k}}^{\ell,1},({\bm{k}}^{\prime},\bm{a})}a_{i}-\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{{\bm{k}}^{\ell,2},({\bm{k}}^{\prime},\bm{a})}a_{i}\right\rvert$
	$\displaystyle=\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\left(\gamma_{{\bm{k}}^{\ell,1},({\bm{k}}^{\prime},\bm{a})}-\gamma_{{\bm{k}}^{\ell,2},({\bm{k}}^{\prime},\bm{a})}\right)a_{i}\mathds{1}_{\{({\bm{k}}^{\ell,1})\neq({\bm{k}}^{\ell,2})\}}\right\rvert$
	$\displaystyle\leq\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\left\lvert\gamma_{{\bm{k}}^{\ell,1},({\bm{k}}^{\prime},\bm{a})}-\gamma_{{\bm{k}}^{\ell,2},({\bm{k}}^{\prime},\bm{a})}\right\rvert\cdot K_{\max}\cdot\lVert{\bm{k}}^{\ell,1}-{\bm{k}}^{\ell,2}\rVert$
	$\displaystyle\leq\sum_{\ell=1}^{L}2\gamma_{\max}K_{\max}\cdot\lVert{\bm{k}}^{\ell,1}-{\bm{k}}^{\ell,2}\rVert$
	$\displaystyle=2\gamma_{\max}K_{\max}\cdot\lVert{\bm{k}}^{{1:L},1}-{\bm{k}}^{{1:L},2}\rVert,$

where the first inequality is due to triangular inequality; the second inequality uses the fact that $a_{i}\leq K_{\max}$ and $\mathds{1}_{\{{\bm{k}}^{1,\ell}\neq{\bm{k}}^{2,\ell}\}}\leq\lVert{\bm{k}}^{1,\ell}-{\bm{k}}^{2,\ell}\rVert$ ; the third inequality uses triangular inequality, the fact that the total rate at a configuration ${\bm{k}}$ is bounded by $\gamma_{\max}$ and the property of the $L^{1}$ norm $\lVert\cdot\rVert$ . Therefore, $f\in\text{Lip}(1)$ . The Lipschitz continuity of $f$ allows us to invoke Lemma 2 and get

\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\lambda_{i}r\right\rvert\right]-\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\lambda_{i}r\right\rvert\right]\leq 2\gamma_{\max}K_{\max}\cdot O\left(\sqrt{r}\right).

Therefore,

(45)

\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\lambda_{i}r\right\rvert\right]\leq\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\lambda_{i}r\right\rvert\right]+O\left(\sqrt{r}\right).

Observe that under a Markovian policy, the request rate of type $i$ jobs can be written as $\overline{\lambda}_{i}=\mathbb{E}[\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}]$ , so we have

(46)

\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\lambda_{i}r\right]=\overline{\lambda}_{i}\cdot\lceil\overline{N}\rceil-\lambda_{i}r=O\left(1\right).

Moreover, because $\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}$ are i.i.d. for $\ell=1,\dots,L$ , we have

	$\displaystyle\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\lambda_{i}r\right\rvert\right]$	$\displaystyle\leq\sqrt{\mathbb{E}\left[\left(\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\lambda_{i}r\right)^{2}\right]}+O\left(1\right)$
		$\displaystyle=\sqrt{\sum_{\ell=1}^{L}\text{Var}\left(\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}\right)}+O\left(1\right)$
		$\displaystyle=O\left(\sqrt{r}\right).$

Therefore, by combining the arguments above, we get

	$\displaystyle\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}-\lambda_{i}r\right\rvert\right]$	$\displaystyle\leq\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\lambda_{i}r\right\rvert\right]+O\left(\sqrt{r}\right)$
		$\displaystyle\leq\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\lambda_{i}r\right\rvert\right]+O\left(\sqrt{r}\right)$
(47)			$\displaystyle\leq O\left(\sqrt{r}\right),$

which proves (43). This implies that the term in (40) is also in $O\left(\sqrt{r}\right)$ .

Combining the bounds on the terms in (40) and (41), we get

\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}dv_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}\right]=O\left(\sqrt{r}\right).

Finally, we bound $\mathbb{E}\left[\lambda_{i}r\cdot dy_{i}\right]$ . We rearrange the terms in (38) and get

	$\displaystyle\mathbb{E}\left[\lambda_{i}r\cdot dy_{i}\right]$	$\displaystyle=\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}(-a_{i}+dv_{i})\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}+\lambda_{i}r\right]$
		$\displaystyle=\mathbb{E}\left[-\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}+\lambda_{i}r\right]+O\left(\sqrt{r}\right).$

By (43), we have $\mathbb{E}\left[-\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}+\lambda_{i}r\right]=O\left(\sqrt{r}\right)$ . Therefore,

\mathbb{E}\left[\lambda_{i}r\cdot dy_{i}\right]=O\left(\sqrt{r}\right).

We invoke the equations (36) and (37) that we get at the beginning of the proof, and conclude that

\mathbb{E}[V_{i}]\leq t_{\max}\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}dv_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}\right]=O\left(\sqrt{r}\right).

\mathbb{E}[Y_{i}]\leq t_{\max}\mathbb{E}\left[\lambda_{i}r\cdot dy_{i}\right]=O\left(\sqrt{r}\right).

This finishes the proof. ∎

We show the calculation detail of deriving the following equality.

(39)		$\displaystyle\mspace{23.0mu}\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}dv_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}\right]$
(40)		$\displaystyle=\frac{1}{\eta_{\max}}\cdot\mathbb{E}\left[\left(\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}-\lambda_{i}r\right)\cdot Z_{i}\right]$
(41)		$\displaystyle\mspace{2.0mu}+\frac{1}{2\eta_{\max}}\cdot\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}\left(a_{i}^{2}-(dv_{i})^{2}\right)\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}+\lambda_{i}r\cdot(1-(dy_{i})^{2})\right].$

The equality is obtained by considering the drift of the function $g(z_{i})=z_{i}^{2}$ , which is zero in steady state. Recall that the drift of $g(z_{i})$ is given by

	$\displaystyle\widehat{G}g(z_{i},\bm{u}^{1:L})$	$\displaystyle=\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{{\bm{k}}^{\ell},({\bm{k}}^{\prime},\bm{a})}\left(g(z_{i}+a_{i}-dv_{i})-g(z_{i})\right)\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}$
(B.2)			$\displaystyle\mspace{15.0mu}+\lambda_{i}r\left(g(z_{i}-1+dy_{i})-g(z_{i})\right).$

We will first calculate $g(z_{i}+a_{i}-dv_{i})-g(z_{i})$ , then $g(z_{i}-1+dy_{i})-g(z_{i})$ , and finally plug them into the (B.2).

The calculation of $g(z_{i}+a_{i}-dv_{i})-g(z_{i})$ utilizes the following property of $dv_{i}$ :

(48)

(z_{i}+a_{i}-dv_{i})\cdot dv_{i}=\eta_{\max}\cdot dv_{i}.

This property follows from the definition $dv_{i}=(z_{i}+a_{i}-\eta_{\max})^{+}$ . Intuitively, this is because $dv_{i}$ is the “force” that pushes $z_{i}$ back when it hits the boundary at $\eta_{\max}$ . Using the property, we have

	$\displaystyle\mspace{23.0mu}g(z_{i}+a_{i}-dv_{i})-g(z_{i})$
	$\displaystyle=(z_{i}+a_{i}-dv_{i})^{2}-z_{i}^{2}$
	$\displaystyle=(z_{i}+a_{i}-dv_{i})^{2}-(z_{i}+a_{i}-dv_{i}-a_{i}+dv_{i})^{2}$
	$\displaystyle=(z_{i}+a_{i}-dv_{i})^{2}-\left((z_{i}+a_{i}-dv_{i})^{2}+2(-a_{i}+dv_{i})\cdot(z_{i}+a_{i}-dv_{i})+(-a_{i}+dv_{i})^{2}\right)$
	$\displaystyle=2(a_{i}-dv_{i})\cdot(z_{i}+a_{i}-dv_{i})-(-a_{i}+dv_{i})^{2}$
	$\displaystyle=2a_{i}\cdot(z_{i}+a_{i}-dv_{i})-(-a_{i}+dv_{i})^{2}-2dv_{i}\cdot\eta_{\max}$
	$\displaystyle=2a_{i}\cdot z_{i}+a_{i}^{2}-(dv_{i})^{2}-2dv_{i}\cdot\eta_{\max}.$

The second last equality is due to (48), and the rest are all algebraic manipulations.

We carry out a similar calculation for $g(z_{i}-1+dy_{i})-g(z_{i})$ :

	$\displaystyle\mspace{23.0mu}g(z_{i}-1+dy_{i})-g(z_{i})$
	$\displaystyle=2(-1+dy_{i})\cdot z_{i}+(-1+dy_{i})^{2}$
	$\displaystyle=-2z_{i}+2z_{i}\cdot dy_{i}+1-2dy_{i}+(dy_{i})^{2}$
	$\displaystyle=-2z_{i}+1+2(z_{i}-1+dy_{i})\cdot dy_{i}-(dy_{i})^{2}$
	$\displaystyle=-2z_{i}+1-(dy_{i})^{2}.$

where the last equality is due to the property that

(49)

(z_{i}-1+dy_{i})\cdot dy=0,

and the rest are all algebraic manipulations.

Putting together,

	$\displaystyle\widehat{G}g(z_{i},\bm{u}^{1:L})$	$\displaystyle=\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{{\bm{k}}^{\ell},({\bm{k}}^{\prime},\bm{a})}\left(g(z_{i}+a_{i}-dv_{i})-g(z_{i})\right)\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}$
		$\displaystyle\mspace{15.0mu}+\lambda_{i}r\left(g(z_{i}-1+dy_{i})-g(z_{i})\right)$
		$\displaystyle=\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{{\bm{k}}^{\ell},({\bm{k}}^{\prime},\bm{a})}\left(2a_{i}\cdot z_{i}+a_{i}^{2}-(dv_{i})^{2}-2dv_{i}\cdot\eta_{\max}\right)\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}$
		$\displaystyle\mspace{15.0mu}+\lambda_{i}r\cdot\left(-2z_{i}+1-(dy_{i})^{2}\right).$

After recombining the terms, we get

	$\displaystyle\mspace{23.0mu}\eta_{\max}\cdot\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}dv_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}\right]$
	$\displaystyle=\mathbb{E}\left[\left(\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}-\lambda_{i}r\right)\cdot Z_{i}\right]$
	$\displaystyle\mspace{23.0mu}+\frac{1}{2}\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}\left(a_{i}^{2}-(dv_{i})^{2}\right)\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}+\lambda_{i}r\cdot(1-(dy_{i})^{2})\right]$

This finishes the calculation.

In this subsection, we prove Theorem 3 without assuming ${\bm{k}}^{0}$ -irreducibility of the subroutine $\overline{\sigma}$ . Specifically, suppose that we have a Markovian single-server policy $\overline{\sigma}$ and an initial distribution $p_{j}$ over its recurrent classes $S_{j}$ for $j=1,2\dots J$ , we will construct an infinite-server policy $\sigma$ such that (5) and (6) still hold. The basic idea is to decompose a general Markovian single-server policy $\overline{\sigma}$ into multiple ${\bm{k}}^{0}$ -irreducible Markovian policies, each induces one recurrent class and preserves stationary distribution and the throughput of $\overline{\sigma}$ on that recurrent class, as stated in Lemma 1 below.

Let $\overline{\sigma}$ be a general single-server Markovian policy with recurrent classes $S_{j}$ for $j=1,2,\dots,J$ . Then for each $j$ exists a Markovian policy $\overline{\sigma}^{j}$ such that

•

The induced Markov chain is ${\bm{k}}^{0}$ -irreducible with the unique recurrent class being $S_{j}$ ;
•

The stationary distribution is the same as the stationary distribution under $\overline{\sigma}$ starting from a configuration in $S_{j}$ .

For each $j=1,2,\dots,J$ , we define the policy $\overline{\sigma}^{j}$ as follows: when the system has configuration ${\bm{k}}\in S_{j}$ , the policy $\overline{\sigma}^{j}$ makes the same decisions as $\overline{\sigma}$ ; when ${\bm{k}}\notin S_{j}$ and ${\bm{k}}=\bm{0}$ , the policy starts a timer whose duration follows an exponential distribution with rate $1$ and immediately adds ${\bm{k}}^{0}$ many jobs of each type when the timer ticks for some arbitrary ${\bm{k}}^{0}\in S_{j}$ ; when ${\bm{k}}\notin S_{j}$ and ${\bm{k}}\neq\bm{0}$ , the policy does not request any jobs.

We show that under the new policy $\overline{\sigma}^{j}$ , $S_{j}$ is also a recurrent class of the induced Markov chain. This is because if the system starts from a configuration ${\bm{k}}\in S_{j}$ , then it will stay in $S_{j}$ since it makes the same decisions and has the same transitions as under the policy $\overline{\sigma}$ . Because $S_{j}$ is a recurrent class under $\overline{\sigma}$ , it is still a recurrent class under $\overline{\sigma}^{j}$ .

To show that the Markov chain induced by $\overline{\sigma}^{j}$ is ${\bm{k}}^{0}$ -irreducible, observe that starting from any ${\bm{k}}\notin S_{j}$ , the system state will return to $S_{j}$ . Specifically,

•

If the system starts from a configuration ${\bm{k}}$ such that ${\bm{k}}\notin S_{j}$ and ${\bm{k}}\neq\bm{0}$ , then no new jobs will be requested until either ${\bm{k}}\in S_{j}$ or ${\bm{k}}=\bm{0}$ . In the latter case, by the construction of the policy, the system jumps to a configuration in $S_{j}$ after the next transition.
•

If the system starts from a configuration ${\bm{k}}$ such that ${\bm{k}}\notin S_{j}$ and ${\bm{k}}=\bm{0}$ , the system jumps to a configuration in $S_{j}$ after the next transition.

The claim that the stationary distribution under $\overline{\sigma}^{j}$ is the same as the stationary distribution under $\overline{\sigma}$ starting from a configuration in $S_{j}$ is trivial to show, because when the system initializes from any configuration in $S_{j}$ , it stays in $S_{j}$ and the transitions are exactly the same under the two policies. ∎

The JRS policy with a general Markovian single-server policy as its subroutine $\overline{\sigma}$ is constructed using the ${\bm{k}}^{0}$ -irreducible policies $\overline{\sigma}^{j}$ ’s obtained from the decomposition of $\overline{\sigma}$ and the probabilities $p^{j}$ ’s.

(1)

We divide all the servers into $J$ server pools, each with infinitely many servers. Let the $j$ -th server pool run the JRS policy with subroutine $\overline{\sigma}^{j}$ (defined in Section 4.2) for each $j=1,\dots J$ .
(2)

Whenever we see an arrival of type $i$ , we route the job to the $j$ -th infinite-server system with probability $\frac{p^{j}\overline{\lambda}_{i}^{j}}{\sum_{j}p^{j}\overline{\lambda}_{i^{\prime}}^{j}}$ for each $j=1,\dots J$ .

To analyze the policy $\sigma$ , let $\pi^{j}$ and $\overline{\lambda}_{i}^{j}$ ’s be the stationary distribution and throughput of the policy $\overline{\sigma}^{j}$ for $j=1,\dots,J$ . By Lemma 1, we have the following relationships:

(50)		$\displaystyle\pi({\bm{k}})$	$\displaystyle=\textstyle\sum_{j=1}^{J}p^{j}\pi^{j}({\bm{k}}),\quad\forall{\bm{k}}\in\mathcal{K},$
(51)		$\displaystyle\overline{\lambda}_{i}$	$\displaystyle=\textstyle\sum_{j=1}^{J}p^{j}\overline{\lambda}_{i}^{j},\quad\forall i\in\mathcal{I}.$

Based on the above relationships, we can prove the general version of the Conversion Theorem that does not require ${\bm{k}}^{0}$ -irreducibility.

For each $j=1,\dots,J$ , Lemma 1 implies that the policy $\overline{\sigma}^{j}$ and stationary distribution $\pi^{j}$ form a feasible solution to the single-server problem $\overline{\mathcal{P}}\big{(}\big{(}p^{j}\overline{\lambda}_{i}^{j}\overline{N}\big{)}_{i\in\mathcal{I}},\epsilon\big{)}$ , and the corresponding objective value is $p^{j}\overline{N}$ . Consider the infinite-server system with arrival rates $\big{(}p^{j}\overline{\lambda}_{i}^{j}\overline{N}\big{)}_{i\in\mathcal{I}}$ and budget $\epsilon$ . As we have proved Theorem 3 for the JRS policy with ${\bm{k}}^{0}$ -irreducible subroutine $\overline{\sigma}^{j}$ , it follows that

(52)		$\displaystyle\left\lvert\textstyle\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}^{j}]-\left\lceil p^{j}\overline{N}\right\rceil\cdot(1-\pi^{j}(\bm{0}))\right\rvert$	$\displaystyle=O\left(\sqrt{r}\right),$
(53)		$\displaystyle\left\lvert\textstyle\sum_{{\bm{k}}\neq\bm{0}}h({\bm{k}})\mathbb{E}[X_{{\bm{k}}}^{j}]-\left\lceil p^{j}\overline{N}\right\rceil\cdot\textstyle\sum_{{\bm{k}}\neq\bm{0}}h({\bm{k}})\pi^{j}({\bm{k}})\right\rvert$	$\displaystyle=O\left(\sqrt{r}\right),$

where $X_{{\bm{k}}}^{j}$ is the random variable representing the steady-state number of servers in configuration ${\bm{k}}$ in the infinite-server system under the JRS policy with subroutine $\overline{\sigma}^{j}.$

By the construction of $\sigma$ , the arrival rate to the $j$ -th server pool is equal to $\frac{p^{j}\overline{\lambda}_{i}^{j}}{\sum_{j}p^{j}\overline{\lambda}_{i^{\prime}}^{j}}\cdot\lambda_{i}r=\frac{p^{j}\overline{\lambda}_{i}^{j}}{\overline{\lambda}_{i}}\cdot\lambda_{i}r=p^{j}\overline{\lambda}_{i}^{j}\overline{N}$ , where the first equality is due to (51), and the second equality is due to the condition that $\lambda_{i}r=\overline{\lambda}_{i}^{j}\cdot L$ . Therefore, (52) and (53) hold, and we have

	$\displaystyle\left\lvert\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}]-\overline{N}\cdot(1-\pi(\bm{0}))\right\rvert$	$\displaystyle=\left\lvert\sum_{j=1}^{J}\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}^{j}]-\sum_{j=1}^{J}p^{j}\overline{N}\cdot(1-\pi^{j}(\bm{0}))\right\rvert$
		$\displaystyle\leq\sum_{j=1}^{J}\left\lvert\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}^{j}]-\left\lceil p^{j}\overline{N}\right\rceil\cdot(1-\pi^{j}(\bm{0}))\right\rvert+O(1)$
		$\displaystyle=O(\sqrt{r}).$

Here we use (52) and the relationship between $\pi({\bm{k}})$ and $\pi^{j}({\bm{k}})$ . Similarly,

	$\displaystyle\left\lvert\sum_{{\bm{k}}\neq\bm{0}}h({\bm{k}})\mathbb{E}[X_{{\bm{k}}}]-\overline{N}\cdot\sum_{j=1}^{J}\sum_{{\bm{k}}\neq\bm{0}}h({\bm{k}})\pi({\bm{k}})\right\rvert$	$\displaystyle=\left\lvert\sum_{j=1}^{J}\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}^{j}]-\sum_{j=1}^{J}p^{j}\overline{N}\cdot\sum_{{\bm{k}}\neq\bm{0}}h({\bm{k}})\pi^{j}({\bm{k}})\right\rvert$
		$\displaystyle=\sum_{j=1}^{J}\left\lvert\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}[X_{{\bm{k}}}^{j}]-\left\lceil p^{j}\overline{N}\right\rceil\cdot\sum_{{\bm{k}}\neq\bm{0}}h({\bm{k}})\pi^{j}({\bm{k}})\right\rvert+O(1)$
		$\displaystyle=O(\sqrt{r}).$

After re-indexing the servers, we get (5) and (6). The bounds on $N(\sigma)$ and $C(\sigma)$ , (7) and (8), follow from (5) and (6). They can be verified in the same way as that in the proof for the irreducible case, so we omit the argument here. ∎

In this section, we show the equivalence of the single-server problem in (3) and a linear program (LP) as stated in Theorem 4. The equivalence needs to be proved in two directions. In Section C.1, we first derive the linear program (61) as a relaxation of the single-server problem (3) so that the optimal value of the LP is a lower bound to the optimal value of (3). Then in Section C.2, we will construct a Markovian single-server policy that achieves the optimal value of the LP, which implies the optimality of the policy.

In this subsection we derive an LP relaxation of the optimization problem (3), restated below:

(3 revisited)	$\displaystyle\underset{\overline{N},\mspace{10.0mu}\overline{\sigma},\mspace{10.0mu}\pi}{\text{minimize}}$	$\displaystyle\overline{N}$
	subject to	$\displaystyle\mathbb{E}\left[h\big{(}\overline{{\bm{K}}}(\infty)\big{)}\middle\|\overline{{\bm{K}}}(\infty)\neq\bm{0}\right]\leq\epsilon,$
	$\displaystyle\overline{N}\cdot\overline{\lambda}_{i}=\lambda_{i}r,\quad\forall i\in\mathcal{I}.$

Here we allow $\overline{\sigma}$ to be non-Markovian, i.e., it can make decisions based on the history, but we still require its performance metrics used in the objective and the constraints of (3) to have well-defined steady-state distributions.

Observe that both $\overline{{\bm{K}}}(\infty)$ and $\overline{\lambda}_{i}$ depend on the stationary distribution $\pi$ , but the constraints in terms of $\pi$ are implicit. To derive an LP relaxation, we give an explicit characterization of the constraints that must be satisfied by the stationary distribution $\pi$ induced by any feasible policy $\overline{\sigma}$ .

To do this, we derive a version of the stationary equation in terms of a quantity called transition frequency. The transition frequency of type $i$ jobs is a function $u_{i}\colon\mathcal{K}\to\mathbb{R}$ describing the steady-state frequency of requesting a type $i$ job when the system has configuration $\bm{k}$ . To rigorously define transition frequency, we first introduce a concept called nominal transition.

Consider a single-server system under any policy. When the configuration $\overline{{\bm{K}}}(t)$ transitions from ${\bm{k}}$ to ${\bm{k}}^{\prime}+\bm{a}$ for some ${\bm{k}},{\bm{k}}^{\prime}\in\mathcal{K}$ with $\bm{a}=(a_{i})_{i\in\mathcal{I}}$ new jobs added into service, we decompose the transition by adding intermediate configurations as illustrated below, where ${\bm{k}}$ first goes to ${\bm{k}}^{\prime}$ if ${\bm{k}}^{\prime}\neq{\bm{k}}$ , then add jobs of each type one by one.

{\bm{k}}\to{\bm{k}}^{\prime}\to({\bm{k}}^{\prime}+\bm{e}_{i_{1}})\to\dots\to({\bm{k}}^{\prime}+a_{i_{1}}\bm{e}_{i_{1}})\to\dots\to({\bm{k}}^{\prime}+a_{i_{1}}\bm{e}_{i_{1}}+\dots+a_{i_{\left\lvert\mathcal{I}\right\rvert}}\bm{e}_{i_{\left\lvert\mathcal{I}\right\rvert}}),

where $(i_{1},i_{2},\dots,i_{\left\lvert\mathcal{I}\right\rvert})$ is a fixed ordering of the set of phases $\mathcal{I}$ . We call each short transition in the diagram a nominal transition.

For ${\bm{k}}^{1},{\bm{k}}^{2}\in\mathcal{K}$ , we denote $F({\bm{k}}^{1},{\bm{k}}^{2},t)$ as the cumulative number of nominal transitions from ${\bm{k}}^{1}$ to ${\bm{k}}^{2}$ during the time interval $[0,t]$ , which is a random variable with a distribution depending on the single-server policy and initial distribution of configurations.

Note that for any $i\in\mathcal{I}$ and ${\bm{k}}\in\mathcal{K}$ s.t. $k_{i}\geq 1$ , $F({\bm{k}},{\bm{k}}-\bm{e}_{i},t)$ counts the number of times that a type $i$ job departs when being in configuration ${\bm{k}}$ . As a result,

F({\bm{k}},{\bm{k}}-\bm{e}_{i},t)=\mathcal{N}\left(\int_{0}^{t}k_{i}\mu_{i\perp}\mathds{1}_{\{K(s)={\bm{k}}\}}ds\right),

where $\mathcal{N}(t)$ denotes a unit rate Poisson process. If we take expectation, divide both sides by $t$ , and let $t\to\infty$ , we have

(54)

\lim_{t\to\infty}\frac{1}{t}\mathbb{E}[F({\bm{k}},{\bm{k}}-\bm{e}_{i},t)]=k_{i}\mu_{i\perp}\cdot\lim_{t\to\infty}\frac{1}{t}\int_{0}^{t}\mathbb{P}({\bm{K}}(s)={\bm{k}})ds=k_{i}\mu_{i\perp}\pi({\bm{k}}).

Similarly, $F({\bm{k}},{\bm{k}}-\bm{e}_{i}+\bm{e}_{i^{\prime}},t)$ counts the number of times a job in phase $i$ transitions to phase $i^{\prime}$ when being in configuration ${\bm{k}}$ for any $i,i^{\prime}\in\mathcal{I}$ , ${\bm{k}}\in\mathcal{K}$ s.t. $i^{\prime}\neq i$ and $k_{i}\geq 1$ , so

(55)

\lim_{t\to\infty}\frac{1}{t}\mathbb{E}[F({\bm{k}},{\bm{k}}-\bm{e}_{i}+\bm{e}_{i^{\prime}},t)]=k_{i}\mu_{ii^{\prime}}\pi({\bm{k}}).

We define transition frequency as follows.

Transition frequency of type $i$ jobs at state ${\bm{k}}$ is the long-run average number of nominal transitions from configuration ${\bm{k}}$ to ${\bm{k}}+\bm{e}_{i}$ per unit time,

(56)

u_{i}({\bm{k}})\triangleq\lim_{t\to\infty}\frac{1}{t}\mathbb{E}[F({\bm{k}},{\bm{k}}+\bm{e}_{i},t)].

The transition frequencies allow us to derive the following version of the stationary equation.

Under any policy, the stationary distribution $\pi$ and the transition frequency $u_{i}$ satisfy the following equation:

(57)			$\displaystyle\mspace{23.0mu}\sum_{i}u_{i}({\bm{k}}-\bm{e}_{i})\mathds{1}_{\{k_{i}\geq 1\}}+\sum_{i}(k_{i}+1)\mu_{i\perp}\pi({\bm{k}}+\bm{e}_{i})$
			$\displaystyle\mspace{23.0mu}+\sum_{i,i^{\prime}\colon i\neq i^{\prime}}(k_{i}+1)\mu_{ii^{\prime}}\pi({\bm{k}}+\bm{e}_{i}-\bm{e}_{i^{\prime}})\mathds{1}_{\{k_{i^{\prime}}\geq 1\}}$
			$\displaystyle=\sum_{i}u_{i}({\bm{k}})+\Big{(}\sum_{i}k_{i}\mu_{i\perp}+\sum_{i,i^{\prime}\colon i\neq i^{\prime}}k_{i}\mu_{ii^{\prime}}\Big{)}\pi({\bm{k}})$

for any state ${\bm{k}}\in\mathcal{K}$ , and $\sum_{i},\sum_{i,i^{\prime}}$ are shorthand for $\sum_{i\in\mathcal{I}},\sum_{i,i^{\prime}\in\mathcal{I}}$ .

For each configuration ${\bm{k}}\in\mathcal{K}$ , if we look at the difference between the number of nominal transitions into configuration ${\bm{k}}$ and that out of configuration ${\bm{k}}$ by time $t$ , we have the following equation,

(58)			$\displaystyle\mspace{23.0mu}\mathds{1}_{\{\overline{{\bm{K}}}(t)={\bm{k}}\}}-\mathds{1}_{\{\overline{{\bm{K}}}(0)={\bm{k}}\}}$
			$\displaystyle=\sum_{i}F({\bm{k}}-\bm{e}_{i},{\bm{k}},t)\mathds{1}_{\{k_{i}\geq 1\}}+\sum_{i}F({\bm{k}}+\bm{e}_{i},{\bm{k}},t)+\sum_{i,i^{\prime}\colon i\neq i^{\prime}}F({\bm{k}}+\bm{e}_{i}-\bm{e}_{i^{\prime}},{\bm{k}},t)\mathds{1}_{\{k_{i^{\prime}}\geq 1\}}$
			$\displaystyle\mspace{23.0mu}-\sum_{i}F({\bm{k}},{\bm{k}}+\bm{e}_{i},t)-\sum_{i}F({\bm{k}},{\bm{k}}-\bm{e}_{i},t)\mathds{1}_{\{k_{i}\geq 1\}}-\sum_{i,i^{\prime}\colon i\neq i^{\prime}}F({\bm{k}},{\bm{k}}-\bm{e}_{i}+\bm{e}_{i^{\prime}},t)\mathds{1}_{\{k_{i}\geq 1\}}$

By Definition 2 and (54)-(55), if we divide both sides of (58) by $t$ and let $t\to\infty$ , we get the stationary equation in (57). ∎

Since the stationary equation in (57) is linear in $u_{i}({\bm{k}})$ and $\pi({\bm{k}})$ , we can write it in matrix form:

(59)

A\bm{\pi}+\textstyle\sum_{i\in\mathcal{I}}B_{i}\bm{u}_{i}=0,

where $\bm{\pi}$ and $\bm{u}_{i}$ are column vectors representing $\pi(\cdot)$ , $u_{i}(\cdot)$ , and $A,B_{i}$ are matrices that make (59) equivalent to (57). Therefore, the following three conditions are necessary for any tuple $(\bm{\pi},(\bm{u}_{i})_{i\in\mathcal{I}})$ to be a possible pair of stationary distribution and transition frequencies for a Markovian policy.

(60)	$\displaystyle A\bm{\pi}+\textstyle\sum_{i\in\mathcal{I}}B_{i}\bm{u}_{i}$	$\displaystyle=0$
	$\displaystyle\textstyle\sum_{{\bm{k}}}\pi({\bm{k}})$	$\displaystyle=1$
	$\displaystyle\bm{\pi},\bm{u}_{i}$	$\displaystyle\geq 0,\quad\forall i\in\mathcal{I}$

Based on the characterization of stationary $\bm{\pi}$ and $\bm{u}_{i}$ ’s in (60), we can now formulate a linear program $\overline{\mathcal{LP}}((\lambda_{i})_{i\in\mathcal{I}},\epsilon)$ . The linear program has decision variables $\Phi\in\mathbb{R}$ , $\bm{\pi}\in\mathbb{R}^{\mathcal{K}}$ , and $\bm{u}_{i}\in\mathbb{R}^{\mathcal{K}}$ for $i\in\mathcal{I}$ , where $\Phi$ is a factor that scales the throughput of each type of jobs in the direction of $(\lambda_{i})_{i\in\mathcal{I}}$ .

where $\bm{h}$ is the vector form of the cost rate function $h$ ; $\bm{1}_{o}^{T}$ is a $\left\lvert\mathcal{K}\right\rvert$ -dimensional vector with one in all entries except those with $\sum_{i\in\mathcal{I}}k_{i}=K_{\max}$ . In addition to the last three constraints on stationarity, the first constraint of $\overline{\mathcal{LP}}((\lambda_{i})_{i\in\mathcal{I}},\epsilon)$ is the resource contention constraint; the second constraint is because the transition frequency from ${\bm{k}}$ to ${\bm{k}}+\bm{e}_{i}$ is equal to the throughput of type $i$ jobs.

By the construction of $\overline{\mathcal{LP}}((\lambda_{i})_{i\in\mathcal{I}},\epsilon)$ , any feasible solution of $\overline{\mathcal{P}}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ can be converted to a feasible solution of $\overline{\mathcal{LP}}((\lambda_{i})_{i\in\mathcal{I}},\epsilon)$ . Let $\overline{N}^{*}$ be the optimal value of (3) and $\Phi^{*}$ be the optimal value of (61). Then we have

(62)

\overline{N}^{*}\geq\frac{r}{\Phi^{*}}.

In this subsection, we describe a procedure that allows us to construct a policy that achieves the lower bound given by the LP relaxation in (61). Specifically, given a feasible solution $(\bm{\pi},(\bm{u}_{i})_{i\in\mathcal{I}})$ to (61), we define an LP-based policy that requests jobs as follows:

•

Case 1: When the system enters a configuration ${\bm{k}}$ with $\pi({\bm{k}})\neq 0$ , for each $i\in\mathcal{I}$ , the policy starts a timer whose duration follows an exponential distribution with rate $\frac{u_{i}({\bm{k}})}{\pi({\bm{k}})}$ . The policy requests a type $i$ job when the $i$ -th timer ticks. When the configuration changes, all timers are canceled.
•

Case 2: When the system enters a configuration ${\bm{k}}$ with $\pi({\bm{k}})=0$ and $\sum_{i^{\prime}}u_{i^{\prime}}({\bm{k}})\neq 0$ , the policy immediately requests a type $i$ job with probability $\frac{u_{i}({\bm{k}})}{\sum_{i^{\prime}}u_{i^{\prime}}({\bm{k}})}$ .
•

Case 3: When the system enters a configuration ${\bm{k}}$ with $\pi({\bm{k}})=0$ and $\sum_{i^{\prime}}u_{i^{\prime}}({\bm{k}})=0$ , the policy does not request any jobs.

We denote the LP-based policy based on the solution $(\bm{\pi},(\bm{u}_{i})_{i\in\mathcal{I}})$ as $\overline{\sigma}(\bm{\pi},(\bm{u}_{i})_{i\in\mathcal{I}})$ .

Note that the definition of the LP-based policy here is stated from a view different from the view in Section 4.2: here each request only adds one job to the server, and one request can happen immediately after another; while in Section 4.2 each request can add multiple jobs to the server, and there is only one request happening at a time. We refer to the view here as the impulsive view, because multiple requests happening at the same time can be thought of as having an infinite request rate. In contrast, we call the view in Section 4.2 the non-impulsive view.

The LP-based policy can be alternatively described using the non-impulsive view if we see multiple requests happening at the same time as one request that adds multiple jobs to the server. More specifically, each reactive request of the LP-based policy is initiated by an internal transition or departure and consists of one or multiple requests of Case 2; each proactive request of the LP-based policy consists of one request in Case 1 and possibly several requests in Case 2.

The following lemma characterizes the steady-state behavior of a single-server system under the LP-based policy.

Consider a single-server system under the LP-based policy $\overline{\sigma}(\bm{\pi},(\bm{u}_{i})_{i\in\mathcal{I}})$ , where $(\bm{\pi},(\bm{u}_{i})_{i\in\mathcal{I}})$ is a feasible solution to (61). We have that $\bm{\pi}$ is a stationary distribution under policy $\overline{\sigma}$ , and $(\bm{u}_{i})_{i\in\mathcal{I}}$ are the transition frequencies corresponding to $\overline{\sigma}$ .

The proof of Lemma 4 is based on (58), following the same argument as the proof of Lemma 3, as well as an induction argument.

Let $(\bm{\pi},(\bm{u}_{i})_{i\in\mathcal{I}})$ be a feasible solution to the LP in (61). To show that $\bm{\pi}$ and $(\bm{u}_{i})_{i\in\mathcal{I}}$ are also the actual stationary distribution and transition frequencies, it suffices to show that if the initial distribution follows $\bm{\pi}$ , i.e.

\mathbb{P}\big{(}\overline{{\bm{K}}}(0)={\bm{k}}\big{)}=\pi({\bm{k}})

then under the policy $\overline{\sigma}(\bm{\pi},(\bm{u}_{i})_{i\in\mathcal{I}})$ , we have

(63)		$\displaystyle\lim_{T\to\infty}\frac{1}{T}\int_{0}^{T}\mathbb{P}\big{(}\overline{{\bm{K}}}(t)={\bm{k}}\big{)}dt=\pi({\bm{k}}),$
(64)		$\displaystyle\lim_{T\to\infty}\frac{1}{T}\mathbb{E}[F({\bm{k}},{\bm{k}}+\bm{e}_{i},T)]=u_{i}({\bm{k}}),$

where $F$ is the cumulative number of nominal transitions under the policy $\overline{\sigma}(\bm{\pi},(\bm{u}_{i})_{i\in\mathcal{I}})$ and the initial distribution $\bm{\pi}$ .

Our proof is based on the following equation:

(65)			$\displaystyle\mspace{23.0mu}\frac{d}{dt}\mathbb{P}(\overline{{\bm{K}}}(t)={\bm{k}})\Big{\|}_{t=0}$
			$\displaystyle=\sum_{i}\frac{d}{dt}\mathbb{E}\left[F({\bm{k}}-\bm{e}_{i},{\bm{k}},t)\right]\Big{\|}_{t=0}-\sum_{i}\frac{d}{dt}\mathbb{E}\left[F({\bm{k}},{\bm{k}}+\bm{e}_{i},t)\right]\Big{\|}_{t=0}$
			$\displaystyle\mspace{23.0mu}+\sum_{i}(k_{i}+1)\mu_{i\perp}\pi({\bm{k}}+\bm{e}_{i})+\sum_{i,i^{\prime}}(k_{i}+1)\mu_{ii^{\prime}}\pi({\bm{k}}+\bm{e}_{i}-\bm{e}_{i^{\prime}})\mathds{1}_{\{k_{i^{\prime}}\geq 1\}}$
			$\displaystyle\mspace{23.0mu}-\sum_{i}k_{i}\mu_{i\perp}\pi({\bm{k}})-\sum_{i,i^{\prime}}k_{i}\mu_{ii^{\prime}}\pi({\bm{k}}),$

The equation is a straightforward consequence of (58), following the same argument as the proof of Lemma 3.

We prove the following two equations by induction on $\sum_{i\in\mathcal{I}}k_{i}$ .

(66)		$\displaystyle\lim_{t\to 0}\frac{1}{t}\mathbb{E}[F({\bm{k}},{\bm{k}}+\bm{e}_{i},t)]$	$\displaystyle=u_{i}({\bm{k}}),$
(67)		$\displaystyle\frac{d}{dt}\mathbb{P}(\overline{{\bm{K}}}(t)={\bm{k}})\|_{t=0}$	$\displaystyle=0.$

We first consider the base case when $\sum_{i\in\mathcal{I}}k_{i}=0$ . In this case, ${\bm{k}}=\bm{0}$ and we have

(68)

\frac{d}{dt}\mathbb{E}\left[F({\bm{k}}-\bm{e}_{i},{\bm{k}},t)\right]\Big{|}_{t=0}=u_{i}({\bm{k}}-\bm{e}_{i})\mathds{1}_{\{k_{i}\geq 1\}}=0,

for all $i$ . This reduces (65) to

(69)			$\displaystyle\mspace{23.0mu}\frac{d}{dt}\mathbb{P}(\overline{{\bm{K}}}(t)={\bm{k}})\Big{\|}_{t=0}$
			$\displaystyle=\sum_{i}u({\bm{k}}-\bm{e}_{i})\mathds{1}_{\{k_{i}\geq 1\}}-\sum_{i}\frac{d}{dt}\mathbb{E}\left[F({\bm{k}},{\bm{k}}+\bm{e}_{i},t)\right]\Big{\|}_{t=0}$
			$\displaystyle\mspace{23.0mu}+\sum_{i}(k_{i}+1)\mu_{i\perp}\pi({\bm{k}}+\bm{e}_{i})+\sum_{i,i^{\prime}}(k_{i}+1)\mu_{i,i^{\prime}}\pi({\bm{k}}+\bm{e}_{i}-\bm{e}_{i^{\prime}})\mathds{1}_{\{k_{i^{\prime}}\geq 1\}}$
			$\displaystyle\mspace{23.0mu}-\sum_{i}k_{i}\mu_{i\perp}\pi({\bm{k}})-\sum_{i,i^{\prime}}k_{i}\mu_{i,i^{\prime}}\pi({\bm{k}}),$

Now we discuss based on whether $\pi({\bm{k}})=0$ . If $\pi({\bm{k}})\neq 0$ , by the definition of our policy, for all $i$ ,

(70)

\frac{d}{dt}\mathbb{E}\left[F({\bm{k}},{\bm{k}}+\bm{e}_{i},t)\right]\Big{|}_{t=0}=\frac{u_{i}({\bm{k}})}{\pi({\bm{k}})}\cdot\mathbb{P}(\overline{{\bm{K}}}(0)={\bm{k}})=u_{i}({\bm{k}}),

which is (66). Combining the above equation and the stationary equation (57) satisfied by $(\bm{\pi},(\bm{u}_{i})_{i\in\mathcal{I}})$ , we conclude that the RHS of (69) is zero, i.e.

\frac{d}{dt}\mathbb{P}(\overline{{\bm{K}}}(t)={\bm{k}})|_{t=0}=0,

which is (67). For the case when $\pi({\bm{k}})=0$ and $\sum_{i}u_{i}({\bm{k}})\neq 0$ , because the system immediately leave the configuration ${\bm{k}}$ after reaching it through a nominal transition,

\frac{d}{dt}\mathbb{P}(\overline{{\bm{K}}}(t)={\bm{k}})|_{t=0}=0,

i.e., the LHS of (69) is $0$ . Again we compare (69) against the stationary equation (57) and get

\sum_{i}\frac{d}{dt}\mathbb{E}\left[F({\bm{k}},{\bm{k}}+\bm{e}_{i},t)\right]\Big{|}_{t=0}=\sum_{i}u_{i}({\bm{k}}).

By the definition of our policy, we have

(71)

\displaystyle\frac{d}{dt}\mathbb{E}\left[F({\bm{k}},{\bm{k}}+\bm{e}_{i},t)\right]\Big{|}_{t=0}

\displaystyle=\frac{u_{i}({\bm{k}})}{\sum_{i^{\prime}}u_{i^{\prime}}({\bm{k}})}\cdot\sum_{i^{\prime}}\frac{d}{dt}\mathbb{E}\left[F({\bm{k}},{\bm{k}}+\bm{e}_{i^{\prime}},t)\right]\Big{|}_{t=0}=u_{i}({\bm{k}}).

which is (66). For the case when $\pi({\bm{k}})=0$ and $\sum_{i}u_{i}({\bm{k}})=0$ , (57) implies that $u_{i}({\bm{k}}-\bm{e}_{i})=0$ , $\pi({\bm{k}}+\bm{e}_{i})=0$ , $\pi({\bm{k}}+\bm{e}_{i}-\bm{e}_{i^{\prime}})=0$ for any $i$ . Then (69) is further reduced to

\frac{d}{dt}\mathbb{P}(\overline{{\bm{K}}}(t)={\bm{k}})\Big{|}_{t=0}=-\sum_{i}\frac{d}{dt}\mathbb{E}\left[F({\bm{k}},{\bm{k}}+\bm{e}_{i},t)\right]\Big{|}_{t=0}.

Because $\mathbb{P}(\overline{{\bm{K}}}(t)={\bm{k}})\geq 0$ and $\mathbb{P}(\overline{{\bm{K}}}(0)={\bm{k}})=0$ , the LHS of the above expression is non-negative. However, the RHS of the above expression is non-positive. Therefore, both sides are equal to zero, thus we have $\frac{d}{dt}\mathbb{P}(\overline{{\bm{K}}}(t)={\bm{k}})\Big{|}_{t=0}=0$ and $\frac{d}{dt}\mathbb{E}\left[F({\bm{k}},{\bm{k}}+\bm{e}_{i},t)\right]\Big{|}_{t=0}=u_{i}({\bm{k}})$ .

Having proved the base case, we do the induction step. Suppose we have proved (66) and (67) for all ${\bm{k}}$ such that $\sum_{i\in\mathcal{I}}k_{i}\leq m-1$ for some integer $m\geq 1$ . We consider ${\bm{k}}$ with $\sum_{i\in\mathcal{I}}k_{i}=m$ . By the induction hypothesis,

(72)

\frac{d}{dt}\mathbb{E}\left[F({\bm{k}}-\bm{e}_{i},{\bm{k}},t)\right]\Big{|}_{t=0}=u_{i}({\bm{k}}-\bm{e}_{i})\mathds{1}_{\{k_{i}\geq 1\}}.

Then we repeat the arguments after (68) of the base case verbatim. By induction, we have proved (66) and (67).

Therefore, given the policy and initial distribution, the distribution of $\overline{{\bm{K}}}(t)$ is stationary, i.e., we always have $\mathbb{P}(\overline{{\bm{K}}}(t)={\bm{k}})=\pi({\bm{k}})$ for all ${\bm{k}}\in\mathcal{K}$ . As a result, an analogue of (66) holds for all $t\geq 0$ : $\mathbb{E}[F({\bm{k}},{\bm{k}}+\bm{e}_{i},t)]$ is differentiable with respect to $t$ and

\displaystyle\frac{d}{dt}\mathbb{E}[F({\bm{k}},{\bm{k}}+\bm{e}_{i},t)]

\displaystyle=u_{i}({\bm{k}}),

for all ${\bm{k}}\in\mathcal{K}$ and all $i\in\mathcal{I}$ . Therefore,

	$\displaystyle\lim_{T\to\infty}\frac{1}{T}\int_{0}^{T}\mathbb{P}(\overline{{\bm{K}}}(t)={\bm{k}})dt$	$\displaystyle=\lim_{T\to\infty}\frac{1}{T}\cdot\pi({\bm{k}})\cdot T=\pi({\bm{k}}),$
	$\displaystyle\lim_{T\to\infty}\frac{1}{T}\mathbb{E}[F({\bm{k}},{\bm{k}}+\bm{e}_{i},T)]$	$\displaystyle=\lim_{T\to\infty}\frac{1}{T}\int_{0}^{T}\frac{d}{dt}\mathbb{E}[F({\bm{k}},{\bm{k}}+\bm{e}_{i},t)]dt$
		$\displaystyle=\lim_{T\to\infty}\frac{1}{T}\cdot u_{i}({\bm{k}})\cdot T=u_{i}({\bm{k}}).$

This completes the proof. ∎

Given an optimal solution $(\Phi^{*},\bm{\pi}^{*},(\bm{u}_{i})_{i\in\mathcal{I}})$ to the linear program $\overline{\mathcal{LP}}((\lambda_{i})_{i\in\mathcal{I}},\epsilon)$ , we can solve the single-server problem $\overline{\mathcal{P}}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ in (3) to achieve an optimal value $r/\Phi^{*}$ , using the optimal policy $\overline{\sigma}(\bm{\pi}^{*},(\bm{u}_{i})_{i\in\mathcal{I}}^{*})$ and the optimal stationary distribution $\bm{\pi}^{*}$ . Moreover, the optimal policy $\overline{\sigma}(\bm{\pi}^{*},(\bm{u}_{i})_{i\in\mathcal{I}}^{*})$ is a Markovian policy.

By Lemma 4, under the policy $\overline{\sigma}(\bm{\pi}^{*},(\bm{u}^{*}_{i})_{i\in\mathcal{I}})$ , $\bm{\pi}^{*}$ is a stationary distribution, and $(\bm{u}^{*}_{i})_{i\in\mathcal{I}})$ are the corresponding transition frequencies. Recall the single-server $\overline{\mathcal{P}}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ problem

(3 revisited)	$\displaystyle\underset{\overline{N},\mspace{10.0mu}\overline{\sigma},\mspace{10.0mu}\pi}{\text{minimize}}$	$\displaystyle\overline{N}$
	subject to	$\displaystyle\mathbb{E}\left[h\big{(}\overline{{\bm{K}}}(\infty)\big{)}\middle\|\overline{{\bm{K}}}(\infty)\neq\bm{0}\right]\leq\epsilon,$
	$\displaystyle\overline{N}\cdot\overline{\lambda}_{i}=\lambda_{i}r,\quad\forall i\in\mathcal{I}.$

Observe that under the policy $\overline{\sigma}(\bm{\pi}^{*},(\bm{u}^{*}_{i})_{i\in\mathcal{I}})$ , the cost rate of resource contention is $\bm{h}^{T}\bm{\pi}\leq\epsilon(1-\pi(\bm{0}))$ , the request rate of type $i$ jobs is $\overline{\lambda}_{i}=\bm{1}_{o}^{T}\bm{u}^{*}_{i}=\Phi^{*}\cdot\lambda_{i}$ , so

	$\displaystyle\mathbb{E}\left[h\big{(}\overline{{\bm{K}}}(\infty)\big{)}\middle\|\overline{{\bm{K}}}(\infty)\neq\bm{0}\right]=\frac{\bm{h}^{T}\bm{\pi}}{1-\pi(\bm{0})}\leq\epsilon,$
	$\displaystyle\overline{\lambda}_{i}=\Phi^{*}\cdot\lambda_{i},\quad\forall i\in\mathcal{I}.$

where we have used the fact that $h(\bm{0})=0$ in the first equality. Therefore, $(\Phi^{*}/r,\overline{\sigma}(\bm{\pi}^{*},(\bm{u}^{*}_{i})_{i\in\mathcal{I}}),\bm{\pi})$ is a feasible solution to the single-server problem $\overline{\mathcal{P}}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ , achieving the objective value of $r/\Phi^{*}$ , which is the optimal value because $r/\Phi^{*}\leq\overline{N}^{*}$ . ∎

In this section, we consider the performance of JRS when it is based on an estimated model. We will state the performance guarantee in terms of the estimation error, and give a proof sketch by pointing out which part of Theorem 3’s proof needs to be changed accordingly.

Consider the setting where the maximal jobs on a server $K_{\max}$ , the set of job phases $\mathcal{I}$ , the cost rate function $h(\cdot)$ , and the budget $\epsilon$ are all known. However, we only have estimations of the jobs’ arrival rates, internal transition rates, and departure rates.

Specifically, for any $i,i^{\prime}\in\mathcal{I}$ , let $\widetilde{\mu}_{ii^{\prime}}$ be the true rate of internal transition from phase $i$ to phase $i^{\prime}$ ; let $\widetilde{\mu}_{i\perp}$ be the true departure rate of phase $i$ ; let $\widetilde{\lambda}_{i}r$ be the true arrival rate of type $i$ jobs. We let $\mu_{ii^{\prime}}$ ’s, $\mu_{i\perp}$ ’s, and $\lambda_{i}r$ ’s be the estimated internal transition rates, departure rates, and arrival rates, respectively. We assume that there exists a small positive constant $\delta$ that is independent of the scaling factor $r$ such that the following assumptions hold.

For any $i,i^{\prime}\in\mathcal{I}$ ,

(73)	$\displaystyle\left\lvert\mu_{ii^{\prime}}-\widetilde{\mu}_{ii^{\prime}}\right\rvert$	$\displaystyle\leq\delta,$
(74)	$\displaystyle\left\lvert\mu_{i\perp}-\widetilde{\mu}_{i\perp}\right\rvert$	$\displaystyle\leq\delta,$
(75)	$\displaystyle\big{\lvert}\lambda_{i}-\widetilde{\lambda}_{i}\big{\rvert}$	$\displaystyle\leq\delta.$

Consider the single-server problem in (3). Let $(\overline{N},\overline{\sigma},\pi)$ be a solution feasible to the single-server problem with the estimated parameters that are $\delta$ -accurate, where $\delta\in[0,\delta_{\max}),$ for some constant $\delta_{\max}>0$ . We assume that there exist constants $0<m_{1}<m_{2}$ independent of $\delta$ and $r$ such that

m_{1}r\leq\overline{N}\leq m_{2}r.

Consider the single-server problem in (3). Let the optimal value of the single-server problem with the estimated parameters be $\overline{N}^{*}$ , where the estimated parameters are $\delta$ -accurate; let the optimal value of the single-server problem with true parameters be $\overline{N}^{*}_{\text{true}}$ . We assume that

\overline{N}^{*}\leq\overline{N}^{*}_{\text{true}}+\delta r.

We also assume that JRS can accurately simulate the virtual jobs.

The virtual jobs simulated in JRS follow the true transition dynamics.

Given the above assumptions, we have the following proposition that states the optimality gap of JRS policy with estimated model parameters, which has a similar form as Theorem 3 for JRS under true model parameters.

Consider a stochastic bin-packing problem in service systems with time-varying job resource requirements. Let the infinite-server policy $\sigma$ be JRS with subroutine $\overline{\sigma}$ . Suppose $\sigma$ is specified based on an estimated model satisfying Assumption 1, 2, 3, and 4, for $\delta$ s.t. $\delta\in[0,\delta_{\max})$ , where $\delta_{\max}$ is some positive constant independent of $r$ . Let $\overline{N}$ be the objective value achieved by $\overline{\sigma}$ in the single-server problem $\overline{\mathcal{P}}((\lambda_{i}r)_{i\in\mathcal{I}},\epsilon)$ with estimated parameters. Under $\sigma$ , for any initial state, we have

(76)		$\displaystyle\left\lvert\sum_{{\bm{k}}\neq\bm{0}}\mathbb{E}\left[X_{{\bm{k}}}\right]-\left\lceil\overline{N}\right\rceil\cdot\mathbb{P}\left(\overline{{\bm{K}}}\neq\bm{0}\right)\right\rvert$	$\displaystyle=O\big{(}\sqrt{r}\big{)}+\delta\cdot O\big{(}r\big{)},$
(77)		$\displaystyle\left\lvert\sum_{{\bm{k}}\neq\bm{0}}h({\bm{k}})\mathbb{E}\left[X_{{\bm{k}}}\right]-\left\lceil\overline{N}\right\rceil\cdot\mathbb{E}\left[h(\overline{{\bm{K}}})\right]\right\rvert$	$\displaystyle=O\big{(}\sqrt{r}\big{)}+\delta\cdot O\big{(}r\big{)},$

where $\overline{{\bm{K}}}$ denotes the steady-state configurations of the single-server system under $\overline{\sigma}$ with the estimated parameters. If we let $\overline{\sigma}$ be the optimal policy of the single-server problem with estimated parameters, for any initial state, we have

(78)		$\displaystyle N(\sigma)$	$\displaystyle\leq\left(1+B\delta+O\left(r^{-0.5}\right)\right)\cdot\overline{N}^{*}_{\text{true}},$
(79)		$\displaystyle C(\sigma)$	$\displaystyle\leq\left(1+B\delta+O\left(r^{-0.5}\right)\right)\cdot\epsilon,$

where $B$ is a positive constant independent of $r$ . In other words, $\sigma$ is $\left(1+B\delta+O\left(r^{-0.5}\right),1+B\delta+O\left(r^{-0.5}\right)\right)$ -optimal.

Note even if the single-server system with estimated parameters ${\bm{k}}^{0}$ -irreducible under $\overline{\sigma}$ , it is hard to guarantee that the original system is ${\bm{k}}^{0}$ -irreducible due to the estimation errors. Consequently, the steady-state performance metrics $N(\sigma)$ and $C(\sigma)$ could depend on the system’s initial state. Fortunately, our proof for Theorem 3 does not rely on the uniqueness of the stationary distribution, so we can adapt the proof to show the inequalities in Proposition 1 for any initial state.

Note that 4 ensures that the real jobs and virtual jobs on a server are indistinguishable, so that $(\widehat{{\bm{K}}}^{1:L}(t),\bm{\eta}^{1:L}(t))$ is still a Markov chain. Recall that $\widehat{{\bm{K}}}^{\ell}$ and and $\bm{\eta}^{\ell}$ denote the observed configuration and tokens for each normal server $\ell$ , respectively (see Section 5.1). We suspect that 4 can be removed since our proof does not rely too much on $(\widehat{{\bm{K}}}^{1:L}(t),\bm{\eta}^{1:L}(t))$ being a Markov chain. However, removing the assumption requires a more careful and notationally heavy analysis. We argue that in practice, this assumption is not restrictive, as one can record the traces of the jobs that arrived in the past and resample virtual jobs from those traces.

In this section, we give a proof sketch for Proposition 1 when the single-server system under $\overline{\sigma}$ with estimated parameters is ${\bm{k}}^{0}$ -irreducible. The argument of extending to the general case is essentially the same as Section B.3, so we omit it here.

On a high level,the proof of Proposition 1 is similar to that of Theorem 1. In particular, the key steps of the proof are to show that $d({\bm{K}}^{1:L},\overline{{\bm{K}}}^{1:L})=O(\sqrt{r})$ and $\sum_{\ell=L+1}^{\infty}\sum_{i\in\mathcal{I}}K_{i}^{\ell}=O(\sqrt{r})$ as $r\to\infty$ , where $L=\lceil\overline{N}\rceil$ , and $\overline{{\bm{K}}}^{1:L}$ are i.i.d. copies of steady-state configurations of the single-server system under $\overline{\sigma}$ with the estimated parameters, and ${\bm{K}}$ is the steady-state configurations of the infinite server system under ${\sigma}$ .

Proposition 1 is based on the three lemmas stated below. The proof of Proposition 1 using the three lemmas is essentially the same as the argument in Section 5.3.

Under the conditions of Proposition 1 and the single-server system with estimated parameters under $\overline{\sigma}$ being ${\bm{k}}^{0}$ -irreducible, for any initial state, we have

d\left(\widehat{{\bm{K}}}^{1:L},\overline{{\bm{K}}}^{1:L}\right)=O\big{(}\sqrt{r}\big{)}+\delta\cdot O\big{(}r\big{)}

Under the conditions of Proposition 1 and the single-server system with estimated parameters under $\overline{\sigma}$ being ${\bm{k}}^{0}$ -irreducible, for any initial state and $i\in\mathcal{I}$ , the steady-state expected number of virtual jobs of type $i$ s.t.

\mathbb{E}\left[\textstyle\sum_{\ell=1}^{L}\zeta_{i}^{\ell}\right]=O\big{(}\sqrt{r}\big{)}+\delta\cdot O\big{(}r\big{)}.

\mathbb{E}\left[\textstyle\sum_{\ell=L+1}^{\infty}K_{i}^{\ell}\right]=O\big{(}\sqrt{r}\big{)}+\delta\cdot O\big{(}r\big{)}.

These three lemmas are analogous to Lemma 2, Lemma 3, and Lemma 4, respectively. In the rest of the section, we sketch the proofs for the three lemmas above, highlighting the difference from the proofs of their analogues.

Recall from Section 5.4 that the proof of Lemma 2 is based on Stein’s method, which compares the generator of the i.i.d. copies of the single-server system with the generator of the infinite-server system. To write down the generators with the estimated model, recall that in the single-server system, each transition can be represented by the diagram

{\bm{k}}\overset{}{\rightarrow}{\bm{k}}^{\prime}\overset{}{\rightarrow}{\bm{k}}^{\prime}+\bm{a},

where the arrow ${\bm{k}}\to{\bm{k}}^{\prime}$ denotes an internal transition or a departure if ${\bm{k}}\neq{\bm{k}}^{\prime}$ ; the arrow ${\bm{k}}^{\prime}\to{\bm{k}}^{\prime}+\bm{a}$ denotes a job request that is made right after reaching ${\bm{k}}^{\prime}$ . For any ${\bm{k}}$ , let $E({\bm{k}})$ be the set of $({\bm{k}}^{\prime},\bm{a})\in\mathcal{K}^{2}$ such that ${\bm{k}}^{\prime}+\bm{a}\in\mathcal{K}$ . We define two sets of transition rates as below: for any ${\bm{k}}\in\mathcal{K}$ and $({\bm{k}}^{\prime},\bm{a})\in E({\bm{k}})$ ,

•

Under the estimated parameters and the policy $\overline{\sigma}$ , we let $\gamma_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}$ be the rate of the transition ${\bm{k}}\to{\bm{k}}^{\prime}\to{\bm{k}}^{\prime}+\bm{a}$ , and let $\gamma_{{\bm{k}}}\triangleq\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}$ be the total transition rate at configuration ${\bm{k}}$ .
•

Under the true parameters and the policy $\overline{\sigma}$ , we let $\widetilde{\gamma}_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}$ be the rate of the transition ${\bm{k}}\to{\bm{k}}^{\prime}\to{\bm{k}}^{\prime}+\bm{a}$ , and let $\widetilde{\gamma}_{{\bm{k}}}\triangleq\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}$ be the total transition rate at configuration ${\bm{k}}$ .

Let $\overline{G}$ be the generator of $L=\lceil\overline{N}\rceil$ i.i.d. copies of single-server systems under the estimated parameters. For any $g\colon\mathcal{K}^{L}\to\mathbb{R}$ , we have

(80)

\overline{G}g({\bm{k}}^{1:L})=\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{{\bm{k}}^{\ell},({\bm{k}}^{\prime},\bm{a})}\left(g(\cdot,{\bm{k}}^{\prime}+\bm{a},\cdot)-g(\cdot,{\bm{k}}^{\ell},\cdot)\right),

where $\sum_{{\bm{k}}^{\prime},\bm{a}}$ is a shorthand for $\sum_{({\bm{k}}^{\prime},\bm{a})\in E({\bm{k}})}$ . Let $\widehat{G}$ be the generator of $(\widehat{{\bm{K}}}^{1:L}(t),\bm{\eta}^{1:L}(t))$ for the infinite-server system. For any function $g\colon\mathcal{K}^{L}\to\mathbb{R}$ and $\psi({\bm{k}}^{1:L},\bm{\eta}^{1:L})=g({\bm{k}}^{1:L}+\bm{\eta}^{1:L})$ , we have

	$\displaystyle\widehat{G}\psi({\bm{k}}^{1:L},\bm{\eta}^{1:L})$	$\displaystyle=\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{{\bm{k}}^{\ell},({\bm{k}}^{\prime},\bm{a})}\left(g(\cdot,{\bm{k}}^{\prime}+\bm{a},\cdot)-g(\cdot,{\bm{k}}^{\ell},\cdot)\right)\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}$
(81)			$\displaystyle\mspace{23.0mu}+\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{{\bm{k}}^{\ell},({\bm{k}}^{\prime},\bm{a})}\left(g(\cdot,{\bm{k}}^{\prime}+\bm{\eta}^{\ell},\cdot)-g(\cdot,{\bm{k}}^{\ell}+\bm{\eta}^{\ell},\cdot)\right)\mathds{1}_{\{\bm{\eta}^{\ell}\neq\bm{0}\}}.$

To prove Lemma 2, we need to show that for any $f\in\text{Lip}(1)$ , ${\bm{k}}^{1:L}\in\mathcal{K}^{L}$ , and $\bm{\eta}^{1:L}\in\mathcal{K}^{L}$ ,

(82)

\overline{G}g_{f}({\bm{k}}^{1:L}+\bm{\eta}^{1:L})-\widehat{G}\psi_{f}({\bm{k}}^{1:L},\bm{\eta}^{1:L})=O\big{(}\sqrt{r}\big{)}+\delta\cdot O\big{(}r\big{)},

where $g_{f}$ is the solution to

(83)

\mathbb{E}\left[f\big{(}\overline{{\bm{K}}}^{1:L}\big{)}\right]-f\big{(}{\bm{k}}^{1:L}\big{)}=\overline{G}g_{f}\big{(}{\bm{k}}^{1:L}\big{)},

and $\psi_{f}({\bm{k}}^{1:L},\bm{\eta}^{1:L})=g_{f}({\bm{k}}^{1:L}+\bm{\eta}^{1:L})$ . Same as the proof of Lemma 2, we prove (82) in two steps: the generator comparison step, and the Stein factor bound step.

In the generator comparison step, we observe that the formula of $\overline{G}g({\bm{k}}^{1:L})$ and $\widehat{G}\psi({\bm{k}}^{1:L},\bm{\eta}^{1:L})$ in (80) and (81) look almost the same as (15) and (17), except that the rates in (81) are $\widetilde{\gamma}_{{\bm{k}}^{\ell},({\bm{k}}^{\prime},\bm{a})}$ instead of $\gamma_{{\bm{k}}^{\ell},({\bm{k}}^{\prime},\bm{a})}$ . As a result, after carrying out similar calculations as in the poof of Lemma 2, we get an extra error term involving $\gamma_{{\bm{k}}^{\ell},({\bm{k}}^{\prime},\bm{a})}-\widetilde{\gamma}_{{\bm{k}}^{\ell},({\bm{k}}^{\prime},\bm{a})}$ , which can be bounded using the lemma below. This error term results in the $\delta\cdot O(r)$ in (82).

Under 1, for any ${\bm{k}}\in\mathcal{K}$ and $({\bm{k}}^{\prime},\bm{a})\in E({\bm{k}})$ , we have

(84)

\left\lvert\widetilde{\gamma}_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}-\gamma_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}\right\rvert\leq K_{\max}\delta.

When ${\bm{k}}={\bm{k}}^{\prime}$ , $\gamma_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}$ and $\widetilde{\gamma}_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}$ are both the rate of adding $\bm{a}$ jobs via a proactive request under the policy $\overline{\sigma}$ , so they are identical.

When ${\bm{k}}\neq{\bm{k}}^{\prime}$ and $\bm{a}=\bm{0}$ , $\gamma_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}$ is equal to the rate of going from ${\bm{k}}$ to ${\bm{k}}^{\prime}$ via an internal transition or departure under the estimated job model, while $\widetilde{\gamma}_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}$ is under the true job model. Because there are at most $K_{\max}$ jobs, and by our assumption the estimation error of each job’s transition rates are bounded by $\delta$ , thus $\gamma_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}$ and $\widetilde{\gamma}_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}$ differ by at most $K_{\max}\delta$ .

When ${\bm{k}}\neq{\bm{k}}^{\prime}$ and $\bm{a}\neq\bm{0}$ , $\gamma_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}$ is equal to the rate of going from ${\bm{k}}$ to ${\bm{k}}^{\prime}$ via an internal transition or departure, multiplied by the probability of adding $\bm{a}$ jobs via a reactive request, under the estimated job model and the policy $\overline{\sigma}$ ; $\widetilde{\gamma}_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}$ is under the true job model instead of the estimated job model, but uses the same policy. Because the rate of going from ${\bm{k}}$ to ${\bm{k}}^{\prime}$ differs by at most $K_{\max}\delta$ under two different job models, and the probability of adding $\bm{a}$ jobs after going from ${\bm{k}}$ to ${\bm{k}}^{\prime}$ is the same under the same policy, thus $\gamma_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}$ and $\widetilde{\gamma}_{{\bm{k}},({\bm{k}}^{\prime},\bm{a})}$ differ by at most $K_{\max}\delta$ . ∎

In the Stein factor bound step, we need to show that

(85)

\sup_{{\bm{k}},{\bm{k}}^{\prime}\in\mathcal{K}}\left\lvert g_{f}(\cdot,{\bm{k}}^{\prime},\cdot)-g_{f}(\cdot,{\bm{k}},\cdot)\right\rvert=O\left(1\right).

This involves analyzing the i.i.d. copies of the single-server system, and the fact that the single-server system with estimated parameters is ${\bm{k}}^{0}$ -irreducible under $\overline{\sigma}$ . This part of the proof is identical to the corresponding part of the proof of Lemma 2.

$\displaystyle\underset{\begin{subarray}{c}\Phi,\mspace{10.0mu}\bm{\pi},\mspace{8.0mu}(\bm{u}_{i})_{i\in\mathcal{I}}\end{subarray}}{\text{maximize}}$

$\displaystyle\Phi$

subject to

$\displaystyle\bm{h}^{T}\bm{\pi}\leq\epsilon(1-\pi(\bm{0}))$

$\displaystyle\bm{1}_{o}^{T}\bm{u}_{i}=\Phi\cdot\lambda_{i}\quad\forall i$

$\displaystyle A\bm{\pi}+\sum\nolimits_{i\in\mathcal{I}}B_{i}\bm{u}_{i}=\bm{0}$

$\displaystyle\bm{1}^{T}\bm{\pi}=1$

$\displaystyle\bm{\pi}\geq 0,\bm{u}_{i}\geq 0\quad\forall i\in\mathcal{I}$

The proof of Lemma 3 and Lemma 4 has a similar structure as the proof of Lemma 3 and Lemma 4. In the first step, we use Little’s law to bound expectations of the number of type $i$ virtual jobs, $V_{i}$ , and the number of type $i$ jobs on backup servers, $Y_{i}$ . We have two equations almost the same as (36) and (37) except that the rates $\gamma_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}$ and $\lambda_{i}r$ are replaced by $\widetilde{\gamma}_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}$ and $\widetilde{\lambda}_{i}r$ :

(86)		$\displaystyle\mathbb{E}[V_{i}]$	$\displaystyle\leq t_{\max}\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}dv_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}\right],$
(87)		$\displaystyle\mathbb{E}[Y_{i}]$	$\displaystyle\leq t_{\max}\mathbb{E}\left[\widetilde{\lambda}_{i}r\cdot dy_{i}\right],$

where $t_{\max}$ is the maximal expected service time of any type of virtual job or real job. Because $t_{\max}=O(1)$ , it suffices to bound $\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}dv_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}\right]$ and $\mathbb{E}\left[\widetilde{\lambda}_{i}r\cdot dy_{i}\right]$ .

In the second step, we utilize the fact that the two Lyapunov functions $g(z_{i})=z_{i}$ and $g(z_{i})=z_{i}^{2}$ have zero drift in steady-state, where $z_{i}$ is the total number of type $i$ tokens. We get two equalities similar to (38) and (41):

(88)

\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}(a_{i}-dv_{i})\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}+\widetilde{\lambda}_{i}r(-1+dy_{i})\right]=0.

(89)		$\displaystyle\mspace{23.0mu}\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}dv_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}\right]$
(90)		$\displaystyle=\frac{1}{\eta_{\max}}\cdot\mathbb{E}\left[\left(\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}-\widetilde{\lambda}_{i}r\right)\cdot Z_{i}\right]$
(91)		$\displaystyle\mspace{2.0mu}+\frac{1}{2\eta_{\max}}\cdot\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}\left(a_{i}^{2}-(dv_{i})^{2}\right)\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}+\widetilde{\lambda}_{i}r\cdot(1-(dy_{i})^{2})\right].$

In the third step, we use the above two equalities to bound $\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}dv_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}\right]$ and $\mathbb{E}\left[\widetilde{\lambda}_{i}r\cdot dy_{i}\right]$ . We first use the equality in (89) to (91) to bound $\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}dv_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}\right]$ . Following the same argument as the proof of Lemma 3 and Lemma 4 until (47), we can show that

\eqref{eq:misspec:ttk-second-order-term2}\leq O\big{(}\sqrt{r}\big{)},

	$\displaystyle\eqref{eq:misspec:ttk-second-order-term1}$	$\displaystyle\leq\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}-\widetilde{\lambda}_{i}r\right\rvert\right]$
		$\displaystyle\leq\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\widetilde{\lambda}_{i}r\right\rvert\right]+O\big{(}\sqrt{r}\big{)}$
(92)			$\displaystyle\leq\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\widetilde{\lambda}_{i}r\right\rvert\right]+O\big{(}\sqrt{r}\big{)}+\delta\cdot O\big{(}r\big{)},$

where to get (92), we apply Lemma 2 to replace $\widehat{{\bm{K}}}^{\ell}$ with $\overline{{\bm{K}}}^{\ell}$ , which causes an $O(\sqrt{r})+\delta\cdot O(r)$ error. Next, we show that

(93)

\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\widetilde{\lambda}_{i}r\right\rvert\right]=O\big{(}\sqrt{r}\big{)}+\delta\cdot O\big{(}r\big{)}.

By 1 and Lemma 5, we have

(94)		$\displaystyle\left\lvert\widetilde{\lambda}_{i}r-\lambda_{i}r\right\rvert$	$\displaystyle\leq\delta r,$
(95)		$\displaystyle\Big{\lvert}\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}\Big{\rvert}$	$\displaystyle\leq\delta\cdot O\big{(}r\big{)}.$

These two bounds allow us to replace $\widetilde{\gamma}_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}$ and $\widetilde{\lambda}_{i}r$ on the LHS of (93) with $\gamma_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}$ and $\lambda_{i}r$ at the cost of introducing $\delta\cdot O(r)$ error. Moreover, because $\{\overline{{\bm{K}}}^{\ell}\}_{\ell=1,\dots,L}$ are i.i.d., $\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}$ concentrates around its mean with $O(\sqrt{r})$ error, where the mean can be shown to be

\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}\right]=\overline{\lambda}_{i}L=\lambda_{i}r+O(1).

Note that the first equality in (96) holds because for each $\ell$ , $\mathbb{E}[\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}]$ is equal to $\overline{\lambda}_{i}$ , i.e., the long-run average rate of requesting type $i$ jobs on a single-server system with estimated parameters under $\overline{\sigma}$ ; the second equality in (96) is because $L=\lceil\overline{N}\rceil$ , and $\overline{N}\overline{\lambda}_{i}=\lambda_{i}r$ . As a result,

(96)

\mathbb{E}\left[\left\lvert\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\gamma_{\overline{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}a_{i}-\lambda_{i}r\right\rvert\right]\leq O\big{(}\sqrt{r}\big{)}.

Combining (94) to (96), we get (93). Therefore,

	$\displaystyle\mathbb{E}[V_{i}]$	$\displaystyle\leq t_{\max}\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}dv_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}\right]$
		$\displaystyle\leq O(1)\cdot\big{(}\eqref{eq:misspec:ttk-second-order-term1}+\eqref{eq:misspec:ttk-second-order-term2}\big{)}$
		$\displaystyle\leq O\big{(}\sqrt{r}\big{)}+\delta\cdot O\big{(}r\big{)},$

which completes the proof of Lemma 3.

Finally, it is straightforward to show that $\mathbb{E}\left[\widetilde{\lambda}_{i}r\cdot dy_{i}\right]=O(\sqrt{r})+\delta\cdot O(r)$ using (88) and the bound on $\mathbb{E}\left[\sum_{\ell=1}^{L}\sum_{{\bm{k}}^{\prime},\bm{a}}\widetilde{\gamma}_{\widehat{{\bm{K}}}^{\ell},({\bm{k}}^{\prime},\bm{a})}dv_{i}\mathds{1}_{\{\bm{\eta}^{\ell}=\bm{0}\}}\right]$ . Therefore, $\mathbb{E}[Y_{i}]\leq t_{\max}\mathbb{E}\left[\widetilde{\lambda}_{i}r\cdot dy_{i}\right]=O(\sqrt{r})+\delta\cdot O(r)$ . This proves Lemma 4.