This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Rightsizing Clusters for Time-Limited Tasks

Venkatesan T. Chakaravarthy, Padmanabha V. Seshadri, Pooja Aggarwal, Anamitra R. Choudhury
Ashok Pon Kumar, Yogish Sabharwal, Amith Singhee
IBM Research - India
{vechakra, seshapad, aggarwal.pooja, anamchou, ashokponkumar, ysabharwal, asinghee}@in.ibm.com
Abstract

Cloud computing has emerged as the dominant platform for application deployment on clusters of computing nodes. In conventional public clouds, designing a suitable initial cluster for a given application workload is important in converging to an appropriate computing foot-print during run-time. In the case of edge or on-premise clouds, cold-start rightsizing the cluster at the time of installation is crucial in avoiding the recurrent capital expenditure. In both these cases, balancing cost-performance trade-off is critical in constructing a cluster of compute nodes for hosting an application with multiple tasks, where each task can demand multiple resources, and the cloud offers nodes with different capacity and cost.

Multidimensional bin-packing algorithms can address this cold-start rightsizing problem, but these assume that every task is always active. In contrast, real-world tasks (e.g. load bursts, batch and dead-lined tasks with time-limits) may be active only during specific time-periods or may have very dynamic load profiles. The cluster cost can be reduced by reusing resources via time sharing and optimal packing. This motivates our generalized problem of cold-start rightsizing for time-limited tasks: given a timeline, time-periods and resource demands for tasks, the objective is to place the tasks on a minimum cost cluster of nodes without violating node capacities at any time instance. We design a baseline two-phase algorithm that performs penalty-based mapping of task to node-type and then, solves each node-type independently. We prove that the algorithm has an approximation ratio of O(Dmin(m,T))O(D\cdot\min(m,T)), where D,mD,m and TT are the number of resources, node-types and timeslots, respectively, We then present an improved linear programming based mapping strategy, enhanced further with a cross-node-type filling mechanism. Our experiments on synthetic and real-world cluster traces show significant cost reduction by LP-based mapping compared to the baseline, and the filling mechanism improves further to produce solutions within 20%20\% of (a lower-bound to) the optimal solution.

I Introduction

Cloud computing has emerged as the preferred platform due to its flexibility in planning and provisioning computing clusters with resources for application deployment and scaling. An undersized cluster degrades performance, whereas an oversized cluster causes wastage and drives up cost. Rightsizing addresses this trade-off by constructing an optimal cluster that balances cost and capacity. Rightsizing has been well-studied over a wide spectrum from pre-deployment (cold-start) cluster provisioning to dynamic node provisioning (E.g. [17, 35, 37, 34, 24]).

Cold-start rightsizing is used as an effective technique to boot-strap an initial cluster based on the workload characteristics, for large public clouds. The capacity of these initial clusters are then managed using dynamic provisioning techniques such as auto-scalers [21]. Thus, the cold-start righsizing and dynamic provisioning techniques are complementary in nature.

Cold-start rightsizing also plays an important role in limited-resource settings where vast resource pools might not exist to support dynamic auto-scalers. In such scenarios, cold-start rightsizing could be the most suitable approach to shape the cluster to a given workload without incurring any capital expenses or logistical challenges. Examples of these scenarios include Telco clouds[1] which co-host computing clusters on base-stations and on-premise clouds dedicated to a single organization. In such infrastructure settings, installation cost on 5G base-stations could be 3×3\times [8] higher than the operational expenditure. Similar challenges exist in on-premise clouds and edge-clouds[20, 6, 4] which may not have huge resource pools and sufficient number of intra-organization tenants to enable arbitrary on-demand scaling.

This paper deals with cold-start rightsizing that sizes a cluster at the time of installation based on projected demands of the application(s) to be hosted on it. The aim is to address the cost-performance trade-off by constructing an optimal cluster that balances cost and capacity, for a given workload of tasks.

Cold-start Rightsizing. Consider a workload (set of tasks) consisting of nn tasks, where each task demands specific quantities of DD resources (e.g CPU cores, memory and disk space). The cloud environment offers mm types of compute nodes that differ in terms of cost and capacity on the DD resources, and a replica of a given node-type is a node. Both demands and capacities can be conveniently represented as DD-dimensional vectors. The goal is to purchase a minimum cost cluster of nodes and place each task on a node such that the aggregate demand of tasks placed on any node does not exceed its capacity along any dimension (resource). We denote this problem as Rightsizing As discussed below (Prior Work), rightsizing has been well-studied in the contexts of cloud computing, virtual machine migration and bin packing.

Cold-start Rightsizing for Time-Limited Tasks. While the above formulation targets rightsizing purely based on resource-demands, real-world tasks are additionally characterized by timeline properties on when they run, how long they run and how much resources they consume at any given time.

These timeline characteristics are available as direct inputs for batch applications [29], for instance, batch jobs running at 1:00AM for few hours or a burst in service load during peak hours. Similarly, scheduled tasks with deadlines are encountered in edge  [10, 30]. Examples are applications that sample duty-cycled data-sources [36]; or adapting to platform activity (e.g. advanced sleeps modes of 5G New Radio which exploit stable user activity patterns [2]). Since the tasks are active only within a specific time-interval, resources could be reused by other tasks outside the interval. This time-limiting nature of tasks could be exploited in rightsizing the cluster leading to cost reduction. With the above motivation, we introduce and study the generalized cold-start rightsizing problem for time-limited tasks.

The problem, denoted TL-Rightsizing, starts with a set of tasks (with varying demands) and different nodes-types (with varying capacity and cost) as in Rightsizing. In addition, we assume a timeline divided into TT discrete timeslots, say a day divided into T=24T=24 hours, and each task has start and end timeslots as an interval [s,e][s,e]. The objective of TL-Rightsizing is to purchase a minimum cost set of nodes and place the tasks on the nodes such that the aggregate demand of tasks placed on any node does not violate its capacity at any timeslot. Figure 1 illustrates TL-Rightsizing and its benefits over Rightsizing.

Refer to caption
Figure 1: Illustration of TL-Rightsizing. A simple example is shown with two resources (D=2D=2), three tasks (n=3n=3), and two node-types (m=2)(m=2). The two dimensional demands and capacities, and node costs are shown. Part (a) shows a solution that exploits the time-limited nature and packs all the tasks in a single node of type 1 with cost $10\$10. In contrast, Part (b) ignores the time-limited aspects, and shows the best solution that packs one node of each type with total cost $16\$16.

For scenarios where demand, start and completion times are not directly known, we could build on cloud-centric time-series analysis  [18, 13] to leverage historical traces of task resource consumption from existing production or test deployments. The resource-demand of the task for various time windows (e.g. hourly, daily, weekly) can be estimated from the traces and each window can be treated as an independent task with a specific start and ending time. For example, a task handling stock market quotes may require high resource regime during market hours and low resource-demand regime at other times. For a timeline that spans a week starting Monday 00:00 hours, we can model this work as six tasks as shown in Figure 2.

Refer to caption
Figure 2: An original long-running task which serves stock market quotes may be modeled as six tasks, spanning a typical week: T1 models the low resource consumption regime, and T2-T6 model the additional demand during the 8-hour high resource consumption regimes during market open hours.

Prior Work. While TL-Rightsizing has not been studied in prior work, two streams of special cases, Rightsizing and interval coloring, have been well explored.

Rightsizing: The Rightsizing problem corresponds to the restricted setting where all the tasks are perpetually active, in other words, there is no consideration of timeline (T=1T=1).

If tasks are items and nodes are bins, then the problem corresponds to multi-dimensional bin-packing over multiple bin-types. Since the problem is NP-hard, a polynomial time optimal solution is infeasible, unless NP=P. While exponential time integer programming [15] can produce optimal solutions, meta-heuristics such as genetic algorithms [14], particle swarm [27] and ant colony optimization [11] can handle only small problem sizes due to high running time.

Alternatively, fast and scalable heuristics have been developed by building on strategies from the bin packing domain. Working in the context of optimizing virtual machines, Panigrahy et al. [25] study the single node-type special case (m=1m=1), called vector bin packing. They generalize bin packing heuristics (such as first and best-fit) and demonstrate that the heuristics produce near-optimal solutions in practice. Gabay and Zaourar [12] present similar procedures for multiple node-types with uniform costs. Other approaches for special cases and variations of Rightsizing have been proposed (e.g., [16, 22]) and we refer to [28] for a detailed discussion.

We build on the work of Silva et al. [28] and Patt-Shamir and Rawitz [26], who present algorithms for the general Rightsizing problem. The latter work proposes a two-phase procedure that first maps each task to an appropriate node-type by comparing demands and capacities, then solves for each node-type independently. They prove that the algorithm has an approximation ratio of O(D)O(D).

From the theoretical perspective, Chekuri and Khanna [7] present an O(logD)O(\log D)-approximation algorithm for the (single node-type) vector bin packing and Patt-Shamir and Rawitz [26] generalize the result to multiple node-types. However, the algorithms are based on an exponential size LP formulation and have a high running time of nO(D)n^{O(D)}.

Interval Coloring (or task scheduling): Interval coloring with bandwidths [3] is equivalent to our setting with single dimensional resources (D=1D=1) and single node-type (m=1m=1). In this problem, the input is a set of intervals with bandwidth requirements and the objective is to color them with minimum number of colors such that the aggregate bandwidth of intervals of the same color does not exceed a fixed capacity at any timeslot. In our context, the intervals are tasks and the colors are nodes. The problem is studied in both online and offline settings (see survey [19]) and procedures with O(1)O(1) approximation ratios have been designed.

Rightsizing for time-limited tasks: The TL-Rightsizing problem generalizes and combines elements from the above two streams: the notion of time-limited tasks from task scheduling, and the concepts of multi-dimensionality and multiple node-types from rightsizing.

Our Contributions. We introduce the TL-Rightsizing problem that reduces the cluster cost by exploiting the time-limited nature of tasks, and make the following contributions:

  • We first design a two-phase algorithm, denoted PenaltyMap, that serves as our baseline. The algorithms uses ideas from prior work on Rightsizing and task scheduling. It first maps each task to an appropriate node-type via a penalty-based heuristic and then solves each node-type separately. We present a detailed analysis and prove that the algorithm has an approximation ratio of O(Dmin(m,T))O(D\cdot\min(m,T)).

  • While PenaltyMap algorithm performs reasonably well in practice, we identify certain deficiencies in the mapping strategy and develop an improved mapping based on linear programming, denoted LP-map.

  • The LP-map algorithm packs tasks mapped to each node-type in a maximal manner, but we observe that it may leave empty capacity that can be filled by tasks mapped to other node-types. We present a cross-node-type filling procedure that reduces the wastage and lowers the overall solution cost.

  • We design a scalable strategy (that can handle large inputs) for determining a lowerbound on the optimal cost and use it to measure the efficacy of our algorithms. Using synthetic and real-world traces, we demonstrate that: (i) compared to the baseline PenaltyMap, LP-map offers up to 150%150\% reduction in the cost (normalized with respect to the lowerbound); (ii) cross-node-type filling provides further improvement up to 8%8\%, taking the solution closer to the optimal; (iii) the final algorithm (LP-map + cross node-type filling) generates solutions within at most 20%20\% of the lowerbound, and closer in most cases.

II Problem Definition and Modeling

Problem Definition [TL-Rightsizing]. Let the timeline be divided into TT discrete timeslots numbered 1,2,,T1,2,\ldots,T. Let DD be the number of resources (or dimensions). We have a set of mm node-types {\cal B}. Each node-type BB\in{\cal B} is associated with a price 𝚌𝚘𝚜𝚝(B){\tt cost}(B), and offers a capacity 𝚌𝚊𝚙(B,d){\tt cap}(B,d), along each dimension d[1,D]d\in[1,D]. The input consists of a set of nn tasks 𝒰{\cal U}; each task u𝒰u\in{\cal U} is specified by a demand 𝚍𝚎𝚖(u,d){\tt dem}(u,d) along each dimension dd and a span [s(u),e(u)][1,T][s(u),e(u)]\subseteq[1,T], with s(u)s(u) and e(u)e(u) being the starting and the ending timeslots of uu. We say that task uu is active at a timeslot tt, if t[s(u),e(u)]t\in[s(u),e(u)] and denote it as ‘utu\sim t’.

A BB-type node refers to a replica bb having the same price and capacity: 𝚌𝚘𝚜𝚝(b)=𝚌𝚘𝚜𝚝(B){\tt cost}(b)={\tt cost}(B) and 𝚌𝚊𝚙(b,d)=𝚌𝚊𝚙(B,d){\tt cap}(b,d)={\tt cap}(B,d), for all d[1,D]d\in[1,D]. A feasible solution is to purchase a set of nodes SS and place each task in one of the nodes in SS such that for any node bb, the total demand at any timeslot and any dimension does not violate the capacity offered by the node.

(t[1,T],d[1,D]):ut&ub𝚍𝚎𝚖(u,d)𝚌𝚊𝚙(b,d),\forall(t\in[1,T],d\in[1,D]):\sum_{u\sim t~{}\&~{}u\in b}{\tt dem}(u,d)~{}\leq~{}{\tt cap}(b,d),

wherein bb is viewed as the set of tasks placed in it. We refer to the above as the capacity constraint. A solution may purchase multiple nodes of the same node-type. The cost of the solution is the aggregate cost of the nodes purchased: 𝚌𝚘𝚜𝚝(S)=bS𝚌𝚘𝚜𝚝(b){\tt cost}(S)=\sum_{b\in S}{\tt cost}(b). The goal is to compute a solution of minimum cost.

Timeline Trimming. While TT can be arbitrarily large, as is common in task scheduling (see e.g., [5]), it can trimmed by considering only the start timeslots of tasks and ignoring the rest so that TnT\leq n. It is not difficult to argue that the transformation does not change the set of feasible solutions.

III Penalty-Based Mapping Algorithm

We design a baseline algorithm, denoted PenaltyMap, by building on heuristics from prior work on the following two special cases: interval coloring [17], wherein number of node-types m=1m=1 and dimensions D=1D=1, and 𝖱𝗂𝗀𝗁𝗍𝗌𝗂𝗓𝗂𝗇𝗀{\sf Rightsizing} [28, 26], wherein there is no timeline (T=1T=1).

Two Phase Framework. The intuition behind the algorithm is that each task must be placed on a node whose capacity matches the demand of the task. For instance, placing a task uu having high demand along a particular dimension dd to a node with low capacity along dd would block us from accommodating other tasks on the node, leading to capacity wastage along the other dimensions. The algorithm addresses the issue by using a two-phase framework. First, each task u𝒰u\in{\cal U} is mapped to a suitable node-type BB by comparing the demand of uu with the capacities of the node-types. The second phase partitions the tasks based on their node-types and solves each node-type separately, as in the single node-type setting, via a popular heuristic used in interval coloring.

Mapping Phase. For a task uu, define the relative demand (or height) with respect to a node-type BB as:

havg(u|B)=1Dd[1,D]𝚍𝚎𝚖(u,d)𝚌𝚊𝚙(B,d)\displaystyle h_{\rm avg}(u|B)=\frac{1}{D}\sum_{d\in[1,D]}\frac{{\tt dem}(u,d)}{{\tt cap}(B,d)}

Intuitively, havg(u|B)h_{\rm avg}(u|B) is a measure on the space occupied by uu, if the task were placed in a node of type BB. Taking the cost into consideration, define the penalty of uu relative to BB as:

pavg(u|B)=𝚌𝚘𝚜𝚝(B)havg(u|B)p_{\rm avg}(u|B)={\tt cost}(B)\cdot h_{\rm avg}(u|B)

We map uu to the node-type 𝙱avg(u){\tt B}^{*}_{\rm avg}(u), yielding the least penalty:

pavg(u)=minBpavg(u|B) & 𝙱avg(u)=argminBpavg(u|B)p^{*}_{\rm avg}(u)=\min_{B\in{\cal B}}~{}p_{\rm avg}(u|B)\mbox{ \& }{\tt B}^{*}_{\rm avg}(u)=\operatorname*{argmin}_{B\in{\cal B}}~{}p_{\rm avg}(u|B)

Placement Phase. We partition the tasks based on their node-types and process each group separately. Consider a node-type BB and let VBV_{B} denote the set of tasks mapped to BB. We process the tasks in VBV_{B} in the increasing order of starting timeslots. Suppose we have processed a certain number of tasks and have purchased a set of nodes to accommodate them. Let uu be the next task. We check if uu can fit in some node bb purchased already without violating the capacity of the node. If so, among the feasible nodes, we place uu in the node purchased the earliest (first-fit). Otherwise, we purchase a new BB-type node and place uu in it. Figure 3 shows a pseudocode.

Alternative Mapping and Fitting Policies. An alternative mapping policy (see [26]) is to define the relative demand as the maximum over the dimensions (as against average), and define the penalty and best node-type analogously:

hmax(u|B)=maxd[1,D]𝚍𝚎𝚖(u,d)𝚌𝚊𝚙(B,d).h_{\rm max}(u|B)=\max_{d\in[1,D]}\frac{{\tt dem}(u,d)}{{\tt cap}(B,d)}.

An alternative to the first-fit policy is the more refined similarity-fit that emulates the best-fit strategy (adapted from a dot-product strategy [25, 12]). It tries to minimize the capacity wastage by selecting the node bb (among the feasible nodes) whose remaining capacity is most similar (dot-product shown below) to the demand of uu.

t[s(u),e(u)]d𝚍𝚎𝚖(u,d)𝚌𝚊𝚙(B,d)×𝚛𝚎𝚖(b,d|t)𝚌𝚊𝚙(B,d),\sum_{t\in[s(u),e(u)]}\sum_{d}\frac{{\tt dem}(u,d)}{{\tt cap}(B,d)}\times\frac{{\tt rem}(b,d|t)}{{\tt cap}(B,d)},

where 𝚛𝚎𝚖(b,d|t){\tt rem}(b,d|t) is the remaining capacity of bb along dimension dd at timeslot tt. We can derive cosine similarity via dividing the product by the norms of capacity-normalized demand and remaining capacity vectors involved in the above inner product. Then, the strategy is to place uu in the node offering the maximum cosine similarity value.

Time Complexity. Time taken for computing the mapping is O(nm)O(n\cdot m). Time required to test whether a task fits a node is O(DT)O(D\cdot T). Given that the number of nodes can be at most |S||S|, the total number of nodes purchased, overall running time of the algorithm is O(nm+n|S|DT)O(n\cdot m+n\cdot|S|\cdot D\cdot T). Typically the last term of this bound dominates in practice.

Mapping Phase: For each task uu and each node-type BB havg(u|B)=1Dd𝚍𝚎𝚖(u,d)𝚌𝚊𝚙(B,d)h_{\rm avg}(u|B)=\frac{1}{D}\sum_{d}\frac{{\tt dem}(u,d)}{{\tt cap}(B,d)} pavg(u|B)=𝚌𝚘𝚜𝚝(B)havg(u|B)p_{\rm avg}(u|B)={\tt cost}(B)\cdot h_{\rm avg}(u|B). For each task uu: pavg(u)=minBpavg(u|B)p^{*}_{\rm avg}(u)=\min_{B}~{}p_{\rm avg}(u|B). Map uu to 𝙱avg(u)=argminBpavg(u|B){\tt B}^{*}_{\rm avg}(u)=\operatorname*{argmin}_{B}~{}p_{\rm avg}(u|B) For each node-type BB: Let VBpenV_{B}^{\rm pen} be the set of tasks mapped to BB. Placement Phase: For each node-type BB: Initialize solution SB=S_{B}=\emptyset Sort VBpenV_{B}^{\rm pen} in the increasing order of starting timeslots. For each task uu in the above order: If uu can fit in some node in SBS_{B}: Among the feasible nodes, place uu in the node purchased the earliest. Else: Purchase a new BB-type node bb and place uu in it. Output: Spen=BSBS_{\rm pen}=\bigcup_{B}S_{B}.

Figure 3: Algorithm PenaltyMap

IV PenaltyMap : Analysis

For the special case of interval coloring (m=1m=1, D=1D=1), prior work [9] derives an O(1)O(1) approximation ratio. Similarly, for the special case of 𝖱𝗂𝗀𝗁𝗍𝗌𝗂𝗓𝗂𝗇𝗀{\sf Rightsizing} (T=1T=1), an O(D)O(D) approximation ratio is implicit in existing literature [26]. We next present an analysis for the general TL-Rightsizing setting and prove that the PenaltyMap algorithm has an approximation ratio of O(Dmin(m,T))O(D\cdot\min(m,T)). The result matches and generalizes both the prior special cases.

For the sake of concreteness, we focus on the havgh_{\rm avg}-based mapping and the first-fit strategy, but the analysis can easily be adapted to the combinations involving hmaxh_{\rm max} and similarity-fit. We say that a task uu is small, if for all node-types BB and all dimensions dd, 𝚍𝚎𝚖(u,d)𝚌𝚊𝚙(u,B)/2{\tt dem}(u,d)\leq{\tt cap}(u,B)/2. We first work under the practically reasonable assumption that all the tasks are small and then discuss the general case.

Let SoptS_{\rm opt} and SpenS_{\rm pen} denote the optimal and PenaltyMap solutions, respectively. We derive a lower-bound on 𝚌𝚘𝚜𝚝(Sopt){\tt cost}(S_{\rm opt}) and an upper-bound on 𝚌𝚘𝚜𝚝(Spen){\tt cost}(S_{\rm pen}), based on the notion of congestion, defined next. Given a subset of tasks U𝒰U\subseteq{\cal U}, define the congestion induced by UU as the maxima over the timeline of the aggregate minimum penalties of tasks active at each timeslot:

𝚌𝚘𝚗𝚐(U)=maxtuU,utpavg(u).{\tt cong}(U)=\max_{t}\sum_{u\in U,u\sim t}p^{*}_{\rm avg}(u).

The following lemma shows that 𝚌𝚘𝚜𝚝(𝚘𝚙𝚝){\tt cost}({\tt opt}) is at least the congestion induced by the entire task set 𝒰{\cal U}.

Lemma 1.

𝚌𝚘𝚜𝚝(𝚘𝚙𝚝)𝚌𝚘𝚗𝚐(𝒰){\tt cost}({\tt opt})\geq{\tt cong}({\cal U}).

Proof.

Consider any node-type BB. Let NBN_{B} denote the number of BB-type nodes purchased by 𝚘𝚙𝚝{\tt opt}, and let VBoptV_{B}^{\rm opt} denote the set of tasks placed in BB-type nodes. For any timeslot tt and dimension dd, the aggregate demand is given by:

uVBopt,ut𝚍𝚎𝚖(u,d)\sum_{u\in V_{B}^{\rm opt},u\sim t}{\tt dem}(u,d)

To accommodate this demand, the number of BB-type nodes NBN_{B} must be at least:

uVBopt,ut𝚍𝚎𝚖(u,d)𝚌𝚊𝚙(B,d).\left\lceil\frac{\sum_{u\in V_{B}^{\rm opt},u\sim t}{\tt dem}(u,d)}{{\tt cap}(B,d)}\right\rceil.

Taking maximum over all (t,d)(t,d) pairs, we get that NBN_{B} must be at least:

maxt,duVBopt,ut𝚍𝚎𝚖(u,d)𝚌𝚊𝚙(B,d)maxtuVBopt,uthavg(u|B),\max_{t,d}\sum_{u\in V_{B}^{\rm opt},u\sim t}\frac{{\tt dem}(u,d)}{{\tt cap}(B,d)}\geq\max_{t}\sum_{u\in V_{B}^{\rm opt},u\sim t}h_{\rm avg}(u|B),

where we replace the maxima over dd by the average. Hence,

𝚌𝚘𝚜𝚝(Sopt)\displaystyle{\tt cost}(S_{\rm opt}) =\displaystyle= B𝚌𝚘𝚜𝚝(B)NB\displaystyle\sum_{B}{\tt cost}(B)\cdot N_{B}
\displaystyle\geq B𝚌𝚘𝚜𝚝(B)maxtuVBopt,uthavg(u|B)\displaystyle\sum_{B}{\tt cost}(B)\max_{t}\sum_{u\in V_{B}^{\rm opt},u\sim t}h_{\rm avg}(u|B)
\displaystyle\geq BmaxtuVBopt,utpavg(u|B)\displaystyle\sum_{B}\max_{t}\sum_{u\in V_{B}^{\rm opt},u\sim t}p_{\rm avg}(u|B)
\displaystyle\geq BmaxtuVBopt,utpavg(u)\displaystyle\sum_{B}\max_{t}\sum_{u\in V_{B}^{\rm opt},u\sim t}p^{*}_{\rm avg}(u)
\displaystyle\geq maxtBuVBopt,utpavg(u)\displaystyle\max_{t}\sum_{B}\sum_{u\in V_{B}^{\rm opt},u\sim t}p^{*}_{\rm avg}(u)
\displaystyle\geq maxtutpavg(u),\displaystyle\max_{t}\sum_{u\sim t}p^{*}_{\rm avg}(u),

where the last inequality follows from the fact that every task active at tt is placed in a node of some type. ∎

We next analyze the solution SpenS_{\rm pen}. Let 𝚌𝚘𝚜𝚝(){\tt cost}({\cal B}) denote the sum of costs of the node-types: 𝚌𝚘𝚜𝚝()=B𝚌𝚘𝚜𝚝(B){\tt cost}({\cal B})=\sum_{B}{\tt cost}(B). For a node-type BB, let VBpenV_{B}^{\rm pen} denote the set of tasks mapped to BB by the solution. In contrast to the lower-bound on 𝚌𝚘𝚜𝚝(𝚘𝚙𝚝){\tt cost}({\tt opt}) which is in terms of the congestion over the entire task set, we prove a (comparatively weaker) upper-bound on 𝚌𝚘𝚜𝚝(Spen){\tt cost}(S_{\rm pen}) in terms of the summation of the congestion induced within each node-type.

Lemma 2.

Assuming all the tasks are small,

𝚌𝚘𝚜𝚝(Spen)𝚌𝚘𝚜𝚝()+(2D)ΣB𝚌𝚘𝚗𝚐(VBpen).{\tt cost}(S_{\rm pen})\leq{\tt cost}({\cal B})+(2D)\cdot\Sigma_{B}{\tt cong}(V_{B}^{\rm pen}).
Proof.

Consider any node-type BB and let NBN_{B} denote the number of BB-type nodes purchased by SpenS_{\rm pen}. We claim that:

|NB|1+(2D)maxtuVBpen,uthavg(u|B).\displaystyle|N_{B}|\leq 1+(2D)\cdot\max_{t}\sum_{u\in V_{B}^{\rm pen},u\sim t}h_{\rm avg}(u|B). (1)

The proof is by induction. Consider the processing of a task vv and let SS denote the set of nodes purchased already. Assume by induction that |S||S| satisfies the bound before processing vv and we shall show that it remains true after the processing. If vv could fit in one of the nodes in SS, then no new node is purchased and the bound remains true.

Suppose vv could not fit into any of the nodes in SS and a new node is purchased. This means that for any node bSb\in S, placing vv in bb would result in violation of the capacity along some dimension d^\widehat{d} at some timeslot t^\widehat{t} within the span of vv:

𝚍𝚎𝚖(v,d^)+ub,ut^𝚍𝚎𝚖(u,d^)𝚌𝚊𝚙(b,d^).{\tt dem}(v,\widehat{d})+\sum_{u\in b,u\sim\widehat{t}}{\tt dem}(u,\widehat{d})\geq{\tt cap}(b,\widehat{d}).

Since the tasks are processed in the increasing order of starting timeslots, all the above tasks uu must start earlier than vv and so, they all must be active at the starting timeslot s(v)s(v), because they overlap with vv. Consequently, we can assume without loss of generality that t^=s(v)\widehat{t}=s(v). Thus,

𝚍𝚎𝚖(v,d^)+ub,us(v)𝚍𝚎𝚖(u,d^)𝚌𝚊𝚙(b,d^).{\tt dem}(v,\widehat{d})+\sum_{u\in b,u\sim s(v)}{\tt dem}(u,\widehat{d})\geq{\tt cap}(b,\widehat{d}).

Since we assume that the tasks are small, 𝚍𝚎𝚖(v,d^)𝚌𝚊𝚙(b,d^)/2{\tt dem}(v,\widehat{d})\leq{\tt cap}(b,\widehat{d})/2 and hence:

ub,us(v)𝚍𝚎𝚖(u,d^)𝚌𝚊𝚙(b,d^)/2,\sum_{u\in b,u\sim s(v)}{\tt dem}(u,\widehat{d})\geq{\tt cap}(b,\widehat{d})/2,

or equivalently:

1/2\displaystyle 1/2 \displaystyle\leq ub,us(v)𝚍𝚎𝚖(u,d^)𝚌𝚊𝚙(b,d^)\displaystyle\sum_{u\in b,u\sim s(v)}\frac{{\tt dem}(u,\widehat{d})}{{\tt cap}(b,\widehat{d})}
\displaystyle\leq ub,us(v)d𝚍𝚎𝚖(u,d)𝚌𝚊𝚙(b,d)\displaystyle\sum_{u\in b,u\sim s(v)}\sum_{d}\frac{{\tt dem}(u,d)}{{\tt cap}(b,d)}
=\displaystyle= Dub,us(v)1Dd𝚍𝚎𝚖(u,d^)𝚌𝚊𝚙(b,d)\displaystyle D\cdot\sum_{u\in b,u\sim s(v)}\frac{1}{D}\sum_{d}\frac{{\tt dem}(u,\widehat{d})}{{\tt cap}(b,d)}
=\displaystyle= Dub,us(v)havg(u|B).\displaystyle D\cdot\sum_{u\in b,u\sim s(v)}h_{\rm avg}(u|B).

Summing over all the nodes in SS:

|S|2DbSub,us(v)havg(u|B).\frac{|S|}{2}\leq D\cdot\sum_{b\in S}\sum_{u\in b,u\sim s(v)}h_{\rm avg}(u|B).

Taking UU^{\prime} as set of tasks included in SS (i.e., the set of tasks processed before vv), we get:

|S|(2D)uU,us(v)havg(u|B).|S|\leq(2D)\cdot\sum_{u\in U^{\prime},u\sim s(v)}h_{\rm avg}(u|B).

The above inequality considers the specific timeslot s(v)s(v). We complete the induction step by relaxing it to the timeslot yielding the maximum summation, observing that UVBpenU^{\prime}\subseteq V_{B}^{\rm pen}, and accounting for the extra node purchased for placing vv. The claim given in (1) is proved.

We are now ready to prove the lemma. Cost of SpenS_{\rm pen} is:

=\displaystyle= B𝚌𝚘𝚜𝚝(B)NB\displaystyle\sum_{B}{\tt cost}(B)\cdot N_{B}
\displaystyle\leq 𝚌𝚘𝚜𝚝()+(2D)BmaxtuVBpen,utpavg(u|B)\displaystyle{\tt cost}({\cal B})+(2D)\cdot\sum_{B}\max_{t}\sum_{u\in V_{B}^{\rm pen},u\sim t}p_{\rm avg}(u|B)
=\displaystyle= 𝚌𝚘𝚜𝚝()+(2D)BmaxtuVBpen,utpavg(u),\displaystyle{\tt cost}({\cal B})+(2D)\cdot\sum_{B}\max_{t}\sum_{u\in V_{B}^{\rm pen},u\sim t}p^{*}_{\rm avg}(u),

where the second inequality is true because the algorithm maps each task uu to the node-type offering the least penalty. ∎

Combing the upper and the lower-bounds, we next prove an approximation ratio for the PenaltyMap procedure.

Theorem 3.

Assuming that all the tasks are small,

𝚌𝚘𝚜𝚝(Spen)𝚌𝚘𝚜𝚝()+(2D)min(m,T)𝚌𝚘𝚜𝚝(Sopt){\tt cost}(S_{\rm pen})\leq{\tt cost}({\cal B})+(2D)\cdot\min(m,T)\cdot{\tt cost}(S_{\rm opt})
Proof.

Referring to Lemma 2, we have that:

B𝚌𝚘𝚗𝚐(VBpen)\displaystyle\sum_{B}{\tt cong}(V_{B}^{\rm pen}) \displaystyle\leq B𝚌𝚘𝚗𝚐(𝒰)\displaystyle\sum_{B}{\tt cong}({\cal U}) (2)
=\displaystyle= m𝚌𝚘𝚗𝚐(𝒰)m𝚌𝚘𝚜𝚝(𝚘𝚙𝚝),\displaystyle m\cdot{\tt cong}({\cal U})\leq m\cdot{\tt cost}({\tt opt}),\ \ \ \ \ \ \

where the first inequality is true because VBpen𝒰V_{B}^{\rm pen}\subseteq{\cal U} and the last inequality follows from Lemma 1. Furthermore,

B𝚌𝚘𝚗𝚐(VBpen)=BmaxtuVBpen,utpavg(u)u𝒰pavg(u),\sum_{B}{\tt cong}(V_{B}^{\rm pen})=\sum_{B}\max_{t}\sum_{u\in V_{B}^{\rm pen},u\sim t}p^{*}_{\rm avg}(u)\\ \leq\sum_{u\in{\cal U}}p^{*}_{\rm avg}(u),

where the inequality is true because every task contributes at most once in the summation. On the other hand,

𝚌𝚘𝚜𝚝(Sopt)\displaystyle{\tt cost}(S_{\rm opt}) \displaystyle\geq maxtu𝒰,utpavg(u)\displaystyle\max_{t}\sum_{u\in{\cal U},u\sim t}p^{*}_{\rm avg}(u)
\displaystyle\geq 1Ttu𝒰,utpavg(u)1Tu𝒰pavg(u),\displaystyle\frac{1}{T}\sum_{t}\sum_{u\in{\cal U},u\sim t}p^{*}_{\rm avg}(u)\geq\frac{1}{T}\sum_{u\in{\cal U}}p^{*}_{\rm avg}(u),

where first inequality follows from Lemma 1 and the last inequality from the fact that every task is active at some timeslot. Hence,

B𝚌𝚘𝚗𝚐(VBpen)u𝒰pavg(u)T𝚌𝚘𝚜𝚝(𝚘𝚙𝚝).\displaystyle\sum_{B}{\tt cong}(V_{B}^{\rm pen})\leq\sum_{u\in{\cal U}}p^{*}_{\rm avg}(u)\leq T\cdot{\tt cost}({\tt opt}). (3)

The theorem is proved by taking minimum of the bounds given by (2) and (3), and appealing to Lemma 2. ∎

The theorem provides an approximation ratio of O(Dmin(T,m))O(D\cdot\min(T,m)), for the case of small tasks. To handle the general case, we segregate the large tasks and apply PenaltyMap separately and derive the final solution by taking the union of the two solutions. Using an analysis similar to that of small tasks, we can show that the case of large tasks also admits an approximation ratio of O(Dmin(T,m))O(D\cdot\min(T,m)), thereby yielding the same overall ratio for the general case (but with an increased hidden constant). However, our experiments show that in practice, the solutions are much closer to the optimal, even without segregating the tasks into the two classes.

Refer to caption
Figure 4: Illustration of deficiencies of penalty-based mapping. In part (a), tasks mapped to the first node-type are mainly active in the earlier segments of the timeline, whereas those mapped to the second are mainly active in the later segments, leading to resource wastage. In part (b), taking D=2,m=3D=2,m=3 and T=1T=1, resource demands/capacities along the two dimensions are shown. PenaltyMap would map tasks 1 and 2 to node-types 1 and 2, respectively, thereby utilizing two nodes, one from each type, whereas both can be accommodated in a single node of node-type 3.

V Linear Programming Based Mapping

While the penalty based mapping works reasonably well in practice, there are certain deficiencies in the strategy and we design an improved mapping based on linear programming.

V-A Penalty Based Mapping: Deficiencies

The penalty-based strategy makes the node-type choices for the tasks independent of each other, optimizing on an individual basis without considering the interactions. The lack of a collective analysis may lead to wastage of resource capacities, resulting in sub-optimal packing and higher cost.

There are two aspects to the issue, pertaining to the timeline and the resource dimensions. Firstly, tasks mapped to a node-type may be predominantly active only at certain segments on the timeline and less active at others. When the group of tasks get placed in nodes of the given type, the resource capacity gets wasted at the latter segments. Secondly, tasks with high demand along a particular dimension dd (say memory) are likely to get mapped to a node-type with higher capacity along dd, wasting the capacity along the other dimensions. Figure 4 illustrates the two aspects. In addition, the node-type is selected based on normalized demands in an aggregated manner, ignoring the effect of individual dimensions.

V-B Linear Programming

We can address the above issues and find an improved mapping via linear programming (LP). Let π:𝒰\pi:{\cal U}\rightarrow{\cal B} be any task to node-type mapping. Consider executing the two-phase algorithm (Figure 3) with the given mapping π\pi (instead of penalty-based mapping), followed by the same greedy placement for each node-type. Let SπS_{\rm\pi} be the solution output by the algorithm. Using a proof similar to that Lemma 1, we can show the following lower-bound, denoted LB(π){\rm LB}(\pi):

𝚌𝚘𝚜𝚝(Sπ)Bmaxt,duVBπ,ut𝚍𝚎𝚖(u,d)𝚌𝚊𝚙(B,d),{\tt cost}(S_{\rm\pi})\geq\sum_{B}\max_{t,d}\sum_{u\in V_{B}^{\rm\pi},u\sim t}\frac{{\tt dem}(u,d)}{{\tt cap}(B,d)},

where VBπV_{B}^{\rm\pi} is the set of tasks mapped to BB under π\pi. Then, 𝚌𝚘𝚜𝚝(𝚘𝚙𝚝){\tt cost}({\tt opt}) is at least the minimum lower-bound over all the possible mappings:

𝚌𝚘𝚜𝚝(𝚘𝚙𝚝)minπLB(π).{\tt cost}({\tt opt})\geq\min_{\pi}{\rm LB}(\pi).

Our goal is to find the mapping π\pi with the least lower-bound LB(π){\rm LB}(\pi). The goal can be accomplished via integer programming. We introduce a variable x(u,B)x(u,B) for each pair of task and node-type, representing whether uu is mapped to BB. We add a variable αB\alpha_{B} for each node-type BB, capturing the contribution from the node type to the lower-bound:

Bmaxt,duVBπ,ut𝚍𝚎𝚖(u,d)𝚌𝚊𝚙(B,d).\sum_{B}\max_{t,d}\sum_{u\in V_{B}^{\rm\pi},u\sim t}\frac{{\tt dem}(u,d)}{{\tt cap}(B,d)}.

The integer program is shown below.

min\displaystyle\min B𝚌𝚘𝚜𝚝(B)αB\displaystyle\sum_{B}{\tt cost}(B)\cdot\alpha_{B}
u:\displaystyle\forall u: Bx(u,B)=1\displaystyle\sum_{B}x(u,B)=1 (4)
(B,t,d):\displaystyle\forall(B,t,d): utx(u,B)𝚍𝚎𝚖(u,d)𝚌𝚊𝚙(B,d)αB\displaystyle\sum_{u\sim t}x(u,B)\cdot\frac{{\tt dem}(u,d)}{{\tt cap}(B,d)}\leq\alpha_{B} (5)
(u,B):\displaystyle\forall(u,B): x(u,B){0,1},\displaystyle x(u,B)\in\{0,1\}, (6)

where the first and the last constraint ensure that uu is mapped to exactly one node-type. Once the values of x(u,B)x(u,B) are fixed, the optimal value for αB\alpha_{B} is automatically the contribution of BB. Thus, the integer program yields the desired mapping.

Unfortunately, it is prohibitively expensive to solve integer programs even on moderate size inputs. So, we resort to linear programming by relaxing the integrality constraint (6) as:

(u,B):0x(u,B)1.\displaystyle\forall(u,B):\quad 0\leq x(u,B)\leq 1. (7)

Let x(u,B)x^{*}(u,B) be the optimal LP solution. The objective value provides a lower-bound on 𝚌𝚘𝚜𝚝(𝚘𝚙𝚝){\tt cost}({\tt opt}).

V-C LP Rounding

In contrast to the integer program, LP produces a fractional solution: for each task uu, the solution consists of a distribution (vector) over the node-types [x(u,B)][x^{*}(u,B)] summing to 11, where x(u,B)x^{*}(u,B) is the fractional extent to which uu is assigned to BB. We design a procedure for transforming (rounding) the fractional solution into an integral solution. The procedure is based on a critical observation that the solution xx^{*} would be nearly integral, when nn is large enough.

Lemma 4.

In the solution xx^{*}, the number of variables x(u,B)x^{*}(u,B) with non-integral values is at most n+mTDn+mTD.

Proof.

Assume that xx^{*} is an extreme point (or basic feasible) solution. The number of variables is nm+mnm+m and so, at least as many constraints have to be tight [23]. Of these, n+mTDn+mTD can be from the constraints (4) and (5), and the rest must come from (7). Hence, at most (n+mTDm)(n+mTD-m) of the nmnm variables of the form x(u,B)x(u,B) can be proper fractions. ∎

The lemma shows that of the nmn\cdot m variables x(u,B)x^{*}(u,B), most would be integral, when nTDn\gg TD. From our experimental study, we observe that the above phenomenon manifests strongly in practice (irrespective of the values of nn and TDTD). For a task uu, let xmax(u)=maxBx(u)x_{\max}(u)=\max_{B}x^{*}(u) denote the maximum extent to which the task is assigned to a node-type. In the worst case, the solution can assign uu to an extent of 1/m1/m to each node-type. However, the results show that xmax(u)=1x_{\max}(u)=1 or close to 11 for vast majority of tasks. Meaning most of the tasks get assigned to a single node-type, or show a clear preference for a particular node-type. Figure 5 provides an illustration by considering a sample input from the experimental study.

Refer to caption
Figure 5: Illustration for near-integrality using a sample input from our experimental study with n=500,m=10,D=5n=500,m=10,D=5 and T=24T=24. The X-axis shows the tasks uu and the Y-axis shows xmax(u)x_{\max}(u). Tasks are sorted by their value for the ease of visualization. We see that most of the tasks are assigned to a single node-type or show a clear preference for a particular node-type (the one receiving the maximum assignment)

Based on the above observation, we employ a simple, but effective, rounding heuristic: map each task uu to the node-type BB to which the task is assigned to the maximum extent:

πLP(u)=argmaxBx(u,B).\pi_{\rm LP}(u)=\operatorname*{argmax}_{B}\quad x^{*}(u,B).

Given the mapping, we partition the tasks according to the node-types and treat each node-type independently, using the same greedy placement heuristic as in PenaltyMap and obtain a solution SLPS_{\rm LP}.

Analysis. The analysis of LP-map procedure is similar to that of PenaltyMap and a sketch is provided below. For any task uu, there must be at least one node-type BB with x(u,B)1/mx^{*}(u,B)\geq 1/m. Thus, the rounding incurs a loss by a factor of at most mm. This fact combined with a proof similar to that of PenaltyMap can be used to show that that solution output by the LP-map satisfies:

𝚌𝚘𝚜𝚝(SLP)𝚌𝚘𝚜𝚝()+(2Dm)𝚌𝚘𝚜𝚝(𝚘𝚙𝚝),{\tt cost}(S_{\rm LP})\leq{\tt cost}({\cal B})+(2Dm)\cdot{\tt cost}({\tt opt}),

yielding an approximation ratio of O(Dm)O(Dm). The solutions are much closer to the optimal in practice, with the margin being no more than 30%30\% in our experiments.

Mapping Phase: Construct the linear program and derive the solution xx^{*}. For each task uu: Map uu to πLP(u)=argmaxBx(u,B)\pi_{{\rm LP}}(u)=\operatorname*{argmax}_{B}~{}x^{*}(u,B) Placement Phase: R=𝒰R={\cal U}    // Remaining tasks. Sort the node-types BB in decreasing order of d𝚌𝚊𝚙(B,d)𝚌𝚘𝚜𝚝(B)\frac{\sum_{d}{\tt cap}(B,d)}{{\tt cost}(B)}. For each node-type BB in the above order: Initialize solution SB=S_{B}=\emptyset Let VBLPV_{B}^{\rm LP} be the set of tasks mapped to BB. Let U=VBLPRU=V_{B}^{\rm LP}\cap R  // Mapped to BB and remaining Sort UU in the increasing order of start timeslots. For each task uu in the above order: If uu can fit in some node in SBS_{B}: Among the feasible nodes, place uu in the node purchased the earliest. Else: Purchase a new BB-type node bb and place uu in it. // Cross node-type filling or piggy-backing Sort the tasks uu in RR in increasing order of havg(u,B)h_{\rm avg}(u,B). For each task uu in RR in the above order If uu can fit in some node in SBS_{B}: Among them place uu in the node purchased the earliest. Delete all the tasks placed in this iteration from RR Output: SLP=BSBS_{\rm LP}=\bigcup_{B}S_{B}.

Figure 6: Algorithm LP-map with cross node-type filling

V-D Cross Node-Type Filling

The placement procedure is maximal in the sense that for any node-type BB, it opens a new node for a task uu only when it cannot fit any of the nodes purchased already. However, the heuristic may leave empty spaces that can be filled by tasks of other node-types. We present a cross node-type filling method that reduces the overall cost by placing tasks mapped to the other node-types in the above empty spaces.

Suppose we have processed node-types B1,B2,,BkB_{1},B_{2},\ldots,B_{k} and let SS denote the set of nodes purchased already. Let RR denote the set of remaining tasks that are yet to be placed (these would be mapped to node-types k+1\geq k+1). We wish to place as many tasks from RR in the nodes found in SS. We employ a greedy heuristic for this purpose. We sort the tasks uRu\in R in the increasing order of havg(u,Bk)h_{\rm avg}(u,B_{k}), intuitively, arranging the task in terms of the space they would occupy. For each task uu in the above order, we check if uu can be placed in some node in SS without violating the feasibility constraint and if so, we place it in node purchased the earliest. The process is repeated till all the tasks are placed.

In the above process, tasks mapped to a node-type BB get an opportunity to piggy-back on the nodes purchased for the earlier node-types. Thus, the tasks mapped to later node-types are more likely to piggy-back and so, the ordering of the node-types is important. We sort the node-types in the decreasing order of capacity to cost ratio: d𝚌𝚊𝚙(B,d)/𝚌𝚘𝚜𝚝(B).\sum_{d}{\tt cap}(B,d)/{\tt cost}(B). Intuitively, the ratio measures the capacity offered per unit cost and the sorting puts less cost-effective node-types later in the ordering, giving more opportunity for their tasks to piggy-back. A pseudo-code is presented in Figure 6. For each node-type BB, the tasks that are mapped to BB are first placed, and then, the remaining tasks (mapped to later node-types) are accommodated as much as possible. The cross node-type filling method can also be applied to penalty based mapping (or to any other mapping strategy).

Time Complexity. The LP is designed to only find the task to node-type mapping, rather than a full solution encoding the placement of tasks in nodes. Consequently, the size of the LP is fairly small with the number of variables constraints being O(nm)O(nm) and O(mTD)O(mTD), respectively. Interior point methods can converge in polynomial time and the execution time depends on the solver’s performance. Given the LP solution, the time complexity of the rest of the procedure is similar to that of PenaltyMap: the mapping takes time O(nm)O(n\cdot m), whereas placement takes time O(n|S|DT)O(n\cdot|S|\cdot D\cdot T), resulting in a total execution time of O(nm+n|S|DT)O(n\cdot m+n\cdot|S|\cdot D\cdot T).

VI Experiments

VI-A Set-up

System. A machine with 2.5 GHz, Intel Core i7, 16GB memory and running macOS version 10.15.2 was used for running experiments implemented in python 3.8.7. We use python-mip package [33] with CBC solver for solving LP.

Google Cloud Trace 2019 (GCT 2019). GCT-2019 is used for real-world evaluation. 10M collection events (begin and end events of groups of tasks) and all machine-types (node-types) available for a single cluster (labeled “a”) [31] were sampled using BigQuery. Demand and capacity are two-dimensional (CPU and memory) and normalized in the trace. Entries with missing fields are purged from sampled trace. Time-stamps are converted to seconds. Task start and end times are discovered using task creation and end events. Thus, the task intervals, demands and capacities are drawn from the real trace. The node-type cost is generated synthetically (see below), and using publicly available pricing coefficients [32]. The processed data contains about 13K tasks and 13 node-types. Given nn and mm, which is one experimental scenario, we construct an input instance by randomly sampling nn tasks and mm node-types from this processed data. Results are average across 55 such random input instances per scenario. Timeline trimming is performed as discussed in Section II.

TABLE I: Default values for parameters
Parameter Traces Value
nn Both 1000
mm Both 10
T Synthetic 24
Capacity Synthetic [0.2, 1.0]
Demand Synthetic [0.01, 0.1]
Dimensions Synthetic 5

Synthetic Data. The GCT-2019 trace is 22-dimensional and the task demands are fixed and small compared to node-capacities. In order to explore higher dimensions and different demand categories, we setup a synthetic benchmark using a random generator with inputs nn, mm, DD, TT, and intervals for demand and capacity of the form [a,b][0,1][a,b]\subseteq[0,1]. Each of the DD components of demand and capacity is uniformly and independently selected from its respective interval. For each task uu, [s(u),e(u)][s(u),e(u)] is uniformly selected from [1,T][1,T]. We fixed T=24T=24 and capacity interval as [0.2,1][0.2,1]. We pick D{2,5,7}D\in\{2,5,7\}, m{5,10,15}m\in\{5,10,15\}, demand intervals from {[0.01,0.05],[0.01,0.1],[0.01,0.2]}\{[0.01,0.05],[0.01,0.1],[0.01,0.2]\}, and nn as per the experiment. Unless specified, default values used in specific experiments are shown in Table I. As in GCT-2019, for each scenario, results are averaged over 5 random inputs and timeline trimming is performed.

Cost-model. Node-type cost is computed as follows:

cost(B)=d[1,D]cdcap(B,d)ecost(B)=\sum_{d\in[1,D]}{c_{d}\cdot cap(B,d)^{e}}\vspace{-0.1in} (8)

where cdc_{d} is the coefficient for the resource component dd and the exponent ee is a measure of cost sensitivity to the changes in the resource quantities. We shall consider two cost models: homogeneous linear and heterogeneous. In the former, the coefficients and the exponent are set to one. In the latter, the coefficients are set heterogeneously and the exponent is varied to simulate non-linearity: (i) when e=1e=1, the per-unit rate (cost/quantity) of a resource component remains constant; (ii) when e<1e<1, the rate decreases with increase in the quantity; (iii) when e>1e>1, the rate increases with increase in quantity.

Algorithms. Taking the penalty based mapping (PenaltyMap) as the baseline, we evaluate the LP based mapping (LP-map), LP based mapping with cross node-type filling (LP-map-F) and penalty based mapping with cross node-type filling (PenaltyMap-F).

For PenaltyMap and PenaltyMap-F, we report the minimum cost obtained from four (i.e. two mapping and two fitting policies described in Section III) combinations each. Similarly, for LP-map and LP-map-F, we take the minimum cost from two fitting policies respectively.

Linear Programming Lowerbound. We use the linear program discussed earlier (Section V-B) to derive a lowerbound (LB) on the cost of the optimal solution. The cost of solutions SS output by the algorithms are normalized as 𝚌𝚘𝚜𝚝(S)/LB{\tt cost}(S)/{{\rm LB}}, so that a normalized cost of 11 means that SS is optimal.

VI-B Homogeneous Linear Cost-Model

As noted earlier, the homogeneous cost-model computes 𝚌𝚘𝚜𝚝(B){\tt cost}(B) by setting e=1e=1 and cd=1c_{d}=1 (Equation 8).

Refer to caption
(a) Varying DD. mm=1010, demand=[0.01,0.1][0.01,0.1]
Refer to caption
(b) Varying mm. D=55, demand=[0.01,0.1][0.01,0.1]
Refer to caption
(c) Varying demand. mm=1010, DD=55
Figure 7: [Synthetic-Homogeneous]: Comparison of algorithm when mm, DD and demand are scaled.

VI-B1 Synthetic Trace

We first compare the three algorithms by scaling each of DD, mm and the demand, fixing the other parameters.

Dimension. In Figure 7(a) DD is varied from 22 to 77. Compared to PenaltyMap, the LP-map strategy decreases the normalized cost by up to 0.130.13 (or 13%13\%), and LP-map-F plugs wastage by using the cross node-type filling, resulting in further reduction by up to 8%8\%, taking the solution closer to the optimal. Overall, LP-map-F offers up to 17%17\% decrease in cost compared to PenaltyMap.

We also note that cost increases for all the mapping strategies as DD increases, since the instances become harder, but LP-map-F remains more resilient and remains within 20% of the lowerbound even for higher DD, and is well within 10% for lower DD, indicating that solution is close to optimal.

Node-type. In Figure 7(b) mm is varied from 5 to 15. Here again, LP-map-F outperforms PenaltyMap by up to 24% with cross node-type filling contributing up to 7% in this cost reduction. As a general trend, more node-types makes the mapping issue harder and the cost of all algorithms increase. The PenaltyMap algorithm is 18%18\% away from the lowerbound at m=5m=5, and shoots to 41%41\% at m=15m=15. In contrast, the LP-map-F algorithm remains stable resulting in only 4%4\% increase (1.121.12 to 1.161.16)

Task demand. In Figure 7(c) task demand is varied from [0.01, 0.05] to [0.01, 0.2]. As in the two prior cases, LP-map-F performs the best in all the cases and stays under 25% of the lowerbound.

VI-B2 Real-trace Performance - Scaling Tasks and Node-types

Google trace results are shown in Figure 8. First, we scale nn (Figure 8(a)) while keeping m=10m=10 constant, and D=2D=2 (CPU and memory). We see that LP-map outperforms PenaltyMap by large margins from 39%39\% to 156%156\% and produces near-optimal solutions, not more than 11%11\% away from the lowerbound. Our second experiment (Figure 8(b)) fixes n=1000n=1000 and varies mm from 44 to 1313, wherein LP-map maintains 26%119%26\%-119\% lower cost than PenaltyMap and remains under 7%7\% of lowerbound.

In GCT-2019 trace, the task demands are small and so, the LP mappings are near-integral. Further, at smaller DD, it is easier to fit the tasks, causing LP-map solutions to be very close to the optimal. The cross node-type filling has little room for improvement and hence LP-map-F exhibits similar performance as LP-map.

As nn increases, PenaltyMap has more tasks to map, due to which sub-optimal mappings get averaged, resulting in lesser capacity wastage and improved performance. However, as mm increases, the mapping issue becomes harder and the performance of the greedy heuristic degrades.

Refer to caption
(a) Varying nn. mm=1010
Refer to caption
(b) Varying mm. nn=10001000
Figure 8: [GCT-2019-Homogeneous]: Quality of solution when mm and nn are scaled.

VI-C Heterogeneous Cost-Model

Synthetic Data. We use a median point of D=5D=5, m=10m=10 and demand = [0.01,0.1][0.01,0.1] for this experiment. Coefficients cdc_{d} for the cost model (Equation 8) are generated randomly in the range of [0.3,1.0][0.3,1.0]. Further, we vary ee from 0.330.33 to 33. The results (Figure 9) show that LP-map-F offers up to 18%18\% gain over PenaltyMap for e=1e=1, and up to 12%12\% gain e<1e<1. Cross node-type filling provides a constant benefit of 2%3%2\%-3\% over LP-map for all values of ee.

When e>1e>1, the cost is skewed towards the largest resource component, and when e<1e<1, the model tends towards uniform cost across the node-types (irrespective of their capacity). In both the cases, it is easier to map the tasks and hence, the greedy PenaltyMap gets closer to LP-map.

Refer to caption
Figure 9: [Synthetic-Heterogeneous]: Varying ee for a given mm.

Real-life Trace GCT-2019. For each node-type in GCT-2019, we use pricing coefficients from the Google pricing model [32], keeping the exponent e=1e=1.

Refer to caption
Figure 10: [GCT-2019-Heterogeneous]: Varying node-types (nn=10001000, ee=11) using cost model with real-world coefficients.

Since, number of node-types has an important role in the quality of solution, we vary the node-types from 44 to 1313. The results are shown in Figure 10. The performance of PenaltyMap is away from the lowerbound by 23%23\% at m=4m=4, and degrades as mm increases, reaching 53%53\% at m=13m=13. In contrast, LP-map-F gives high quality solutions staying close to 5%5\% of the lowerbound, while outperforming PenaltyMap by up to 48%48\%. Cross node-type filling contributes up to 5%5\% in LP-map-F performance.

VI-D Cross Node-Type Filling

As mentioned earlier (Section V-D), cross node-type filling can be applied over PenaltyMap as well, and it is interesting to study the implications. Towards that goal, we augment PenaltyMap with the filling method to derive a PenaltyMap-F version, and evaluate on all the data points of the GCT-2019 trace experiments (Figures 8 and 10). We can see that the filling method offers 44%44\%-142%142\% gains over PenaltyMap.

Refer to caption
Figure 11: [GCT-2019 All-Scenarios]: Comparing PenaltyMap-F and LP-map-F.

Figure 11 compares PenaltyMap-F and LP-map-F, where LP-map-F outperforms PenaltyMap-F consistently with gains in the range 3%3\%-13%13\% establishing that an organic combination of LP-based mapping and cross node-type filling provides better solutions.

VI-E Running Time

We profiled the running time for all the algorithms on the largest configuration of 2000 tasks and 13 node-types from the GCT-2019 trace. We find that on the average, PenaltyMap takes about one second to produce a solution. Running time of the LP-based mappings could be split into the running time of the LP solver and the mapping phase. While the LP solver itself takes about 15 min, the mapping phase for both LP-map and LP-map-F takes about a second. Since our objective is cold-starting the cluster and cluster-sizing is an one-time computation, 15min is a very practical running time.

VI-F No-Timeline Scenario

Figure 1 illustrates the gains of incorporating the timeline into solution computation. To experimentally confirm these gains, we considered the configuration of 1000 tasks and 10 node-types from the GCT-2019 trace. We compared the cluster cost under timeline-aware and timeline-agnostic settings. For the former, we use the cost of LP-map-F solution (Figure 8(a)) and for the latter, we compute a lowerbound on the cost by treating the tasks to be always active. As expected, the latter produces higher cost, and we found that the factor is as high as 22x on the average.

VII Conclusions

We introduced the TL-Rightsizing problem that reduces cluster cost by exploiting the time-limited nature of real-life tasks. We designed a baseline two-phase algorithm with an approximation ratio of O(Dmin(m,T))O(D\cdot\min(m,T)), an improved LP-based mapping strategy and a cross-node-type filling method for further cost reduction. Experiments on synthetic and real-life traces demonstrate that the final algorithm produces solutions within 20%20\% of a lowerbound on the optimal cost. We identify two avenues of fruitful research. First is to further bridge the gap between the lowerbound and the solutions on the hard instances. Secondly, improving the approximation ratio would be of interest from a theoretical perspective. We also intend to work on enhancing the scheduler and auto-scaling algorithms to better leverage the output from TL-Rightsizing

References

  • [1] NEC Wraps Red Hat OpenShift Into 5G Aspirations. https://www.sdxcentral.com/articles/news/nec-wraps-red-hat-openshift-into-5g-aspirations/2021/04/. Accessed: 2021-04-28.
  • [2] A technical look at 5G energy consumption and performance. https://www.ericsson.com/en/blog/2019/9/energy-consumption-5g-nr. Accessed: 2021-04-28.
  • [3] U. Adamy and T. Erlebach. Online coloring of intervals with bandwidth. In International Workshop on Approximation and Online Algorithms, 2003.
  • [4] How the U.S. Air Force Deployed Kubernetes and Istio on an F-16 in 45 days. https://thenewstack.io/how-the-u-s-air-force-deployed-kubernetes-and-istio-on-an-f-16-in-45-days/. Accessed: 2021-04-28.
  • [5] A. Bar-Noy, R. Bar-Yehuda, A. Freund, J. Naor, and B. Schieber. A unified approach to approximating resource allocation and scheduling. Journal of the ACM, 48(5):1069–1090, 2001.
  • [6] How DENSO is fueling development on the Vehicle Edge with Kubernetes. https://www.cncf.io/blog/2019/10/01/how-denso-is-fueling-development-on-the-vehicle-edge-with-kubernetes/. Accessed: 2021-04-28.
  • [7] C. Chekuri and S. Khanna. On multidimensional packing problems. SIAM Journal on Computing, 33(4):837–851, 2004.
  • [8] Cost analysis of orchestrated 5G networks for broadcasting. https://tech.ebu.ch/docs/techreview/EBU_Tech_Review_2019_Lombardo_Cost_analysis_of_orchestrated_5G_networks_for_broadcasting.pdf. Accessed: 2021-04-28.
  • [9] K. Elbassioni, N. Garg, D. Gupta, A. Kumar, V. Narula, and A. Pal. Approximation algorithms for the unsplittable flow problem on paths and trees. In Foundations of Software Technology and Theoretical Computer Science, volume 18, 2012.
  • [10] Ivan Farris, Antonino Orsino, Leonardo Militano, Antonio Iera, and Giuseppe Araniti. Federated iot services leveraging 5g technologies at the edge. Ad Hoc Networks, 68:58–69, 2018.
  • [11] E. Feller, L. Rilling, and C. Morin. Energy-aware ant colony based workload placement in clouds. In Grid Computing, 2011.
  • [12] M. Gabay and S. Zaourar. Vector bin packing with heterogeneous bins: application to the machine reassignment problem. Annals of Operations Research, 242(1):161–194, 2016.
  • [13] Daniel Gmach, Jerry Rolia, Ludmila Cherkasova, and Alfons Kemper. Workload analysis and demand prediction of enterprise data center applications. In 2007 IEEE 10th International Symposium on Workload Characterization, pages 171–180. IEEE, 2007.
  • [14] H. Goudarzi and M. Pedram. Multi-dimensional sla-based resource allocation for multi-tier cloud computing systems. In Cloud Computing, 2011.
  • [15] B. Han, G. Diehr, and Jack S J. Cook. Multiple-type, two-dimensional bin packing problems: Applications and algorithms. Annals of Operations Research, 50(1):239–261, 1994.
  • [16] E. Jeannot, G. Mercier, and F. Tessier. Process placement in multicore clusters: Algorithmic issues and practical techniques. IEEE Transactions on Parallel and Distributed Systems, 25(4):993–1002, 2013.
  • [17] B. Jennings and R. Stadler. Resource management in clouds: Survey and research challenges. Journal of Network and Systems Management, 23(3):567–619, 2015.
  • [18] Arijit Khan, Xifeng Yan, Shu Tao, and Nikos Anerousis. Workload characterization and prediction in the cloud: A multiple time series approach. In 2012 IEEE Network Operations and Management Symposium, pages 1287–1294. IEEE, 2012.
  • [19] H. Kierstead. Coloring graphs online. In A. Fiat and G. Woeginger, editors, Online algorithms: The state of the art. Springer, 1998.
  • [20] KubeEdge: An open platform to enable Edge computing. https://kubeedge.io/en/. Accessed: 2021-04-28.
  • [21] Kubernetes Auto-scaler. https://kubernetes.io/blog/2016/07/autoscaling-in-kubernetes/. Accessed: 2021-04-28.
  • [22] W. Leinberger, G. Karypis, and V. Kumar. Multi-capacity bin packing algorithms with applications to job scheduling under multiple constraints. In International Conference on Parallel Processing, 1999.
  • [23] J. Matousek and B. Gärtner. Understanding and using linear programming. Springer Science & Business Media, 2007.
  • [24] Iyswarya Narayanan, Aman Kansal, and Anand Sivasubramaniam. Right-sizing geo-distributed data centers for availability and latency. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pages 230–240. IEEE, 2017.
  • [25] R. Panigrahy, K. Talwar, L. Uyeda, and U. Wieder. Heuristics for vector bin packing. Technical report, Microsoft, 2011.
  • [26] B. Patt-Shamir and D. Rawitz. Vector bin packing with multiple-choice. Discrete Applied Mathematics, 160(10-11):1591–1600, 2012.
  • [27] M. Rodriguez and R. Buyya. Deadline based resource provisioningand scheduling algorithm for scientific workflows on clouds. IEEE Transactions on cloud computing, 2(2):222–235, 2014.
  • [28] P. Silva, C. Perez, and F. Desprez. Efficient heuristics for placing large-scale distributed applications on multiple clouds. In CCGrid, 2016.
  • [29] Spring Batch on Kubernetes: Efficient batch processing at scale. https://spring.io/blog/2021/01/27/spring-batch-on-kubernetes-efficient-batch-processing-at-scale. Accessed: 2021-04-28.
  • [30] Amir Varasteh, Basavaraj Madiwalar, Amaury Van Bemten, Wolfgang Kellerer, and Carmen Mas-Machuca. Holu: Power-aware and delay-constrained vnf placement and chaining. IEEE Transactions on Network and Service Management, 2021.
  • [31] Google Cloud Trace 2019 helper document. https://drive.google.com/file/d/10r6cnJ5cJ89fPWCgj7j4LtLBqYN9RiI9/view. Accessed: 2021-02-28.
  • [32] Google pricing model. https://cloud.google.com/compute/vm-instance-pricing. Accessed: 2021-02-28.
  • [33] Python MIP solver. https://pypi.org/project/mip/1.12.0/. Accessed: 2021-02-28.
  • [34] Jie Xu, Lixing Chen, and Shaolei Ren. Online learning for offloading and autoscaling in energy harvesting mobile edge computing. IEEE Transactions on Cognitive Communications and Networking, 3(3):361–373, 2017.
  • [35] Maotong Xu, Sultan Alamro, Tian Lan, and Suresh Subramaniam. Cred: Cloud right-sizing with execution deadlines and data locality. IEEE Transactions on Parallel and Distributed Systems, 28(12):3389–3400, 2017.
  • [36] Poonam Yadav, Julie A McCann, and Tiago Pereira. Self-synchronization in duty-cycled internet of things (iot) applications. IEEE Internet of Things Journal, 4(6):2058–2069, 2017.
  • [37] Hao Yin, Xu Zhang, Hongqiang H Liu, Yan Luo, Chen Tian, Shuoyao Zhao, and Feng Li. Edge provisioning with flexible server placement. IEEE Transactions on Parallel and Distributed Systems, 28(4):1031–1045, 2016.