Admission Control with Response Time Objectives for Low-latency Online Data Systems

Hao Xu LinkedIn CorporationMountain ViewCAUSA and Juan A. Colmenares LinkedIn CorporationMountain ViewCAUSA

Abstract.

To provide quick responses to users, Internet companies rely on online data systems able to answer queries in milliseconds. These systems employ complementary overload management techniques to ensure they provide a continued, acceptable service throughout traffic surges, where “acceptable” partly means that serviced queries meet or track closely their response time objectives. Thus, in this paper we present Bouncer, an admission control policy aimed to keep admitted queries under or near their service level objectives (SLOs) on percentile response times. It computes inexpensive estimates of percentile response times for every incoming query and compares the estimates against the objective values to decide whether to accept or reject the query. Bouncer allows assigning separate SLOs to different classes of queries in the workload, implements early rejections to let clients react promptly and to help data systems avoid doing useless work, and complements other load shedding policies that guard systems from exceeding their capacity. Moreover, we propose two starvation avoidance strategies that supplement Bouncer’s basic formulation and prevent query types from receiving no service (starving).

We evaluate Bouncer and its starvation-avoiding variants against other policies in simulation and on a production-grade in-memory distributed graph database. Our results show that Bouncer and its variants allow admitted queries to meet or stay close to the SLOs when the other policies do not. They also report fewer overall rejections, a small overhead, and with the given latency SLOs, they let the system reach high utilization. In addition, we observe that the proposed strategies can stop query starvation, but at the expense of a modest increase in overall rejections and causing SLO violations for serviced requests; however, our results suggest that the violation counts may be acceptable in practice.

1. Introduction

Social media, e-commerce, and other Internet companies rely on low-latency online data systems (DeCandia et al., 2007; Bronson et al., 2013; Corbett et al., 2013; Zhan et al., 2019; Corp., 2022b, 2018) to provide quick responses to their users. These large-scale systems answer queries from multiple clients demanding millisecond-scale response times (e.g., 1ms–100ms), and each of their servers can receive tens or even a few hundreds of thousand queries per second (QPS). The queries are normally of various types with different complexity and latency characteristics. Customers usually want to know the response times a system can offer to their queries to ensure their latency requirements are met. To set clear performance goals for themselves and realistic expectations for customers, system operators establish service level objectives (SLOs) (Jones et al., 2017) on query response times. These latency SLOs are typically defined in terms of percentiles (e.g., p50=10ms and p90=60ms), and having separate SLOs for different classes of queries is common.

With users driving the inbound traffic, they can cause surges in which numerous simultaneous requests exceed the maximum load a serving system can handle with the provisioned resources (LeFebvre, 2001; Kim, 2018; Purnell, 2020). Besides legitimate, well-intended users, malicious agents launching distributed denial of service (DDoS) attacks (Jonker et al., 2017; Kopp et al., 2021; Toh et al., 2022) can exacerbate the situation. Even with the inbound traffic below the tolerable limit, there are other reasons for online data systems to become overloaded. Unplanned reduction in the system’s capacity¹¹1 i.e., the maximal traffic load, in QPS, that the system is able to sustain. can result from network outages, node failures, configuration changes, and software bugs. Moreover, Internet companies typically conduct live traffic load tests (Veeraraghavan et al., 2016; Mallapur and Kehoe, 2017) on a regular basis (e.g., daily) to confirm that their system infrastructure can handle the extra load when outages in one or more data centers occur.

To ensure continued operation throughout possible load surges, online data systems (and their surrounding infrastructure) employ complementary overload management techniques (Cuervo, 2017a; Zhang et al., 2018) that tackle the problem from different angles, such as load balancing (Lewandowski, 2017; Cuervo, 2017b; Chou et al., 2019), per-client quotas and client-side throttling (Cuervo, 2017a; Szopa et al., 2016), resource scheduling (Colajanni et al., 1997; Schroeder and Harchol-Balter, 2006), and graceful degradation (Liu et al., 1994; Mittal, 2016; Agarwal et al., 2013; Ulrich, 2017). Another angle is admission control (load shedding) (Chen and Mohapatra, 2002; Welsh and Culler, 2003; Elnikety et al., 2004; Blanquer et al., 2005; Tozer et al., 2010; Xiong et al., 2011; Zhang, 2014; Ulrich, 2017; Zhou et al., 2018; Hao et al., 2017; Cho et al., 2020; Noll et al., 2019; Chapnik et al., 2022), which is the focus of this paper.

Based on the system’s load status, an admission control module decides to accept or reject client requests so that the system continues providing an acceptable service when receiving excessive traffic. Some admission control techniques (e.g., (Heiss and Wagner, 1991; Mönkeberg and Weikum, 1992; Elnikety et al., 2004)) concentrate on allowing the system to serve a sustained throughput without exceeding its capacity, protecting it from performance collapse. But, for online data systems “acceptable service” also means that admitted queries should meet or at least track closely the latency SLOs. Hence, in this paper we present Bouncer (§3), a measurement-based admission control policy whose main goal is to keep serviced queries under or close to their latency SLOs defined in terms of percentile response times. It implements early rejections (Ulrich, 2017; Hao et al., 2017; Cho et al., 2020) to help data systems avoid doing useless work and give clients more flexibility to failover promptly. It is also designed to complement other admission control policies that keep systems from exceeding their capacity. For workloads with diverse queries, Bouncer allows assigning SLOs per query type.

The main challenge here lies in estimating percentile response times quickly and effectively for query-by-query acceptance decision making. Bouncer computes estimates of percentile response times for each incoming query by combining inexpensive approximations of queue wait time and processing time from recent query executions, and compares the estimates with the objective values to decide to accept or reject the query. An additional issue is that Bouncer’s basic formulation may systematically deny service to query types, especially under heavy load. To prevent query types from receiving no service (starving), we supplement Bouncer with two alternative starvation avoidance strategies (§4).

We evaluate Bouncer and its starvation-avoiding variants against several in-house policies, in simulation (§5.3) and on LIquid, a production-grade in-memory distributed graph database (§5.4). We describe the in-house policies in §5.2 and provide an overview of LIquid (Meyer et al., 2020a, b) in §5.1. We show that Bouncer and its variants are able to make effective decisions, and allow admitted queries to meet or stay near the latency objectives when the other policies do not. They report fewer overall rejections (15% $\sim$ 30%) than the other policies, a small overhead (mean=18 $\mu$ s) for millisecond-scale queries, and with the given latency SLOs, they let the system reach high utilization. Further, we show that the proposed strategies can stop query starvation, but at the expense of a modest increase in overall rejections and causing SLO violations for serviced requests; however, our results on LIquid graph database suggest the violation counts may be acceptable in practice.

To sum up, the contributions of this paper are:

•

an admission control policy, called Bouncer, that promptly rejects queries expected to surpass their response time objectives and lets admitted queries meet or stay near theirs;
•

two starvation avoidance strategies – acceptance-allowance (§4.1) and helping-the-underserved (§4.2) – that complement Bouncer and protect query types from systemic service denial; and
•

an extensive evaluation of Bouncer and its starvation-avoiding variants in simulation and on LIquid (i.e., a real system).

The paper ends with the related work (§6), our conclusions and future work (§7). In addition, it provides a supplementary discussion about our plan to handle cold starts and traffic lulls in Bouncer (Appendix A) and practical considerations (Appendix B) related to the selection of percentiles for the latency SLOs and Bouncer’s configuration effort.

2. Motivation

In this section, we discuss the motivation behind the main design decisions in Bouncer.

Early rejections

Online data systems often operate as shared services at the bottom of a multi-tier architecture, with microservices (Newman, 2021) in the middle and API gateways (Richardson, 2019) at the top as entry points for external mobile or web client applications. Top- and mid-tier services set expiration times (deadlines) for their requests to downstream services, and give up on requests that time out to offer users timely, yet lower-quality responses. For example, suppose a mid-tier service sends simultaneous requests $r_{1}$ , $r_{2}$ and $r_{3}$ with a 50ms deadline to give enough time to upstream services to produce their responses and the mobile app to render the results. If $r_{3}$ takes longer than the deadline (let’s say 55ms), then the service, after waiting until the deadline expires, would ignore $r_{3}$ ’s response and may choose to only return the results from $r_{1}$ and $r_{2}$ to its client.

When an online data system is under heavy load and operates nearly at peak capacity, its response times increase and if no remediation is in place, queries may approach or reach their expiration times, which can have several consequences. First, longer response times exacerbate the resource usage (e.g., memory and threads) of mid-tier services because the longer a query takes, the longer the upstream services in the call chain wait and hold on to resources (until deadlines expire). Second, depending on how tight the deadlines are compared to the expected upper-bound response times, upstream services may have little time to react to expired requests and produce alternative results. Continuing with our example, if the mid-tier service is expected to respond in no more than 60ms, after waiting 50ms for $r_{3}$ it would only have 10ms to provide a substitute, degraded response. Third, deadline expirations throughout the service call chains become more likely when queries take longer; when queries time out at the data system or its responses expire on their way up to the users, the system unnecessarily spends memory, CPU, and other resources – precious during overload – and any work it does to answer the queries goes to waste.

To alleviate the above issues, we adopt a fail-early-and-cheaply approach (Ulrich, 2017) to admission control on the online data system. Bouncer rejects queries expected to miss their latency objectives straight away after arrival (i.e., these queries never make it into the data system’s queue). By giving early rejections (Hao et al., 2017; Cho et al., 2020), it 1) helps reduce resource usage of upstream services, 2) offers these services more flexibility to decide the next action to obtain alternative results, and 3) lowers the chances of the data system doing useless work. Further, on the “cheaply” side, Bouncer uses inexpensive response time estimates to make acceptance decisions.

Complementing capacity-centric policies

We design Bouncer to complement, rather than replace, admission control policies aimed to guard data systems from exceeding their capacity. The reason is that we cannot easily guarantee that a system is protected against overloading by just enforcing percentile response times because the level of protection depends on the chosen percentile values. For example, suppose a system can answer individual queries of type A in 10ms on average, is currently dedicated to serve such queries, and can process 64 queries in parallel; the system can thus sustain 6,400 ( $=1/0.01\times 64$ ) of type-A queries per second. If we adopt the customer’s latency requirement of $SLO_{p50}=100ms$ for the A queries,²²2 Rather than an $SLO_{p50}$ closer to 10ms, in anticipation of other query types being soon in production. then Bouncer will only reject queries after the system has taken work beyond its capacity. Moreover, finding the right latency SLOs to ensure the system is protected becomes more challenging when the workload includes queries of various types with different performance characteristics. By contrast, some capacity-centric policies (e.g., (Cherkasova and Phaal, 1998) and AcceptFraction in §5.2.3) just require setting a few parameters, such as the maximum utilization threshold.

Note that being complementary to capacity-centric policies also means that Bouncer should not prevent the system from processing queries at its full capacity. But, as illustrated in our example, that is not difficult to achieve with loosen latency requirements.

Query type awareness

Online data systems normally serve workloads carrying queries of various types with different complexity and latency characteristics. In graph databases, for example, simple edge queries, which return the vertices directly connected to a given vertex, are usually fast while graph distance queries, which determine the shortest distance between two vertices, can take longer. We choose to differentiate query types in Bouncer as prior research shows that 1) a query’s type often has a big impact on its response time (Elnikety et al., 2004), and 2) admission control techniques oblivious to query types tend to reject more queries than needed because they cannot determine which queries are more beneficial to reject (Tozer et al., 2010).

3. The “Bouncer” Policy

Bouncer is an admission control policy that makes acceptance decisions based on the latency SLOs of queries. It estimates percentile response times for each incoming query and compares them with the target percentile response times in the SLO to decide whether to accept or reject the query.

Figure 1. Query processing with admission control and metrics collection.

Bouncer is built atop a software framework that resembles a stage in the staged event-driven architecture (SEDA) with the multi-class overload controller (Welsh and Culler, 2003), and it facilitates implementing thread-safe, highly-concurrent policies. Figure 1 depicts the query processing in the framework. When a new query arrives, the policy examines it and, based on metrics gathered from recent executions, decides to admit or reject it. If admitted, the query is inserted into the FIFO queue to wait for its turn to be processed; otherwise, the policy drops it and instructs the server host to reply with an error response. A fixed number of query engine processes (or threads) dequeue the admitted queries and process each independently. The framework offers methods for the policies to record time intervals and metrics they need (e.g., queue wait time, processing time, queue length, and rejection counts) at several points in the process: Point 1, after the admission or rejection decision is made; Point 2, after a query is dequeued for processing; and Point 3, after a query has been processed and before the response is sent back to the client.

We assume that every request includes a short string indicating the type of the query it carries (e.g., part of the REST URL endpoint’s path or the name of a datalog-like rule (Abiteboul et al., 1995; Meyer et al., 2020b)). Thus, the policy is configured with strings denoting the query types and for each type, a latency SLO with the target percentile response times; for example: "Fast":{p50=10ms, p90=90ms}, "Slow":{p50=60ms, p90=270ms}, and "default":{p50=30ms, p90=400ms}.³³3 These query types are merely for illustration purpose; in practice, they often have more sensible names (e.g., GetFriends and GraphDistance). Note that default is a “catch-all” query type.

The time a query $Q$ spends in the FIFO queue is its queue wait time; i.e., $wt(Q)=t_{dequeued}(Q)-t_{enqueued}(Q)$ . We define a query $Q$ ’s processing time, denoted as $pt(Q)$ , as the time interval from the instant at which $Q$ is pulled from the head of the queue to the instant at which $Q$ has been fully processed and the response is ready to be sent back to the client; i.e., $pt(Q)=t_{completed}(Q)-t_{dequeued}(Q)$ . Then, the response time of a query $Q$ is:

(1)

rt(Q)=wt(Q)+pt(Q)+\xi

where $\xi$ is some additional time the server host takes (e.g., in the network stack and operating system) to handle the query. In our experience $\xi$ is often negligible and we assume $\xi=0$ .

Separate query types often have different processing time distributions that vary over time. Bouncer adopts the natural approach of maintaining approximations for these distributions in histograms, one per query type (including default); it periodically updates the histograms at run time using a dual-buffer technique.⁴⁴4 While one histogram is only read, a second histogram is being populated. At the end of a time interval the new and old histograms are swapped atomically, and the old histogram is reset before being populated again.

Figure 2. Operation of Bouncer admission control policy.

Figure 2 illustrates Bouncer’s operation. For every query $Q$ that arrives at the host, it computes an estimate of the mean queue wait time that $Q$ will experience as:

(2)

ewt_{mean}=\frac{\sum_{type\in QT}{(count(type)\times pt_{mean}(type))}}{P}

where $QT$ is the set of query types the policy recognizes, including default; $count(type)$ is the number of queries of the given $type$ currently in the queue; $pt_{mean}(type)$ is the mean processing time for queries of the given $type$ , obtained from the corresponding histogram; and $P$ is the number of query engine processes running on the host (i.e., the level of task parallelism for query processing). Bouncer maintains per-type atomic counts of the queries currently in the queue, and updates the counts as queries are enqueued and dequeued. The histograms and query counts are stored in hash maps with query types as the keys for easy and quick access.

Bouncer estimates the percentile response times of $Q$ as:

(3)		$\displaystyle ert_{p50}(Q)$	$\displaystyle=ewt_{mean}+pt_{p50}(Type(Q))$
(4)		$\displaystyle ert_{p90}(Q)$	$\displaystyle=ewt_{mean}+pt_{p90}(Type(Q))$

where $Type(Q)$ returns the type of the query $Q$ , and $pt_{p50}(type)$ and $pt_{p90}(type)$ are the 50th- and 90th-percentile processing times for queries of the given $type$ , obtained from the corresponding histograms. Finally, after calculating $ert_{p50}$ and $ert_{p90}$ , Bouncer decides to accept or reject $Q$ according to Algorithm 1.

if ( $ert_{p50}(Q)>SLO_{p50}(Q)$ || $ert_{p90}(Q)>SLO_{p90}(Q)$ ) then Reject;

else Accept;

Algorithm 1 Decision on incoming query

Q

Note that Equations 2, 3, and 4, rather than precise expressions, are inexpensive estimations we adopt to keep the policy’s overhead low. This trade-off of accuracy for speed is of practical importance because the computations Bouncer does are in the critical path of the queries. Our experiments (§5) show, however, that the policy is effective despite this accuracy loss. Moreover, this policy formulation can be easily modified to support SLOs with other percentile response times (e.g., p99) in lieu of or in addition to p50 and p90, and to adopt different logical expressions for acceptance decision making in Algorithm 1. The evaluation of alternative formulations for Bouncer is left as future work, but Appendix B offers some practical considerations about selecting percentiles for latency SLOs.

Also note that the Bouncer policy, as formulated above, suffers from two problems. The first one is query starvation since the policy may systematically deny service to query types. The second problem is that Bouncer is susceptible to cold starts and lulls in traffic as its histograms start in a blank state and may become empty for query types with intermittent patterns. We address query starvation in the next section, but leave the issue of cold starts and traffic lulls as future work and just discuss our plan in Appendix A.

4. Avoiding Query Starvation

Query types frequently have latency SLOs that are tighter than those of other query types (i.e., their percentile response times are closer to the values in their SLOs). Since admitted queries share the same FIFO queue, it is possible that queries with looser SLOs cause queries with tighter SLOs to be rejected in large numbers, especially at high QPS. Figure 3 illustrates the problem, which we reproduced by sending traffic at a high rate to an experimental cluster running LIquid graph database (§5.1) with basic Bouncer. We selected two types of queries, identified as FAST and SLOW, and generated traffic from a set of millions queries sampled from production. Both queries types are configured with $SLO_{p50}$ =18ms and $SLO_{p90}$ =50ms, shown in dotted lines; thus, the SLO is tighter for the SLOW queries than for the FAST ones. The figure reports the p50 and p90 response time estimates for both query types in a one-second interval. We can see that the FAST queries make the SLOW queries “starve” as nearly 100% of the SLOW queries exceed the SLO and get rejected in the interval. Therefore, Bouncer needs to alleviate query starvation, and we discuss two alternative strategies for that purpose next.

Refer to caption — Figure 3. Example of query starvation. The latency SLO is the same for FAST and SLOW queries: $SLO_{p50}=18ms$ and $SLO_{p90}=50ms$ (dotted lines). Around $99\%$ of SLOW queries are rejected, while less than $10\%$ of FAST queries are rejected.

4.1. Acceptance Allowance

Input:

Q

: incoming query.

Data:

SW

: sliding window;

A\in[0.0,1.0]

: acceptance allowance.

Output:

decision

: Accept or Reject.

type\leftarrow Type(Q)

;

// Historical counts

aqc\leftarrow SW.GetAcceptedQueryCount(type)

;

rqc\leftarrow SW.GetQueryCount(type)

;

decision\leftarrow

Reject;

if $rqc=0$ then

decision\leftarrow

Accept;

else

AR\leftarrow aqc/rqc

;

// Acceptance ratio

if $AR<A$ then

decision\leftarrow

Accept;

if $decision=$ Reject then

decision\leftarrow Bouncer.CanAdmit(Q)

;

// Ask the policy

if $decision=$ Reject then

// On the spot

if $\mathtt{rand()}<A$ then

decision\leftarrow

Accept;

if $decision=$ Accept then

SW.IncrementAcceptedQueryCount(type)

;

SW.IncrementQueryCount(type)

;

Algorithm 2 Acceptance allowance strategy.

This strategy ensures that a small percentage of queries of each type is always admitted. For that, it gives a little allowance to each query type to get accepted and processed by the system, hence its name. The strategy, shown in Algorithm 2, operates on a sliding window $SW$ with duration $D$ and time step $\Delta$ , where $D\gg\Delta$ (e.g., $D$ =1s and $\Delta$ =10ms). The sliding window tracks the number of accepted queries ( $aqc$ ) and received⁵⁵5accepted and rejected queries ( $rqc$ ) per query type. The parameter $A\in[0.0,1.0]$ represents the acceptance allowance and is expected to be small (0.01–0.03). Setting $A=0.01$ means that we are willing to give “free passes” to up to $1\%$ of the queries of each type over the span of the sliding window. Although admission decisions are made per query type, we use the same value of $A$ irrespective of the type for the strategy to have few configuration parameters.

The call to Bouncer splits the strategy in two parts. The first part makes decisions based on the historical query counts ( $aqc$ and $rqc$ ) and their quotient (the acceptance ratio), while the second part overrides rejection decisions “on the spot” uniformly at random with probability equal to $A$ .

Besides relieving query types from systemic service denial, by always letting a few queries of the different types in, this strategy ensures that the processing time histograms Bouncer uses for admission decisions get populated.

4.2. Helping the Underserved

Figure 4. Intent of helping the underserved.

Input:

Q

: incoming query.

Data:

SW

: sliding window;

QT

: set of query types;

\alpha\in(0.0,1.0]

: scaling factor.

Output:

decision

: Accept or Reject.

decision\leftarrow Bouncer.CanAdmit(Q)

;

// Ask the policy

if $decision=$ Reject then

type\leftarrow Type(Q)

;

// Acceptance ratio for the query type

AR\leftarrow\dfrac{SW.GetAcceptedQueryCount(type)}{\mathtt{max}(SW.GetQueryCount(type),1)}

;

// Average acceptance ratio

AAR\leftarrow\sum_{t\in QT}\dfrac{SW.GetAcceptedQueryCount(t)}{\mathtt{max}(SW.GetQueryCount(t),1)}

;

AAR\leftarrow AAR/\mathtt{max}(|QT|,1)

;

if $AR<ARR$ then

x\leftarrow(AAR-AR)/ARR

;

p\leftarrow\alpha\cdot x/(1+x)

;

if rand() $<p$ then Accept;

if $decision=$ Accept then

SW.IncrementAcceptedQueryCount(type)

;

SW.IncrementQueryCount(type)

;

Algorithm 3 “Helping the underserved” strategy.

Rather than assigning a fixed allowance to the query types, this strategy is more dynamic and tries to help query types that have been rejected more than others. A query type is deemed that has been treated unfavorably when its acceptance ratio (the quotient between accepted and received queries) is lower than the average acceptance ratio across query types (see Figure 4).

The strategy is presented in Algorithm 3. Like the previous one, it operates on a sliding window $SW$ , with duration $D$ and time step $\Delta$ , which tracks the number of accepted and received queries per query type. A key step in this strategy is the calculation of $p$ , which represents the probability of overriding Bouncer’s decision of rejecting a query $Q$ when the acceptance ratio for $Q$ ’s type is lower than the average acceptance ratio ( $AR<ARR$ ). We use a heuristic to calculate $p$ . We do not make $p=(ARR-AR)/ARR$ because we would give unfavored query types excessive help (i.e., if $AR$ is very small, then $p\approx 1$ ). Instead, we use the sigmoid function $p=\alpha\cdot x/(1+|x|)$ to reduce and smooth “the help” given to query types, where $\alpha\in(0.0,1.0]$ is a configurable scaling factor.

While this strategy may help Bouncer populate its histograms, it is not guaranteed to be effective in doing so because its admission decisions are probabilistic based on the value of $p$ . For example, it is possible that a series of queries with a rarely-seen type may get unlucky and all be rejected by both Bouncer and the strategy.

5. Experimental Evaluation

In this section, we evaluate Bouncer (§3) and its variants with the starvation avoidance strategies discussed in §4. We conduct two studies, one using simulation (§5.3) and the other on LIquid distributed graph database (§5.4), that compare Bouncer with other policies. We first give an overview of LIquid and then describe the policies we evaluate Bouncer against. After presenting the studies, the section ends with a summary of the results (§5.6).

5.1. LIquid Graph Database

Figure 5. LIquid cluster, unit of deployment.

LIquid (Meyer et al., 2020a, b) is an in-memory distributed graph database built at LinkedIn to answer online, interactive queries at high throughput and with low latency. Among other data corpora, it serves LinkedIn’s Economic Graph (Corp., 2022a) that includes hundreds of millions of vertices and hundreds of billions of edges. To store large graphs, LIquid is deployed on a cluster of machines organized in two tiers, brokers and shards (see Figure 5). Our data centers host multiple LIquid clusters that act as replicas to serve large volumes of traffic (several millions of queries per second) with high availability.

A LIquid cluster breaks up the graph into multiple data shards and assigns them to separate shard hosts, which store and index the data in memory (Carter et al., 2019). The shard hosts load data generated by offline jobs from remote persistent storage. They also receive a continuous feed of updates (e.g., via Kafka (Kreps et al., 2011)) from source-of-truth databases, and each shard keeps the updates belonging to its slice of the graph.

The broker hosts offer REST endpoints for clients to send query requests. LIquid ’s request processing is stateless as client requests are considered independent units. When a broker receives a query from a client, the broker sends sub-queries to the shard hosts to fetch data from them. Answering a query involves one or more communication rounds between the broker and the shards. At the end of each round, the broker accumulates the shards’ responses and processes the sub-query results before starting the next round. Once the last communication round completes, the broker sends a response with the query result back to the client.

Brokers and shards implement the admission control framework described in §3. They run a configurable number of query engine processes that cycle between obtaining an admitted (sub-)query from the FIFO queue and processing it; these processes operate independently and in a lock-free manner. Brokers and shards also enforce expiration times for admitted queries. In our evaluation, shards, which do the heavy lifting in query processing, use the acceptance fraction policy (§5.2.3) governed by a utilization threshold, while brokers use Bouncer or one of the policies in §5.2.

5.2. Other Admission Control Policies

Here we describe several in-house policies available in LIquid and used in our comparative evaluation. They are built atop the framework described in §3, but use different criteria for decision making. Unlike Bouncer, these policies are oblivious to query types.

5.2.1. Maximum Queue Length (MaxQL) Policy

It simply accepts an incoming query only if the FIFO queue’s length is less than a configurable length limit ( $l<L_{limit}$ ) (Iyer et al., 2001).

5.2.2. Maximum Queue Wait Time (MaxQWT) Policy

It admits an incoming query $Q$ only if the estimate for $Q$ ’s mean queue wait time is less than or equal to a configurable time limit ( $ewt_{mean}\leq T_{limit}$ ). The mean queue wait time is estimated as:

(5)

ewt_{mean}=l\times pt_{mavg}~{}/~{}P

where $l$ is the FIFO queue’s current length; $pt_{mavg}$ is the moving average of query processing times in a sliding window of duration $D$ and time step $\Delta$ , with $D\gg\Delta$ ; and $P$ is the number of processes responsible for processing queries.

MaxQWT is an experimental policy we include to evaluate the use of the mean queue wait time as a metric for per-query admission decision making, rather than as an overload signal (Zhou et al., 2018; Cho et al., 2020). Its main configuration parameter is the maximum wait time, which it enforces without differentiating query types. We discuss the effect of lifting this limitation in §5.5. Unless stated otherwise, MaxQWT’s sliding window is configured with $D=\text{60s}$ and $\Delta=\text{1s}$ .

5.2.3. Acceptance Fraction (AcceptFraction) Policy

It periodically computes the fraction of queries that the host should accept as:

f=min\left(1.0,\frac{\text{available processing capacity}}{\text{demanded processing capacity}}\right)

A host’s available processing capacity is the number of processing units the host has at its disposal to process queries, while its demanded processing capacity is the number of processing units it needs to fully serve the incoming flux of queries.

This policy calculates the available processing capacity as $APC=MaxUtil\times|PU|$ , where $MaxUtil\in(0.0,1.0]$ is the maximum utilization threshold, and $|PU|$ is the number of processing units set aside for query processing. $MaxUtil$ and $|PU|$ are configuration parameters and $APC$ , once computed, remains fixed throughout the system’s operation. Also, the policy periodically computes the demanded processing capacity, which generally varies over time, as $dpc=qps_{mavg}\times pt_{mavg}$ , where $qps_{mavg}$ is the moving average of the incoming traffic rate in queries per second, and $pt_{mavg}$ is the moving average of the query processing times. Both moving average values are obtained in a sliding window of duration $D$ and time step $\Delta$ , with $D\gg\Delta$ . Thus, the fraction of queries that should be accepted is given by:⁶⁶6 $dpc=qps_{mavg}\times pt_{mavg}$ may be a small positive value or even zero. We rely on standard floating-point arithmetic to handle these cases; e.g., if $dcp=0.0$ , then $f=min(1.0,inf)=1.0$ .

f=min\left(1.0,\frac{MaxUtil\times|PU|}{qps_{mavg}\times pt_{mavg}}\right)

The policy then accepts queries with probability equal to the current value of $f$ . If $f=1.0$ it accepts every query, but if $f<1.0$ it probabilistically rejects $(1.0-f)\times 100\%$ of the queries (i.e., the percentage exceeding the available capacity).

The number of processing units ( $|PU|$ ) represents the level of parallelism for query processing on a host. On shards, $|PU|$ is set as the number of CPU cores dedicated to process sub-queries from brokers, whereas on brokers $|PU|$ is the number of processes responsible for handling queries from clients, issuing sub-queries to and processing responses from the shards.

AcceptFraction also estimates the mean queue wait time of every query using Eq. 5 with $P=|PU|$ , and rejects the queries expected to time out in the queue. Unless stated otherwise, this policy is configured to update the demanded processing capacity ( $dpc$ ) every second and with $D=\text{60s}$ and $\Delta=\text{1s}$ for the sliding window.

5.3. Simulation Study

We compare the basic behavior of the policies listed in Table 2, using a discrete event-driven simulator we wrote in Python 3. The simulator implements the framework in Figure 1. It assumes a query engine with a fixed number of processes and gives the admitted queries to the idle processes on a first-come, first-serve basis.

We simulate a LIquid broker host (§5.1) with 100 query engine processes, a number in the same order of magnitude used in practice. Table 1 lists the query types used in this study. Each type is given a fixed percentage among the generated queries (i.e., its proportion in the query mix), and its processing times follow a lognormal distribution, which approximates those of real production queries.

Table 1. Query types used in the simulation study.

Anonymized

Query Types

Proportion in

Query Mix

\mathbf{pt_{mean}}

(ms)

\mathbf{pt_{p50}}

(ms)

\mathbf{pt_{p90}}

(ms)

fast

40%

1.16

0.38

2.70

medium fast

20%

2.53

2.22

4.27

medium slow

30%

12.13

7.40

26.44

slow

10%

20.05

12.51

44.26

The traffic rate that fully utilizes the query engine is given by $QPS_{full\_load}=\frac{P}{pt_{wmean}}$ , where $P$ is the number of query engine processes, and $pt_{wmean}$ is the weighted mean processing time of the query types with their proportions in the query mix. We evaluate the policies with traffic rates ranging from $0.9\times QPS_{full\_load}$ to $1.5\times QPS_{full\_load}$ . From the values in Table 1, $pt_{wmean}$ is equal⁷⁷7 $pt_{wmean}=0.4\cdot 1.16+0.2\cdot 2.53+0.3\cdot 12.13+0.1\cdot 20.05=6.614\text{~{}ms}$ to 6.614ms. Then, with $P=100$ , $QPS_{full\_load}=\frac{100}{6.614\cdot 10^{-3}}\approx\text{15.1~{}kQPS}$ , and the traffic rate range is $[\text{13.6~{}kQPS},\text{22.7~{}kQPS}]$ .

The inter-arrival times for the queries are generated from an exponential distribution to simulate traffic burstiness. We subject the policies to the same incoming traffic and compare them on three dimensions: SLO violations, rejection ratio, and system utilization. Each simulation run produces 1.5 million queries, lasting between one and two minutes of simulated time, and is preceded by a warm-up phase to avoid capturing cold start effects in our results. Table 2 lists the parameters used for the policies in this study.

Table 2. Parameters for the admission control policies in the simulation study.

Bouncer is configured with the $SLO_{p50}$ and $SLO_{p90}$ values given above, even when supplemented with a starvation avoidance strategy.
Policy	Parameters
Bouncer	$SLO_{p50}$ =18ms; $SLO_{p90}$ =50ms
Bouncer + acceptance-allowance	$A$ =0.05
Bouncer + helping-the-underserved	$\alpha$ =1.0
MaxQL	queue length limit = 400
MaxQWT	wait time limit = 15ms
AcceptFraction	utilization threshold=95%

5.3.1. Basic Bouncer Policy vs. The Other Policies

Here we evaluate Bouncer’s basic formulation (without starvation avoidance) against the policies in §5.2. We observe that, when the traffic load exceeds $QPS_{full\_load}$ , Bouncer keeps the serviced queries within the latency SLO whereas the other policies do not. Figure 6 shows this result for the slow queries, for which the SLO is the tightest; the other query types exhibit similar behavior.

The median response time ( $rt_{p50}$ ) for MaxQL, although above the SLO, plateaus at $\sim$ 40ms due to the limit this policy imposes on queue length. Similarly, MaxQWT plateaus at a $rt_{p50}$ of $\sim$ 22ms because it limits the time the queries wait in the queue; yet it exceeds the SLO since it does not take into account the percentile response times of the query types. By contrast, $rt_{p50}$ for AcceptFraction grows with the traffic rate, reaching the top, right corner of Figure 6. The reason is that in our simulation this policy imposes no limits on queue length and queue wait time.⁸⁸8 Note that this is not the case in LIquid, where AcceptFraction rejects queries expected to time out and enforces a maximum queue length to safeguard against very high response times. Thus, when the system is overloaded, the queue length grows significantly and so does $rt_{p50}$ .

As shown in Figure 7, all the policies but one are able to nearly reach 100% utilization when the traffic approaches and exceeds $QPS_{full\_load}$ ; the exception is AcceptFraction that is limited by its utilization threshold of 95%. In addition, Figure 8 shows that as the QPS increases beyond $QPS_{full\_load}$ , the policies reject more queries to keep the system’s load at bay. Bouncer reports the lowest percentage of rejections for the following reasons. First, the queries it rejects the most are slow queries (see Table 3). Second, compared to the other query types, the slow queries have the highest processing time (closest to the latency SLO); thus, Bouncer needs to reject a lower number of queries of the slow type to prevent the system from overloading. The other policies report more rejections because they make no distinction between the types of queries and treat them as equals. In particular, AcceptFraction has the highest rejection percentage because it is bounded by a 95% utilization threshold while the other policies are not.

Our simulation results show that Bouncer’s basic formulation performs better than the other polices during overload as it keeps the queries within their SLOs and reports the least overall rejections, while allowing the system reach nearly maximal utilization.

5.3.2. Bouncer policy with Starvation Avoidance Strategies

Table 3. Percentage of rejections reported by Bouncer with and without starvation avoidance strategies.

Each cell reports the average of 5 simulation runs. -0- means absolute zero rejections.
		Factor of $\mathbf{QPS_{full\_load}}$
Policy (Strategy)	Query Type	0.9 $\times$	0.95 $\times$	1.0 $\times$	1.05 $\times$	1.1 $\times$	1.15 $\times$	1.2 $\times$	1.25 $\times$	1.3 $\times$	1.35 $\times$	1.4 $\times$	1.45 $\times$	1.5 $\times$
	fast	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-
	medium fast	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-
	medium slow	-0-	-0-	-0-	-0-	-0-	0.00	0.00	0.01	0.05	0.23	0.82	2.29	4.86
	slow	0.01	0.53	5.02	15.89	29.27	41.84	53.63	64.37	74.18	82.88	90.37	95.68	98.46
Bouncer (Basic Formulation)	ALL	0.00	0.05	0.50	1.59	2.93	4.18	5.36	6.44	7.43	8.36	9.28	10.25	11.30
	fast	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-
	medium fast	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-
	medium slow	-0-	-0-	-0-	-0-	-0-	-0-	0.02	0.07	0.36	1.29	3.45	6.86	10.83
	slow	0.01	0.53	4.97	15.98	29.31	41.86	53.58	64.24	73.56	80.97	85.63	87.58	88.12
Bouncer (Acceptance Allowance, $A$ =0.1)	ALL	0.00	0.05	0.50	1.60	2.93	4.19	5.36	6.45	7.46	8.48	9.60	10.82	12.06
	fast	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-
	medium fast	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-
	medium slow	-0-	-0-	-0-	-0-	-0-	0.04	0.25	1.35	4.06	7.96	12.20	16.29	20.36
	slow	0.01	0.53	5.03	15.99	29.34	41.89	53.21	61.93	67.08	69.26	70.21	70.84	71.37
Bouncer (Helping the Underserved, $\alpha$ =1.0)	ALL	0.00	0.05	0.50	1.60	2.94	4.20	5.40	6.60	7.93	9.31	10.68	11.97	13.25

Table 3 reports the rejection percentages for Bouncer with and without the starvation avoidance strategies at increasing traffic rates. Naturally, as the load grows, the overall rejections increase for the basic policy and the strategies. Slow queries experience most rejections and are subject to starvation at high traffic rates. Medium slow queries also experience rejections but far less, whereas fast and medium fast queries are never rejected. This result is expected since the processing times of slow and medium slow queries are the first and second closest to the latency SLO (see Tables 1 and 2).

We observe that the Bouncer’s basic formulation rejects more than 90% of slow queries at traffic rates $\geq 1.4\times QPS_{full\_load}$ . By contrast, both strategies, acceptance-allowance (§4.1) and helping-the-underserved (§4.2), keep rejections of slow queries below 90%.

The acceptance-allowance strategy limits the rejections of slow queries by enforcing the configured allowance of 10%. Hence, as the traffic load increases, the strategy causes rejections to shift from slow to medium slow queries (up to $\sim$ 11%), when compared to basic Bouncer. The reason is that the slow queries the strategy lets in take the room of other, less costly query types, and the medium slow queries, being the second most costly, are next in line to experience the effects. The helping-the-underserved strategy behaves similarly, but only rejects up to $\sim$ 71% of slow queries. The reason is that with $\alpha=1.0$ , the probability of overriding Bouncer’s rejections can get close to 0.5. Thus, by rejecting less slow queries, the strategy forces more rejections of medium slow queries (up to $\sim$ 20%).

Compared to Bouncer’s basic formulation, the starvation avoidance strategies report a small rise in overall rejection percentage (up to $\sim$ 1% and $\sim$ 2% increase for acceptance-allowance and helping-the-underserved, respectively). The strategies are expected to increase overall rejections because to make room for some additional slow queries, a larger number of less expensive queries need to be rejected. Yet, the rise is very modest because only slow and medium slow queries suffer rejections, but the much cheaper fast and medium fast queries never do. This also means that Bouncer with starvation avoidance still offers fewer rejections when compared to MaxQL, MaxQWT and AcceptFraction (see Figure 8).

Also as expected, the strategies cause the response times to exceed the latency SLOs since to avoid query starvation they must accept some requests that Bouncer’s basic formulation would otherwise reject. Our results indicate that acceptance-allowance is more advantageous than helping-the-underserved in this regard. Figure 9 shows that acceptance-allowance lets the slow queries reach a higher QPS without violating the $SLO_{p50}$ requirement, when compared to helping-the-underserved. Moreover, acceptance-allowance reports lower median response times for the slow queries at high traffic rates. Acceptance-allowance’s advantage over helping-the-underserved stems from how they detect query starvation. Acceptance-allowance considers that a query type is starving when its acceptance ratio ( $AR$ ), independently from the other query types, falls under the given allowance ( $A$ ). Helping-the-underserved, on the other hand, considers that queries of a given type are starving when they are treated unfavorably relative to the other query types. Our results suggest that relative unfavorable treatment among query types tend to be observed at traffic rates lower than those at which per-query-type low acceptance ratios are.

Like Bouncer’s basic formulation, both strategies allow the system to reach close to 100% utilization, which is expected as they accept requests that plain Bouncer would reject.

5.3.3. Influence of Parameters on Query Starvation Avoidance

‘

Table 4. Percentage of rejections of Bouncer with the acceptance-allowance strategy for increasing values of parameter

A

and traffic rate equal to

1.5\times QPS_{full\_load}

Cells report the average of 5 simulation runs. -0- means absolute zero rejections.
Query Type	$\mathbf{A}$
	$[(1-A)\times 100\%]$ : Maximum rejection percentage enforced
	0.01	0.02	0.03	0.04	0.05	0.06	0.07	0.08	0.09	0.1	0.2	0.3
	[99%]	[98%]	[97%]	[96%]	[95%]	[94%]	[93%]	[92%]	[91%]	[90%]	[80%]	[70%]
fast	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-
medium fast	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-
medium slow	5.56	6.08	6.64	7.24	7.72	8.38	9.04	9.57	9.96	10.74	16.49	22.26
slow	97.21	96.23	95.25	94.30	93.26	92.19	91.20	90.17	89.16	88.13	77.48	67.26
ALL	11.39	11.45	11.52	11.60	11.64	11.73	11.83	11.89	11.91	12.03	12.70	13.40

‘

Table 5. Percentage of rejections of Bouncer with the helping-the-underserved strategy for increasing values of parameter

\alpha

and traffic rate equal to

1.5\times QPS_{full\_load}

From Algorithm 3, $p_{max}=\alpha\times 1/2$ . Cells report the average of 5 simulation runs. -0- means absolute zero rejections.
Query Type	$\mathbf{\alpha}$
	$[p_{max}\times 100\%]$ : Maximum probability of acceptance by overriding rejection decision
	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	1.0
	[5%]	[10%]	[15%]	[20%]	[25%]	[30%]	[35%]	[40%]	[45%]	[50%]
fast	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-
medium fast	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-	-0-
medium slow	7.07	9.01	10.98	12.60	14.19	15.98	16.97	17.99	19.10	20.41
slow	94.74	91.32	88.11	84.81	82.38	79.47	77.10	75.01	72.98	71.15
ALL	11.59	11.83	12.11	12.26	12.50	12.74	12.80	12.90	13.03	13.24

Here we study how $A$ and $\alpha$ influence the acceptance-allowance strategy and the helping-the-underserved strategy, respectively. We simulate an overload scenario with $QPS=1.5\times QPS_{full\_load}$ and vary $A$ in the range $[0.01,0.3]$ and $\alpha$ in $[0.1,1.0]$ .

Predictably, Table 4 shows that as $A$ increases, acceptance-allowance rejects fewer slow queries, and the rejections shift from slow queries, which are the most expensive, to medium slow queries, the second most expensive. The rejection counts do not exceed the maximum rejection percentage enforced by the strategy.

Similarly, in the case of helping-the-underserved, Table 5 shows that by increasing $\alpha$ , slow queries experience less rejections because the chances of Bouncer’s rejection decisions being overridden ( $p_{max}$ ) grow. Rejections also shift from slow to medium slow queries. However, this strategy rejects more than $(1-p_{max})\times 100\%$ of slow queries in most cases; e.g., with $\alpha=0.6$ and $p_{max}=0.3$ , it rejects $\sim$ 79.5% of slow queries instead of 70%, and with $\alpha=1.0$ and $p_{max}=0.5$ , it rejects $\sim$ 71% instead of 50%. The reason is not only the probabilistic nature of helping-the-underserved, but the fact that the probability $p$ of overriding rejections is not always close to its maximum because it depends on two values that vary over time – the query type’s acceptance ratio ( $AR$ ) and the overall average acceptance ratio ( $ARR$ ). By contrast, a given $A$ in acceptance-allowance directly means having a rejection percentage no larger than $(1-A)\times 100\%$ , which makes acceptance-allowance more intuitive to use than helping-the-underserved.

Tables 4 and 5 report similar percentages of overall rejections (from 11.4% to 13.4%) for both strategies in the considered ranges of $A$ and $\alpha$ . The overall rejections also increase slightly for both strategies with their corresponding parameters. As explained in §5.3.2, this is expected because when a strategy lets a number of slow queries in, they occupy the room of a larger number of cheaper queries (e.g., medium slow) that need to be rejected.

We also observe that varying $A$ and $\alpha$ has no significant impact on the query response times. For example, Figure 10 shows that $rt_{p50}$ for slow queries grows very slowly with $A$ and $\alpha$ (less than 10% increase). Both strategies report similar $rt_{p50}$ (above 20ms), and as in Figure 9, $rt_{p50}$ exceeds the $SLO_{p50}$ requirement (18ms) because the strategies accept additional requests that Bouncer’s basic formulation would not. Finally, for the considered values of $A$ and $\alpha$ , both strategies let the system nearly reach 100% utilization.

5.4. Study on a Real System

Previously, we evaluated the admission control policies on a simulator assuming an ideal parallel query processing engine. Now we evaluate them on LIquid graph database (§5.1). Our test platform is an experimental cluster with 16 shards and 12 brokers, serving a large portion of LinkedIn’s Economic Graph. Shards have 1.5TB of RAM and brokers 256GB, and each host is equipped with two Intel^® Xeon^® Platinum 8171M CPUs (52 cores) and a 25-Gbps NIC.

We compare Bouncer, supplemented with the two starvation avoidance strategies presented in §4, against the policies in §5.2. We vary the policy on the brokers, while the shards always run AcceptFraction (§5.2.3). The brokers are configured with the same policy for each test run. Since the brokers are the queries’s entry point to the LIquid cluster (see Figure 5), this setup enables the policies to reject queries early, while letting AcceptFraction guard against excessive CPU utilization on the shards, where CPU is the limiting resource.

We send traffic to the LIquid cluster with a load generator program we built by modifying wrk2 (Tene et al., ), an open-source HTTP benchmarking tool capable of producing significant load. Our load generator sends HTTPS requests at an average rate given by the user, and emulates traffic burstiness with inter-departure times following an exponential distribution. It draws queries from one or more query sets, each containing queries of a specific type, and generates traffic according to a query mix, which indicates the proportions per query type. The query sets and query mix are provided in input files.

The queries used in this study were obtained from online production traffic. The query mix is given in the following list of query types and their proportions: { "QT1": 11.56%, "QT2": 0.04%, "QT3": 0.04%, "QT4": 2.34%, "QT5": 13.44%, "QT6": 13.44%, "QT7": 0.42%, "QT8": 0.09%, "QT9": 26.35%, "QT10": 4.49%, "QT11": 27.80% }. While anonymized, the query types are sorted by cost in ascending order. We sampled the traffic of a LIquid cluster for 2 hours of daily peak load, for 2 weekdays to create the query sets in the mix, totalling 5.5 million queries. Our selection of query types, out of several dozens, captures those with larger representation as well as their diversity in processing time.

We evaluate the policies at 36K, 72K, 108K, 144K, and 180K QPS. Shards report high CPU utilization at rates $\geq$ 108K QPS. Each test run lasts 10 minutes, in which our load generator issues queries to the cluster at one of these traffic rates and with the load evenly divided among the brokers. We give queries generous expiration times to ensure they do not time out. To avoid cold start effects, we warm up the cluster before each run by sending traffic to it at the expected rate for one (extra) minute.

Like in the simulation study, we configure Bouncer with $SLO_{p50}=\text{18ms}$ and $SLO_{p90}=\text{50ms}$ for both starvation avoidance strategies; these values are similar to our latency objectives in production for most queries in the mix. The acceptance-allowance strategy uses $A=0.05$ and the helping-the-underserved strategy $\alpha=1.0$ . AcceptFraction is configured with a maximum utilization of 80%, and MaxQWT is given 12ms as its wait time limit. In LIquid not only MaxQL, but the other policies too can enforce a limit on the queue’s length to safeguard against its unbounded growth. We set the maximum queue length ( $L_{limit}$ ) to 800 for all the policies.

Results

Figure 11 shows that the overall rejections increase with the traffic load. As expected, Bouncer’s starvation-avoiding variants record similar rejection counts. More importantly, they report lower percentages of overall rejections (between $\sim$ 15% and $\sim$ 30% less) when compared to the other policies. With a similar result from simulation (Figure 8), we confirm our observation that Bouncer rejects fewer requests because it targets query types whose processing times are higher and closer to the latency SLO, such as QT11. By contrast, the other policies, oblivious to query types and their individual costs, refuse service to cheap and expensive queries alike. Since rejecting a number of less expensive queries has a similar effect as rejecting a smaller number of costlier ones, these policies report larger rejection counts. Figure 11 reports similar rejection percentages for the policies MaxQL and MaxQWT, and AcceptFraction yields the highest rejection rates because its maximum utilization was conservatively set to 80%. We inspected the system’s logs and confirmed that in the case of MaxQL, MaxQWT and Bouncer, the brokers, not the shards, produced the vast majority of rejections, and in the case of MaxQWT, AcceptFraction and Bouncer, the brokers did not reach the queue length limit ( $L_{limit}$ ).

We now focus on the SLO violations for the QT11 queries as they exhibit the largest processing time (i.e., the SLO is the tightest for them) and have the largest representation in the mix. Like in our simulation, Figure 12 shows that at meeting latency requirements, Bouncer, with both starvation avoidance strategies, and MaxQWT perform better than MaxQL and AcceptFraction. Bouncer and MaxQWT maintain the response times of the serviced Q11 queries close to $SLO_{p50}$ and comfortably under $SLO_{p90}$ . Instead, MaxQL and AcceptFraction let them exceed $SLO_{p50}$ ( $>4\times$ ) and $SLO_{p90}$ ( $>2\times$ ) at traffic rates $\geq$ 108K QPS.

In this experiment, Bouncer rejects QT11 queries more than any other; thus, they are the only queries that suffer from starvation. Figure 12(a) shows that at 144K and 180K QPS, helping-the-underserved permits QT11 queries to slightly exceed $SLO_{p50}$ because it takes in some extra queries of this type to make up for having treated them unfavorably. By contrast, acceptance-allowance (with $A=0.05$ ) never acts since QT11’s rejections remain below 95%; hence, the median response time ( $rt_{p50}$ ) for these queries stays under $SLO_{p50}$ .

Figure 12(a) also shows that under MaxQWT, the QT11 queries exceed $SLO_{p50}$ at 144K and 180K QPS. We got this result despite our efforts to tune the policy’s wait time limit for reducing rejection counts and SLO violations, and setting it to 12ms ( $<SLO_{p50}$ =18ms). The reason is that the query processing time, observed by the brokers in the cluster, increases with the traffic load. As shown in Figure 13, QT11’s median processing time ( $pt_{p50}$ ) rises with the QPS and reaches $\sim$ 15ms at 180K QPS, just 3ms below $SLO_{p50}$ . This real system’s behavior differs from what we assumed in our simulation study. Unlike an ideal parallel query engine, the LIquid cluster’s processing tier, depicted in Figure 5, comprises shard hosts with their own FIFO queue⁹⁹9among other queues in the NIC, OS, and implementation software. and subject to queueing effects too. Figure 13 also reports that MaxQWT lets QT11’s $rt_{p50}$ depart from $pt_{p50}$ . By contrast, under Bouncer, which considers both queue wait time and percentile processing times, QT11’s $rt_{p50}$ stays below or barely exceeds $SLO_{p50}$ and tracks much more closely $pt_{p50}$ . Thus, limiting the queue wait time is not enough to ensure that serviced queries meet the latency SLO on the real system.

Finally, our implementation of Bouncer reports a small overhead (mean=18 $\mu$ s, p50=15 $\mu$ s, and p99=87 $\mu$ s) for millisecond-scale response times.

5.5. Bouncer vs. MaxQWT with parameter settings per query type

As mentioned in §5.2.2, the MaxQWT policy evaluated in this paper is experimental and receives a single parameter value – the wait time limit, which is enforced without differentiating query types. Given this limitation of implementation and the relation between queue wait time and response time (Eq. 1), we ask ourselves: How does Bouncer, which allows individual SLOs per query type, compare to MaxQWT when wait time limits are also assigned per query type? We answer the question via simulation, and Figure 14 shows the median response times for slow queries and overall rejections.

We observe that with properly chosen wait time limits per query type, MaxQWT can match Bouncer’s behavior in terms of serviced queries meeting latency SLOs and overall rejections. But, finding the right values is a time-consuming task of experimental tuning, and needs to be repeated for different workloads and their causal variations. Besides, the selected values may still be unreliable because, as shown in Figure 13, query processing times in real systems may increase with the load. By contrast, Bouncer requires no tuning since it explicitly receives the latency requirements in the SLOs and by design ensures that serviced queries meet these requirements, even when the system is overloaded. Hence, from a practical standpoint Bouncer is more advantageous than MaxQWT.

5.6. Summary of Experimental Results

Our results show, in simulation and on a real system, that under Bouncer serviced queries meet or, when starvation avoidance is in use, track closely their latency SLOs. The reason is that Bouncer uses percentile response time estimates obtained by combining measurement-based approximations of queue wait time and percentile query processing times (see Equations 2, 3, and 4). Further, despite their inherent accuracy loss, the adopted estimates have proven to enable effective admission control decisions.

Our results also indicate that under MaxQWT serviced queries are susceptible to violate SLOs expressed in percentile response times ( $SLO_{p50}$ and $SLO_{p90}$ ). The reason is that this policy makes decisions based on queue wait time estimates calculated using the moving average processing time ( $pt_{mavg}$ ) across queries of different types and cost (see Eq.5), and $pt_{mavg}$ is very different from the $pt_{p50}$ and $pt_{p90}$ values for individual query types. Moreover, MaxQWT can be configured with carefully chosen wait time limits to have admitted queries meet their latency SLOs. But, it is generally impractical due to the laborious tuning involved, and it may even be ineffective since query processing times in real database systems tend to increase with the load (see Figure 13).

MaxQL and AcceptFraction, in contrast, do not enforce latency SLOs because that is not their design goal. The former only considers the number of queries in the queue (a coarse-grained queuing metric), and the latter rejects queries to prevent the system from exceeding a utilization threshold.

We have also shown that both strategies, acceptance-allowance (§4.1) and helping-the-underserved (§4.2), are able to avoid query starvation at the expense of 1) a very modest increase in overall rejections (still lower than the other policies’ rejection counts), and 2) causing serviced queries to exceed the SLOs at high loads. Queries exceed the SLOs by a fair margin in simulation (Figure 9), but track them very closely in our experiments on LIquid (Figure 12); this suggests that the starvation avoidance strategies may cause acceptable counts of SLO violations in practice. Moreover, we find acceptance-allowance to be more advantageous than helping-the-underserved because the allowance parameter $A$ of the former is intuitive whereas the parameter $\alpha$ of the latter is not. Plus, in simulation acceptance-allowance activated at higher traffic rates compared to helping-the-underserved.

Compared to MaxQL, MaxQWT and AcceptFraction, Bouncer reported the least overall rejections (between 15% and 30% less) by targeting the query types with the highest cost and tightest latency SLO. It also incurred small overhead (mean=18 $\mu$ s) for millisecond-scale queries and, with the given latency SLOs, let the system reach nearly maximal utilization.

In our experiments on the LIquid cluster (§5.4), we ran Bouncer on the brokers and AcceptFraction on the shards. This pairing is reasonable because AcceptFraction guards against excessive CPU usage on shards, where CPU is the limiting resource, while Bouncer guards against violations of latency SLOs on the brokers, offering early rejections to client requests. Thus, Bouncer is a viable complement to AcceptFraction and other policies that protect the system from exceeding its capacity.

6. Related Work

Admission control has been applied to networking (Jamin et al., 1997; Nam et al., 2008; Yang et al., 2016), operating systems (Hao et al., 2017), web services (Chen and Mohapatra, 2002; Welsh and Culler, 2003; Blanquer et al., 2005; Zhou et al., 2018), complex event processing over data streams (Slo et al., 2020; Zhao et al., 2020; Chapnik et al., 2022), and databases (Elnikety et al., 2004; Tozer et al., 2010; Xiong et al., 2011; Zhang, 2014; Noll et al., 2019; SAP, 2022; IBM, 2022; Doran et al., 2022; Corp., 2022c). Here we discuss prior work closely related to Bouncer.

Admission control techniques can be divided in two groups: 1) those that distinguish between request types (e.g., (Elnikety et al., 2004; Blanquer et al., 2005; Tozer et al., 2010; Xiong et al., 2011)), and 2) those that do not (e.g., (Mönkeberg and Weikum, 1992; Schroeder et al., 2006; Zhou and Yang, 2006; Bartolini et al., 2009)). Like Bouncer, Gatekeeper (Elnikety et al., 2004) belongs to the first group and implements a measurement-based approach, but with different goals and response time estimates. Gatekeeper lets the system serve a sustained throughput without exceeding its capacity, and uses moving averages to estimate mean response times. Bouncer instead ensures that serviced requests meet or track closely their latency SLOs and maintains histograms to estimate percentile response times. While Bouncer strives for early rejections, Gatekeeper does not as it lets the requests time out in the scheduling queue. Gatekeeper implements a shortest-job-first (SJF) scheduler with an aging mechanism (Cherkasova, 1998) to reduce the overall average response time and prevent longer-running queries from starving. We also supplement Bouncer with starvation avoidance strategies that, rather than deferring query executions, make decisions on the spot at query arrival. LIquid currently processes queries in FIFO order and evaluating other scheduling disciplines is left as future work.

Quorum (Blanquer et al., 2005) also differentiates request types and implements a non-invasive, measurement-based approach to quality of service for Internet services. Quorum and Bouncer perform admission control at the system’s entrance point, but for Quorum the system is an Internet site whereas for Bouncer a two-tier database. Quorum ensures that serviced requests of different types meet their response time guarantees, expressed as averages or percentiles. It selectively drops requests after being dequeued; by contrast, Bouncer rejects requests before being enqueued. Quorum prevents request types from starving by ensuring they get a large enough fraction of capacity to guarantee their required minimum throughput at all times. Contrarily, Bouncer does not control the capacity given to query types and applies another kind of strategies to avoid query starvation.

Q-Cop (Tozer et al., 2010), like Bouncer, focuses on admission control, but seeks to minimize the number of request timeouts. Q-Cop builds a linear regression model from offline experiments, and uses the model to predict the processing time of a newly arriving query based on the query’s type and the mix of queries being executed by the database system. Q-Cop’s model and Bouncer’s response time estimates have similar formulations, and their decision processes are alike too. But, rather than relying on a model trained offline, Bouncer estimates percentile response times based on recent processing times and the number of queries in the queue at the time. Besides, Q-Cop’s model only captures average execution times; hence, it cannot enforce latency SLOs defined in terms of percentiles. Another prediction-based approach is ActiveSLA (Xiong et al., 2011), which makes acceptance decisions to improve the profit of database-as-a-service providers while considering the service level agreements (SLA) with their clients. It uses a non-linear classification model that considers the query types and mix, among other features, to estimate the probability of a new query meeting its deadline. Then, based on the probability, ActiveSLA decides whether to admit the query with a profit optimization objective, with the expected profit derived from the SLA. In contrast, Bouncer has no notion of profit and does not seek to maximize it. Unfortunately, ActiveSLA’s overhead is much (1000x) higher than Bouncer’s and becomes prohibitive for systems like LIquid offering millisecond-scale response times. In addition, ActiveSLA continually rebuilds the model to reduce the human effort on model retraining. But, for production data systems like LIquid, queries are regularly added and retired, altering the query mix. Hence, ActiveSLA’s model, as well as Q-Cop’s, would need retraining more often than their authors anticipate, not only when the database or hardware configurations change. For Bouncer, model retraining is not a concern.

Bouncer’s overhead is low for millisecond-scale queries, which is key for request-by-request acceptance decisions. Similarly, but at the operating-system level, MittOS (Hao et al., 2017) computes latency estimates per IO operation in few $\mu$ s or less, and embraces quick rejections for IOs that cannot be served by a deadline. Built within the storage stack, it exposes a SLO-aware interface to help reduce IO completion time for data-parallel applications. MittOS receives latency deadlines from applications and requires white-box knowledge of devices and resources, whereas Bouncer adopts a black-box approach and its latency SLOs are percentiles. MittOS neither considers call types nor tackles starvation, but applications could do so since IO calls are given individual deadlines. Targeting microsecond-scale remote procedure calls, Breakwater (Cho et al., 2020) implements a credit-based admission control scheme with demand speculation and overcommitment handling. While offering fast rejections, it is oblivious to query types and only considers queue wait times to determine the credits to be distributed distribution among clients. Instead, Bouncer considers queue wait times and processing times to estimate percentile response times for queries.

7. Conclusion and Future Work

We presented Bouncer, an admission control policy that makes query-by-query decisions based on latency SLOs defined in terms of percentile response times. It combines inexpensive approximations of queue wait time and processing time to estimate percentile response times for each query, and compares the estimates with the objective values to decide to accept or reject the query. It takes into account that production workloads often include various types of queries with different latency characteristics, and allows assigning SLOs per query type. In addition, we studied two starvation avoidance strategies (acceptance-allowance and helping-the-underserved) that supplement Bouncer’s basic formulation to prevent query types from systematically receiving no service. We also conducted an extensive evaluation of Bouncer and its starvation-avoiding variants against several in-house policies in simulation and on LIquid graph database. A summary of our experimental results is given in §5.6.

As part of our future work, we plan to evaluate alternative formulations for Bouncer that use higher-order percentile response times, apply different logical expressions for decision making, and update processing time histograms in a sliding window, instead of non-overlapping windows. We also intend to extend Bouncer to support queries served based on priorities (rather than in FIFO order), and will consider adapting it to other query scheduling disciplines.

Bouncer is currently susceptible to cold starts and lulls in traffic because its decisions making relies on per-query-type processing time histograms, which are initially in a blank state and become empty when queries of certain types stop coming for some time. We expect Bouncer to handle these issues by itself, with no need of supplementary infrastructure and no extra burden on system’s operators. Appendix A discusses our plan in this regard.

Finally, we are interested in evaluating Bouncer against other policies in the literature as well as its applicability within more general overload management solutions, such as Quorum (Blanquer et al., 2005) (in its selective dropping module) and LinkedIn’s Hodor (Barkley, 2022).

References

(1)
Abiteboul et al. (1995) Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley.
Agarwal et al. (2013) Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In Proceedings of the 8th ACM European Conference on Computer Systems. 29–42.
Barkley (2022) Bryan Barkley. 2022. Hodor: Detecting and addressing overload in LinkedIn microservices. https://engineering.linkedin.com/blog/2022/hodor--detecting-and-addressing-overload-in-linkedin-microservic. (February 2022). [Accessed: Feb 2023].
Bartolini et al. (2009) Novella Bartolini, Giancarlo Bongiovanni, and Simone Silvestri. 2009. Self-* through Self-Learning: Overload Control for Distributed Web Systems. Computer Networks 53, 5 (April 2009), 727–743.
Blanquer et al. (2005) Josep M. Blanquer, Antoni Batchelli, Klaus E. Schauser, and Richard Wolski. 2005. Quorum: Flexible Quality of Service for Internet Services. In Proceedings of the 2nd USENIX Symposium on Networked Systems Design and Implementation. 159–174.
Bronson et al. (2013) Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani. 2013. TAO: Facebook’s Distributed Data Store for the Social Graph. In Proceedings of the 2013 USENIX Annual Technical Conference. 49–60.
Carter et al. (2019) Andrew Carter, Andrew Rodriguez, Yiming Yang, and Scott Meyer. 2019. Nanosecond Indexing of Graph Data With Hash Maps and VLists. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD’19). ACM, 623–635.
Chapnik et al. (2022) Koral Chapnik, Ilya Kolchinsky, and Assaf Schuster. 2022. DARLING: Data-Aware Load Shedding in Complex Event Processing Systems. Proceedings of the VLDB Endowment 15, 3 (2022), 541–554.
Chen and Mohapatra (2002) Huamin Chen and Prasant Mohapatra. 2002. Session-based Overload Control in QoS-aware Web Servers. In Proceedings of the Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, Vol. 2. 516–524.
Cherkasova (1998) Ludmila Cherkasova. 1998. Scheduling Strategy to Improve Response Time for Web Applications. In High-Performance Computing and Networking. Springer Berlin Heidelberg, 305–314.
Cherkasova and Phaal (1998) Ludmila Cherkasova and Peter Phaal. 1998. Session Based Admission Control: A Mechanism for Improving the Performance of an Overloaded Web Server. Technical Report HPL-98-119. Computer Systems Laboratory. Hewlett-Packard.
Cho et al. (2020) Inho Cho, Ahmed Saeed, Joshua Fried, Seo Jin Park, Mohammad Alizadeh, and Adam Belay. 2020. Overload Control for $\mu$ s-scale RPCs with Breakwater. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation. 299–314.
Chou et al. (2019) David Chou, Tianyin Xu, Kaushik Veeraraghavan, Andrew Newell, Sonia Margulis, Lin Xiao, Pol Mauri Ruiz, Justin Meza, Kiryong Ha, Shruti Padmanabha, Kevin Cole, and Dmitri Perelman. 2019. Taiji: Managing Global User Traffic for Large-Scale Internet Services at the Edge. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 430––446.
Colajanni et al. (1997) Michele Colajanni, Philip S. Yu, and Daniel M. Dias. 1997. Scheduling Algorithms for Distributed Web Servers. In Proceedings of 17th International Conference on Distributed Computing Systems. IEEE, 169–176.
Corbett et al. (2013) James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson C. Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2013. Spanner: Google’s Globally Distributed Database. ACM Transactions on Computer Systems 31, 3 (2013), 8:1–8:22.
Corp. (2018) LinkedIn Corp. 2018. The graph team at LinkedIn. https://engineering.linkedin.com/teams/data/data-infrastructure/graph. (2018). [Accessed: Feb 2023].
Corp. (2022a) LinkedIn Corp. 2022a. LinkedIn’s Economic Graph. https://economicgraph.linkedin.com. (2022). [Accessed: Feb 2023].
Corp. (2022b) Microsoft Corp. 2022b. Azure Cosmos DB. https://azure.microsoft.com/en-us/services/cosmos-db/. (2022). [Accessed: Feb 2023].
Corp. (2022c) Microsoft Corp. 2022c. SQL Server Resource Governor. https://learn.microsoft.com/en-us/sql/relational-databases/resource-governor/resource-governor?view=sql-server-ver16. (2022). [Accessed: Feb 2023].
Cuervo (2017a) Alejandro Forero Cuervo. 2017a. Handling Overload. Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media Inc., Chapter 21. https://sre.google/sre-book/handling-overload/.
Cuervo (2017b) Alejandro Forero Cuervo. 2017b. Load Balancing in the Datacenter. Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media Inc., Chapter 20. https://sre.google/sre-book/load-balancing-datacenter/.
DeCandia et al. (2007) Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s Highly Available Key-Value Store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles. 205–220.
Doran et al. (2022) Mark Doran, Padmaja Potineni, and Rajesh Bhatiya. 2022. Managing Resources with Oracle Database Resource Manager. Oracle Database: Database Administrator’s Guide, 21c. Chapter 26. https://docs.oracle.com/en/database/oracle/oracle-database/21/admin/index.html.
Elnikety et al. (2004) Sameh Elnikety, Erich Nahum, John Tracey, and Willy Zwaenepoel. 2004. A Method for Transparent Admission Control and Request Scheduling in E-Commerce Web Sites. In Proceedings of the 13th International Conference on World Wide Web. ACM, 276–286.
Hao et al. (2017) Mingzhe Hao, Huaicheng Li, Michael Hao Tong, Chrisma Pakha, Riza O. Suminto, Cesar A. Stuardo, Andrew A. Chien, and Haryadi S. Gunawi. 2017. MittOS: Supporting Millisecond Tail Tolerance with Fast Rejecting SLO-aware OS Interface. In Proceedings of the 26th Symposium on Operating Systems Principles. 168–183.
Heiss and Wagner (1991) Hans-Ulrich Heiss and Roger Wagner. 1991. Adaptive Load Control in Transaction Processing Systems. In Proceedings of the 17th International Conference on Very Large Data Bases. 47–54.
IBM (2022) IBM. 2022. Db2 Adaptive workload manager. https://www.ibm.com/docs/en/db2/11.5?topic=management-adaptive-workload-manager. (2022). [Accessed: Feb 2023].
Iyer et al. (2001) Ravi Iyer, Vijay Tewari, and Krishna Kant. 2001. Overload Control Mechanisms for Web Servers. In Proceedings of the International Conference on the Performance and QoS of Next Generation Networking. Springer, 225–244.
Jamin et al. (1997) Sugih Jamin, Peter B. Danzig, Scott J. Shenker, and Lixia Zhang. 1997. A Measurement-based Admission Control Algorithm for Integrated Service Packet Networks. IEEE/ACM Transactions on Networking 5, 1 (1997), 56–70.
Jones et al. (2017) Chris Jones, John Wilkes, Niall Murphy, and Cody Smith. 2017. Service Level Objectives. Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media Inc., Chapter 4. https://sre.google/sre-book/service-level-objectives/.
Jonker et al. (2017) Mattijs Jonker, Alistair King, Johannes Krupp, Christian Rossow, Anna Sperotto, and Alberto Dainotti. 2017. Millions of Targets under Attack: A Macroscopic Characterization of the DoS Ecosystem. In Proceedings of the 2017 Internet Measurement Conference. ACM, 100––113.
Kim (2018) Eugene Kim. 2018. Internal documents show how Amazon scrambled to fix Prime Day glitches. https://www.cnbc.com/2018/07/19/amazon-internal-documents-what-caused-prime-day-crash-company-scramble.html. (2018). [Accessed: Feb 2023].
Kopp et al. (2021) Daniel Kopp, Christoph Dietzel, and Oliver Hohlfeld. 2021. DDoS Never Dies? An IXP Perspective on DDoS Amplification Attacks. In Proceedings of the 22nd International Conference on Passive and Active Measurement (Lecture Notes in Computer Science), Oliver Hohlfeld, Andra Lutu, and Dave Levin (Eds.), Vol. 12671. Springer, 284–301.
Kreps et al. (2011) Jay Kreps, Neha Narkhede, and Jun Rao. 2011. Kafka: A Distributed Messaging System for Log Processing. In Proceedings of the 6th International Workshop on Networking Meets Database (NetDB’11). ACM, 1–7.
LeFebvre (2001) William LeFebvre. 2001. CNN.com: Facing a World Crisis. In 15th Systems Administration Conference (LISA 2001). USENIX Association, San Diego, CA. https://www.usenix.org/conference/lisa-2001/cnncom-facing-world-crisis
Lewandowski (2017) Piotr Lewandowski. 2017. Load Balancing at the Frontend. Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media Inc., Chapter 19. https://sre.google/sre-book/load-balancing-frontend/.
Liu et al. (1994) J.W.S. Liu, Wei-Kuan Shih, Kwei-Jay Lin, R. Bettati, and Jen-Yao Chung. 1994. Imprecise Computations. Proceedings of the IEEE 82, 1 (1994), 83–94.
Mallapur and Kehoe (2017) Anil Mallapur and Michael Kehoe. 2017. TrafficShift: Load Testing at Scale. https://engineering.linkedin.com/blog/2017/05/trafficshift--load-testing-at-scale. (2017). [Accessed: Feb 2023].
Meyer et al. (2020a) Scott Meyer, Andrew Carter, and Andrew Rodriguez. 2020a. LIquid: The soul of a new graph database, Part 1. https://engineering.linkedin.com/blog/2020/liquid-the-soul-of-a-new-graph-database-part-1. (2020). [Accessed: Feb 2023].
Meyer et al. (2020b) Scott Meyer, Andrew Carter, and Andrew Rodriguez. 2020b. LIquid: The soul of a new graph database, Part 2. https://engineering.linkedin.com/blog/2020/liquid–the-soul-of-a-new-graph-database–part-2. (2020). [Accessed: Feb 2023].
Mittal (2016) Sparsh Mittal. 2016. A Survey of Techniques for Approximate Computing. ACM Computing Surveys 48, 4 (May 2016).
Mönkeberg and Weikum (1992) Axel Mönkeberg and Gerhard Weikum. 1992. Performance Evaluation of an Adaptive and Robust Load Control Method for the Avoidance of Data-Contention Thrashing. In Proceedings of the 18th International Conference on Very Large Data Bases. 432––443.
Nam et al. (2008) Seung Yeob Nam, Sunggon Kim, and Dan Keun Sung. 2008. Measurement-Based Admission Control at Edge Routers. IEEE/ACM Transactions on Networking 16, 2 (April 2008), 410–423.
Newman (2021) Sam Newman. 2021. Building Microservices: Designing Fine-Grained Systems (2 ed.). O’Reilly Media.
Noll et al. (2019) Stefan Noll, Norman May, Alexander Böhm, Jan Mühlig, and Jens Teubner. 2019. From the Application to the CPU: Holistic Resource Management for Modern Database Management Systems. IEEE Data Engineering Bulletin 42, 1 (2019), 10–21. http://sites.computer.org/debull/A19mar/p10.pdf
Purnell (2020) Spence Purnell. 2020. State Unemployment Websites Crash as COVID-19 Shines Light on Government Technology Failures. https://shorturl.at/BNS29. (2020). [Accessed: Feb 2023].
Richardson (2019) Chris Richardson. 2019. Microservices Patterns: With examples in Java (1 ed.). Manning, Chapter 8, 253–291.
SAP (2022) SAP. 2022. Admission Control. Monitoring View. SAP HANA Administration with SAP HANA Cockpit (2.15.0 ed.). Chapter 7.5. https://help.sap.com/docs/SAP_HANA_COCKPIT/afa922439b204e9caf22c78b6b69e4f2/ce46dcceaef045cb85f6fdf694789ea0.html.
Schroeder and Harchol-Balter (2006) Bianca Schroeder and Mor Harchol-Balter. 2006. Web Servers under Overload: How Scheduling Can Help. ACM Transactions on Internet Technology 6, 1 (Feb. 2006), 20–52.
Schroeder et al. (2006) B. Schroeder, M. Harchol-Balter, A. Iyengar, E. Nahum, and A. Wierman. 2006. How to Determine a Good Multi-Programming Level for External Scheduling. In Proceedings of the 22nd International Conference on Data Engineering. 60–71.
Slo et al. (2020) Ahmad Slo, Sukanya Bhowmik, and Kurt Rothermel. 2020. hSPICE: State-aware Event Shedding in Complex Event Processing. In Proceedings of the 14th ACM International Conference on Distributed and Event-based Systems (DEBS’20). 109–120.
Szopa et al. (2016) Ryszard Szopa and others. 2016. Doorman: Global Distributed Client Side Rate Limiting. https://github.com/youtube/doorman. (2016). [Accessed: Feb 2023].
(54) Gil Tene and others. wrk2: a HTTP benchmarking tool based mostly on wrk. https://github.com/giltene/wrk2. (????). [Accessed: Feb 2023].
Toh et al. (2022) Alethea Toh, Anupam Vij, and Syed Pasha. 2022. Azure DDoS Protection - 2021 Q3 and Q4 DDoS attack trends. https://azure.microsoft.com/en-us/blog/azure-ddos-protection-2021-q3-and-q4-ddos-attack-trends/. (January 2022). [Accessed: Feb 2023].
Tozer et al. (2010) Sean Tozer, Tim Brecht, and Ashraf Aboulnaga. 2010. Q-Cop: Avoiding bad query mixes to minimize client timeouts under heavy loads. In Proceedings of the IEEE 26th International Conference on Data Engineering. 397–408.
Ulrich (2017) Mike Ulrich. 2017. Addressing Cascading Failures. Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media Inc., Chapter 22. https://sre.google/sre-book/addressing-cascading-failures/.
Veeraraghavan et al. (2016) Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and Yee Jiun Song. 2016. Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation. 635–651.
Welsh and Culler (2003) Matt Welsh and David Culler. 2003. Adaptive Overload Control for Busy Internet Servers. In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems - Volume 4 (USITS’03). 1:1–1:15.
Xiong et al. (2011) Pengcheng Xiong, Yun Chi, Shenghuo Zhu, Junichi Tatemura, Calton Pu, and Hakan HacigümüŞ. 2011. ActiveSLA: A Profit-Oriented Admission Control Framework for Database-as-a-Service Providers. In Proceedings of the 2nd ACM Symposium on Cloud Computing. Article 15, 14 pages.
Yang et al. (2016) Jian Yang, Kunjie Zhu, Yongyi Ran, Weizhe Cai, and Enzhong Yang. 2016. Joint Admission Control and Routing via Approximate Dynamic Programming for Streaming Video Over Software-defined Networking. IEEE Transactions on Multimedia 19, 3 (2016), 619–631.
Zhan et al. (2019) Chaoqun Zhan, Maomeng Su, Chuangxian Wei, Xiaoqiang Peng, Liang Lin, Sheng Wang, Zhe Chen, Feifei Li, Yue Pan, Fang Zheng, and Chengliang Chai. 2019. AnalyticDB: Real-time OLAP Database System at Alibaba Cloud. Proceedings of the VLDB Endowment 12, 12 (2019), 2059–2070.
Zhang (2014) Mingyi Zhang. 2014. Autonomic Workload Management for Database Management Systems. Ph.D. Dissertation. Queen’s University. http://hdl.handle.net/1974/12181.
Zhang et al. (2018) Mingyi Zhang, Patrick Martin, Wendy Powley, and Jianjun Chen. 2018. Workload Management in Database Management Systems: A Taxonomy. IEEE Transactions on Knowledge and Data Engineering 30, 7 (2018), 1386–1402.
Zhao et al. (2020) Bo Zhao, Nguyen Quoc Viet Hung, and Matthias Weidlich. 2020. Load Shedding for Complex Event Processing: Input-based and State-based Techniques. In Proceedings of the IEEE 36th International Conference on Data Engineering (ICDE’20). 1093–1104.
Zhou et al. (2018) Hao Zhou, Ming Chen, Qian Lin, Yong Wang, Xiaobin She, Sifan Liu, Rui Gu, Beng Chin Ooi, and Junfeng Yang. 2018. Overload Control for Scaling WeChat Microservices. In Proceedings of the ACM Symposium on Cloud Computing. ACM, 149–161.
Zhou and Yang (2006) Jingyu Zhou and Tao Yang. 2006. Selective Early Request Termination for Busy Internet Services. In Proceedings of the 15th International Conference on World Wide Web. ACM, 605–614.

Acknowledgements.

We thank LinkedIn’s graph team for their support and valuable feedback. Special thanks to Andrew Carter for the AcceptFraction policy ($5.2.3), SungJu (Joe) Cho for the MaxQWT policy (§5.2.2), and Roman Averbukh for improving the AcceptFraction policy and LIquid’s admission control framework.

Appendix A Handling cold starts

Like other measurement-based techniques, Bouncer suffers from the “cold start” syndrome because its decision making relies on per-query-type processing time histograms, which are initially in a blank state. Note that different query types experience the problem independently, and the time they need to warm up depends on when queries begin to arrive, at what pace, and whether they come continuously or sporadically – aspects out of the policy’s control. For instance, queries of popular types may eagerly arrive in continuous streams immediately after the system starts, prompting the histograms to get filled early; by contrast, other query types may arrive sporadically long after the system started, causing the histograms to become populated at a later time. In this section, we discuss alternative solutions and our plan to handle cold starts in Bouncer.

A possible solution to this problem is to warm up the system, before serving live traffic, with sampled production queries amply representative of the different query types (like we did in §5.4). But, it has practical implications. Every system’s installation (e.g., dozens of LIquid clusters per data center) must be warmed up right after being deployed, which occurs often under continuous integration and deployment (CI/CD). This solution thus requires an ancillary software infrastructure, arguably part of the CI/CD pipeline, that samples production traffic, coordinates warm-up executions, and issues sampled queries to newly deployed system’s installations. It also imposes additional burden on the system’s operators since they need to manage this extra piece of infrastructure and ensure that the query types in production are well represented by the sampled queries. Alternatively, one could think about deploying the system along with pre-populated histograms containing query processing times from previous installations. But, besides needing mechanisms to capture, store, and redeploy histograms, this solution assumes that past histograms remain representative of the queries’ performance across versions of the system, which is generally not true. The above solutions, which require supplementary infrastructure to handle cold starts, hinder Bouncer’s usability due to the added complexity. For that reason, we plan to adopt a solution in which Bouncer deals with cold starts by itself.

In our preferred solution, each query type goes through a warm-up phase. During this transitory phase, Bouncer lets queries in, possibly with some leniency, until the histogram gets sufficiently populated, and once that happens the policy switches to normal operation for the warmed-up query type. The question is then: How should Bouncer operate during the warm-up phase? Bouncer includes a general histogram where it stores the processing times of queries regardless of their types. Thus, when the policy receives a query and finds out that the corresponding histogram is not sufficiently populated, it gets the mean, 50th- and 90th-percentile processing times ( $pt_{mean}$ , $pt_{p50}$ , and $pt_{p90}$ ) from the general histogram, and decides to admit or reject the query based on these values and the latency SLOs for the default (catch-all) query type. We also choose this solution because it is simple to implement and well-aligned with the acceptance-allowance strategy (§4.1), which helps fill quickly the general and per-query-type histograms.

In addition, there is a related issue with queries following intermittent patterns. When queries of a given type stop coming for some time, the corresponding histogram may be 1) replaced by an empty one, turning ineffective as it goes back to the initial blank state, or 2) retained, despite becoming stale and its effectiveness possibly decaying with time. In this case we prefer stale data to no data; thus, we choose to retain the histograms when the query counts are below a threshold.

Our solution is under development and its evaluation is left as future work.

Appendix B Other Practical Considerations

Here we discuss some aspects related to the use of Bouncer.

B.1. Choosing percentiles for latency SLOs

As discussed in §3, Bouncer can be easily modified to support one, two, or more percentile response times as objectives. Then, the natural question is: What percentiles should we use in our latency SLOs? That decision depends on multiple factors, such as the characteristics of the system being considered, its workload (mix of queries and traffic volumes), and the percentile latencies the system’s operators monitor and report. But, one guiding principle is the stability of the selected percentile values over time.

In this paper, we use the 50th- and 90th-percentile response times to indicate our latency objectives ( $SLO_{p50}$ and $SLO_{p90}$ ). We chose them because, besides being commonly used to specify latency requirements, our experience with LIquid shows that the 50th- and 90th-percentile processing times ( $pt_{p50}$ and $pt_{p90}$ ) observed by the brokers are more stable than $pt_{p99}$ . Shards (as well as brokers) run a front-end component written in Java, and garbage collection pauses regularly cause relatively high $pt_{p99}$ . When a query type’s histogram stores an elevated $pt_{p99}$ (i.e., close to or larger than $SLO_{p99}$ ), most of the queries of this type will be rejected in the next time interval until the histogram is updated. Instead, we found $pt_{p50}$ and $pt_{p90}$ to be less susceptible to garbage collection stalling.

B.2. Bouncer’s configuration effort

This is an aspect of practical importance because system operators prefer easy-to-configure policies to simplify their duties. Since latency objectives under Bouncer are defined in terms of percentile response times (e.g., $SLO_{p50}$ and $SLO_{p90}$ ), its configuration comprises determining such percentiles for the query types. But, this task is often necessary regardless of the admission control policy in use. The reason is that prospective and current customers of an online data system typically want to know the response time for their queries and whether the system is able to meet their latency requirements, especially at peak traffic load. Getting this information usually involves empirically evaluating the queries’ performance under realistic conditions (e.g., with actual production traffic). Thus, being part of the duties the system’s operators need to carry out, obtaining the queries’ response time percentiles often requires no extra effort in reality.

Effort is also devoted to setting and maintaining the response time SLOs. At the extreme, Bouncer allows having an SLO setting (e.g., $SLO_{p50}$ and $SLO_{p90}$ ) per query type. Imagine being responsible for making sure that the individual SLOs for 100 query types in production are properly set. That seems excessive when compared to using a utilization-centric policy, like AcceptFraction, with a single key configuration parameter: the maximum utilization limit. In practice, however, the configuration effort is generally acceptable because multiple query types often share the same SLO. We have seen ratios as high as 20:1, and our evaluation (§5) exemplifies a scenario with an 11:1 ratio. Such ratios suggest that operators can establish a manageable sized set of SLOs and assign each SLO to multiple query types, effectively grouping queries into classes (reminiscent of quality of service classes).

In addition, Bouncer includes a default SLO setting for queries with unrecognizable type. By setting permissive latency values in the default SLO, the system can serve new queries with no declared type, and the operator can add the new query types and SLOs to the policy’s configuration later. Hence, the default SLO can help not only reduce the configuration effort involved in testing new queries, but also avoid hindering their onboarding.