This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

NetFC: Enabling Accurate Floating-point Arithmetic on Programmable Switches

Penglai Cui, Heng Pan , Zhenyu Li , Jiaoren Wu§
Shengzhuo Zhang§, Xingwu Yang§, Hongtao Guan, Gaogang Xie
ICT, CAS, China    Purple Mountain Laboratories    §Kuaishou    CNIC, CAS, China
Abstract

In-network computation has been widely used to accelerate data-intensive distributed applications. Some computational tasks, traditional performed on servers, are offloaded to the network on programmable switches. However, the computational capacity of programmable switches is limited to simple integer arithmetic operations while many of applications require on-the-fly floating-point operations. To address this issue, prior approaches either adopt a float-to-integer method or directly offload computational tasks to the local CPUs of switches, incurring accuracy loss and delayed processing.

To this end, we propose NetFC, a table-lookup method to achieve on-the-fly in-network floating-point arithmetic operations nearly without accuracy loss. NetFC adopts a divide-and-conquer mechanism that converts the original huge table into several much small tables together with some integer operations. NetFC further leverages a scaling-factor mechanism for computational accuracy improvement, and a prefix-based lossless table compression method to reduce the memory consumption. We use both synthetic and real-life datasets to evaluate NetFC. The experimental results show that the average accuracy of NetFC is as high as up to 99.94% at worst with only 448KB memory consumption. Furthermore, we integrate NetFC into Sonata [14] for detecting Slowloris attack, yielding significant decrease of detection delay.

Index Terms:
In-network computation, floating-point calculation, programmable switch

I Introduction

In modern data center, many applications are data-intensive, such as big data analysis [12], distributed deep learning [28], graph processing [26] and real-time stream processing [17, 5]. Due to frequent data exchanging, these applications have suffered performance degradation from network overhead. For example, in some of the FaceBook MapReduce jobs, network communication can occupy up to 70% of the execution time [6]; in distributed reinforcement learning, network overhead can be over 80% of the total cost in each iteration [25]. Thus cutting down network traffic to accelerate network communication has been the key factor to improve the overall performance.

Recently, a new direction to accelerate data-intensive applications, i.e. in-network computation, has been explored. The intuition behind is that the network has been equipped with many new devices (e.g. programmable switches [4] and smart network interface [24]); that said the network has become capable of providing computational capacity. Consequently, some computational tasks, traditionally performed in the host side, can be offloaded to the network devices. And network traffic can be intercepted and processed by the network devices on the fly based on the pre-offloaded computational logics before it reaches the hosts. Indeed, researchers in the community have already utilized in-network computation to accelerate different distributed applications and achieve remarkable performance improvement. For example, NetCache [21] is co-designed with programmable switches to achieve more than 2B queries per second; ATP [23] performs in-network gradient aggregation, which can improve the training throughput of existing distributed deep learning systems up to 66%; Linear-Road [19], a stream processing benchmark, shows the capability that achieves 4B events per second; Sonata [14] utilizes programmable switches to achieve fast In-band network telemetry.

Unfortunately, the computational capacity of the network is actually very limited, and even the state-of-the-art programmable switches (e.g. Barefoot Tofino [4]) only support simple integer arithmetic operations (e.g. addition and subtraction). Consequently, this becomes a barrier to in-network acceleration of applications, because many of them often require processing sophisticated floating-point data and arithmetic operations (e.g multiplication and division). For example, ATP [23] accelerates distributed deep learning via in-network gradient aggregation, which needs to perform floating-point calculation on programmable switches; Sonata [14] measures the state of the network, where some measurement tasks (e.g. Slowloris attack detection [1]) require in-network multiplication or division.

To overcome the barrier, the prior works often adopt two different ways of doing this. One is to convert floating-point numbers into integers on the host side in advance so that it only needs to perform integer operations on programmable switches. Indeed, this approach has been adopted by ATP to achieve in-network gradient aggregation. However, it may incur some non-negligible accuracy loss (see Section V). The other solution is to offload the computational tasks to the local CPUs of switches (like Sonata). Nevertheless, this solution introduces significant additional latency (see section VI). In summary, to the best of our knowledge, there is a lack of a solution to achieve on-the-fly in-network floating-point arithmetic operations nearly without accuracy loss on programmable switches.

To address this gap, we design and implement a novel approach — NetFC. NetFC adopts a table-lookup method to achieve floating-point arithmetic operations on programmable switches. Intuitively, a simple and direct way is to use a table to enumerate all possible calculation cases ahead. In this way, for one arithmetic operation, we can use its two operands as a key to look up the table while the corresponding value is the result. It is noteworthy that the size of the generated table is huge since it needs to traverse all operands and enumerate their various combinations. For example, for two 16-bit floating-point operands, it would cost almost 8 GB memory, which is unaffordable for on-chip memory on programmable switches (e.g. Barefoot Tofino switches). To this end, NetFC adopts a divide-and-conquer method to address this memory issue. Specifically, it utilizes logarithm projection and transformation to convert the original large table into several much small tables together with simple integer operations (i.e. addition and subtraction). Furthermore, NetFC adopts a scaling-factor mechanism to improve computational accuracy, and design a prefix-based loss-less compression method to further reduce on-chip memory usage. Experimental results on different types of datasets demonstrate that the average accuracy of NetFC is up to 99.94% at worst with only 448KB memory consumption, which performs much better than the state-of-the-art approach (i.e. float-to-integer). In addition, we integrate NetFC into a network telemetry system—Sonata— for detecting and defencing Slowloris attack in real time; with NetFC, the detection latency is reduced from 43.16ms to 0.046ms.

To sum up, the contributions of this paper are three-fold:

  • We design a table-lookup approach, NetFC, to achieve in-network floating-point arithmetic operations nearly without accuracy loss. It adopts a divide-and-conquer method to address the on-chip memory consumption problem.

  • We propose a scaling-factor method to improve the computational accuracy of NetFC, and design a prefix-based loss-less compression mechanism to further reduce memory consumption.

  • We implement NetFC based on Barefoot Tofino switches. Extensive experiments show that NetFC enables floating-point arithmetic operations on programmable switches with low on-chip memory usage. In addition, we also integrate NetFC into Sonata [14] to detect and defense SlowLoris attack in real time.

The remainder of this paper is organized as follows. Section II describes the background and the motivation. We present the design of our NetFC and its fundamental theory in Section III. Section IV shows the implementation details and our optimization mechanism. We evaluate our NetFC in Section V and present an use case that are equipped with our NetFC in Section VI. We finally discuss the related work in Section VII and conclude our work in Section VIII.

II Background and Motivation

In this section, we first briefly describe in-network computation, and then introduce details of floating-point standards. At last, we give two typical examples to illustrate the limitation of prior arts and motivate our work.

II-A Emerging Trends of In-network computation

In-network computation, built on top of programmable switches, exploits the computational capacity of the switches to offload part of computational tasks from the server side to the network. Barefoot Networks’ Tofino switches [4] are one of the popular programmable switches that have been widely used in academy and industry. The chip of the Tofino switch has a flexible parser and a customized match-action forwarding engine. With the provided programming language and interfaces, network programmers are able to dynamically configure the switch to program the network. Tofino switches have two multi-stage pipelines: ingress pipelines and egress pipelines. Each pipeline stage has a fixed amount of time to process packets in memory (TCAM and SRAM). The switches also support some boolean and simple arithmetic operations (e.g. integer addition and subtraction) using a set of ALUs. That said, the switches do not support complex operations (e.g. multiplication and division) or data type (e.g. floating-point number).

In-network computation is appealing for several reasons: i) many packets can be consumed and processed during data transmission, which significantly reduces the overhead of the network (e.g. the network queuing latency and I/O overhead); ii) the computational workloads that are offloaded to the network can alleviate the burden of the server CPUs. For example, ATP [23], SwitchML [30] and iSwitch [25] target at accelerating distributed deep learning via in-network gradient aggregation; NetCache [21] caches data in the network; NetSHa [35] accelerates LSH-based distributed search. We also see some traditional network algorithms [18] (e.g. string matching) and network telemetry tasks [36, 32] (e.g. sketch) are deployed to the network.

II-B Floating-point Arithmetic

Refer to caption

(a) 16-bit float.

Refer to caption

(b) 16-bit posit.

Figure 1: Data format for 16-bit length floating points.

In general, two popular technical standards for floating-point computation have been widely adopted: the traditional IEEE 754 standard [2] and the posit standard [15]. Here, we briefly introduce the two standards.

IEEE 754 floating point.  IEEE 754 float is the most widely used data type. An IEEE 754 16-bit float representation consists of three parts as shown in Figure 1(a): a sign bit, 5 exponent bits and 10 fraction bits. Let ee be the unsigned integer represented by the exponent field. If the fraction bits are f1f2fsf_{1}f_{2}...f_{s}, then f=1.f1f2fsf=1.f_{1}f_{2}...f_{s}. The value pp of a 16-bit floating-point number that does not fall into any exception cases is given by:

p=sign2e15fp=sign*2^{e-15}*f (1)

Modern FPU (floating-point unit) implements floating-point addition, subtraction, multiplication and division as follows. For addition and subtraction, the FPU expresses operands with the same exponent and shift the mantissas accordingly; the shifted mantissas are then added together; for multiplication, the FPU adds the exponents of the two numbers and multiplies their mantissas; Likewise, for division, the FPU subtracts the divisor’s exponent from the dividend’s exponent, and divides the divisor’s mantissa from the dividend’s mantissa. The final outputs of the above floating arithmetic are obtained by rounding and normalizing the results.

Posit floating point.  Posit is designed as a direct drop-in replacement for IEEE Standard 754 floating-point numbers (floats) [15]. Figure 1(b) shows its data format when using 16-bit length to represent the points. Compared to float, posit data type has an additional regime field. Let eses be the width of regime bits (es=1es=1 in Figure 1(b)), kk be the integer represented by regime field, ee be the unsigned integer represented by the exponent field. If the fraction bits are f1f2fsf_{1}f_{2}...f_{s}, then f=1.f1f2fsf=1.f_{1}f_{2}...f_{s}. The value of a posit number p can be represented as:

p=sign22esk2efp=sign*2^{2^{es^{k}}}*2^{e}*f (2)

Compared to the IEEE 754 float, posit floating point has the following advantages [15, 11, 7].

  • Larger dynamic range. Dynamic range represents a range from the minimum positive number to the maximum positive number that a number system can express. The dynamic range of 16-bit float is 61086*10^{-8} to 71047*10^{4}, while the dynamic range of 16-bit posit is 41094*10^{-9} to 31083*10^{8}. That said, with the same bitwidth, posit has a larger dynamic range.

  • No representations wasted for NaN or infinity. For the 16-bit float, when the exponent bits are 11111, it represents NaN or infinity. That said, there are 2,048 representations used to represent Nan or infinity. However, for the 16-bit posit, it represents NaN or infinity only when the representation is 0x8000.

  • Tapered accuracy. Posit numbers near 1 in magnitude have more accuracy than extremely large or extremely small numbers. This phenomenon is called “golden zone” in [11], in which the accuracy of posit is higher than float. For example, Posit32 has more fraction bits than Float32 numbers whose magnitude ranges from 10610^{-6} to 10610^{6}.

Due to the above characteristics, posit is considered to be very advantageous in deep learning by [22, 34, 7]. Indeed, the above two popular standards for floating-point calculation are widely used by different kinds of applications while our NetFC can work very well with both of them. In the following sections, we use float or floating-point (posit resp.) to refer to IEEE 754 floating point (posit floating point resp.).

II-C Motivating NetFC

We are motivated by the fact that modern programmable switches only support simple integer arithmetic operations (addition and subtraction). That said, in-network computation tasks have to rely on other indirect ‘layers’ to implement floating-point arithmetic operations (either addition and subtraction, or multiplication and division) if needed. These indirect ‘layers’ may loose accuracy or increase the delay of the tasks. We take two examples here to illustrate this.

In-network gradient aggregation.  In-network gradient aggregation is used to accelerate distributed machine learning system [23, 30]. The basic idea is to cache gradients from training workers on programmable switches, and accumulate them when some conditions are met to get aggregated gradients. In this way, the number of gradient packets sent to the parameter servers are decreased, mitigating the communication overhead. Gradients are always floating-point numbers, which require the support of floating-point arithmetic operations when performing aggregation. As programmable switches are unable to support the operations, the current solution in [23, 30] is to introduce a ‘shim layer’ on end hosts that coverts floating-point numbers to integers by multiplying a scaling factor (sf) on workers first, and after aggregation by programmable switches (using integer arithmetic operations), the aggregated results will be forwarded to servers where they are restored back to floating-point representation by dividing sf. The above conversion approach leads to significant accuracy loss. For example, let’s consider a floating-point number xx=1.000654 and sf=10000. Consequently, it will convert xx into 10006 while the last two bits are lost.

Inband Network Telemetry.  Programmable switches are the ‘sweet point’ to implement network telemetry as they sit in the middle of network paths. We have seen plenty of network telemetry systems built on programmable switches [14, 32, 16]. Many measurement tasks (e.g. Slowloris attack detection) require the support of floating-point arithmetic operations (even multiplication and division). Because programmable switches cannot support these operations, these systems adopt a slow-path solution, where the intermediate results that require floating-point arithmetic operations are sent to local CPUs for processing. This solution will inevitable introduce huge delay of the tasks. Let us consider the detection of Slowloris attacks [1] in Sonata as a detailed example [14]. The task monitors the traffic of all connections belonging to individual hosts to see whether the average traffic volume of each connection belonging to a host is less than a predefined threshold value. Apparently, this requires floating-point arithmetic operations and has to be implemented in switch local CPUs through the slow path, delaying the detection. That said, without the support of floating-point arithmetic operations in programmable switches, online detection and defense of attacks like Slowloris attacks cannot be implemented in current inband network telemetry.

Summary.  The support of floating-point arithmetic operations in programmable switches are important and essential to enable practical application of in-network computation. To the best of our knowledge, such a support is overlooked by prior arts, which motivates our work, NetFC.

III Design of NetFC

NetFC aims at enabling sophisticated floating-point arithmetic on programmable switches nearly without accuracy loss and additional latency. In this section, we first describe the basic idea, and then discuss the challenges of NetFC, and finally detail the design.

III-A Design Choice

To fix ideas, we assume that 16-bit floating-point numbers follow IEEE 754 standard, but we will relax this assumption in Section III-D. We also assume the use of Barefoot Tofino switches. Intuitively, there are two potential ways to implement floating-point arithmetic on the data plane of programmable switches.

  • FPU method. A floating-point number is represented as three portions: sign, mantissa and exponent. For floating-point addition (subtraction reps.), it should shift the mantissa of one floating-point operand so that its exponent is identical to that of the other operand. Finally, it adds (subtracts reps.) the two operand mantissas. For floating-point multiplication (division reps.), it needs to perform multiplication (division reps.) between the two operand mantissas and addition (subtraction reps.) between the two operand exponents.

  • Table-lookup method. Let’s use addition to illustrate this method. For any two 16-bit floating-point operands a and b, we perform an addition operation between the two operands and thus get its corresponding result in advance. After this operation, we obtain a key-value pair whose key is a and b while z constitutes the value. Next we traverse all possible values of a and b to generate multiple key-value pairs and finally constitute an addition table. Subsequently, if we calculate the sum of any other two 16-bit floating-point numbers, we only need to use the two operands as a key to look up the table while the value of the matched table entry is the result. Of course, this method can also be generalized to other arithmetic operations.

However, the question remains: can either of the above methods be deployed to programmable switches directly? To explore this, we analyze the capacity limitations of programmable switches as follows:

  1. i.

    Limited computation capacity. Programmable switches (e.g. Barefoot Tofino) only supports some simple integer arithmetic. That said, floating-point numbers and arithmetic operations of multiplication and division have exceeded the switch capacity.

  2. ii.

    Scarce on-chip memory. Switch on-chip memory size is very small (e.g. tens of megabyte in Barefoot Tofino) so that it is impossible to provide huge memory for floating-point arithmetic. Note that, a portion of memory has to be reserved for forwarding rule storage and lookup, further aggravating this problem.

  3. iii.

    Limited pipeline stages. The switch data plane often consists of a pipeline of stages, each of which is a packet processing unit equipped with some computing and storage resources. However, the number of stages is small (e.g. 32 stages in Barefoot Tofino at most), and any two dependent packet processing operations cannot be assigned to the same stage.

Now, let us return back to the FPU and Table-lookup methods. FPU requires on-the-fly variable shifting operations and needs to perform multiplication/division arithmetic. And thus, it is not possible to implement FPU in programmable switches. Thus we turn to the Table-lookup method.

The table-lookup method fits the programmable switches, which abstract the packet processing on data plane as tables that consist of match-action tuples. However, it does not work directly either due to a large amount of memory consumption: a 16-bit floating-point addition arithmetic would consume about 8 GB (216×216×2B2^{16}\times 2^{16}\times 2B) memory. Thus, the implementation finally nails down to how to reduce the memory consumption.

III-B Basic idea: Divide and Conquer

Refer to caption
Figure 2: Example of NetFC.

The original table-lookup method traverses all possible operands and their combinations, which finally constitutes a very large table. For example, considering two 16-bit operands, they will generate 216×2162^{16}\times 2^{16} (a.k.a Cartesian product) table entries. Our NetFC adopts a divide-and-conquer approach to address this issue. Specifically, it utilizes logarithm projection and transformation to convert the original large table into several much small tables together with some simple integer arithmetic operations. We will provide its details in the subsequent sections. It is noteworthy that diverse types of floating-point arithmetics would generate different numbers of small tables. For example, as shown in Figure 2, it uses two small tables to replace the original large arithmetic table while the total table entries are reduced from 216×2162^{16}\times 2^{16} to 2162^{16}+2162^{16}. Consequently, NetFC achieves floating-point arithmetic operations via looking up these small tables in sequence and performing some integer arithmetic operations.

Readers might wonder looking up multiple tables and performing some arithmetic operation would degrade the performance of programmable switches. But in fact, packet-processing pipelines have an all-or-noting characteristic [13]: programs can run at the line rate of the switch pipeline as long as they can run. Next we respectively introduce the details of NetFC for different arithmetic types.

III-B1 Addition and Subtraction

We assume that two floating-point numbers, xx and yy, are positive; we will relax this assumption later. Note that we mainly discuss addition operation as subtraction also can viewed as one type of addition operations. Let’s ii (jj resp.) denote log2(x)\lfloor\log_{2}(x)\rfloor (log2(y)\lfloor\log_{2}(y)\rfloor resp.). We note that the round-down operations would degrade the computing accuracy, and thus propose a novel approach to make up for the accuracy loss (see Section IV-B). Logically, xx adds yy can be obtained as follow.

x+y\displaystyle x+y =2log2(x+y)\displaystyle=2^{log_{2}(x+y)} (3)
=2log2(x)+log2(1+y/x)\displaystyle=2^{log_{2}(x)+log_{2}(1+y/x)}
=2log2(x)+log2(1+2log2(y)log2(x))\displaystyle=2^{log_{2}(x)+log_{2}(1+2^{log_{2}(y)-log_{2}(x)})}
=2i+log2(1+2ji)\displaystyle=2^{i+log_{2}(1+2^{j-i})}

To achieve the above addition, we need to set up three tables (see Figure 3). The first table, logTable, is used to record the logarithm values of all possible keys. With this basis, it is straightforward to get the value of ii and jj via looking up logTable. The second table, miTable, is used to figure out the value σ(θ)=log2(1+2θ)\sigma(\theta)=\log_{2}(1+2^{\theta}) for a given θ\theta; we use jij-i to look up miTable, and then use the result to add ii. Thus we can obtain the value of i+log2(1+2ji)i+\log_{2}(1+2^{j-i}). The last table, expTable, is to compute (find out) the exponential value for a given key. With this table, we can find out the value of 2i+log2(1+2ji)2^{i+log_{2}(1+2^{j-i})} (a.k.a x+yx+y).

Refer to caption
Figure 3: Floating-point arithmetic on switches.

Next we consider a more general condition that x0x\not=0 and y0y\not=0. Thus x+yx+y equals to x+y\mid x+y\mid or -x+y\mid x+y\mid. Likewise, we use ii (jj resp.) to denote log2(x)\lfloor\log_{2}(\mid x\mid)\rfloor (log2(y)\lfloor\log_{2}(\mid y\mid)\rfloor resp.). x+yx+y can be obtained as follow.

x+y\displaystyle x+y =±2log2(x+y)\displaystyle=\pm 2^{log_{2}(\mid x+y\mid)} (4)
=±2log2(x)+log2(x+y/x)\displaystyle=\pm 2^{log_{2}(\mid x\mid)+log_{2}(\mid x+y\mid/\mid x\mid)}

\midx+y\mid belongs to one of the four situations: \midx\mid+\midy\mid, or \midx\mid-\midy\mid, or \midy\mid-\midx\mid, or -\midx\mid-\midy\mid. Finally, we can get Eq. 5.

x+y=±2i+log(±1±2ji)\displaystyle x+y=\pm 2^{i+log(\pm 1\pm 2^{j-i})} (5)

Indeed, due to ±\pm, Eq. 5 has eight possible situations, which are decided by the following three conditions: 1) x>>0; 2) y>>0; 3) \midx\mid>>\midy\mid. The detail is shown in Table I.

TABLE I: Decision table.
x>>0 y>>0 \midx\mid>>\midy\mid formula
T T T 2i+log(1+2ji)2^{i+log(1+2^{j-i})}
T T F 2i+log(1+2ji)2^{i+log(1+2^{j-i})}
T F T 2i+log(12ji)2^{i+log(1-2^{j-i})}
T F F 2i+log(1+2ji)-2^{i+log(-1+2^{j-i})}
F T T 2i+log(12ji)-2^{i+log(1-2^{j-i})}
F T F 2i+log(1+2ji)2^{i+log(-1+2^{j-i})}
F F T 2i+log(1+2ji)-2^{i+log(1+2^{j-i})}
F F F 2i+log(1+2ji)-2^{i+log(1+2^{j-i})}

Similarly, we still need logTable, miTable and expTable. Three variants of miTable (i.e. σ(θ)=log2(1+2θ)\sigma(\theta)=\log_{2}(1+2^{\theta}), σ(θ)=log2(12θ)\sigma(\theta)=\log_{2}(1-2^{\theta}) and σ(θ)=log2(1+2θ)\sigma(\theta)=\log_{2}(-1+2^{\theta})) will be generated. In summary, NetFC maintains one logTable table, three miTable tables and one expTable table to achieve floating-point addition operation.

Corner cases.  There are some corner cases, however. For example, if xx (yy resp.) equals to 0, NetFC will return yy (xx resp.) directly. In addition, in the case of ji>15j-i>15 (ji<15j-i<-15 resp.), xx (yy resp.) is 15 orders larger than yy (xx resp.) so that the sum of xx and yy approximately equals to xx (yy resp.).

Algorithm 1 In-network floating-point addition.
0:  p, an input data packet.
1:  parser floating-point operands xx, yy from p.
2:  if xx (or yy) \equivthen
3:     return yy (or xx)
4:  end if
5:  get i=log2(x)i={\lfloor log_{2}(\mid x\mid)\rfloor}, j=log2(y)j={\lfloor log_{2}(\mid y\mid)\rfloor} by logTable.
6:  n=jin=j-i.
7:  if n>n>15 then
8:       return yy
9:  else if n<n<-15 then
10:       return xx
11:  else
12:       select miTable based on table I.
13:       get m=log(±1±2n)m=\lfloor log(\pm 1\pm 2^{n})\rfloor by looking up miTable.
14:       k=i+mk=i+m.
15:       get x+y=2k\mid x+y\mid=2^{k} by looking up expTable.
16:       set sign bit according to table I.
17:  end if

Algorithm 1 summarizes how NetFC performs floating-point addition/subtraction operations on programmable switches. First, it parses an incoming packet to obtain two operands xx and yy (line 1), checks their values (line 2 to 4) and projects them into logarithm space via looking up logTable (line 5). After processing corer cases, it decides which miTable to use based on Table I. Finally, it further looks up the selected miTable and expTable to calculate the result (line 13 to line 16). Figure 4 shows the implementation on data plane of programmable switches for addition/subtraction operations.

Refer to caption
Figure 4: Implementation of floating-point addition in NetFC.

III-B2 Multiplication and Division

Consider two non-zero floating-point numbers, xx and yy. We still use ii and jj to denote log2(x)\lfloor log_{2}(\mid x\mid)\rfloor and log2(y)\lfloor log_{2}(\mid y\mid)\rfloor respectively. Note that xx=-2log2(x)2^{\log_{2}(\mid x\mid)}=-2i2^{i} (xx=2i2^{i} resp.) when x<0x<0 (x>0x>0 resp.). Likewise, yy=-2j2^{j} (yy=2j2^{j} resp.) when y<0y<0 (y>0y>0 resp.). Thus the multiplication of xx and yy can be transformed to xyx*y=±2i+j\pm 2^{i+j} while their division is x/yx/y=±2ij\pm 2^{i-j}.

Refer to caption
Figure 5: Implementation of floating-point multiplication in NetFC.

In summary NetFC only needs two types of tables, logTable and expTable, to achieve floating-point multiplication and division operations.

Corner cases.  Similarly, some corner cases should be dealt with separately. One case is that one or both operands equal to 0 so that NetFC returns 0 or NaN111NaN (Not a number) is a representation of exceptions in IEEE 754. directly. For another case, the result exceeds the representation range of IEEE 754 floating point and NetFC would return INFINITY222All exponent bits are assigned to 1, and all fraction bits are assigned to 0. or 0 directly.

Algorithm 2 In-network floating-point multiplication.
0:  p, an input data packet.
1:  parser floating-point number xx, yy from p.
2:  if xx\equiv0 or yy\equivthen
3:       return 0.
4:  end if
5:  get i=log(x)i=\lfloor log(\mid x\mid)\rfloor, j=log(y)j=\lfloor log(\mid y\mid)\rfloor by logTable.
6:  sign(xx*yy)=XOR(sign(xx), sign(yy)).
7:  n=i+jn=i+j.
8:  if n>n>15 then
9:       return INFINITY
10:  else if n<n<-24 then
11:       return 0.
12:  else
13:       get xy=2n\mid x*y\mid=2^{n} by looking up expTable.
14:       set the sign bit
15:  end if

Algorithm 2 shows how NetFC performs the floating-point multiplication. NetFC first parses an input packet to obtain two operands xx and yy and decides whether operands equal to 0 or not (line 2 to 4). Then it looks up logTable to find out the value of ii and jj, whose sum is used to detect the corner cases (line 5 to 11). After processing the corner cases, NetFC further uses the sum (ii+jj) to look up expTable table and decides the sign of the result (line 13 to 14). The processing logic of floating-point division is similar to Algorithm 2. Their major differences lie in the corner cases and line 7 in Algorithm 2 (i.e., n=ijn=i-j). Figure 5 shows the implementation of multiplication and division for floating-point numbers on programmable switches.

III-C Overhead analysis

NetFC requires several tables on the switch data plane to implement floating-point arithmetic operations. This does consume the on-chip memory of programmable switches. We next discuss its overhead.

NetFC generates 5 tables at most: two logTable333In practice, each operand needs to look up an exclusive logTable. (2*2152^{15} 16-bit entries in total, \approx128KB), three miTable (2152^{15} 16-bit entries, \approx192KB) and one expTable (2162^{16} 16-bit entries, \approx128KB). As a result, NetFC consumes about 448KB on-chip memory in total, which is reasonable considering that our lowe-end Barefoot Tofino switches are equipped with 20MB memory. Nevertheless, we propose an optimization approach to further reduce the memory usage (see Section IV). As to pipeline usage, our experiments show that NetFC only consumes 5 pipeline stages so that it easily runs on the data plane of switches.

III-D Working with Posit

Till now we build NetFC for IEEE 754 standard, next we discuss the support for posit. Overall, there are two major differences between IEEE 754 float and posit floating point when implemented in NetFC as follows: 1) To guarantee both the input and output data are posit, the keys of logTable and the values of expTable need to be converted to posit floating points. 2) Since posit has larger dynamic range than that of IEEE 754 float (discussed in section II-B), NetFC has to consider different corner cases. For example, for float addition, the corner case is ji>15{\mid}j-i{\mid}>15 (discussed in section III-B1). But for posit addition, the corner case becomes ji>64{\mid}j-i{\mid}>64. Overall, the implementation of posit is not very different from float in principle. We believe that our NetFC is universal and it can be generalized to other data types.

IV Implementation and Optimization

In this section, we first present implementation details on programmable switches, and then introduce a few optimizations to improve the computational accuracy and reduce the overhead.

IV-A Implementation

We implement our NetFC on a Barefoot Tofino switch (3.2Tb/s) using P416P4_{16} language. The switch has some restrictions on the resource usage. A particular restriction is that it cannot support complicated programming logics due to the limited pipeline stages. However, NetFC requires some if-else conditions to decide miTable. To address this issue, NetFC uses a separated table with pre-issued entries that covers all cases shown in Table I to choose which miTable it should use. This eliminates multi-layer nested if-else conditions and reduces the usage of pipeline stages.

IV-B Optimization: scaling factor

As mentioned before, NetFC uses log2(x)\lfloor log_{2}(x)\rfloor to approximate log2(x)log_{2}(x), which inevitably incurs accuracy loss since the decimal fraction of log2(x)log_{2}(x) has been ignored. To cope with this problem, NetFC utilizes a scaling factor, kk, to multiply log2(x)log_{2}(x) for amplifying its decimal fraction and avoiding being ignored. NetFC also divides this scaling factor in subsequent steps for guaranteeing the correctness of floating-point arithmetic operation.

To illustrate this, let us consider two operands, xx and yy. We first get i=log(x)ki=\lfloor log(x)*k\rfloor and j=log(y)kj=\lfloor log(y)*k\rfloor by looking up logTable respectively, and then use θ=ij\theta=i-j to look up miTable for obtaining γ=log(±1±2θk)k\gamma=\lfloor log(\pm 1\pm 2^{\frac{\theta}{k}})*k\rfloor. Finally, it looks up expTable for the result ±2i+γk\lfloor\pm 2^{\frac{i+\gamma}{k}}\rfloor. It is clear that one scaling down operation (diving kk) follows every scaling up operation (multiplying kk). A larger kk brings higher accuracy, but also consumes more table entries (i.e. memory). Thus this is a tradeoff between accuracy and memory, which will be evaluated in Section V.

IV-C Optimization: Prefix-Based Lossless Compression

We next discuss the optimization for NetFC memory overhead. Specifically, for a table in NetFC, it is possible that many continuous entries have the same value, consequently their corresponding keys can be merged. Thus we propose a prefix-based compression mechanism.

Refer to caption
Figure 6: Example of the prefix-based compression mechanism.

Figure 6 shows an example that NetFC compresses four keys (i.e. table entry) into one key via a wildcard representation, which can be loaded to the switch TCAM memory. It is noteworthy that such compression method is lossless. Our experimental results show that it can save about 25% memory consumption.

V Evaluation

We raise questions about the computational accuracy and the overhead of NetFC:

  • Q1: In terms of floating-point addition/subtraction, how does NetFC perform comparing to the state-of-the-art approach (i.e. float-to-integer [23]) (Section V-B)?

  • Q2: How does NetFC perform for floating-point multiplication/division (Section V-C)?

  • Q3: How does the scaling factor affect NetFC ( Section V-D)?

  • Q4: Does NetFC work well for the posit standard ( Section V-E)?

V-A Methodology

Experimental setup.  We evaluate NetFC on a testbed with two commodity servers, each of which is equipped with 32 cores of Intel(R) Xeon(R) E5-2682 CPU @ 2.5GHz, 256GB RAM with Ubuntu 16.04 and Linux kernel 4.15.0-132. All servers are directly connected to a Barefoot Tofino switch (3.2 Tbps). We run NetFC on the programmable switch, and deploy a “sender” on one server and a “receiver” on the other server. The sender reads datasets and constructs floating-point operands and operators, which constitute NetFC packets to be forwarded to the switch. And the switch identifies NetFC packets and performs floating-point arithmetic operations while the receiver receives, parses and checks the results from the switch.

Baseline.  To the best of our knowledge, the state-of-the-art approach that perform floating-point arithmetic on the switch data plane is float-to-integer method, which has been widely used in accelerating distributed machine learning system [23, 30]. However, this method only supports floating-point addition and subtraction. Thus we use it as the baseline when evaluating NetFC on addition and subtraction. But for floating-point multiplication and division, we compare NetFC with the actual results (e.g. calculated by CPUs).

Benchmarks.  In general, we use two random (synthetic) datasets and one real dataset to evaluate our NetFC. One random dataset, denoted by Dataset I, consists of ten thousand randomly generated 16-bit floating-point numbers; the other random dataset, denoted by Dataset II contains ten thousand randomly generated 16-bit floating-point decimals. The real dataset, denoted by Dataset III, makes up of gradient updates (fifty thousand records) from a real distributed deep learning model training [20].

Metrics.  Let \otimes denote +, -, ×\times and ÷\div. Overall, We utilize the following formula to quantize the accuracyaccuracy:

expect_result\displaystyle expect\_result =xy\displaystyle=x\otimes y (6)
result\displaystyle result =NetFC(xy)\displaystyle=\text{NetFC}(x\otimes y)
accuracy\displaystyle accuracy =eexpect_resultresultexpect_result\displaystyle=e^{-\frac{{\mid}expect\_result-result{\mid}}{{\mid}expect\_result{\mid}}}

where expect_resultexpect\_result is the exact results that are calculated by the receiver CPUs, and accuracyaccuracy represents the proportion of error to expect_resultexpect\_result. Thus accuracyaccuracy lies in between 0 and 1, a higher accuracyaccuracy indicates that resultresult is more close to expect_resultexpect\_result. To demonstrate this point, we assume that accuracyaccuracy is close enough to 1. Consequently, we can see that expect_resultresult\mid expect\_result-result{\mid}\approx (1accuracy)(1-accuracy)*expect_result{\mid}expect\_result{\mid}. This conclusion is easily to be proved via Taylor series [3]. Our experiments show that this approximation holds as long as accuracyaccuracy is larger than 0.950.95.

Furthermore, we also use MSE (Mean Square Error) to measure the deviation of expect_resultexpect\_result and resultresult:

MSE=1ni=1n(expect_resultiresulti)2\displaystyle MSE=\frac{1}{n}\sum_{i=1}^{n}(expect\_result_{i}-result_{i})^{2} (7)

where nn represents the total number of pairs in the dataset. MSE measures the average squared difference between expect_resultexpect\_result and resultresult, and smaller MSE values means better accuracy.

TABLE II: Accuracy Table
dataset IEEE 754 floating point posit floating point
name +,- (NetFC) +,- (Float-to-integer) ×{\times} ÷{\div} +,- ×{\times} ÷{\div}
average I 99.94 48.19 99.96 99.96 99.86 99.91 99.91
accuracy II 99.95 84.83 99.94 99.98 99.86 99.87 99.95
(%) III 99.96 95.28 99.97 99.98 - - -
median I 99.94 36.79 100 100 99.91 99.94 99.94
accuracy II 100 99.93 100 100 99.94 99.87 99.96
(%) III 100 99.24 100 100 - - -
minimum I 77.88 13.72 60.65 95.56 36.79 93.11 89.48
accuracy II 68.55 4.54 71.65 99.85 36.79 93.11 98.20
(%) III 76.74 0.01 36.79 99.88 - - -

V-B Addition/Subtraction Performance

Refer to caption

(a) Dataset I accuracy heatmap.

Refer to caption

(b) Dataset II accuracy heatmap.

Refer to caption

(c) Dataset III accuracy heatmap.

Figure 7: Floating-point addition/subtraction heatmap. (Note that we adopt logarithm for the values in grids.)

For each of the three datasets, we plotted a heatmap to compare the accuracy of NetFC and Float-to-integer method. The horizontal axis represents the percentile of accuracy, and the vertical axis represents the decile of accuracy. When the value of vertical axis is ii and the value of horizontal axis is jj, the value in the corresponding grid means the logarithm of the number of data with accuracy between 0.1i+0.01j0.1*i+0.01*j and 0.1i+0.01(j+1)0.1*i+0.01*(j+1). For example, let’s consider the case that the value of axis (6,3) is 12 (see figure 7(a) left). It means that the number of data with accuracy between 0.36 and 0.37 is about 4,096 (2122^{12}).

As shown in Figure 7(a), the accuracy of Float-to-integer method mainly distributed near (6,3), while accuracy of NetFC concentrates around (9,9). Overflow is the main reason for the poor accuracy of Float-to-integer method. Overflow is a phenomenon that an arithmetic operation attempts to create a numeric value that is outside the range that can be represented with a given number of bits. For example, considering two 16-bits floating-point numbers x=4.765625x=4.765625 and y=0.005203y=-0.005203, if we use a scaling factor of 10,000, Float-to-integer method converts xx to 47656 and y to 52-52 firstly. Unfortunately, 47656 is out of the range of 16-bits integers, it will be mistaken for -17880 by the switch. This example reveals drawback (significant errors) of the Float-to-integer method: overflow occurs when the factor is large, while loss of decimal parts occurs when the factor is small. Due to this property, Float-to-integer method performs well only on decimal datasets.

Figure 7(b) shows the performance of the two methods on the dataset II. As shown in figure 7(b) left, the accuracy of Float-to-integer method were concentrated in the range of 0.97 to 1.0. The accuracy is much better than the result in figure 7(a), but still not as good as NetFC (see figure 7(b) right). For NetFC, we found that only 20 out of 10,000 data have an accuracy of less than 99%.

In Figure 7(c), we compare the performance of the two methods on the dataset sampled from a distributed machine learning system. Again, NetFC achieves better accuracy. It is noteworthy that this dataset takes values in the range of -0.01 to 0.01, which is detrimental to our approach because the non-decimal part of the implementation is completely wasted. Thus we can remove those table entries whose value is larger than 1 since it is impossible that they would be matched. With this basis, NetFC can save more on-chip memory for further improving computational accuracy via increasing scaling factor.

These results demonstrate the advantage of NetFC over Float-to-integer method in accuracy. Table III compares the two methods in terms of MSE. NetFC performs well on all three datasets; Float-to-integer method is acceptable on dataset III, but performs poorly on dataset I and II.

TABLE III: MSE Table
Dataset I Dataset II Dataset III
+,- (NetFC) 91.32 2.60×108\times 10^{-8} 1.95 ×1013\times 10^{-13}
+,- (Float-to-integer) 198106067.16 0.11 4.17 ×1011\times 10^{-11}
×{\times} 28.13 1.52×109\times 10^{-9} 2.72 ×1016\times 10^{-16}
÷{\div} 27.29 0.98 1.12 ×104\times 10^{-4}

V-C Multiplication/Division Performance

Refer to caption
Figure 8: Dataset II accuracy heatmap for multiplication and division.

Figure 8 shows the accuracy distribution of NetFC for multiplication and division using dataset II. Note that some similar results were found on the other two datasets. We see a high accuracy no matter which datasets are used: almost all computations achieve a accuracy over 99% for division, and only 50 (out of 10,000) computations has a accuracy of less than 99%. Table III demonstrates that the deviation from expect_resultexpect\_result of NetFC’s result is very low. Table II shows the median accuracy of NetFC, whose worst case also can achieve up to 100%.

V-D Sensitiveness to Scaling Factor

Refer to caption
Figure 9: Relation between median accuracy and memory usage for floating-point addition.

Recall the scaling factor is used to compensate for using round-down operation on accuracy: a larger scaling factor increases the accuracy but also the memory consumption. Figure 9 shows the memory usage and median accuracy when increasing scaling factor for floating-point addition. We can see that memory usage is proportional to the value of scaling factor. The accuracy increases significantly with the scaling factor when the factor below 256; beyond this point the improvement become marginal. That said, NetFC can achieve an significant accuracy improvement using limited extra cost.

V-E Posit Performance

Table II shows the accuracy of NetFC for floating points of posit standard (scaling factor=512). Again, we see very high accuracy with an average accuracy above 99.86% and a median accuracy above 99.87%, for all four types of operations. These results prove NetFC’s good support of floating point arithmetic operations for both IEEE 754 standard and the posit standard.

We see a slightly lower accuracy for posit standard than IEEE 754 standard in Table II. The reason is that posit standard has a larger dynamic range and thus, with the same amount of memory as IEEE 754 standard, it has to use a smaller scaling factor, which is 512 (1,024 for IEEE 754 standard in our experiments). A smaller scaling factor leads to the loss in accuracy.

VI Use Case: Online detection of Slowloris attacks

Refer to caption
Figure 10: Slowloris attack detection measurement framework with NetFC.

To verify the feasibility of NetFC in real applications, we integrated NetFC into a online Slowloris attack detection framework, shown in Figure 10.

Prior Art : Without the support of float-point multiplication and division on programmable switches, Sonata, a typical in-network telemetry system, implements Slowloris attack detection as follows: (1) Inquirer sends a query about whether there exists a slowloris attack associated with a destination IP, to data plane; (2) the system gets the number of connections and the total bytes corresponding to IP, and forward them to control plane; (3) the controller calculates the average number of connections that a byte corresponds to by dividing the number of connections with the total bytes, and then send the obtained number to the data plane, which compares the value with the predefined threshold and further answers the inquirer. This detour to local CPUs will increase the detection delay. The fundamental reason is the lack of support of division on data plane of programmable switches.

Online Slowloris attack detection with NetFC.  NetFC provides good support of floating-point division on data plane, thus enables online attack detection and defense, as shown in Figure 10. With NetFC, the detection no longer requires the involvement of control plane. Specially, we add a NetFC module and a CPB (connection per byte) register to the Sonata’s implementation. The NetFC modules completes the division operation and writes result to CPB register.

Evaluation Setup.  To measure the performance of our implementation, we setup a cluster consisting five servers and an Barefoot Tofino switch. Each server has 32 CPU cores and 100G NIC, we connect them directly to switch. We use four of the five servers as host and one as inquirer.

The experiment is divided into two phases. In the first phase, hosts communicate through switch. In this phase, when a packet arrives, our online version gets the number of corresponding connections and the total number of bytes of the destination IP, which are stored in two registers; it then computes the average number of connections each byte corresponds to through NetFC, and stores the outcome in the CPB register.

In the second phase, the inquirer sends a query packet to the switch, the packet needs to specify the IP to be queried. The switch queries CPB register directly to get the result, and then sends the answer packet back to the inquirer. The switch can also drop the subsequent packets to this destination IP for early defence.

Results.  By capturing packets on the inquirer’s NIC, we measured the time elapsed from the time the query packet was sent to the time we got the answer packet. Experiments show that the time to complete a query decreases from 43.156ms to 0.046ms, showing the significance that NetFC enables the removal of the involvement of switch local CPU. With such a short response time, network operators can quickly detect possible attacks and effectively protect network from the attacks.

VII Related Work

FPUs. FPU is a relatively conventional topic in the community. Recent researches mainly focus on how to improve the performance of floating-point calculation or increase the computational accuracy [31, 29, 10, 27]. However, due to a specific programming model and limited computing capacity, it is not easy to deploy FPUs into programmable switches.

Logarithm Number System Processor. Logarithm Number System (LNS) is another numeric representation. Recently, the community has proposed different approaches to optimize the LNS processors [33, 8, 9]. Actually, LNS mainly uses such logarithm representation to assist its microprocessors for higher computational performance. On the contrary, our NetFC borrows this representation to achieve on-the-fly floating-point operations on programmable switches.

In-network computation. There are a lot of works focusing on in-network computation and achieving significant performance improvement. For example, [23, 30] perform in-network gradient aggregation to accelerate distributed deep learning. NetCache [21] implements a high-performance key-value cache in programmable switches. Sonata [14] utilizes programmable switches to achieve fast In-band network telemetry. However, these works do not fundamentally solve the floating-point calculation issue on the data plane of programmable switches.

VIII Conclusion

In-network computation is an emerging trend to reduce the network overhead and accelerate data-intensive applications via offloading some computational tasks to the network (i.e. programmable switches). However, programmable switches only support simple integer arithmetic operations while other sophisticated floating-point operations have exceeded their capacity. To address this issue, the prior arts propose a float-to-integer mechanism or offload such computational tasks to the local CPUs of switches, which may either lead to accuracy loss or incur additional latency. To this end, we design and implement NetFC, a table-lookup method, to achieve on-the-fly in-network floating-point operations nearly without accuracy loss. NetFC adopts divide-and-conquer mechanism and prefix-based lossless compression method to reduce memory consumption, designs a scaling-factor approach to improve computational accuracy. Our experimental results show that the average accuracy of NetFC can achieve up to 99.94% at worst with only 448KB memory consumption. Furthermore, we integrate NetFC into Sonata for detecting Slowloris attack, yielding significant decrease of detection delay. We believe that our NetFC can be a building block for in-network computation.

References

  • [1] Slowloris http dos. https://web.archive.org/web/20150426090206/http://ha.ckers.org/slowloris, 2009.
  • [2] Ieee 754 standard. http://mathcenter.oxford.emory.edu/site/cs170/ieee754/, 2021.
  • [3] Taylor series. https://simple.wikipedia.org/wiki/Taylor_series, 2021.
  • [4] Tofino switch. https://www.intel.com/content/www/us/en/products/network-io/programmable-ethernet-switch/tofino-series/tofino.html, 2021.
  • [5] Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch. Integrating scale out and fault tolerance in stream processing using operator state management. In Proceedings of the 2013 ACM SIGMOD international conference on Management of data, pages 725–736, 2013.
  • [6] Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I Jordan, and Ion Stoica. Managing data transfers in computer clusters with orchestra. ACM SIGCOMM Computer Communication Review, 41(4):98–109, 2011.
  • [7] Marco Cococcioni, Emanuele Ruffaldi, and Sergio Saponara. Exploiting posit arithmetic for deep neural networks in autonomous driving applications. In 2018 International Conference of Electrical and Electronic Technologies for Automotive, pages 1–6. IEEE, 2018.
  • [8] John N Coleman and EI Chester. A 32 bit logarithmic arithmetic unit and its performance compared to floating-point. In Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No. 99CB36336), pages 142–151. IEEE, 1999.
  • [9] John N Coleman, EI Chester, Christopher I Softley, and Jiri Kadlec. Arithmetic on the european logarithmic microprocessor. IEEE Transactions on Computers, 49(7):702–715, 2000.
  • [10] Thierry Cretegny, Thierry Dauxois, Stefano Ruffo, and Alessandro Torcini. Localization and equipartition of energy in the β\beta-fpu chain: Chaotic breathers. Physica D: Nonlinear Phenomena, 121(1-2):109–126, 1998.
  • [11] Florent De Dinechin, Luc Forget, Jean-Michel Muller, and Yohann Uguen. Posits: the good, the bad and the ugly. In Proceedings of the Conference for Next Generation Arithmetic 2019, pages 1–10, 2019.
  • [12] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
  • [13] Xiangyu Gao, Taegyun Kim, Michael D Wong, Divya Raghunathan, Aatish Kishan Varma, Pravein Govindan Kannan, Anirudh Sivaraman, Srinivas Narayana, and Aarti Gupta. Switch code generation using program synthesis. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, pages 44–61, 2020.
  • [14] Arpit Gupta, Rob Harrison, Marco Canini, Nick Feamster, Jennifer Rexford, and Walter Willinger. Sonata: Query-driven streaming network telemetry. In Proceedings of the 2018 conference of the ACM special interest group on data communication, pages 357–371, 2018.
  • [15] John L Gustafson and Isaac T Yonemoto. Beating floating point at its own game: Posit arithmetic. Supercomputing Frontiers and Innovations, 4(2):71–86, 2017.
  • [16] Qun Huang, Patrick PC Lee, and Yungang Bao. Sketchlearn: relieving user burdens in approximate measurement with automated statistical inference. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pages 576–590, 2018.
  • [17] Virajith Jalaparti, Peter Bodik, Srikanth Kandula, Ishai Menache, Mikhail Rybalkin, and Chenyu Yan. Speeding up distributed request-response workflows. ACM SIGCOMM Computer Communication Review, 43(4):219–230, 2013.
  • [18] Theo Jepsen, Daniel Alvarez, Nate Foster, Changhoon Kim, Jeongkeun Lee, Masoud Moshref, and Robert Soulé. Fast string searching on pisa. In Proceedings of the 2019 ACM Symposium on SDN Research, pages 21–28, 2019.
  • [19] Theo Jepsen, Masoud Moshref, Antonio Carzaniga, Nate Foster, and Robert Soulé. Life in the fast lane: A line-rate linear road. In Proceedings of the Symposium on SDN Research, pages 1–7, 2018.
  • [20] Biye Jiang, Chao Deng, Huimin Yi, Zelin Hu, Guorui Zhou, Yang Zheng, Sui Huang, Xinyang Guo, Dongyue Wang, Yue Song, et al. Xdl: an industrial deep learning framework for high-dimensional sparse data. In Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data, pages 1–9, 2019.
  • [21] Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soulé, Jeongkeun Lee, Nate Foster, Changhoon Kim, and Ion Stoica. Netcache: Balancing key-value stores with fast in-network caching. In Proceedings of the 26th Symposium on Operating Systems Principles, pages 121–136, 2017.
  • [22] Seyed Hamed Fatemi Langroudi, Tej Pandit, and Dhireesha Kudithipudi. Deep learning inference on embedded devices: Fixed-point vs posit. In 2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), pages 19–23. IEEE, 2018.
  • [23] ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael Swift. Atp: In-network aggregation for multi-tenant learning. In 18th {\{USENIX}\} Symposium on Networked Systems Design and Implementation ({\{NSDI}\} 21), pages 741–761, 2021.
  • [24] Yanfang Le, Hyunseok Chang, Sarit Mukherjee, Limin Wang, Aditya Akella, Michael M Swift, and TV Lakshman. Uno: Uniflying host and smart nic offload for flexible packet processing. In Proceedings of the 2017 Symposium on Cloud Computing, pages 506–519, 2017.
  • [25] Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian Huang. Accelerating distributed reinforcement learning with in-switch computing. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), pages 279–291. IEEE, 2019.
  • [26] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 135–146, 2010.
  • [27] Stuart Franklin Oberman. Design Issues in High Performance Floating Point Arithmatic Units. PhD thesis, Citeseer, 1996.
  • [28] Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. A generic communication scheduler for distributed dnn training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 16–29, 2019.
  • [29] Bob Rink. Symmetry and resonance in periodic fpu chains. Communications in Mathematical Physics, 218(3):665–685, 2001.
  • [30] Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan Ports, Ports, and Peter Richtárik. Scaling distributed machine learning with in-network aggregation. 2021.
  • [31] Julian Stecklina and Thomas Prescher. Lazyfp: Leaking fpu register state using microarchitectural side-channels. arXiv preprint arXiv:1806.07480, 2018.
  • [32] Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. Elastic sketch: Adaptive and fast network-wide measurements. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pages 561–575, 2018.
  • [33] LK Yu and David M Lewis. A 30-b integrated logarithmic number system processor. IEEE journal of solid-state circuits, 26(10):1433–1440, 1991.
  • [34] Hao Zhang, Jiongrui He, and Seok-Bum Ko. Efficient posit multiply-accumulate unit generator for deep learning applications. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–5. IEEE, 2019.
  • [35] Penghao Zhang, Heng Pan, Zhenyu Li, peng He, Zhibin Zhang, Gareth Tyson, and Gaogang Xie. Accelerating lsh-based distributed search with in-network computation. 2021.
  • [36] Yikai Zhao, Kaicheng Yang, Zirui Liu, Tong Yang, Li Chen, Shiyi Liu, Naiqian Zheng, Ruixin Wang, Hanbo Wu, Yi Wang, et al. Lightguardian: A full-visibility, lightweight, in-band telemetry system using sketchlets. In 18th {\{USENIX}\} Symposium on Networked Systems Design and Implementation ({\{NSDI}\} 21), pages 991–1010, 2021.