Towards Fast and Scalable Private Inference

Invited Paper

Jianqiao Mo, Karthik Garimella, Negar Neda, Austin Ebel, Brandon Reagen {jm8782, kg2383, nn2231, abe5240, bjr5}@nyu.edu
New York UniversityNew YorkNew YorkUSA

(2023)

Abstract.

Privacy and security have rapidly emerged as first order design constraints. Users now demand more protection over who can see their data (confidentiality) as well as how it is used (control). Here, existing cryptographic techniques for security fall short: they secure data when stored or communicated but must decrypt it for computation. Fortunately, a new paradigm of computing exists, which we refer to as privacy-preserving computation (PPC). Emerging PPC technologies can be leveraged for secure outsourced computation or to enable two parties to compute without revealing either users’ secret data. Despite their phenomenal potential to revolutionize user protection in the digital age, the realization has been limited due to exorbitant computational, communication, and storage overheads.

This paper reviews recent efforts on addressing various PPC overheads using private inference (PI) in neural network as a motivating application. First, the problem and various technologies, including homomorphic encryption (HE), secret sharing (SS), garbled circuits (GCs), and oblivious transfer (OT), are introduced. Next, a characterization of their overheads when used to implement PI is covered. The characterization motivates the need for both GCs and HE accelerators. Then two solutions are presented: HAAC for accelerating GCs and RPU for accelerating HE. To conclude, results and effects are shown with a discussion on what future work is needed to overcome the remaining overheads of PI.

private inference, privacy preserving computation, homomorphic encryption, garbled circuits

^†^†copyright: acmcopyright^†^†journalyear: 2023^†^†copyright: acmlicensed^†^†conference: 20th ACM International Conference on Computing Frontiers; May 9–11, 2023; Bologna, Italy^†^†booktitle: 20th ACM International Conference on Computing Frontiers (CF ’23), May 9–11, 2023, Bologna, Italy^†^†price: 15.00^†^†doi: 10.1145/3587135.3592169^†^†isbn: 979-8-4007-0140-5/23/05^†^†ccs: Security and privacy Privacy-preserving protocols^†^†ccs: Computer systems organization Architectures

1. INTRODUCTION

As privacy concerns continue to rise, users are demanding more protections for their data, including confidentiality and control over how it is used. Current security techniques, which only protect communication and storage, do not provide sufficient guarantees for users during computation. In other words, existing cryptographic techniques secure data when stored or communicated but must decrypt it for computation, as Figure 1 shows. To address this issue, PPC is a solution that can extend security guarantees to the entire execution life cycle. By using PPC, users can access online services with improved security guarantees for their own data. However, the challenge of PPC is that it incurs extremely-high performance overheads compared to plaintext execution. This performance gap is significant. Protected storage and secure communication can be efficient and cheap, but not PPC. Our goal is to overcome these overheads and realize PPCs as a practical computational paradigm with system-level optimizations and novel hardware accelerators.

Deep learning is a natural starting point for demonstrating and benchmarking PPC. It relies heavily on the client-cloud model and requires access to user data. Additionally, neural networks account for a significant amount of private user data processing, with some companies (e.g., Meta) processing trillions of inferences per day (FacebookTrillions, 26). Ensuring the privacy of essential neural network functions, such as convolution and ReLU, would safeguard a disproportionate amount of users’ data. PPC is still a developing field, and as we will show inference is still a ways off from being practical. In this paper, we focus on private inference (PI) and leave training to future work.

At a high level, PI involves encrypting a user’s input and (optionally, depending on the threat model) the server’s neural network. The encrypted data and model are processed using PPC methods to execute an inference on the protected user data. At the end of the execution, the intended party(-ies) learns the inference result, and neither party learns anything else about the other’s input. There exists a large body of research on PI using both homomorphic encryption and secure multi-party computation (MPC), including secret sharing and garbled circuits. Specifically, most protocols use HE and SS for linear functions (convolutions and fully-connected layers), and GCs for nonlinear functions, including ReLUs (Sphynx, 5, 6, 21, 20, 29). Non-hybrid approaches have been proposed. However, they often sacrifice accuracy due to approximating activation functions with polynomials and are not considered here (Cryptonets, 14, 38, 11).

Recent studies have focused on enhancing hybrid protocols through improved neural architecture design (Sphynx, 5, 20, 13, 6), protocol optimization (SNL, 6, 27, 21, 29), and hardware acceleration (Cheetah, 36, 38, 23, 22, 18). State-of-the-art PI protocols break the entire PI process into two phases, a pre-processing (or offline) phase and an online phase, to move expensive computations offline and improve online inference latency. Although there are various optimizations aiming to solve specific aspects of the hybrid PI protocols, it is not yet clear how they combine to enhance the end goal of fast, end-to-end PI.

To understand the problem, and where we as a community stand, this paper reviews characterizations of both the online and offline phases, accounting for compute, storage, and communication overheads of the system. First, we consider inference request arrival rates in PI, while most prior work focuses on individual inferences in isolation only. Results show it a vital distinction for computation latency, as the offline costs are so large that they do not always remain offline, even at low arrival rates. The offline phase involves HE and GCs computations that typically require minutes to complete and it is hard to hide this latency. As for storage, it also incurs significant offline pressure, which can amount to tens of GBs per inference on the client. The storage requirement constrains the number of PI pre-computations that can be buffered and can quickly exceed the storage capacities of limited client devices. Meanwhile, ReLU activation introduces both significant computation and communication overheads for the nonlinear function evaluation.

Refer to caption — Figure 1. Upper: Standard cloud-service protects communication, but client’s privacy is not preserved toward the server. Lower: PPC enables entire privacy protection. Data remains encrypted during cloud computation.

We then give the insights and introduce our solutions to solve these challenges in Section 4. At the system level, to overcome the limits of client storage we present the Client-Garbler protocol. Reversing the client and cloud roles in GCs enables pre-computed GCs to be stored on the server instead of a client’s smartphone. To address HE’s computation overhead, Layer-parallel HE (LPHE) provides embarrassing parallelism across neural network layers to significantly reduce the offline cost. LPHE takes advantage of the fact that each neural network layer’s offline HE computation is independent and can be run completely in parallel. Finally, with the Wireless Slot Allocation (WSA), we can optimize the default provisioning of wireless bandwidth between upload and download for PI. The combination of the proposed optimizations improves the mean inference latency by 1.8 $\times$ over the state of the art. (hybrid_PI, 10) Besides the system-level optimizations, we also review two hardware accelerators, HAAC (HAAC, 30) and RPU (RPU, 40), to accelerate GCs and HE, respectively. HAAC aims at reducing the computation latency for the garbled circuits generation/evaluation, while RPU targets the offline HE computation overhead. Our result shows that HAAC achieves an average speedup of 589 $\times$ with DDR4 (2627 $\times$ with HBM2), and RPU provides a speedup of 1485 $\times$ over a CPU.

The organization of the rest paper is as follows. We discuss the background of cryptographic primitives and private inference in Section 2, characterize the problem in Section 3. In Section 4, we introduce performance solutions from three perspectives. Related work is reviewed in Section 5. Finally, we discuss our thoughts on the state and future of systems for PI in Section 6.

2. BACKGROUND

A common convolutional neural network (CNN) contains two types of operations: linear (convolutional or fully-connected layers) and nonlinear (e.g., ReLU). In this section, we introduce the basics of how to compute the linear and nonlinear functions privately as well as complete private inferences.

2.1. Private Linear Computation (HE, SS)

Two common linear computations in neural network inference are fully-connected and convolution layers, both of which can be expressed as matrix-vector multiplication. Here, the matrix represents the trainable parameters of the linear layer and the vector represents the input data. Given that linear transformations can be efficiently expressed as arithmetic circuits with bounded depth, a natural cryptographic primitive for privately computing the linear layers on encrypted inputs is HE.

HE is an encryption scheme that allows for computation directly on encrypted data without the need for decryption, thus preserving data confidentiality during the actual computation. Given two parties, a client (with some data) and a server (with a linear function), the client can first encrypt their data under a HE scheme and send their encrypted data to the server. The server can then homomorphically evaluate the linear function on the client’s ciphertext without ever decrypting the client’s data server-side. The resulting linear computation produces a ciphertext that can be sent back to the client who can decrypt the ciphertext locally.

HE introduces a large computational overhead of 4-6 orders of magnitude slowdown over their plaintext counterparts (1080 seconds for a single ResNet-18 inference (hybrid_PI, 10, 38)). Rather than directly using HE for computing linear layers, it is common to combine HE with Additive Secret Sharing (SS), another cryptographic building block that supports plaintext-level speeds for computing linear layers. In this HE+SS setting, linear layers are processed in two steps: a pre-processing or offline phase independent of the client’s input and an online phase dependent on the client’s input. In the pre-processing phase, the server performs a homomorphic evaluation of the linear layer on a randomly sampled and encrypted input to generate the client’s additive secret share. This pre-processing step enables the server to perform a simple secret share evaluation of the linear layer on the client’s actual data during the online phase.

2.2. Private Nonlinear Computation (GCs)

Nonlinear layers are a crucial component in deep learning models, which include either an activation function or a pooling function. The activation function is typically chosen from several types of nonlinear functions, such as the widely-used rectified linear unit (ReLU) function. In convolutional neural networks, max-pooling functions are also frequently employed.

Generally, HE and additive SS enable private arithmetic computations, i.e., additions and multiplications over integer values. Garbled Circuits (Yao1986GC, 44) enable two parties to compute a Boolean function on their private inputs without revealing their inputs to each other. Unlike SS and HE, operating over Boolean gates enables the parties to jointly compute arbitrary function, making GCs a solution to privately compute nonlinear layers. We note that binary constructions of HE exist (e.g., TFHE (2020TFHE, 4)). However, these incur extremely high performance overheads with prior work. Evaluating single Boolean gate with HE can take 75-600ms to process (Stochastic_TFHE, 16, 28).

In GCs, a function is represented as a Boolean circuit. One party (the garbler) assigns random labels representing {0, 1} to each input wire of each gate, generates an encrypted truth table that maps the output labels to the gate’s input labels. The evaluator uses Oblivious Transfer (ObliviousTransfer, 33) to obtain labels corresponding to its inputs without revealing the values to the garbler, and then executes the circuit gate-by-gate. Finally, the evaluator shares the output labels with the garbler who maps them to plaintext values. Recent algorithmic optimizations are used to construct high-performance GCs (FreeXOR, 24, 45). More details can be found in (Half_Gate_rekey, 15, 31, 43).

2.3. Private Inference

Starting point. We briefly describe DELPHI (delphi, 29), a baseline hybrid PI protocol that uses HE and SS for linear layers, and GCs for ReLU layers. In the 2-party-computation setting, a client uses her data for inference on the server’s proprietary model, without learning the server’s model parameters. The DELPHI protocol consists of an offline phase, which only depends on the network architecture and parameters, and an online phase, which is performed after the client’s input is available. In this way, many expensive computations are moved to offline to improve online inference latency.

Offline Phase. First, the client sends their public keys to the server. At the $i$ th layer, client and server sample random vectors $r_{i}$ and $s_{i}$ respectively. Then the client sends the encrypted random vector $E(r_{i})$ to the server. The server uses its model parameter $w_{i}$ to compute $E(w_{i}\cdot r_{i}-s_{i})$ homomorphically and returns to the client, thus the client holds the share $\langle y_{i}\rangle_{c}=w_{i}\cdot r_{i}-s_{i}$ . The server also constructs GCs for each nonlinear ReLU operation and sent them to the client. The client obtains labels corresponding to its $\langle y_{i}\rangle_{c}$ and $r_{i+1}$ using OT, where $r_{i+1}$ is prepared for the next $(i+1)$ th layer.

Online Phase. After the client’s input $x_{1}$ is available, the client sends the subtraction $x_{1}-r_{1}$ to the server. At the beginning of the $i$ th layer, the server uses the subtraction to calculate its share of the layer output $\langle y_{i}\rangle_{s}=w_{i}(x_{i}-r_{i})+s_{i}$ . To evaluate the ReLU layer, the server (garbler) sends the labels corresponding to its $\langle y_{i}\rangle_{s}$ to the client. The client uses its labels of $\langle y_{i}\rangle_{c}$ and $r_{i+1}$ , together with the label of $\langle y_{i}\rangle_{s}$ to evaluate the garbled circuits ReLU $(\langle y_{i}\rangle_{c}+\langle y_{i}\rangle_{s})-r_{i+1}$ . The GCs’ output labels will be sent to the server as they represent $(x_{i+1}-r_{i+1})$ . At this point, the client holds the share $r_{i+1}$ and the server holds $(x_{i+1}-r_{i+1})$ . They will similarly evaluate subsequent layers.

3. THE PI PROBLEM

We identify three system-level bottlenecks that hinder private inference protocols from being both scaled and deployed: high client-side storage costs, client- and server-side compute latency, and communication costs.

Storage: We observe that ReLU GCs are the primary storage bottleneck in PI. The common setup is to generate garbled ReLUs server-side and store them on the client device during the offline phase. We benchmark the fancy-garbling library (swanky_fancyGC, 3) to understand the costs. For each scalar ReLU operation, the garbler (the server) must store 3.5KB per ReLU while the evaluator (the client) pays a penalty of 18.2KB per ReLU. To put this in perspective, to perform a single inference using a high-performance network (ResNet-18) on TinyImageNet requires the client to store 41GB of garbled ReLUs. For the same network on a larger dataset (ImageNet), the client must store 498GB of ReLU GCs.

Compute: In the baseline protocol described above, the server performs ReLU GCs garbling and both HE and SS evaluation of linear layers. Meanwhile, the client is responsible for ReLU GCs evaluation. Secret shares generated via HE as well as GCs garbling are performed in the offline phase, leaving only GCs evaluation and the server’s SS evaluation as the computation that must happen online. Our analysis reveals that the server-side HE evaluations dominate the compute time of PI protocols. Furthermore, both GCs garbling and evaluation have non-negligible compute latency. For example on ResNet-18 on TinyImageNet, the HE portion of these PI protocols takes 1080 seconds while the GCs portion takes 225 seconds in total. Crucially, we find that the GCs garbling and evaluation compute latency depends significantly upon the device performing the computation. For the above protocol, GCs garbling takes only 25 seconds on the server while GCs evaluation takes 200 seconds (when using an Intel Atom board as a client and an AMD 32-core machine as the server).

Communication: Hybrid PI protocols require several rounds of communication both in the offline and online phase with the number of rounds growing with network depth. The transmission of ReLU GCs from the garbler to the evaluator during the offline phase dominates the communication latency in hybrid PI. Transmitting the 41GB of garbled ReLUs needed for ResNet-18 on TinyImageNet at 5G (1Gpbs) takes roughly 747 seconds. Figure 3 (Baseline) shows the total end-to-end latency of performing a single private inference on ResNet-18 with a TinyImageNet input.

4. OUR SOLUTIONS

This section presents our end-to-end characterization of the above protocol. In particular, we uncover and mitigate three main system-level bottlenecks that prevent PI from scaling: storage, computation, and communication. Even after our system-level optizations, we find high compute latency that necessitates custom architecture. We discuss two accelerators: one for garbled circuits (HAAC) and one for homomorphic encryption (RPU).

4.1. System-Level Design

Addressing Storage: Storing the garbled ReLUs client-side during the offline phase inhibits the client from (at best) storing more than a few precomputes especially for deeper and larger networks. To overcome this limitation, the Client-Garbler protocol switches the role of the garbler and evaluator so that the server (now the evaluator) must store the ReLU GCs while holding the exact same security guarantees. This reduces the storage cost of the client device by $5\times$ ; rather than storing 41GB for TinyImageNet inference on ResNet-18, the client now only stores 8GB per inference. Outside of storage costs, Client-Garbler also reduces online phase latency as the powerful server is now evaluating the garbled ReLUs while the offline phase latency increases as the client must now perform garbling. This latency tradeoff is reflected in Figure 3 (+SysOpt).

Addressing Compute: Server-side HE evaluations of the linear layers during the offline phase account for a majority of the compute cost. As discussed in Section 2.3, the client and server iterate through each linear layer to generate secret shares that are used for fast, online phase linear evaluations. These secret shares are generated independently for each linear layer meaning each linear layer HE evaluation can be run in an embarrassingly parallel manner on the server which we call Layer-Parallel Homomorphic Evaluation (LPHE). Rather than performing each independent HE evaluation sequentially on a single core, each HE evaluation is allocated to a separate core server-side. For ResNet-18 on TinyImageNet, LPHE reduces the HE evaluation run-time from 1080s to just 141s. Across all datasets and networks, LPHE speeds up HE by $9.7\times$ .

Addressing Communication: The transmission of garbled ReLUs from the garbler to the evaluator in the offline phase dominates the communication latency of the entire PI protocol. Furthermore, the massive storage penalty per ReLU (18.2KB) results in an asymmetry in the amount of data sent between the two parties. For example, in the Client-Garbler protocol, 83.5% of the total amount of data transferring is caused by uploading data to the server. Current 5G wireless standards allows for flexibility in the amount of bandwidth allocated to both uplink and downlink. By optimally allocating the bandwidth split (Wireless Slot Allocation), the communication latency of PI protocols can be reduces by 35%.

Putting these aforementioned system-level optimizations helps us not only reduces the storage cost of the client but also reduce both the offline and online phase latency. Figure 3 (+SysOpt) shows the latency breakdown with our optimized protocol. Compared to Figure 3 (Baseline), these optimizations reduce the end-to-end latency from 2050 seconds to 1052 seconds. Even after these optimizations, high compute costs from both the HE and GCs portions of the protocol dominate the runtime and necessitate custom architecture to further reduce latency.

4.2. Accelerating GCs with HAAC

Given the system-level optimization mentioned above, the GCs storage stress of the client side is largely relieved, but it raises the pressure on GCs computation. The GCs accelerator should be not only fast but also lightweight and energy efficient. It should be able to process large amounts of data and maintain high throughput. Here, we present HAAC (HAAC, 30), a novel hardware-software co-design that includes a compiler and hardware accelerator that combine to improve GCs performance and efficiency, making PI more practical.

Employing hardware-software co-design in GCs is appropriate, as the programs including all dependence, memory accesses, and control flow, are already known and fixed at compile time. Potential benefits in hardware can be realized through optimization on the software side. HAAC’s design philosophy is to keep hardware simple and efficient, maximizing area utilization of custom execution units and other circuits essential for high performance. Our approach leverages the properties of GCs to express arbitrary programs as streams, which simplifies hardware and enables complete memory-compute decoupling. The evaluation results demonstrate the effectiveness of the proposed approach in accelerating GCs and mitigating performance overheads.

HAAC hardware comprises Gate Engines (GE) execution units, see Figure 2. GE is deep, in-order pipeline that provide high performance potential, which ideally computes a gate per cycle. However, exploiting GEs parallelism while keeping hardware efficient is a challenge. This presents a prime opportunity for hardware-software co-design. HAAC’s compiler leverages Instruction-level Parallelism (ILP) to improve intra/inter-GE parallel processing, with instructions and constants streamed to each GE using queues. The compiler analyzes the leveled data dependence graph for the entire baseline program to expose all available ILP. This strategy, called full-reordering, works well to resolve data hazards, and significantly increases parallel GEs performance, but it can reduce the on-chip data reuse and burden the off-chip traffic. To better balance data reuse and ILP, HAAC also purposes another scheme called segment-reordering. Rather than computing the ILP graph for the entire program, we partition the program into contiguous parts (segments) and reorder instructions within each segment. This provides more instruction parallelism than baseline programs and generally captures more data reuse than full-reordering.

With the help of HAAC’s distinct on-chip memory subsystem, the high parallelism in GEs can be actually achieved instead of being blocked by input/output data latency. The gate inputs and outputs, called wires data, are more difficult as they do not follow a pattern. HAAC uses a scratchpad with multiple memory banks to capture wires reuse by tracking program execution, eliminating the need for costly hardware-managed caches and tagging logic. The on-chip scratchpad memory, called sliding wire window (SWW), stores a contiguous address range of wires, and the range increases as the program executes. We also leverage GCs input/output patterns and strides wires across SWW banks to reduce conflicts. Renaming is a complementary compiler pass that sequentializes gate output wire addresses according to program order.

The SWW and renaming combine to filter off-chip accesses, as recently generated wires are often soon reused when they are still on-chip, with the performance benefits of a cache and the efficiency of a scratchpad. HAAC’s co-design also enables complete gate execution and off-chip accesses decoupling. Used wires, marked by the compiler, will not go off-chip by Eliminating Spent Wires (ESW), which elides redundant writes to off-chip memory. The unused wires, or Out-of-range Wire (OoRW), will be stored in the off-chip memory. The compiler knows which wires will be OoR and can push them on-chip to an OoRW queue, and the GEs will to check the OoRW queue if the inputs are not on-chip. The optimizations allow the overlap of computation and off-chip accesses, hiding the latency of data movement.

We evaluate HAAC thoroughly with benchmarks from VIP-Bench (VIP-Bench, 2). With 16 GEs, a 2MB SWW on-chip memory, and DDR4, HAAC provides an average speedup of 589 $\times$ than an Intel Core i7-10700K CPU (2627 $\times$ with HBM2) in 4.3mm². Figure 3 (+HAAC) shows that after accelerating GCs with HAAC, we further reduce the total latency for a private inference on ResNet-18 by 39%. The percentage of GCs in the computation latency is already very low, leaving the HE evaluation as a large occupation.

4.3. Accelerating HE with the RPU

We now turn to the final optimization: accelerating expensive HE operations required for linear layers in the offline phase. Like RELUs, simple plaintext operations see large overheads in the encrypted domain. Specifically, linear transformations map to a series of complex ciphertext rotations and key switching operations, which dominate the runtime of HE, requiring frequent applications of the (inverse) number theoretic transform, or (i)NTT. Fortunately, this implies high degrees of data-level parallelism, which can be efficiently exploited with the right architectural design.

With this in mind, we now discuss our Ring Processing Unit (RPU) (RPU, 40). The RPU implements B512, a vector ISA that has been specifically designed to cater to common HE operations while also remaining well-suited to general-purpose programming. The decision to develop a vector architecture and ISA, rather than fixed ASIC hardware, gives us the reprogrammability needed to remain viable as HE algorithms continue to develop and mature.

The vector architecture state includes 64 128-bit vector registers, 64 128-bit scalar registers, a 4 MiB vector data memory (VDM) expandable up to 32 MiB, and a 2 MiB scalar data memory (SDM). Additionally, there is an Address Register File (ARF) for indirect memory access, a Modulus Register File (MRF), and several control registers for increased flexibility. We operate on fixed vector lengths of 512 elements to balance microarchitectural concerns with the power-of-two input sizes common to both (i)NTT and HE kernels. Instructions are grouped into four categories: compute, memory, shuffle, and control. Compute instructions perform point-wise modular computations between two vectors or a vector and a scalar value. Memory operations interact with the VDM and SDM to bring data to and from the vector and scalar register files respectively. We support four addressing modes in memory instructions to efficiently handle the complex access patterns in the (i)NTT and other HE workloads. Shuffle instructions are heavily used by (i)NTT and are efficiently supported in the microarchitecture. Control instructions act on the Control Register File (CRF) and drastically reduce our required code size. Overall, B512 introduces 28 instructions to meet the needs of HE-specific applications without sacrificing the generality that traditional vector operations provide.

The RPU is designed for general ring processing with high performance by taking advantage of regularity and data parallelism. We achieve this balance by designing explicitly managed hardware to elide the high costs and complexity of caches, dynamic scheduling logic, and prediction, and task the compiler with handling scheduling and data movement at compile time. Figure 2 presents the RPU microarchitecture, which consists of a front-end to handle instruction fetching, decoding, and control logic, and a backend, which provides the high-performance hardware needed to efficiently perform our HE-tailored vector operations. The front-end includes three decoupled queues that operate independently on compute, memory, and shuffle instructions. Once an instruction is in its respective queue, it can run in parallel with other instruction types, and the microarchitecture guarantees that data hazards are avoided. The parallel execution via these decoupled pipelines is key to achieving high performance with general-purpose processing as it masks much of the data movement time.

Compute instructions are passed to computational units that we denote as high-performance large arithmetic word (LAW) engines (HPLEs). HPLEs each operate on a slice of the 512 element vector whose size is determined by the number of parallel HPLEs (i.e., lanes) in our vector architecture. Each HPLE consists of a 128-bit modular multiplier, adder, subtractor, and two comparator units. HPLEs can be modified to support bit-serial computation, trading off performance for area.

We use the SPIRAL (Spiral, 8) to map instructions onto the RPU and automatically generate high-performance B512 programs. SPIRAL has a rich library of transformations and optimizations that can expertly generate high-performance code across various platforms and kernels, especially in linear transforms such as the (i)NTT. We evaluate and characterize RPU using a detailed cycle-level simulator that is parameterized to consider a range of IP, namely modular multiplier designs, number of HPLEs, and VDM partitioning strategies. This enables rapid design space exploration to quantify design decisions. The simulator is further verified with a complete RTL implementation of the RPU. We show that our most efficient parameter choices can accelerate varying size (i)NTTs by up to $1485\times$ over a traditional 32-core 2.5GHz AMD EPYC 7502 CPU.

Figure 3 (+RPU) shows that the HE evaluation latency is largely eliminated. Overall, by performing the system-level optimization, together with the hardware accelerators HAAC and RPU, we significantly reduced the end-to-end private inference latency by 76% against the baseline. The computation overhead is inconspicuous compared to the communication, which will be resolved by developing faster bandwidth and network technology in the future.

5. RELATED WORKS

PI: Prior work has explored using HE only, which is convenient as privacy primitives are not changed (Cryptonets, 14, 38). However, these protocols cannot leverage LPHE and introduce accuracy loss via the approximation of ReLU, even with complex training (sisyphus, 11). Additionally, some prior work includes a trusted-third party, which assumes a weaker security model for higher performance (Falcon, 42, 25, 41). The machine learning community has started exploring ways to design neural networks with fewer nonlinear function, such as pruning ReLUs from the networks (SNL, 6, 20), and approximating ReLU computations for cheaper GCs implementations (Circa, 12). DELPHI (delphi, 29) and AESPA (AESPA, 32) replace ReLUs with polynomial activation functions that are processed using Beaver Triples (BeaverTriples, 1), which are cheaper in both compute and communication than GCs. However, replacing ReLUs with low-degree polynomials reduces test accuracy, especially for deeper networks (sisyphus, 11). Therefore, we only considers highly accurate ReLU-only deep learning models that are state-of-the-art.

GCs: There are other prior works to accelerate GCs with GPU (GPU_GC, 19, 9) and FPGA (Maxelerator, 18, 17, 7, 39). We note that prior work uses the less secure fixed-key GCs setup (Half_Gate_rekey, 15), or uses SHA-1 instead of AES, which is simpler and less secure (GC_FPGA_Overlay, 7). Moreover, most prior work uses small benchmarks that do not stress off-chip bandwidth, which is one of HAAC’s primary contributions. The uniqueness in HAAC when comparing against prior work is that HAAC considers parallel processing and pipelining at the same time, and also optimizes for off-chip communication. To the best of our knowledge, HAAC is the first ASIC GCs accelerator. HAAC outperforms all prior accelerator and GPU works as shown in the paper (HAAC, 30).

HE: Several accelerators have been developed to improve the performance of HE primitives. F1 (F1_2021, 37) designs specialized functional units to accelerate primitive computations, such as NTT. However, F1 has a maximum polynomial degree support of only 16K, whereas RPU has no such limitations. More recent HE accelerators such as CraterLake (Craterlake, 38), BTS (BTS, 23), and ARK (ARK, 22) target high multiplicative depth applications, but require large on-chip memories, e.g., 256MB for Craterlake. Craterlake supports up to 64K polynomial degree. However, to support 128K ciphertext, the hardware needs modifications, which results in an additional area of 27.4 $mm^{2}$ , making it larger than RPU. Unlike Craterlake, RPU is flexible to support larger polynomials without hardware changes.

6. DISCUSSION AND CONCLUSION

This paper presents solutions for improving privacy and security via PPCs. The characterization identifies the need to completely redesign computing stacks from algorithms to storage, in order to enable practical private inference. Point solutions are presented to demonstrate how well custom hardware can perform in overcoming many of the computational overheads leveraging classic architectural mechanisms: vectors (RPU) and VLIW (HAAC). We conclude with a discussion of three predictions about what the future of PPC and systems for PPC will look like.

First, after rigorously studying private inference for multiple years, we feel hybrid protocols will be most valuable. All PPC technologies (e.g., HE, GCs, SS, and OT) have strengths and critical weaknesses. By effectively combining them and tailoring their use to workloads and threat models, one can often leverage their strengths while overcoming their weaknesses. A prime example is ReLU and HE. HE is a necessity for computing linear layers in ML, whether used directly online to process inputs or offline to compute secrets for SS. However, popular integer/fixed-point schemes amenable for linear layer processing are not capable of processing non-linear functions. Thus, HE can be combined with GCs to properly execute ReLU and preserve network accuracy. We hope it inspires the community to consider accelerators beyond HE.

Second, through understanding the degree of slowdown at all aspects of a system–compute, communication, and storage–it is clear that the problem exceeds beyond custom hardware. Hardware accelerators are certainly a necessity for overcoming the computational slowdowns in PPCs like GCs and HE, but are not alone sufficient. To truly realize real-time private inference systems will need to leverage extremely dense storage technologies on the client device. Communication between client and cloud will have to take place over the highest bandwidth wireless protocols offering the most bandwidth, e.g., up to 100 Gbps (ted2019wireless, 34, 35). Finally, even this may not be enough, and we need to re-think how we express problems as programs. Looking again at PI, this can be seen in the need to re-think neural architectures to minimize ReLU counts, also known as the ReLU budget (CryptoNAS, 13). Prior work has shown promising solutions that other application domains can learn from (Sphynx, 5, 6, 13, 20).

Finally, we feel PPCs are the perfect application of co-design techniques. In PPCs, all program behavior and information is known at compile time, i.e., programs are data oblivious. This provides the compiler an ideal view of execution, and the opportunity to schedule computations and data movement just as well as any dynamic hardware. This is likely important, and we feel PPC accelerators need to be programmable and software driven. The protocols will continue to evolve, and new applications will continually be ported. An ISA can solve these issues. At the same time, the performance and efficiency problems of ISAs for general-purpose computing can largely be overcome. By pushing all scheduling to the compiler (VLIW), leveraging highly decoupled pipelines (DAE), and classic parallel computing techniques (vectors), we believe the efficiency and performance of ASICs can be achieved without over-fitting hardware to a particular implementation.

Acknowledgements.

We acknowledge support from the ADA Research Center, SRC and DARPA. Additionally, this research was under the DPRIVE program, contract HR0011-21-9-0003. The views expressed are those of the authors and do not necessarily reflect those of the sponsors.

References

(1) Donald Beaver “Precomputing Oblivious Transfer” In Advances in Cryptology — CRYPT0’ 95 Berlin, Heidelberg: Springer Berlin Heidelberg, 1995, pp. 97–109
(2) Lauren Biernacki et al. “VIP-Bench: A Benchmark Suite for Evaluating Privacy-Enhanced Computation Frameworks” In 2021 International Symposium on Secure and Private Execution Environment Design (SEED), 2021, pp. 139–149 IEEE
(3) Brent Carmer, Alex J Malozemoff and Marc Rosen “swanky: A suite of rust libraries for secure multi-party computation”, 2019
(4) Ilaria Chillotti, Nicolas Gama, Mariya Georgieva and Malika Izabachène “TFHE: fast fully homomorphic encryption over the torus” In Journal of Cryptology 33.1 Springer, 2020, pp. 34–91
(5) Minsu Cho et al. “Sphynx: A Deep Neural Network Design for Private Inference” In IEEE Security & Privacy 20.5 IEEE, 2022, pp. 22–34
(6) Minsu Cho et al. “Selective network linearization for efficient private inference” In International Conference on Machine Learning, 2022, pp. 3947–3961 PMLR
(7) Xin Fang, Stratis Ioannidis and Miriam Leeser “Secure function evaluation using an fpga overlay architecture” In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017, pp. 257–266
(8) Franz Franchetti et al. “SPIRAL: Extreme Performance Portability” In Proceedings of the IEEE 106.11, 2018, pp. 1935–1968 DOI: 10.1109/JPROC.2018.2873289
(9) Tore Kasper Frederiksen, Thomas P Jakobsen and Jesper Buus Nielsen “Faster maliciously secure two-party computation using the GPU” In Security and Cryptography for Networks: 9th International Conference, SCN 2014, Amalfi, Italy, September 3-5, 2014. Proceedings 9, 2014, pp. 358–379 Springer
(10) Karthik Garimella et al. “Characterizing and Optimizing End-to-End Systems for Private Inference”, ASPLOS ’23, 2023 DOI: 10.1145/3582016.3582065
(11) Karthik Garimella, Nandan Kumar Jha and Brandon Reagen “Sisyphus: A cautionary tale of using low-degree polynomial activations in privacy-preserving deep learning” In arXiv preprint arXiv:2107.12342, 2021
(12) Zahra Ghodsi, Nandan Kumar Jha, Brandon Reagen and Siddharth Garg “Circa: Stochastic relus for private deep learning” In Advances in Neural Information Processing Systems 34, 2021, pp. 2241–2252
(13) Zahra Ghodsi, Akshaj Kumar Veldanda, Brandon Reagen and Siddharth Garg “Cryptonas: Private inference on a relu budget” In Advances in Neural Information Processing Systems 33, 2020, pp. 16961–16971
(14) Ran Gilad-Bachrach et al. “Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy” In International conference on machine learning, 2016, pp. 201–210 PMLR
(15) Chun Guo et al. “Better concrete security for half-gates garbling (in the multi-instance setting)” In Advances in Cryptology–CRYPTO 2020: 40th Annual International Cryptology Conference, CRYPTO 2020, Santa Barbara, CA, USA, August 17–21, 2020, Proceedings, Part II, 2020, pp. 793–822 Springer
(16) Hsuan Hsiao, Vincent Lee, Brandon Reagen and Armin Alaghi “Homomorphically Encrypted Computation using Stochastic Encodings” In arXiv preprint arXiv:2203.02547, 2022
(17) Siam U Hussain and Farinaz Koushanfar “FASE: FPGA acceleration of secure function evaluation” In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2019, pp. 280–288 IEEE
(18) Siam U Hussain, Bita Darvish Rouhani, Mohammad Ghasemzadeh and Farinaz Koushanfar “Maxelerator: FPGA accelerator for privacy preserving multiply-accumulate (MAC) on cloud servers” In Proceedings of the 55th Annual Design Automation Conference, 2018, pp. 1–6
(19) Nathaniel Husted, Steven Myers, Abhi Shelat and Paul Grubbs “GPU and CPU parallelization of honest-but-curious secure two-party computation” In Proceedings of the 29th Annual Computer Security Applications Conference, 2013, pp. 169–178
(20) Nandan Kumar Jha, Zahra Ghodsi, Siddharth Garg and Brandon Reagen “DeepReDuce: Relu reduction for fast private inference” In International Conference on Machine Learning, 2021, pp. 4839–4849 PMLR
(21) Chiraag Juvekar, Vinod Vaikuntanathan and Anantha Chandrakasan “ $\{$ GAZELLE $\}$ : A low latency framework for secure neural network inference” In 27th $\{$ USENIX $\}$ Security Symposium ( $\{$ USENIX $\}$ Security 18), 2018, pp. 1651–1669
(22) Jongmin Kim et al. “Ark: Fully homomorphic encryption accelerator with runtime data generation and inter-operation key reuse” In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022, pp. 1237–1254 IEEE
(23) Sangpyo Kim et al. “BTS: An Accelerator for Bootstrappable Fully Homomorphic Encryption”, ISCA ’22 New York, New York: Association for Computing Machinery, 2022, pp. 711–725 DOI: 10.1145/3470496.3527415
(24) Vladimir Kolesnikov and Thomas Schneider “Improved garbled circuit: Free XOR gates and applications” In Automata, Languages and Programming: 35th International Colloquium, ICALP 2008, Reykjavik, Iceland, July 7-11, 2008, Proceedings, Part II 35, 2008, pp. 486–498 Springer
(25) Nishant Kumar et al. “Cryptflow: Secure tensorflow inference” In 2020 IEEE Symposium on Security and Privacy (SP), 2020, pp. 336–353 IEEE
(26) Kevin Lee, Vijay Rao and William Christie Arnold “Accelerating facebook’s infrastructure with application-specific hardware” In Facebook. Retrieved August 20, 2019, pp. 2020
(27) Qian Lou, Yilin Shen, Hongxia Jin and Lei Jiang “SAFENet: A secure, accurate and fast neural network inference” In International Conference on Learning Representations, 2021
(28) Daniele Micciancio and Yuriy Polyakov “Bootstrapping in FHEW-like cryptosystems” In Proceedings of the 9th on Workshop on Encrypted Computing & Applied Homomorphic Cryptography, 2021, pp. 17–28
(29) Pratyush Mishra et al. “Delphi: a cryptographic inference system for neural networks” In Proceedings of the 2020 Workshop on Privacy-Preserving Machine Learning in Practice, 2020, pp. 27–30
(30) Jianqiao Mo, Jayanth Gopinath and Brandon Reagen “HAAC: A Hardware-Software Co-Design to Accelerate Garbled Circuits” In Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023
(31) Ignacio Navarro “On Garbled Circuits”, 2018
(32) Jaiyoung Park, Michael Jaemin Kim, Wonkyung Jung and Jung Ho Ahn “AESPA: Accuracy preserving low-degree polynomial activation for fast private inference” In arXiv preprint arXiv:2201.06699, 2022
(33) Michael O Rabin “How to exchange secrets with oblivious transfer” In Cryptology ePrint Archive, 2005
(34) Theodore Scott Rappaport et al. “Wireless communications and applications above 100 GHz: Opportunities and challenges for 6G and beyond” In IEEE access 7 IEEE, 2019, pp. 78729–78757
(35) TS Rappaport “5G’s Killer App Will Be 6G: Massive MIMO Millimeter Waves and Small Cell Infrastructure Will Pay Off for Future Tech Generations” In IEEE Spectrum OP-ED, 2021
(36) Brandon Reagen et al. “Cheetah: Optimizing and accelerating homomorphic encryption for private inference” In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021, pp. 26–39 IEEE
(37) Nikola Samardzic et al. “F1: A fast and programmable accelerator for fully homomorphic encryption” In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 238–252
(38) Nikola Samardzic et al. “Craterlake: a hardware accelerator for efficient unbounded computation on encrypted data” In Proceedings of the 49th Annual International Symposium on Computer Architecture, 2022, pp. 173–187
(39) Ebrahim M Songhori et al. “Garbledcpu: a mips processor for secure computation in hardware” In Proceedings of the 53rd Annual Design Automation Conference, 2016, pp. 1–6
(40) Deepraj Soni et al. “RPU: The Ring Processing Unit” In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2023
(41) Sameer Wagh, Divya Gupta and Nishanth Chandran “SecureNN: 3-Party Secure Computation for Neural Network Training.” In Proc. Priv. Enhancing Technol. 2019.3, 2019, pp. 26–49
(42) Sameer Wagh et al. “Falcon: Honest-Majority Maliciously Secure Framework for Private Deep Learning” In Proceedings on Privacy Enhancing Technologies 2021, 2021, pp. 188–208 DOI: 10.2478/popets-2021-0011
(43) Sophia Yakoubov “A gentle introduction to yao’s garbled circuits” In preprint on webpage at https://web. mit. edu/sonka89/www/papers/2017ygc. pdf, 2017
(44) Andrew Chi-Chih Yao “How to generate and exchange secrets” In 27th annual symposium on foundations of computer science (Sfcs 1986), 1986, pp. 162–167 IEEE
(45) Samee Zahur, Mike Rosulek and David Evans “Two halves make a whole: Reducing data transfer in garbled circuits using half gates” In Advances in Cryptology-EUROCRYPT 2015: 34th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Sofia, Bulgaria, April 26-30, 2015, Proceedings, Part II 34, 2015, pp. 220–250 Springer