Encrypted Large Model Inference: The Equivariant Encryption Paradigm

James Buban, Hongyang Zhang, Claudio Angione, Harry Yang, Ahmad Farhan, Seyfal Sultanov, Michael Du, Xuran Ma, Zihao Wang, Yue Zhao, Arria Owlia, Fielding Johnston, Patrick Colangelo Nesa Research
[email protected]

1 Introduction

Artificial Intelligence (AI), particularly machine learning (ML), has grown significantly in recent decades [1]. Since the introduction of large language models (LLMs) such as ChatGPT [2], Claude [3], Gemini [4], and LLaMA [5], as well as diffusion models [6] like DALLE-3 [7] and Sora [8], these foundation models have attracted significant interest [9]. They exhibit advanced capabilities such as in-context learning [10] and chain-of-thought reasoning [11], yet privacy challenges arise when these models are deployed across distributed or decentralized infrastructures [12].

In many scenarios—especially in healthcare [13], finance, or other regulated domains [14]—data privacy is a central requirement. Users often need to ensure that sensitive data (e.g., medical images, personal identifiers, or transaction records) are not visible to untrusted nodes in a distributed inference pipeline. Existing methods like secure multiparty computation (SMPC) [15], homomorphic encryption (HE) [16], and differential privacy (DP) [17] can help, but each involves trade-offs in communication overhead, computational latency, or accuracy.

To address these limitations, we propose Equivariant Encryption (EE), a technique that enables large-scale model inference on encrypted data while maintaining near-zero performance overhead. By transforming internal representations so that the model can operate on ciphertext as if it were plaintext, EE eliminates the high computational costs typically associated with fully homomorphic approaches. In this work, we:

•

Review background approaches like SMPC, HE, and DP, emphasizing their strengths and shortcomings for large-model inference (§2.1–§2.3).
•

Introduce Equivariant Encryption (§2.4) as a new framework for preserving data confidentiality throughout neural network pipelines.
•

Demonstrate a decentralized infrastructure example where EE can protect queries and outputs from untrusted nodes (§2.5).
•

Analyze potential attack vectors and strategies adversaries might employ to invert or compromise EE, and discuss how to counter them (§3).

Overall, we show that Equivariant Encryption preserves both the functionality and throughput of large models in distributed or untrusted environments, bridging a gap between security guarantees and practical latency requirements.

2 Equivariant Encryption: A Middle Ground for Secure Model Inference

Before detailing our new method, Equivariant Encryption (EE), we will briefly recap three key tools in privacy-preserving data processing—differential privacy (DP) (§2.1), secure multi-party computation (SMPC) (§2.2), and homomorphic encryption (HE) (§2.3). DP manages privacy at the dataset level by adding noise, thereby limiting how much an attacker can deduce about any single record, yet does not encrypt intermediate states during inference. SMPC splits data and computation across multiple participants, reducing exposure but often demanding complex protocols. HE allows computations on encrypted data at all times, though it can impose substantial overhead and may struggle with non-linear network layers. Our Equivariant Encryption (§2.4) seeks a balanced approach: rather than fully encrypting every component or depending solely on noise or multi-party flows, EE selectively obfuscates crucial internal representations within LLMs and more, retaining strong confidentiality while minimizing performance cost.

2.1 Background: Differential Privacy (DP)

Differential Privacy (DP) is a statistical framework designed to protect individual data records in a dataset, while still allowing meaningful aggregate computations or analyses. Formally, let $D$ and $D^{\prime}$ be two neighboring datasets differing by a single record. A randomized algorithm $\mathcal{M}$ is said to satisfy $(\varepsilon,\delta)$ -DP [18] if, for any measurable set $S$ ,

\Pr[\mathcal{M}(D)\in S]\;\leq\;e^{\varepsilon}\,\Pr[\mathcal{M}(D^{\prime})\in S]+\delta.

Intuitively, altering one individual’s record does not significantly change the distribution of the algorithm’s outputs, thus limiting privacy risks for each participant.

Classical Mechanisms.

Several mechanisms can ensure DP under different assumptions:

•

Laplace Mechanism: Injects noise drawn from a Laplace distribution whose scale depends on the function’s sensitivity, thereby hiding individual contributions.
•

Gaussian Mechanism: Uses Gaussian noise to achieve $(\varepsilon,\delta)$ -DP in settings where high-dimensional outputs are required.
•

Exponential Mechanism: Chooses outputs with probabilities proportional to a utility function, balancing usefulness with DP constraints.

Noise level tuning (e.g., the variance of the distribution) controls the trade-off between privacy strength and accuracy.

Practical Considerations and Composition.

A notable feature of DP is its handling of sequential queries on the same dataset. Multiple runs of DP-protected algorithms incur a composed privacy cost, which can be bounded using additive or more refined composition theorems [17]. In machine learning, differentially private stochastic gradient descent (DP-SGD) [19] clips gradients and adds noise at each update, preserving DP at the expense of some accuracy loss—often more pronounced in large-scale models or complex tasks.

Security Model and Limitations.

DP restricts what can be inferred about any single record by observing the final outputs or aggregated statistics of an algorithm. However, DP does not encrypt intermediate model activations at inference time, leaving room for leaks if raw data are exposed to an untrusted service during predictions.

Connection to EE.

DP and EE (§2.4) solve different but compatible facets of privacy. While DP reduces the risk of exposing individual training samples through aggregate statistics or model parameters, EE ensures that the inference pipeline itself never processes raw plaintext data. I n practice, one might train a model with DP for statistical protection of the training set, then deploy EE to keep inference inputs confidential against adversarial observers. This combination can safeguard both training and inference in a layered privacy architecture.

2.2 Background: Secure Multi-Party Computation (SMPC)

Secure Multi-Party Computation (SMPC) is a cryptographic approach that enables multiple parties, each holding private inputs, to compute a joint function without revealing these inputs to one another. Formally, suppose there are $n$ parties $\{P_{1},P_{2},\dots,P_{n}\}$ with private inputs $x_{1},x_{2},\dots,x_{n}$ , and they wish to compute a deterministic function

f(x_{1},x_{2},\dots,x_{n})=y,

where $y$ is the output revealed to some or all of the parties, but each $x_{i}$ remains hidden.

Classical Constructions.

SMPC can be realized through various protocols, each with different security assumptions and performance characteristics:

•

Yao’s Garbled Circuits: Originating with Yao [20], this approach encrypts a Boolean circuit such that each party learns nothing beyond its own inputs and the final output.
•

Secret-Sharing Protocols (BGW): Introduced by Ben-Or, Goldwasser, and Wigderson (BGW) [21], each input is split into multiple shares distributed among parties. Intermediate computations proceed on these shares, ensuring no single share reveals the original input.

A hallmark of such constructions is that all parties learn the correct final result $y$ , while intermediate values remain masked.

Secret Sharing and Arithmetic Operations.

A common variant of secret sharing is additive sharing, where a secret $x$ over a ring $\mathbb{Z}_{q}$ is divided into $n$ shares $(x_{1},x_{2},\dots,x_{n})$ such that

x=\sum_{i=1}^{n}x_{i}\quad(\bmod\,q).

Each party receives one $x_{i}$ . Adding two secrets can be done locally on each party’s shares, whereas multiplication often requires additional steps. The BGW model and later protocols such as SPDZ [22] use multiplication triplets and integrity checks to allow correct evaluation of products, even in the presence of malicious adversaries.

Security Models.

SMPC protocols typically consider:

•

Semi-Honest Adversaries: Parties follow the protocol correctly but try to infer extra information from received messages.
•

Malicious Adversaries: Parties can deviate arbitrarily to extract data or alter the outcome.

Security proofs guarantee that any subset of corrupted parties learns nothing beyond the legitimate final output.

Practical Considerations.

While SMPC obviates the need for a fully trusted server, it often introduces higher computational and communication overhead than a single trusted third party [23]. Large-scale SMPC can involve frequent message exchanges, especially for complex operations like matrix multiplication in neural networks. Nonetheless, specialized circuit optimizations and precomputation (e.g., random-beaver triplets in SPDZ) have improved the practicality of SMPC for certain machine learning workloads [24].

Connection to EE.

Although SMPC conceals inputs from other parties, it does not necessarily hide internal computations from the machine performing those computations. By contrast, EE (see §2.4) encrypts the internal representations used within neural network layers. In scenarios where partial computations are offloaded to untrusted infrastructure, SMPC ensures data are shared among multiple parties without revealing secrets, and EE obfuscates the intermediate states of the network. Combined, they form a multi-layered approach, with SMPC covering multi-party input privacy and EE preventing visibility into intermediate neural activations or parameters.

2.3 Background: Homomorphic Encryption (HE)

Homomorphic Encryption (HE) is a cryptographic framework that keeps data encrypted while still allowing meaningful computations on it. This capability supports many secure outsourcing and cloud computation scenarios [16, 25], though practical applications often face significant performance challenges. Understanding the basics of HE clarifies why EE focuses on a more targeted approach for neural networks.

Motivation and Basic Setup.

Consider a user with private data $m\in\mathcal{M}$ that must be processed by an untrusted server. Rather than sending $m$ in plaintext, the user encrypts $m$ to produce $c=E(m)$ , where

E:\mathcal{M}\;\rightarrow\;\mathcal{C}.

The server then operates on $c$ to yield some output $\tilde{c}$ . Crucially, the homomorphic property ensures:

D\bigl{(}f^{\prime}(c_{1},c_{2},\dots)\bigr{)}=f\bigl{(}D(c_{1}),D(c_{2}),\dots\bigr{)},

where $D$ is the corresponding decryption function, $f(\cdot)$ is the desired plaintext operation, and $f^{\prime}(\cdot)$ is its encrypted analog. This principle lets the server process encrypted data without learning $m$ [26, 27].

Types of HE Schemes.

HE systems are commonly categorized by how many operations on ciphertexts they support:

•

Partial HE (PHE): Permits repeated use of one operation—addition or multiplication. For instance, RSA-based schemes support multiplicative homomorphism [26], whereas the Paillier cryptosystem supports additive homomorphism [27].
•

Somewhat or Leveled HE: Allows both addition and multiplication up to a certain depth, controlled by noise management. This depth determines how many multiplied ciphertexts can be handled before decryption becomes invalid.
•

Fully HE (FHE): Provides unlimited additions and multiplications, often through “bootstrapping” to periodically refresh ciphertexts and limit noise [16].

Ring-Based Construction and Polynomial Representation.

Modern FHE schemes (e.g., BFV [28], CKKS [25]) typically use polynomial rings for computational efficiency. A cyclotomic polynomial ring

\mathcal{R}=\mathbb{Z}_{q}[x]\big{/}\langle x^{N}+1\rangle

serves as the plaintext space, with additional polynomials denoting ciphertexts. Security derives from adding controlled “noise” that grows with each operation. If not managed, excessive noise can invalidate decryption.

Computational Overheads and Trade-Offs.

Despite extensive research and optimizations, HE can remain much more resource-intensive than plaintext processing [29]. Ciphertext sizes and polynomial arithmetic introduce overhead, and advanced batching or leveled HE schemes [30] partially mitigate but do not eliminate these costs. In particular, LLMs or deep neural architectures demand numerous matrix multiplications across many layers, challenging HE’s performance in real-time or large-scale settings. Parameter tuning, relinearization, and ciphertext expansion can increase both latency and memory usage.

Connection to EE.

EE leverages the concept of secure computation over transformed data but confines encryption to certain high-risk network layers, rather than fully encrypting the entire computational graph. By restricting complex or noise-sensitive operations to plaintext, EE dramatically reduces the overhead commonly associated with HE, yet still prevents exposure of critical internal representations. As we discuss in the next sections (§2.4), this selective encryption yields a more manageable trade-off between runtime performance and data confidentiality in modern neural networks.

2.4 Equivariant Encryption: A Practical Solution for Blind Inference

Equivariant Encryption (EE) is presented here as a selective encryption technique for neural network inference, avoiding the high overhead of fully HE and circumventing the limitations of trusted execution environments (TEEs) or DP. EE keeps inputs and outputs confidential while preserving near-zero additional latency, making it suitable for large-scale models or time-critical applications.

Refer to caption — Figure 1: A concise illustration of Equivariant Encryption’s workflow. A one-time setup (*top*) applies EE transformations to the model on a secure server, and the runtime environment (*middle*) handles encrypted model artifacts along with user queries. This ensures requests and responses remain unreadable by any untrusted infrastructure.

Key Characteristics and Advantages.

EE has the following advantages:

•

Complete Server Blindness: In an EE-based pipeline, raw data, queries, and intermediate activations never appear in plaintext on the server.
•

Negligible Latency: EE sidesteps the typical performance pitfalls of full HE, allowing inference speeds comparable to standard unencrypted processing.
•

Broad Model Applicability: From CNNs to LLMs with attention blocks, EE can accommodate a variety of deep-learning architectures, including multi-modal pipelines.
•

Cost-Effectiveness: By eliminating the need for specialized hardware (as in TEEs) or complex parameter setups (as in HE), EE can lower operating expenses for on-prem or cloud-based deployments.
•

RAG and Beyond: Retrieval-augmented generation workflows remain encrypted end to end, preserving both queries and retrieved documents from external inspection.
•

Simple Integration: EE typically requires minimal changes in code, such as replacing specific layer operations with “encrypted” equivalents.

Motivation: “Blind AI” Without Performance Penalties.

Safeguarding privacy during inference poses significant challenges, particularly for large-scale models and real-time systems. Existing methods have notable drawbacks:

•

HE: Encrypts all operations but struggles with non-linear layers and can incur large runtime expansions.
•

TEEs: Rely on hardware trust, granting potential backdoor privileges to system administrators.
•

DP: Obscures individual contributions through noise but may not secure intermediate activations from a malicious inference server.

Equivariant Encryption addresses these gaps by focusing on layer-specific transformations, retaining strong data confidentiality with minimal overhead.

Overview of EE.

EE works by converting data and selecting neural operators into a specialized “encrypted domain” (Figure 1). Rather than encrypting every operation via polynomial-based homomorphisms, EE tailors transformations to each layer’s structure. This customization permits the network to handle encrypted vectors nearly as if they were plaintext, without the computational blowup seen in fully homomorphic approaches.

Formally, we have the following definition for EE:

Definition 1 (Equivariant Encryption)

Given any plaintext $p$ , EE is an encrypt-decrypt algorithm such that

•

Recoverability:

$p=\mathrm{decrypt}(\mathrm{encrypt}(p)),$ (1)
•

Equivariance:

$\mathrm{decrypt}(F(\mathrm{encrypt}(p)))=F(p),$ (2)

where $F$ represents any linear operations and a specific set of supported non-linear operations.

Currently, our framework directly supports the following set of activation and processing functions: ReLU, GeLU, SiLU, RMS Normalization, and Layer Normalization. The framework can also support other non-linear functions without requiring any modifications.

Comparison with HE.

While both EE and HE enable computations on encrypted data, they differ in overhead and flexibility:

Table 1: Equivariant Encryption (EE) vs. Homomorphic Encryption (HE) for Neural Inference.

Property	EE	HE
Latency Overhead	Near-zero	High
Handling of Non-linear Ops	Exact	Often approximations
Key Management	User-defined	Tied to HE scheme
Security Basis	Large combinatorial space	Lattice / number theory
Scalability to Large Models	Straightforward	Resource-intensive
Accuracy	Matches plaintext	Potential approximation loss
Integration Complexity	Layer-by-layer transforms	Major re-engineering

EE in Practice: Minimal Overheads and Realistic Security.

All transformations are applied once, offline, ensuring the final “encrypted model” maintains the same order of multiplications and additions as an unencrypted version. Consequently, runtime latencies mirror those of plaintext inference. Compromising the data would require inverting $T$ —frequently a high-dimensional transform—rendering brute-force or direct linear-algebraic attacks computationally infeasible.

Deployment Scenarios.

•

LLMs and Conversational Systems: Token embeddings become encrypted embeddings so no plaintext tokens ever appear on the server.
•

Vision Models: Encrypted feature maps flow through convolution and activation layers with minimal overhead.
•

RAG Pipelines: Queries and retrieved content remain enciphered, preventing servers from inspecting user context or knowledge sources.

Summary.

Equivariant Encryption represents a pragmatic, high-performance alternative to fully homomorphic encryption for blind inference. By using a selective approach, encrypting only layers at the highest risk of leaking information, EE achieves robust privacy without sacrificing speed. In large-scale deployments, from LLM serving to real-time analytics, it provides a compelling solution for “always-encrypted” inference that remains both practical and secure.

2.5 Use Case: A Decentralized Infrastructure Example of EE

Although EE applies broadly to any scenario requiring private model inference, this section presents a concrete decentralized infrastructure example, inspired by frameworks that split model execution among multiple nodes or shards. Figure 2 illustrates a setting where:

•

A query enters the system through a decentralized application (dApp) [31] and a wallet mechanism.
•

The query, along with relevant state, is dispatched across a blockchain-like infrastructure, performing a distributed hash table (DHT) lookup for transactions.
•

A message-broker subsystem manages job routing to multiple nodes, each responsible for processing a portion (shard) of a large model [32].
•

Activations and partial outputs flow through gRPC-based links, and final results are stitched together for the user.

Such distributed systems are attractive for scalability and fault-tolerance but can raise privacy questions: intermediate activations, user queries, or model outputs may be visible to untrusted parties at each node. Equivariant Encryption addresses this challenge by encrypting the internal representations, ensuring that no node—except the original client—can interpret the raw data or glean sensitive information. As described in §2.4, EE focuses on carefully chosen transformations that maintain the correctness of computations while preventing adversaries from reconstructing user inputs or outputs. In this sense, it complements existing decentralized methods by preserving high performance without sacrificing privacy.

3 Threat Analysis and Attack Models

Having introduced EE as a general technique for secure model inference, we now focus on potential attacks against such systems. This section formalizes how attackers might attempt to invert or bypass EE when data are transmitted (and processed) in an encrypted form. Although the following examples refer to a network context inspired by decentralized inference and token-based LLM protocols, these considerations apply broadly wherever EE is used to conceal intermediate representations or token mappings.

3.1 Attack Vector Background

We focus on the scenario in which requests and responses are transmitted via HTTP in an equivariantly encrypted form. Specifically, the tokens that represent inputs and outputs for a large language model (LLM) are permuted or transformed according to an unknown mapping. Bad actors intercepting these encrypted token IDs gain access only to a transformed sequence; the legitimate user or trusted client alone knows the key(s) or mapping required to recover the original token IDs.

For concreteness, assume the attacker obtains input-output pairs over some duration. Each pair $(I_{i},O_{i})$ is represented by sequences of token IDs that have been scrambled through EE. The adversary’s goal is to reconstruct or guess the original plaintext tokens used by the standard tokenizer. This setting highlights the difference between observing encrypted token sequences and actually inverting them.

3.2 A Unified Analytical Framework

To systematically study potential attacks, we consider a mathematical optimization viewpoint. Consider a target LLM, such as a Llama-family model, which implements a function

f:\mathcal{C}\to\mathcal{C},

mapping a token-sequence input in some dictionary $\mathcal{V}$ (where $|\mathcal{V}|$ may be up to 128K tokens) to a token-sequence output. Depending on the sampling mechanism, $f$ can be deterministic (greedy decoding) or stochastic (temperature-based or top- $k$ sampling).

An attacker observes $n$ pairs of encrypted input-output sequences $\{(I_{i},O_{i})\}_{i=1}^{n}$ , with each $I_{i},O_{i}\in\mathcal{C}$ after scrambling by EE. The adversary knows the vocabulary set $\mathcal{V}$ but not the specific permutation or mapping $\mathbb{P}$ that recovers plaintext tokens. To mount an attack, the adversary tries to find a mapping $\mathbb{P}:\mathcal{V}\to\mathcal{V}$ such that:

\mathbb{P}(O_{i})\;\approx\;f\bigl{(}\mathbb{P}(I_{i})\bigr{)},

and the decrypted sequences $\bigl{(}\mathbb{P}(I_{i}),\mathbb{P}(O_{i})\bigr{)}$ form a semantically valid question-answer or prompt-response pair. Formally, one might frame this as:

\min_{\mathbb{P}}\;\;\frac{1}{n}\sum_{i=1}^{n}\mathcal{L}\bigl{(}\mathbb{P}(I_{i}),\mathbb{P}(O_{i})\bigr{)}\quad\text{s.t.}\quad\mathbb{P}(O_{i})=f\bigl{(}\mathbb{P}(I_{i})\bigr{)}\;\forall i,

(3)

where $\mathcal{L}\bigl{(}\cdot,\cdot\bigr{)}$ is a loss function that captures how well the decrypted pairs match valid natural language usage and plausible model responses.

Challenges.

We witness the following challenges for solving Equation (3):

•

Loss Function Design: What semantic or linguistic constraints best reflect the adversary’s prior knowledge? For instance, knowledge of frequency distribution (e.g., tokens like “the,” “of,” “and” occur frequently) or grammar structure might be integrated into $\mathcal{L}$ .
•

Discrete Optimization: Finding a permutation $\mathbb{P}$ that satisfies the above constraints is a high-dimensional combinatorial problem on the order of $|\mathcal{V}|!$ , which is intractable to solve exactly for large vocabularies.

3.3 Baseline Attacks

In practice, adversaries often resort to heuristic or partial methods for solving (3). Below, we outline several baseline approaches.

3.3.1 Designing a Loss Function

LLM-as-a-Judge.

One concept is to leverage a powerful reference model (e.g., GPT-4 or another advanced LLM) to score how consistent a decrypted output $\mathbb{P}(O_{i})$ is with the corresponding input $\mathbb{P}(I_{i})$ . For instance, the attacker can prompt the reference LLM to rate the coherence or correctness of the response from 0 to 10, assigning a lower loss for better Q&A alignment. This approach effectively uses a large model’s own understanding to guess whether a proposed permutation is valid.

Linguistic Domain Knowledge.

Alternatively, the adversary can incorporate domain expertise or statistical cues. For example, the frequency of certain tokens (e.g., “the,” “is,” “and”) might be recognized in plaintext language, and grammar rules (e.g., subject-verb-object sequences) can guide guesses about which tokens appear in typical positions. These heuristics inform $\mathcal{L}\bigl{(}\mathbb{P}(I_{i}),\mathbb{P}(O_{i})\bigr{)}$ to penalize permutations that fail to produce plausible word frequencies or syntactic structures.

3.3.2 Designing an Optimizer

Even with a well-defined $\mathcal{L}$ , solving for a global or local minimum in (3) can be difficult. We outline three heuristic attacks:

Brute Force.

The naive method enumerates all permutations of $\mathcal{V}$ , computing the loss each time. With complexity $|\mathcal{V}|!$ , this is clearly infeasible beyond very small vocabularies.

Random Sampling.

A more tractable (though still limited) approach randomly draws $M$ permutations from the space. The attacker then evaluates $\mathcal{L}$ and chooses the lowest-loss candidate. Genetic algorithms or other population-based methods can improve upon pure random sampling by “breeding” permutations that yield better fitness scores.

Hill-Climbing.

Starting from a random or heuristic permutation, an attacker iteratively searches for local improvements by swapping two token mappings at a time. If a swap lowers $\mathcal{L}$ , the permutation is updated. This process continues until no improving swaps are found or computational resources are exhausted. While the algorithm may get stuck in local minima, it can be more effective than random guessing for moderate vocabulary sizes.

Summary.

These baseline attacks demonstrate how an adversary might attempt to invert or weaken Equivariant Encryption by exploiting partial linguistic cues or iterative search heuristics. In large-scale LLM scenarios—with extensive vocabularies and highly varied text inputs—the complexity of inverting the token transformations remains considerable. Nonetheless, these methods highlight the importance of carefully choosing transformations and ensuring sufficient dimensional and combinatorial complexity in EE, so that feasible attacks remain prohibitively expensive in practice.

4 Benchmarking

4.1 Language Models

4.1.1 Fidelity Score

The fidelity score measures the similarity of confidence values for the generated logits between two inference runs. It is defined as:

\text{Fidelity}=1-\frac{\sum_{i=1}^{n}\frac{|s_{i}^{EE}-s_{i}^{VI}|}{\max(s_{i}^{EE},s_{i}^{VI})}}{n},

(4)

where:

•

$n$ is the total number of samples.
•

$s_{i}^{VI}$ and $s_{i}^{EE}$ are the class/first token confidence scores for the $i$ -th sample from Vanilla Inference (VI) and Equivariant Encryption (EE), respectively.

A higher fidelity score indicates that the EE model produces confidence values that are more similar to the VI model. Our benchmarking for text models is as follows. For IMDB dataset, we sampled 5000 entries. For LLMs, we used MT-Bench plus 2000 entries sampled from ShareGPT repeated twice.

Table 2: Comparison of inference latency, fidelity, and output consistency between Vanilla Inference (VI) and Equivariant Encryption (EE) across various language models. Latency is measured in seconds, while fidelity is evaluated on IMDb (for BERT models) and ShareGPT (for LLMs). Standard deviation in inference time is also provided as a percentage in inference time.

Model	vLLM?	bs	VI (s)	EE (s)	$\Delta$ T (%)	Fid (%)	$\Delta$ T Std (%)
BERT-base	No	1	39.02	39.27	$-0.64\%$	92.38	$\pm 0.89\%$
Sentiment-BERT	No	1	39.16	39.08	$+0.20\%$	88.35	$\pm 0.91\%$
RoBERTa-base	No	1	40.31	39.98	$+0.82\%$	99.85	$\pm 0.87\%$
Llama 3.1-8B	No	1	418.58	455.68	$+8.88\%$	99.999	$\pm 0.78\%$
Llama 3.1-8B	Yes	256	293.88	292.33	$-0.53\%$	99.999	$\pm 1.18\%$

References

[1] Michael I Jordan and Tom M Mitchell. Machine learning: Trends, perspectives, and prospects. Science, 349(6245):255–260, 2015.
[2] OpenAI. ChatGPT: Optimizing language models for dialogue, 2023.
[3] Anthropic. Claude: An AI assistant, 2023.
[4] Google DeepMind. Gemini: A language model by Google DeepMind, 2023.
[5] Llama Team. The Llama 3 herd of models. Llama 3 Technical Report, 2024.
[6] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
[7] OpenAI. DALLE-3: Creating images from text, 2023.
[8] OpenAI. Sora: Creating video from text. 2024.
[9] Rishi Bommasani et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
[10] Tom Brown et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
[11] Jason Wei et al. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
[12] Peter Kairouz et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1-2):1–210, 2021.
[13] D Ghosh, A Abecassis, and J Loveridge. Privacy and the pandemic: Time for a digital bill of rights. Foreign Policy. Retrieved from: https://foreignpolicy. com/2020/04/20/coronavirus-pandemic-privacy-digital-rights-democracy, 2020.
[14] Yang Liu et al. Privacy-preserving machine learning: Methods, challenges and solutions. IEEE Communications Surveys & Tutorials, 23(2):1178–1209, 2021.
[15] David Evans, Vladimir Kolesnikov, and Mike Rosulek. A pragmatic introduction to secure multi-party computation. Foundations and Trends in Privacy and Security, 2(2-3):70–246, 2018.
[16] Craig Gentry. Fully homomorphic encryption using ideal lattices. Proceedings of the 41st Annual ACM Symposium on Theory of Computing, pages 169–178, 2009.
[17] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Now Publishers Inc, 2014.
[18] Cynthia Dwork. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference (TCC), pages 265–284. Springer, 2006.
[19] Martin Abadi, Andy Chu, Ian Goodfellow, Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318, Vienna, Austria, 2016. ACM.
[20] Andrew C. Yao. Protocols for secure computations. In 23rd Annual Symposium on Foundations of Computer Science (FOCS), pages 160–164, 1982.
[21] Michael Ben-Or, Shafi Goldwasser, and Avi Wigderson. Completeness theorems for non-cryptographic fault-tolerant distributed computation. Proceedings of the Twentieth Annual ACM Symposium on Theory of Computing (STOC), pages 1–10, 1988.
[22] Ivan Damgård, Valerio Pastro, Nigel Smart, and Sarah Zakarias. Multiparty computation from somewhat homomorphic encryption. In Annual Cryptology Conference, pages 643–662. Springer, 2012.
[23] Ronald Cramer, Ivan Bjerre Damgård, et al. Secure multiparty computation. Cambridge University Press, 2015.
[24] Payman Mohassel and Yupeng Zhang. SecureML: A system for scalable privacy-preserving machine learning. In IEEE Symposium on Security and Privacy (SP), pages 19–38, 2017.
[25] Jung Hee Cheon, Andrey Kim, Miran Kim, and Yongsoo Song. Homomorphic encryption for arithmetic of approximate numbers. In Advances in Cryptology—ASIACRYPT 2017, pages 409–437. Springer, 2017.
[26] Ronald L Rivest, Adi Shamir, and Leonard Adleman. A method for obtaining digital signatures and public-key cryptosystems. In Communications of the ACM, volume 21, pages 120–126, 1978.
[27] Pascal Paillier. Public-key cryptosystems based on composite degree residuosity classes. In Advances in cryptology—EUROCRYPT’99, pages 223–238. Springer, 1999.
[28] Junfeng Fan and Frederik Vercauteren. Somewhat practical fully homomorphic encryption. In IACR Cryptol. ePrint Arch., volume 2012, page 144, 2012.
[29] Darko Hrestak and Stjepan Picek. Homomorphic encryption in the cloud. In 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pages 1400–1404. IEEE, 2014.
[30] Jung Hee Cheon, Wootae Kim, and Jai Hyun Park. Efficient homomorphic evaluation on large intervals. IEEE Transactions on Information Forensics and Security, 17:2553–2568, 2022.
[31] Zibin Zheng et al. An overview of blockchain technology: Architecture, consensus, and future trends. IEEE International Congress on Big Data, pages 557–564, 2020.
[32] Samyam Rajbhandari et al. ZeRO: Memory optimizations toward training trillion parameter models. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2020.