Leaking Secrets through Modern Branch Predictors in the Speculative World

Md Hafizul Islam Chowdhuryy, and Fan Yao M. Chowdhuryy2 and F. Yao4 are with the Department of Electrical and Computer Engineering, University of Central Florida, Orlando, FL.
E-mail: [email protected], [email protected]

Abstract

Transient execution attacks that exploit speculation have raised significant concerns in computer systems. Typically, branch predictors are leveraged to trigger mis-speculation in transient execution attacks. In this work, we demonstrate a new class of speculation-based attacks that targets the branch prediction unit (BPU). We find that speculative resolution of conditional branches (i.e., in nested speculation) alter the states of pattern history table (PHT) in modern processors, which are not restored after the corresponding branches are later squashed. Such characteristic allows attackers to exploit the BPU as the secret transmitting medium in transient execution attacks. To evaluate the discovered vulnerability, we build a novel attack framework, BranchSpectre, that enables exfiltration of unintended secrets through observing speculative PHT updates (in the form of covert and side channels). We further investigate the PHT collision mechanism in the history-based predictor and the branch prediction mode transitions in Intel processors. Built upon such knowledge, we implement an ultra-high speed covert channel (BranchSpectre-cc) as well as two side channels (i.e., BranchSpectre-v1 and BranchSpectre-v2) that merely rely on BPU for mis-speculation trigger and secret inference in the speculative domain. Notably, BranchSpectre side channels can take advantage of much simpler code patterns than those used in Spectre attacks. We present an extensive BranchSpectre code gadget analysis on a set of popular real-world application code bases followed by a demonstration of side channel attack on OpenSSL. The evaluation results show substantially wider existence and higher exploitability of BranchSpectre code patterns in real-world software. Finally, we discuss several secure branch prediction mechanisms that can mitigate transient execution attacks exploiting modern branch predictors.

Index Terms:

Branch Predictor, Transient Execution Attacks, Nested Speculation, Pattern History, Side Channels.

1 Introduction

As end users are increasingly demanding higher performance from computer systems, processor vendors have been looking for every possible source of optimization and improvement in microarchitecture design. Modern processors heavily rely on speculation to offer high instruction level parallelism. Under speculation, the processor executes instructions based on certain predicted paths, which may potentially resolve to be the wrong executions (i.e., transient execution). As transient execution can defy program software semantics, the underlying speculation engine is carefully designed to ensure that executions of undesired instructions are squashed, and no architectural state changes are in effect for mis-speculation.

Recent advances in transient execution attacks have demonstrated the possibility of exploiting speculative execution to construct dangerous information leakage attacks. As the branch prediction unit (BPU) plays a key role in determining instructions to be fetched in the speculative path, it has been heavily exploited in these attacks to trigger mis-speculation. Particularly, in Spectre V1, the attacker can induce speculative access of unintended data through branch direction mistraining, which enables the later leakage of secrets through a microarchitecture side channel [1]. Although hardware-based side channels in the non-speculative domain are widely studied [2, 3, 4, 5, 6], transient execution attacks considerably empower these classical information leakage threats by expanding the attack surface to the speculative domain. While several system-level mitigation techniques are proposed [7, 8, 9], recent studies show that these techniques either are ineffective towards certain attack variants or rarely employed in userspace due to performance concerns [10, 5].

In this work, we demonstrate a new class of hardware-based information leakage that exploits BPU state updates within the speculative domain. Our key observation is that resolutions of conditional branch instruction in the speculative path (e.g., nested speculation) alter the states of branch pattern history—the pattern history table (PHT) in particular. More critically, these speculative updates of PHT states are not restored even after the squash of speculatively executed branches in modern processors. As branch instruction outcomes in the speculation domain can depend on data accessed in a domain beyond the programmer’s original intention, speculative branch execution can be potentially exploited to perform transient execution attacks in which the BPU is utilized as the secret transmitting hardware. We systematically explore the aforementioned security vulnerability and implement a new form of BPU side/covert channel in the speculative domain, which we term BranchSpectre. Similar to how Spectre fuels traditional cache timing channels, BranchSpectre reveals a more severe security concern for branch predictors with the attack manifestation in the speculative world as compared to prior BPU side channels [11, 12, 5]. Furthermore, BranchSpectre exhibits two unique characteristics distinctive from existing speculation-based exploits: (i) BranchSpectre completely relies on the BPU for transient execution triggering and speculation-domain secret transmission, minimizing the hardware footprint for attackers. It can bypass the bulk of existing defense techniques mostly targeting protection in the cache hierarchy [13, 14]; (ii) Different from Spectre attacks that depend on the relatively rare code gadget (e.g., memory access indirection [15]), BranchSpectre can utilize much simpler code patterns that are more commonly existing (e.g., branch whose conditional is based on a speculatively-loaded value). These enable even higher exploitability for BranchSpectre as the transient execution attack in real systems.

This article considerably extends our prior work [16] in the following aspects: i) We provide new insights about mode transitions and the PHT collision mechanism for hybrid branch predictor in commercial-off-the-shelf processors, which enable substantially higher efficiency in speculative secrets transmission. ii) We introduce a new variant of BranchSpectre side channel exploitation (i.e., BranchSpectre-v2) that chains arbitrary BranchSpectre gadgets through branch target poisoning, which further enhances the attack flexibility and capability over BranchSpectre-v1. iii) We conduct a comprehensive investigation of code patterns in commonly-used application binaries to quantify the existence of BranchSpectre gadgets and demonstrate a real-world BranchSpectre attack against OpenSSL on Intel processors. Our new findings further broaden the scope of the previous work and highlight the need for rethinking branch predictor designs that are secure in speculative executions. In summary, the key contributions are¹¹1Our PoC source is released at http://tiny.cc/hfgutz.:

•

We find that speculative update of PHT states in modern processors creates a new information leakage threat that can leverage branch predictors as the transmitting medium in the speculative domain.
•

We systematically explore the prediction mode transition in three recent generations of Intel processors and reverse-engineer the PHT collision mechanism under the history-based predictor. The discovery enables probing of the PHT state perturbations by an attacker with extremely high efficiency.
•

We present a novel transient execution attack framework–BranchSpectre–that enables information leakage through BPU in various forms: BranchSpectre-cc, an ultra-fast covert channel that achieves up to 1.3Mbps transmission bit rate; BranchSpectre-v1 and BranchSpectre-v2 side channels that leverage conditional branch mistraining and branch target poisoning respectively to induce speculative PHT update in nested speculation.
•

We perform extensive analysis on code bases of 10 popular open-source applications/libraries, which shows wider existence and stronger leakage capability of BranchSpectre gadgets in real systems. Further, our case study demonstrates a real-world BranchSpectre side channel on OpenSSL that achieves 97.3% bit accuracy.
•

We discuss potential speculation-secure branch predictor designs that can mitigate transient execution attacks exploiting modern branch predictors.

2 Background

2.1 Branch Prediction Structure

Branch predictor is a critical per-core structure that directs the control flow of speculative execution. At a high level, the BPU is involved with two major tasks: direction prediction that speculatively decides whether a conditional branch is taken or not, and destination prediction that predicts branch target address. The BPU enables the processor to continue execution on the predicted path before the branch’s outcome resolution to minimize pipeline stalls. The underlying speculation engine ensures that instructions will eventually retire in order using the re-order buffer and only committed instructions can change the architecturally visible states (e.g., architectural registers and memory).

Refer to caption — Figure 1: A high-level illustration of modern branch predictor.

Typically, the BPU takes advantage of the Pattern History Table (PHT) for direction prediction. Each entry of the PHT incorporates a state machine using a saturating counter. For instance, in a 2-bit saturating counter, four possible states are associated with each PHT entry: Strongly Taken (ST), Weakly Taken (WT), Weakly Not Taken (WN) and Strongly Not Taken (SN). To predict branch directions, the BPU can operate in different modes [5]. Specifically, in the one-level prediction mode, the branch address is the only information source to index the PHT entry. As a result, each branch only maps to a single PHT entry. One-level prediction excels in fast training, but it performs poorly for branches whose outcomes depend on the program execution context (e.g., the execution history of recently executed branches). Differently, history-based prediction (i.e., two-level prediction) maintains the history of prior branch executions in a branch history buffer. It leverages both the branch history and branch address to access the PHT and can train multiple PHT entries, each of which corresponds to a unique branching context [16]. The history-based prediction mechanism can predict branches with complex patterns with substantially higher degree of accuracy at the expense of longer training time.

Modern processors (e.g., from Intel) generally use a hybrid design that leverages both one-level and history-based prediction in tournament mode as illustrated using Selection logic in Figure 1. In particular, the one-level prediction uses a predetermined number of bits from the branch address to select PHT entry. The history-based prediction instead combines the global history register (which represents the branch history state) with the branch address to index into PHT. The global history register (GHR) is a shift register that keeps track of the most recent history among all branches executed on the core. Note that since the number of entries in PHT is limited, some branches will unavoidably map to the same PHT entry, leading to PHT collision that will create interference in predictions for those branches.

2.2 Microarchitectural Timing Channel Attacks

Microarchitectural attacks are a class of information leakage threats where a malicious process manages to receive or infer secrets via a stealthy communication channel using microarchitectural components as the transmitting medium. Among various attack variants, timing channels that modulate access latency to hardware resources are most widely exploited. These attacks can either manifest as covert channels that allow two isolated domains to willingly transmit data illegitimately or as side channels in which a spy process illicitly steals secrets from an unknowing victim process. Prior works have demonstrated timing channels on various hardware components in modern processors such as function units [2], caches [17, 1], and memory bus [18]. Recent works [5, 12] show that attackers can infer program-defined secrets that are used as branch conditionals by observing the perturbation in BPU microarchitectural states. To mitigate these classical timing channels, hardware-based techniques such as partitioning that avoid resource sharing [17, 19, 20], obfuscating timing observations [21, 22] and randomizing hardware access patterns [23] are proposed.

2.3 Transient Execution Attacks

Transient execution attacks augment classical side channels by exploiting the effect of speculation in modern processors [1]. They leverage the fact that the speculative execution path driven by the speculation engine can defy the program semantic in case of a branch misprediction. The tentative erroneous execution flow could lead to unintended memory accesses that cross security boundary. A successful transient execution attack depends on two factors: 1) the unintended accessed memory is propagated to taint certain microarchitectural states and 2) the microarchitectural states remain unpurged after mis-speculation is detected. Figure 2 illustrates the high-level comparison between the transient execution attack and a non-speculative side channel. While non-speculative side channels typically rely on the victim directly using secret-dependent control flow or data flow, transient execution attacks are even more dangerous as they substantially broaden the data outreach for an attacker.

Spectre attacks abuse branch predictors to trigger transient execution of instructions that access restricted data. These attacks widely harness caches as the target hardware component for emitting secrets as memory blocks accessed by speculative loads/stores remain in cache even after speculation is rolled back. Particularly, the V1 variant mistrains the BPU to predict the wrong branch direction, while in V2, the attacker performs branch target injecting to hijack the speculative control flow. To mitigate spectre attacks, system-level defenses are employed that aim to either limit speculation through software patches (e.g., adding fences or using retpoline [7]) or restrain branch target poisoning (e.g., IBRS and STIBP [24, 8]). While these techniques can mitigate security breaches due to speculation, recent studies reveal that they either do not defeat all attack vectors or may introduce non-trivial performance overhead, which hinders its adoption in userspace applications [10, 5].

3 Threat Model

Similar to previously demonstrated transient execution attacks [10, 1], we assume the attacker can run a process on the same core with the victim. These two processes are running either on the same hardware context in a round-robin fashion or on individual virtual cores under simultaneous multi-threading. Since the PHT is a per-core structure, the victim’s perturbations on the PHT can potentially be observed by the attacker. The attacker process only has userspace privileges, and the pre-requisite of a malicious OS is not required. Further, we assume that the attacker has knowledge about the code/binary of the victim application and can trigger victim’s execution.

4 PHT Update in Speculative Path

Speculative execution can be triggered by a multitude of on- and off-chip events with varying resolution window sizes. For instance, speculation induced by contention in functional units may be resolved within only a few cycles, while it can take thousands of cycles if triggered by a last-level cache (LLC) miss followed by a row-buffer conflict in memory [15]. As branches are one of the most common instructions in programs, multiple branch instructions may be encountered in the speculation path of a program. To avoid pipeline stall in the front-end, modern processors perform nested speculation where branch predictions continue to be made for branch instructions executed in the speculative path within a certain speculation window [25]. In case the earlier branch that triggers the speculation resolves and a misprediction is detected, the processor will squash all dependent instructions - including the subsequent branches that are fetched along the speculative path. Note that if a speculatively executed branch is resolved before it is squashed, the processor can potentially update the branch predictor’s states (e.g., PHT) based on its branch outcome.

⬇

1if (x < bound) // Outer branch

2 // Inner branch in loop structure

3 for (int i = 0; i < iterator; ++i)

4 <some_operations>;

Listing 1: Sample code with potential nested speculation.

The code example from Listing 1 shows why it may be beneficial to allow speculative PHT state updates. In this example, depending on the availability of bound and iterator, the inner branch (line 3) can be resolved very quickly. In contrast, the outer branch (line 1) remains unresolved for several iterations of the inner branch. In case the outer branch is predicted as not taken in the direction breaking out the loop, the inner branch may be executed for multiple iterations. Once the outer branch is resolved and the actual direction of the branch is known, the processor’s states are rolled back. If the PHT entry for the inner branch is not updated in the speculative path, the BPU would not be trained properly for predicting the loop behavior. For instance, the inner branch could be mostly $taken$ while the initial PHT state for the inner branch is in the $not\ taken$ state. In this case, the inner branch will continue to be mispredicted, leading to degraded performance if the entire speculation path turns to be useful. In contrast, if the PHT is allowed to update in the speculative path, the PHT entry will converge to $taken$ after a few iterations of the inner branch.

⬇

1bool control;

2if (i < bound) { // Parent branch (

b_{p}

)

3 start = rdtsc();

4 if (control) // Child branch (

b_{c}

)

5 <some_operations>;

6 end = rdtsc();

Listing 2: Testing PHT updates in speculative path.

Based on the above discussion, we can see that updating the PHT based on nested speculation can bring potential performance benefits. However, once the dependent branch is resolved and the validity of the entire speculation is refuted, the processor needs to decide how to deal with those speculative PHT updates. In particular, the processor may choose to restore the PHT to the states before speculation started. This can annul the impact of speculation with respect to the PHT perturbations by speculative branch executions in the wrong path. However, prior academic studies have shown that recovering speculative updates of the PHT when mis-speculation is detected brings negligible performance advantage [26]. In order to figure out the PHT update mechanism in real processors, we design a microbenchmark that monitors the PHT perturbations for branch executions within mis-speculated paths. The core code snippet is shown in Listing 2 where a child branch $b_{c}$ can be speculatively executed when speculation is triggered by the parent branch $b_{p}$ . The experiment performs the following steps:

➊ Initialization: In the first step, we execute a sequence of branch instructions with randomized outcomes [5] that forces the BPU to use the one-level prediction (See Section 5 below for more discussions). The one-level prediction utilizes branch address exclusively to index PHT, thus making it easier to control collision in order to infer the state of a particular PHT entry later.

➋ Triggering $b_{p}$ misprediction: In this step, we first train the $b_{p}$ branch with not taken outcomes and subsequently trigger mis-speculation with an out-of-bound i value. Based on the value of $control$ , $b_{c}$ will be taken (or not taken). The code segment in Listing 2 is executed multiple times so that the PHT state of $b_{c}$ will converge to taken (or not taken) if BPU updates are not squashed for the transient executions of $b_{c}$ . Note we load $control$ in cache and flush $bound$ out of cache to ensure $b_{c}$ is resolved before $b_{p}$ .

➌ Infer the outcome of $b_{c}$ : In this step, we set an in-range value for $i$ , and preload $i$ and $bound$ in cache. We then execute the code again with $control$ value set to 1 (i.e., $b_{c}$ should be taken) and measure the latency for executing the code block in Line 4-5.

We run this experiment on machines with three generations of Intel processors - Skylake, Coffee Lake and Cascade Lake. On each of the machines, we execute the aforementioned experiment 1000 times for each of the following configurations: 1) $b_{c}$ in step ➋ not taken, and $b_{c}$ in step ➌ taken; 2) $b_{c}$ in both step ➋ and ➌ are taken. Note that this can be easily setup by controlling the value of $control$ . In Figure 3, we show the latency distributions for executing the code block line 4-5 in step ➌. As we can clearly see, the execution time of the $b_{c}$ in step ➌ is consistently shorter (i.e., correct prediction) if the outcome of branch $b_{c}$ is the same as the speculatively resolved outcome of that branch in ➋. In contrast, when the outcome of $b_{c}$ in step ➌ is different from step ➋, we observe longer execution time indicating a misprediction has occurred. These results evidently show that conditional branches resolved in transient execution do influence the branch prediction later on even after they are squashed. Based on the investigation results, we make the key observation that conditional branches executed on speculative path changes the PHT state when the branch outcome is resolved, and these alterations are not restored regardless of whether the branch is eventually committed or not. Moreover, such an observation is consistent across all processors we have evaluated. We note that the microarchitectural footprint in the BPU due to speculation can create a new avenue for transient execution attacks, essentially making it possible to exploit the BPU as the secret transmitting medium in the speculative domain. To the best of our knowledge, we are the first to investigate the exploitation of the PHT state changes in branch predictors for speculative branches in transient execution attacks. Note that while we mainly explore the speculation behavior of BPUs in Intel processors, our observation could also be applicable to chips from other vendors. Particularly, any processor that allows speculative updates of the PHT can be vulnerable to this exploitation.

(a) b

(b) b

Figure 3: Execution latency distribution for

b_{c}

branch in step ➌ corresponding to the Taken and Not taken outcome of

b_{c}

5 Understanding Modern Branch Prediction Mechanisms

5.1 History-based Predictor Triggering Mechanism

Understanding the branch prediction mode in operation and the conditions that trigger the BPU to transmit the prediction mode from one to another is critical for an attacker to create PHT collisions. Branchscope [5] triggers the one-level predictor mode by running a large sequence of branches (more than 100K) with random outcomes that cannot be well predicted by the history-based predictor. While one-level prediction simplifies the procedure of PHT collision, it incurs substantial runtime overhead for the attacker as the randomization procedure is needed frequently. This is because one-level prediction is typically only active for a short period before the BPU resumes the history-based prediction mode that generally exhibits better prediction accuracy. A recent study [12] has proposed an extension of the BranchScope attack by exploiting the history-based predictor. It shows that history-based activation can be enforced by running an empirically found sequence of conditional branches with a certain length. However, the exact details of the triggering mechanism have not yet been fully explored. We systematically reverse-engineer the history-based prediction transitioning mechanism, enabling efficient controls of the prediction mode in the BPU.

According to prior studies [12], the saturating counters in the PHT have distinctive sizes in different prediction modes. Particularly, 2-bit counters (4 states) are used in the one-level prediction, and 3-bit ones (8 states) are utilized in the history-based prediction. For an $n$ -bit saturating counter, values within [ $0$ , $2^{n-1}-1$ ] represent $taken$ states and [ $2^{n-1}$ , $2^{n}-1$ ] are the not taken states (or vice versa). If the branch predictor is first initialized with one-level prediction, it is possible to determine whether it has transitioned to the history-based prediction based on the misprediction behavior of a certain PHT entry. Specifically, when the target PHT entry for a branch is set to strongly taken, executions of the same branch with not taken outcomes (NNN...N) will result in 2 mispredictions in one-level prediction but 4 mispredictions in history-based prediction before they start to predict correctly. Based on this observation, we employ a two-step procedure to determine the current prediction mode: First, a conditional branch is executed a sufficient number of times in one fixed direction (i.e., either taken or not taken), which trains the corresponding PHT entry to the strong state (i.e., all ’1’s or ’0’s in the counter). Second, the same branch is executed $K$ ( $K>4$ ) times with the opposite branch outcome, and the execution latency of the basic block of the branch is measured. Note that while executing the same branch will guarantee the same PHT entry is accessed in one-level predictor, different PHT entries may be exercised for the same branch in history-based mode based on the GHR state. Therefore, to ensure only one PHT entry is used (in the case of history-based prediction), we run a sufficiently long sequence of predetermined branches before executing the target branch to preset the GHR. Figure 4 shows the distinctive misprediction patterns that could be used to identify whether the current prediction mode is one-level and history-based.

Input:

t\_branch,seq\_length,mispred\_rates

Output:

pred\_mode

4for $r$ $\in$ $\{mispred\_rates\}$ do

// Generate outcome sequence

S_{r}

with expected misprediction rate

r

S_{r}

= gen_seq (r, seq_length)

// Execute t_branch sequence with

S_{r}

outcome

6 exec (t_branch,

S_{r}

)

// Check if history-based prediction is active

7 pred_mode = chk_mode(t_branch)

10 Function chk_mode(t_branch):

// Set the test outcome sequence of the target branch

S_{t}

= {TTTTTTTT, NNNN}

12 exec (t_branch,

S_{t}

)

// Check misprediction times (e.g., K set to 6)

13 num_mispred = mispredictions on last K executions

14 if num_mispred == 4 then

15 return History-based

16 else

17 return One-level

Algorithm 1 Determining the triggering of history-based prediction

We hypothesize that modern processors employ a tournament-style design where the prediction accuracy of each prediction entity is dynamically monitored, and the best-performing prediction mode for each branch (or a set of branches) is selected. To evaluate the transition criteria, we create a microbenchmark that executes a target branch instruction many times with a predetermined outcome. This sequence can be configured such that the execution will result in a certain misprediction rate under the one-level prediction. We call such sequence the exercising sequence. Note that the exercising sequence for a certain misprediction rate can be generated by first initializing the PHT entry state. For example, the branch sequence TNTNTN will lead to 50% misprediction rate if the initial PHT state is set to WN. After the exercising sequence is executed, the benchmark will test the current prediction mode by executing another instruction sequence with the same target branch - the testing sequence. The branch outcomes during the testing sequence are set to distinguish the prediction patterns as shown in Figure 4. The overall procedure is shown in Algorithm 1.

For each exercising sequence at a specific misprediction rate (i.e., $S_{r}$ ), we execute the microbenchmark 100 times. The misprediction rate is confirmed through reading performance counters. At the end of each run, the program checks the current effective prediction mode. We then compute the success rate of enabling the history-based predictor. We perform the same experiment with varying lengths of the exercising sequence. Figure 5 illustrates the minimally required length under each aimed misprediction rate. The results reveal that the required number of executions for the target branch decreases as the misprediction rate increases accordingly. More clearly, we believe that the accumulated misprediction is used as the triggering criteria. Particularly, when three mispredictions occur under the one-level prediction, transitioning to the history-based prediction happens. We note that such phenomenon can be potentially attributed to the internal selection logic that uses a confidence counter to choose a winning prediction mode [27]. Using this knowledge, we can swiftly trigger the history-based predictor by only executing the target branch six times with the corresponding outcomes: TNTNTN. Note that this sequence could either induce three or six mispredictions for one-level prediction based on the initial state of the PHT entry.

5.2 Creating PHT Collisions in History-based Predictor

With the history-based prediction, the PHT entry of a conditional branch depends on the state of branch pattern history stored in the GHR. As a result, multiple PHT entries may be trained for predicting one branch in this mode. To create a PHT collision in the history-based prediction, using a congruent branch (as in one-level prediction) is no longer sufficient. Particularly, the GHR has to be properly set for both the observee and observer branches (i.e., the branch of the victim and the attacker in side channel). One possible way to do this is to execute an excessive number of conditional branches to ensure the GHR is flushed. However, such a mechanism undermines the efficiency of PHT state inference. A more optimized approach is to precisely configure the GHR, which requires knowledge about the size of the GHR and how it is populated. Classical BPU design implements the GHR as a shift register where ’1’ or ’0’ are inserted when a conditional branch is resolved as taken and not taken respectively [28]. However, prior studies have shown that the GHR in Intel processors is populated with partial bits from the target addresses of taken branches [29]. To determine the exact size of the GHR, the following experiment is performed: 1) activate the history-based prediction for the target branch (as discussed in Section 5.1), 2) preset certain PHT entry to strongly taken by executing $N$ distinctive $taken$ branches followed by a target branch with $taken$ outcome, 3) detect PHT collision by executing the same $N$ number of $taken$ branches followed by the execution of target branch with $not\ taken$ outcome.

We vary $N$ and detect PHT collision accuracy under each setting. This experiment is run 1000 times for each $N$ value and the results are shown in Figure 6. We can see that executing 12 taken branches before the target branch is sufficient to preset the state of GHR for direction prediction and ensure PHT entry collision. Such observation is consistent among all processors we tested. We therefore conjecture that the size of GHR used for history-based prediction is $12\times B_{t}$ where $B_{t}$ is the number of bits from the targeted address of a taken branch populated to the GHR.

6 Overview of Exploitation

In this section, we show the overview of the BranchSpectre attack design which performs information leakage by inferring secrets from BPU state updates in transient execution.

As discussed in Section 5, for a PHT entry with an $n$ -bit saturating counter with $2^{n}$ possible values, there are 1 Strongly Taken state, $2^{n-1}-1$ Weakly Taken states, $2^{n-1}-1$ Weakly Not-taken state and 1 Strongly Not-taken state. Once a PHT collision is achieved, the attacker can infer secrets by observing the sequence of prediction outcomes made for a colliding branch. Specifically, to infer the outcome of a victim branch $b_{v}$ when executed speculatively, we first use a colliding branch $b_{a}$ from the attacker’s address space²²2Note that from this point, executing a target branch in the history-based predictor will implicitly mean that GHR is properly preset to ensure collision. to initialize the target PHT entry to a strong state. The attacker then triggers the execution of a victim’s branch $b_{v}$ speculatively. Finally, we execute $b_{a}$ again to infer the state of the PHT that has already been perturbed by the victim. If the outcome of $b_{v}$ is dependent on a secretive value (i.e., unintended secret), the attacker can reveal that value after the mis-speculation is corrected. Figure 7 shows a high-level implementation of the attack. Note that while we use side channel terminologies in this description, the same techniques can also be applied to covert channels. We now discuss how the attackers achieve each of these steps:

Step 1:

PHT initialization for victim’s branch. The attacker has two goals in this stage. First, the attacker trains the PHT entry of the victim’s branch ( $PHT_{t}$ ) so that it is pushed to a deterministic state (e.g., either ST or SN). For $n$ -bit counters, this can be achieved by executing a branch $b_{a}$ in the attacker’s address space that is congruent to $b_{v}$ for $2^{n}-1$ times with the taken (or not taken) outcome. This means executions of $b_{a}$ $3$ and $7$ times for one-level and history-based predictor respectively. Second, the attack either mistrains the branch direction prediction (in case of a conditional branch) or poisons the target address (in case of an indirect jump) of the parent instruction $b_{v0}$ in the victim’s process. This will trigger transient execution of $b_{v0}$ in the path with the $b_{v}$ instruction. Finally, the attacker triggers the victim process execution and waits until $b_{v}$ is executed speculatively.
Step 2:

Victim execution in speculative path. When the victim runs, branch $b_{v}$ will be first speculatively executed and later squashed. The attacker can control the speculation window for $b_{v0}$ to make the speculation sufficiently long so that $b_{v}$ is resolved first in the speculative path. $b_{v}$ ’s speculative resolution will alter the state of $PHT_{t}$ depending on certain conditionals (likely unintended data). Essentially, after victim’s execution, $PHT_{t}$ is tainted with the out-of-bound value.
Step 3:

Infer secret by probing $\mathbf{PHT_{t}}$ state. In this step, the attacker aims to infer the secret value in the victim’s address space by probing the state of $PHT_{t}$ . To do so, the attacker executes the branch $b_{a}$ that is congruent to $b_{v}$ with outcome in the opposite direction from Step 1. To observe the difference, the attacker executes $b_{a}$ for $2^{n-1}$ times and record the execution latency of $b_{a}$ execution.

Figure 8 demonstrates the generalized state transition of $PHT_{t}$ for $n$ -bit counters after the attacker presets $PHT_{t}$ state to taken in Step 1. We can see that the value of $secret$ is directly correlated with the prediction of the $2^{n-1}th$ inference operation in Step 3. Particularly, if $b_{v}$ is resolved as not taken speculatively, the $2^{n-1}th$ branch of the attacker will be correctly predicted, otherwise, a misprediction would occur. The attacker can then infer the secret used as $b_{v}$ ’s conditional based on timing as shown in Figure 3. Similar PHT state diagrams can be generated for branches with not taken outcomes in Step 1 and taken outcomes in Step 3.

7 BranchSpectre Covert Channel Attack

To demonstrate the information leakage threat with speculative PHT update, we investigate a covert channel where a trojan and a spy exploit transient branch execution to build a covert communication. We call it BranchSpectre-cc. Different from previous covert channel attacks in the non-speculative domain [12, 11, 5], covert channels using speculation can be more stealthy and remain undetected even with the presence of dynamic software analysis techniques [30].

To construct the BranchSpectre-cc attack, the spy first executes Step 1 from Section 6 and then waits for the trojan’s execution (Step 2) before inferring the secret in Step 3. The code gadget for trojan’s exploitation can be any value-dependent conditional branch executed in the speculative path. After activating certain branch prediction mode, the attacker needs to follow this sequence of actions: Spy initialization–preset the $PHT_{t}$ to the deterministic state by executing $b_{a}$ $\rightarrow$ Trojan training–execute the $b_{v}$ speculatively with conditional depending on sensitive data $\rightarrow$ Spy inference–infer the state of $PHT_{t}$ after trojan’s execution. Since the trojan and spy are colluding, a PHT collision between them can be achieved by simply using branches (both $b_{a}$ and $b_{v}$ ) with the same address (for one-level prediction) or by executing the same set of 12 taken branches to preset the GHR state before the execution of $b_{a}$ and $b_{v}$ (for history-based prediction). Figure 9 shows the communication protocol of the covert channel and illustrates how the trojan transmits bits ’010’. Figure 10 illustrates the latency traces observed by the spy corresponding to the $2^{n-1}$ th execution in the inference phase for a snippet of 50-bit transmission. The spy observes a clear pattern differentiating bit ’0’ and bit ’1’.

There are several ways the transmission rate of the covert channel can be improved. A common optimization technique is to reuse the operation in the inference phase for one-bit transmission as initialization operation for the next bit. Specifically, we increase the number of branch executions in the inference phase to $2^{n}-1$ . This way, at the end of the inference, the target PHT entry is already set to a strong state (the purpose of the initialization step for the next bit reception). Note that the spy infers the secret by observing the prediction from $2^{n-1}$ th inference branch. We can thus improve the bit rate by removing spy’s initialization through alternating the PHT entry of $b_{a}$ between $ST$ and $SN$ . We also perform speed-enhancing techniques particular to certain prediction mode. For instance, under one-level prediction, we can increase the bit rate by tuning bits transmitted between re-enforcements of the one-level predictor.

Figure 11 illustrates the raw transmission rate of BranchSpectre-cc and the corresponding error rate under each prediction mode. Specifically, with one-level predictor, it is observed the dominant bit rate improvement is due to coalescing multiple bits transmission for one randomization operation. The attacker can achieve a peak transmission rate of 131Kbps within 5% bit error rate (shown in Figure 11a). We find that further increasing bit rate will lead to considerable drop in bit accuracy due to transfer of the prediction mode. On the other hand, Figure 11b shows that under history-based prediction, the adversary can achieve up to 1.3Mbps with less than 4% bit error ratio, which is an order of magnitude faster than the one in one-level prediction mode. We note that BranchSpectre-cc with history-based predictor is considerably more efficient since it eliminates expensive additional operations for keeping the one-level predictor active. Compared to the existing BPU-based covert channel leveraging coarse-grained pattern history manipulations [11], BranchSpectre-cc exhibits much higher transmission rate due to precise control of collision on a single PHT entry in history-based prediction mode.

8 BranchSpectre Side Channel Attack

In this section, we demonstrate BranchSpectre side channels that enable inferring speculation-domain secrets from a victim process through branch predictor exploitation. To leverage this vulnerability, the attacker needs to find appropriate gadgets in the victim application. Particularly, BranchSpectre depends on the presence of two types of gadgets in the victim application: a trigger gadget to start mis-speculation and a transmitter gadget that perturbs a target PHT entry with the information of speculative secret.

Triggering gadget. Generally, any speculation inducing instruction that deviates control flow can be a trigger gadget. However, to be exploitable, the triggering gadget needs to fulfill two goals: (i) speculative execution in the wrong path driving towards the transmitter gadget and (ii) preparation of speculatively accessed secret (e.g., propagating the secrets to the instruction operands in the transmitter).

Transmitter gadget. The transmitter gadget can be as simple as a conditional branch that uses out-of-bound accessed data or more generally a register/memory (tainted by a secret in the speculative domain) as a part of conditional argument. Note that one such conditional jump using the tainted register is sufficient as this will alter the state of PHT for that branch according to the secret value.

Our proposed attack can manifest in ways similar to either Spectre V1 or V2, and we call our attacks BranchSpectre-v1 and BranchSpectre-v2 respectively. The BranchSpectre-v1 attack leverages a conditional branch to induce mis-speculation. Its speculative execution path that traverses the trigger gadget and transmitter gadget follows through the static control flow graph of the victim program. Differently, BranchSpectre-v2 harnesses branch target positioning as the triggering mechanism. The exploited path is driven by chaining the attack gadgets at potentially arbitrary locations. As a result, the speculative path in BranchSpectre-v2 is not constrained by the victim’s static control flow. Since the execution path in v2-type attack can manipulate instruction sequences throughout the entire address space, it provides the adversaries with higher attack flexibility as well as code gadget availability.

8.1 Side Channel Implementation

Using the attack methodology discussed in Section 6, we can now build the two variants of BranchSpectre side channels. Specifically, the adversary locates a code sequence that corresponds to a transient execution path covering both a trigger gadget and a transmitter gadget. For BranchSpectre-v1, the code sequence contains two conditional branches where the first branch (e.g., CMP <conditions> $\rightarrow$ Condtional Jump <LABEL>) induces mis-speculation that leads to the execution of the second one with speculative conditional values. In BranchSpectre-v2, the trigger gadget ends with an indirect jump/call (e.g., a virtual function invocation), and its target address will be pointing to the transmitter gadget (i.e., speculative conditional branch) through branch target buffer (BTB) poisoning. Figure 12 illustrates the primary steps of side channel exploiting for BranchSpectre-v1 and BranchSpectre-v2 variants. Once the victim branch is identified and its corresponding PHT entry is determined, the attacker locates a branch in its own address space that will collide with the same PHT entry ( $PHT_{t}$ ). As shown in Figure 12, $PHT_{t}$ is first initialized by the attacker to a predetermined state. The attacker then triggers transient execution of the victim’s target branch $b_{v}$ in the speculative path, which resolves before its earlier dependent branch is squashed. Lastly, the attacker infers the secrets by observing $PHT_{t}$ state changes made by the speculative victim branch using the technique shown in Section 6.

Achieving PHT collision in side channels. Under the one-level predictor, the attacker can achieve PHT collision by using an attack branch ( $b_{a}$ ) with the same or congruent address as the victim branch $b_{v}$ . For history-based prediction, the attacker has to ensure that the GHR state (filled by the last 12 taken branches) before the execution of $b_{a}$ is exactly the same as the one before the execution of $b_{v}$ in the victim process. With a privileged attacker, this can be achieved by interrupting the victim before $b_{v}$ ’s execution and preparing a predetermined GHR value by the attacker. However, it is potentially more challenging for unprivileged attackers since they cannot control context switch of the victim arbitrarily. Fortunately, we observe that it is not uncommon that branches leading to $b_{v}$ are secret independent and they exhibit persistent prediction behaviors. As a result, it is possible for the attacker to perform off-line profiling of the victim binary with sample inputs to replicate the branch history in its inference stage. To ensure collision, gadgets exploitable for history-based prediction have the additional constraint that sufficient deterministic conditional branches preceding the transmitter gadget exist. Note that if most of the branches (not all) that impact the GHR state during the execution of the transmitter gadget can be determined, the attacker may still observe the perturbation in $PHT_{t}$ by $b_{v}$ by probing all possible PHT entries that would be touched.