Mithril: Cooperative Row Hammer Protection on Commodity DRAM Leveraging Managed Refresh

Michael Jaemin Kim^† Jaehyun Park^† Yeonhong Park^‡ Wanju Doh^§ Namhoon Kim^†
Tae Jun Ham^‡ Jae W. Lee^‡ Jung Ho Ahn^†§
Dept. of {Intelligence and Information^†, Computer Science and Engineering^‡}, Prog. in Artificial Intelligence^§ Seoul National University {michael604, wogus20002, ilil96, wj.doh, sirius0323, taejunham, jaewlee, gajh}@snu.ac.kr

Abstract

Since its public introduction in the mid-2010s, the Row Hammer (RH) phenomenon has drawn significant attention from the research community due to its security implications. Although many RH-protection schemes have been proposed by processor vendors, DRAM manufacturers, and academia, they still have shortcomings. Solutions implemented in the memory controller (MC) incur increasingly higher costs due to their conservative design for the worst case in terms of the number of DRAM banks and RH threshold to support. Meanwhile, DRAM-side implementation either has a limited time margin for RH-protection measures or requires extensive modifications to the standard DRAM interface. Recently, a new command for RH-protection has been introduced in the DDR5/LPDDR5 standards, referred to as refresh management (RFM). RFM enables the separation of the tasks for RH-protection to both MC and DRAM by having the former generate an RFM command at a specific activation frequency and the latter take proper RH-protection measures within a given time window. Although promising, no existing study presents and analyzes RFM-based solutions for RH-protection. In this paper, we propose Mithril, the first RFM interface-compatible, DRAM-MC cooperative RH-protection scheme providing deterministic protection guarantees. Mithril has minimal energy overheads for common use cases without adversarial memory access patterns. We also introduce Mithril+, an optional extension to provide minimal performance overheads at the expense of a tiny modification to the MC, while utilizing existing DRAM commands.

I Introduction

Row Hammer (RH) has been critical DRAM reliability and security vulnerabilities that have troubled the industry for almost a decade. This refers to a phenomenon in which a certain frequently activated row (aggressor) results in bit-flips in the corresponding adjacent rows (victims). In particular, RH is incurred when the activation rate exceeds the RH threshold ( $Flip_{TH}$ ). RH is especially dangerous as it breaks the basic integrity guarantee in the computer system and can be abused in various attack scenarios [1, 18, 13, 46, 12, 55, 59].

The criticality of this problem has motivated many RH-protection solutions. There exist several software-based solutions [4, 8, 55, 20, 31], but such of these typically incurs a high-performance cost and have limited coverage (i.e., only effective against a specific attack scenario). For these reasons, architectural solutions have emerged as promising alternatives.

One of the important design decisions for an architectural RH-protection scheme is to determine where to implement the proposed solution within the system. In practice, most RH-protection solutions are either implemented in an on-die memory controller (MC) or a DRAM device. For example, Graphene [43], BlockHammer [56], and PARA [30] have been proposed for implementation on the processor-side MC, whereas TWiCe [32] and industry-oriented RH-protection schemes [40, 15] are implemented in DRAM. Unfortunately, both choices have their own drawbacks.

First, the MC-side implementation needs to provision RH-protection resources for the worst-case scenario, where the expected $Flip_{TH}$ level is very low and the processor is connected to the maximum number of DRAM banks it supports. As a result, this strategy tends to require a large extra area for the counter structures utilized by the RH-protection mechanism. DRAM-side implementations are free from such concerns, as $Flip_{TH}$ of a specific DRAM is more accurately estimated by DRAM vendors, and the resource usage is proportional to the number of DRAM banks because on-DRAM RH-protection schemes are often deployed on a per-bank or per-DIMM basis. However, such on-DRAM protection schemes have interface issues. To secure the time margin for the extra operations for potential RH victim rows, DRAM-side schemes must either request the MC to generate non-standard adjacent row refresh (ARR) commands or perform extra operations during the auto-refresh process (ordinary DRAM operation) in a way transparent to the MC. The former mechanism breaks the abstraction that DRAM is a passive device, whereas the latter [15], referred to as the time-margin-stealing method, is not always possible depending on DRAM characteristics such as the time margin during the auto-refresh process.

Refresh Management (RFM) is a newly added extension for the latest DDR5 and LPDDR5 interfaces [23, 22], allowing the DRAM-side implementation of an RH-protection solution to cooperate smoothly with an MC. An MC sends an RFM command at a specific activation frequency to a target DRAM bank without specifying a target row. The DRAM-side RH-protection scheme exploits the time margin provided by the RFM command to undertake necessary operations. This cooperation between the MC and DRAM effectively avoids the critical drawbacks of MC- or DRAM-side only implementations.

TABLE I: Categorization of existing Row Hammer mitigation schemes and Mithril

Mitigation Scheme

Protection Guarantee

Remedy

Implementation Location

Tracking Mechanism

PARA [30]

Probabilistic

ARR

Probabilistic Approach

CBT [49, 48]

Deterministic

ARR

Grouped Counter Approach

TWiCe [32, 33]

Deterministic

ARR (feedback-augmented)

DRAM (buffer-chip)

Streaming Algo. (Lossy-Counting)

Graphene [43]

Deterministic

ARR

Streaming Algo. (Counter-based Summary)

BlockHammer [56]

Deterministic

Throttling

Streaming Algo. (Count-min Sketch)

Mithril

Deterministic

RFM

DRAM (co-op with MC)

Streaming Algo. (Counter-based Summary)

Despite its promising traits, the applicability of RFM as an RH-protection scheme has not been publicly verified or properly evaluated to the best of our knowledge. A prior probabilistic scheme [30] can be trivially applied. However, prior deterministic (guaranteeing not to exceed $Flip_{TH}$ ) schemes cannot be directly applied to the RFM interface. Prior ARR-based schemes reactively issue a command targeting a specific row when the activation count reaches a scheme-specific predefined threshold. However, given its periodicity, the RFM interface is prone to the worst-case scenario where a large number of rows will simultaneously require a preventive refresh in a short time period, unlike the ARR-based schemes. Thus, prior approaches are not compatible with the RFM interface.

In this paper, we propose Mithril, a novel RFM-compatible, deterministic RH-protection scheme that exploits MC and DRAM in a cooperative manner. To avoid the aforementioned concentration of rows to refresh for RH-protection, we utilize a greedy approach when selecting the target row to refresh upon every RFM command. We investigate the effective use of streaming algorithms [38] (Section III) and provide a new mathematical proof through which we guarantee deterministic protection by maintaining the greedy selection scheme (Section IV and Appendix).

Finally, we propose 1) a hardware scheme to obviate the need for counter table resets, which were mandatory in prior studies; 2) an algorithmic optimization for energy savings; and 3) an extension to the RFM interface to mitigate the performance overhead by exploiting the memory access patterns of ordinary workloads.

The key contributions of this paper are as follows:

•

We propose Mithril, the first RFM-based RH-protection scheme with deterministic safety guarantees, exploiting a modified Counter-based Summary algorithm [37, 36].
•

We provide a rigorous mathematical proof of the modified algorithm and the RH safety of Mithril.
•

We suggest energy and performance optimization techniques that exploit the memory access patterns of common, non-adversarial workloads.

II Background

II-A DRAM Refresh

DRAM stores a single bit in a cell, composed of one capacitor and one access transistor [41]. These cells are organized into rows and columns. A DRAM row, the cells of which share a wordline, is the granularity of the activation (ACT) and precharge (PRE), respectively allowing and disallowing read or write operations on the row. The read and write operation involves accessing a certain number of columns in an activated row. DRAM is composed of multiple banks. Each bank allows independent ACT, PRE, read, and write operations. Multiple banks form a rank, which shares the memory channel with other ranks and the memory controller (MC) at the host side.

Due to the inherent characteristic of a DRAM cell capacitor, by which the stored charge leaks over time, the cell value must be restored periodically [7, 5]. This type of periodic restoration, referred to as an auto-refresh, is initiated at every refresh (REF) command within the tRFC (refresh time) period. Every DRAM row must be refreshed at least once during every refresh window period (tREFW) to be safe from this charge retention problem. In modern DRAM devices (e.g., DDR5 [23]), all rows in a single bank are divided into typically 8,192 groups. A group is refreshed in every time interval tREFI (refresh interval).

II-B Row Hammer Phenomenon

Row Hammer (RH) refers to a phenomenon in which repetitive activations of a specific row (aggressor) lead to bit flips in physically nearby rows (victims) [39, 30, 42, 57]. A bit flip is observable when the ACT count reaches a certain RH threshold ( $Flip_{TH}$ ) without being refreshed inside a tREFW time window. Because two aggressors can simultaneously affect a single victim, $\nicefrac{{Flip_{TH}}}{{2}}$ ACTs on each aggressor can cause a bit flip (double-sided attack). The $Flip_{TH}$ value varies depending on different chips, generations, and/or DRAM manufacturers [29]. The RH problem has worsened following the current scale-down trend of fabrication technology, due to the intensified inter-cell interference. Recent studies [15, 29] reported that $Flip_{TH}$ has been reduced to a mere several thousand ACTs. It has also been observed that non-adjacent rows affect the victim rows when activated frequently, which degrades the effective $Flip_{TH}$ .

II-C Classifying Prior RH Mitigation Schemes

As shown in Table I, existing architectural RH-protection schemes all have four important criteria of a 1) protection guarantee, 2) type of remedy, 3) implementation location, and 4) tracking mechanism.

II-C1 Protection Guarantee

There exist two different types of RH-protection guarantees, deterministic and probabilistic. The deterministic guarantee ensures RH-protection by guaranteeing that a victim row is always refreshed before the number of ACTs exceeds $Flip_{TH}$ on its aggressors, either by an extra preventive refresh or the normal auto-refresh. This type utilizes a counter structure to track the aggressor row and deals with it by applying a certain remedy. The main drawback of a deterministic scheme is its higher area overhead due to the large counter structure.

The probabilistic guarantee prevents RH with a certain probability. The probabilistic approach has its strength in the minimal area overhead. However, the performance overhead is exacerbated severely when the target $Flip_{TH}$ level is lowered or when the number of DRAM devices in the system increases. It does not provide a deterministic protection guarantee, either.

II-C2 Remedies of Prior RH-protection Schemes

Prior works exploited one of two remedies, adjacent row refresh (ARR) or throttling. ARR refers to a type of command that the MC issues to DRAM with an explicit target row address (either aggressor or victim) at a required moment. It triggers an extra preventive refresh on the potential RH victim rows within the time margin provided by the command. This differs from the normal REF command, which is row-agnostic and periodic. Prior RH-protection schemes that exploited ARR either issued commands based on some probability [30, 52, 58] or when the ACT count of a certain aggressor exceeds a scheme-specific predefined threshold, which is assumed to be hazardous. However, ARR is not practically applicable because it either requires a new interface that breaks the abstraction of a passive DRAM device or requires the MC to become the sole manager of RH-protection. In fact, a command with a similar concept was once proposed in DDR4, but is now deprecated.

Throttling is a method by which the MC delays the frequency of activation on an aggressor starting at the moment of identification for a defined time. The duration and intensity of the delay are adjusted to guarantee RH-protection. After the initial suggestion of such methodology [17], a deterministic RH-protection scheme utilizing the throttling method was proposed [56]. However, leveraging throttling requires system-level support along with more complex MC scheduling and makes the system vulnerable to adversarial patterns (details in Section 10).

II-C3 Implementation Location

Prior RH-protection schemes are all located either on the MC-side or the DRAM-side. MC-side implementation has strength in that it utilizes a superior logic process with a larger area budget. However, it has the following major drawbacks. First, it requires a conservatively high number of counter structures to populate. The counter table of the deterministic scheme is typically allocated per DRAM bank. The latest CPU servers, such as Intel Ice Lake, support up to 1,024 banks per socket (8 channels $\times$ 8 ranks $\times$ 16 banks). This number could increase further if we consider 3D stacked DRAM devices or future generations. Despite the fact that fully populating 1,024 banks may be unlikely, the counter structures must be designed to support the worst case. Second, MC-side implementation must protect against a conservatively low target $Flip_{TH}$ value. The target $Flip_{TH}$ varies greatly depending on the manufacturer, generation, or even the device. Considering that most deterministic schemes must be tuned to the target $Flip_{TH}$ at the time of their design, they must protect against pessimistic $Flip_{TH}$ values.

DRAM-side implementation typically relies on an extra preventive refresh on a potential RH victim row. However, it is difficult to secure adequate uninterrupted time to execute preventive refreshes in the conventional MC-DRAM interface. Previous DRAM-side RH-protection schemes attempted to address this problem with either a feedback-augmented ARR command [32] or via the auto-refresh time-margin stealing method [15]. The former is similar to the normal ARR command issued by MCs but requires that DRAM halt the MC for a certain amount of time. There exist some methods of feedback from DRAM to MC, such as an ALERT_n signal, but these require more pins to deliver additional alert types to support the DRAM-side RH-protection scheme. The latter method, auto-refresh time-margin stealing, invisibly executes a preventive refresh during the normal auto-refresh operation. Although not requiring any feedback path, it has a limited time margin that can be stolen and thus cannot be scaled to a low $Flip_{TH}$ value.

Refer to caption — Figure 1: (a) Example of main-memory organization with the RFM interface-, and (b) the issue logic of RFM.

Symbol	Description
tREFW	Per row auto-refresh interval (e.g., 32ms or 64ms)
$Flip_{TH}$	RH threshold
$RFM_{TH}$	RFM threshold
Preventive refresh	Extra refresh of potential RH victim rows. Executed during ARR, RFM command, or hidden under auto-refresh.

Core Configurations (16 cores)
Core	3.6 GHz 4-way OOO cores
LLC	16 MB
Memory System Configurations
Module	DDR5-4800
Channel	2 channels
Configuration	1 rank; 32 banks per rank
Scheduling	BLISS [53]
Page-Policy	Minimalist-open [27]
tRFC, tRC, tRFM	295 ns, 48.64 ns, 97.28 ns
tRCD, tRP, tCL	16.64 ns

Mithril: Cooperative Row Hammer Protection on Commodity DRAM Leveraging Managed Refresh

Abstract

I Introduction

II Background

II-A DRAM Refresh

II-B Row Hammer Phenomenon

II-C Classifying Prior RH Mitigation Schemes

II-C1 Protection Guarantee

II-C2 Remedies of Prior RH-protection Schemes

II-C3 Implementation Location

II-C4 Tracking Mechanism and Streaming Algorithms

II-D RFM Interface as a New Remedy

III Investigating RFM-based Schemes

III-A Incompatibility of Prior Approaches

III-B Greedy Selection

III-C Counter-based Summary

III-D Grouped Counter Approach

III-E Probabilistic RFM-based Scheme

IV Mithril

IV-A Organization

IV-B Operation

IV-C Mathematical Proof of Protection Guarantee

IV-D Configuring Ne​n​t​r​yN_{entry} and R​F​MT​HRFM_{TH}

IV-E Wrapping Mithril Counters

V Enhancing Mithril Further

V-A Adaptive Refresh

V-B Mithril+

V-C Non-adjacent Row Hammer

VI Evaluation

VI-A Experimental Setup

VI-B The Overheads of Mithril and Mithril+

VI-C Comparison with Other Interface-Compatible Schemes

VI-D Comparison with Interface Non-Compatible Schemes

VI-E Table Size Overhead

VII Related Work

VIII Conclusion

Acknowledgment

IX Appendix

IX-A Proof for Theorem 1

IX-B Finding New MM for Adaptive Refresh

IX-C PARFM Probability of Failure

References

IV-D Configuring $N_{entry}$ and $RFM_{TH}$

IX-B Finding New $M$ for Adaptive Refresh