Parallelized sequential composition, pipelines, and hardware weak memory models

Robert J. Colvin Defence Science and Technology Group, Australia The University of Queensland [email protected]

Abstract.

Since the introduction of the CDC 6600 in 1965 and its “scoreboarding” technique processors have not (necessarily) executed instructions in program order. Programmers of high-level code may sequence independent instructions in arbitrary order, and it is a matter of significant programming abstraction and computational efficiency that the processor can be relied upon to make sensible parallelizations/reorderings of low-level instructions to take advantage of, for instance, multiple arithmetic units. At the architectural level such reordering is typically implemented via a per-processor pipeline, into which instructions are fetched in order, but possibly committed out of order depending on local considerations, provided any reordering preserves sequential semantics from that processor’s perspective. However, multiprocessing and multicore architectures, where several pipelines run in parallel, can expose these processor-level reorderings as unexpected, or “weak”, behaviours. Such weak behaviours are hard to reason about, and (via speculative execution) underlie at least one class of widespread security vulnerability.

In this paper we introduce a novel program operator, parallelized sequential composition, which can be instantiated with a function m that controls the reordering of atomic instructions. The operator generalises both standard sequential composition and parallel composition, and when appropriately instantiated exhibits many of the weak behaviours of the well-known hardware weak memory models TSO, Release Consistency, ARM, and RISC-V. We show that the use of this program-level operator is equivalent to sequential execution with reordering via a pipeline. Encoding reordering as a programming language operator admits the application of established compositional techniques (such as Owicki-Gries) for reasoning about weak behaviours, and is convenient for abstractly expressing properties from the literature such as sequential consistency. The semantics and theory is encoded and verified in the Isabelle/HOL theorem prover, and we give an implementation of the pipeline semantics in the Maude rewriting engine and use it empirically to show conformance of the semantics against established models of ARM and RISC-V, and elucidate some stereotypical weak behaviours from the literature.

Semantics, pipelines, weak memory models

^†^†copyright: none^†^†ccs: Software and its engineering Semantics^†^†ccs: Software and its engineering Concurrent programming languages^†^†ccs: Computing methodologies Computer algebra systems

1. Introduction

The 1960s saw significant improvements in processor efficiency, including allowing out-of-order instruction execution in cases where program semantics would not be lost (e.g., the “scoreboarding” technique of the CDC 6600 (Thornton, 1964)) and maximising use of multiple computation units (e.g., Tomasulo’s algorithm (Tomasulo, 1967), implemented in the IBM System360/91 (Anderson et al., 1967)). These advances meant that instructions could be distributed in parallel among several subunits and their results combined, provided the computation of one did not depend on an incomplete result of another. Furthermore, interactions with main memory can be relatively slow in comparison to local computation, and so allowing independent loads and stores to proceed in parallel also improved throughput. Even more complex is speculative execution, in the sense of guessing the result of a branch condition evaluation and transiently executing down that path. These features had the effect of greatly increasing the speed of processors, and without any visible intrusion on programmers: the conditions under which parallelization could take place ensured the sequential semantics of any computation was maintained.

When concurrency is used, either explicitly as threads sharing a single processor or via multicore architectures, the effect of out-of-order execution may be exposed, as the reordering of accesses of shared memory can dramatically change concurrent behaviour. This has provided a challenge for developing efficient, correct and secure concurrent software for modern processors that communicate via shared memory (Adve and Gharachorloo, 1996; Hill, 1998; Batty et al., 2015). Order can be restored by injecting artificial dependencies between instructions (usually called fences or barriers), but the performance cost is significant; for instance, performance concerns hamper the widespread mitigation of the Spectre class of security vulnerabilities, despite their seriousness (Kocher et al., 2019; Trippel et al., 2018).

To reason about the impact of reorderings on program behaviour we introduce parallelized sequential composition as a primitive operator of an imperative language. The program ‘ $c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2$ ’, for some function on instructions m, may execute $c_{\_}1$ and $c_{\_}2$ in order, or it may interleave actions of $c_{\_}2$ with those of $c_{\_}1$ provided m allows it. We give the weakest m such that $c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2$ preserves the intended sequential semantics of a single process despite executing some instructions out of order, and show how modern memory models are strengthenings of this concept, with the possible addition of further instruction types to restore order. Based on restrictions built into processors as early as the 1960s, we show that this concept of thread-local reorderings explains many of the behaviours observed on today’s multicore architectures, such as TSO (Intel, 2019) and ARM (Arm Ltd, 2020), and conforms to RISC-V (RISC-V International, 2017). We derive language-level properties of parallelized sequential composition, and use these to explain possibly unexpected behaviours algebraically. In particular, under some circumstances the possible reorderings can be reduced to a nondeterministic choice between sequential behaviours, and then existing techniques for reasoning about concurrency, such as the Owicki-Gries method (Owicki and Gries, 1976), can be employed directly. The language, its semantics, and the properties in the paper are encoded and machine-checked in the Isabelle/HOL theorem prover (Nipkow et al., 2002; Paulson, 1994) (see supplementary material).

In Sect. 2 we provide a foundation for instruction reordering, and provide a range of theoretical memory models. In Sect. 3 we give a straightforward abstract semantics for a hardware pipeline with reordering. In Sect. 4 we fully define the syntax and semantics of an imperative language, IMP+pseq, that includes conditionals, loops, and parallelized sequential composition. In Sect. 5 we show how standard Hoare logic and weakest preconditions can be used to reason about IMP+pseq. We then define TSO (Sect. 6), Release Consistency (Sect. 7), ARM (Sect. 8) and RISC-V (Sect. 9) as instances of parallelized sequential composition, showing conformance via both general properties and empirically (ARM and RISC-V). We discuss related work in Sect. 10.

2. Foundations of instruction reordering

In this section we describe instruction reordering in a simple language containing a minimal set of instruction types and the novel operator parallelized sequential composition (we give a richer language in Sect. 4). Instances of this operator are parameterized by a function on instructions, which for convenience we call a memory model.¹¹1 Memory models may embrace global features in addition to weak behaviours due to pipelining; since many behaviours of weak memory models are explained by pipeline reordering we use this more general term. We use this foundation to explore theoretically significant memory models that underlie modern processors.

We start with a basic imperative language containing just assignments and guards as actions (type $Instr$ ) with parallelized sequential composition as combinator.

\alpha\hskip 5.69054pt\mathbin{:\!:\!=}\hskip 5.69054ptx\mathbin{\mathtt{:\!=}}e\mathrel{~{}|~{}}\llparenthesis e\rrparenthesis\qquad\qquad c\hskip 5.69054pt\mathbin{:\!:\!=}\hskip 5.69054pt\mathbf{nil}\mathrel{~{}|~{}}\alpha\mathrel{~{}|~{}}c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2

An assignment $x\mathbin{\mathtt{:\!=}}e$ is the typical notion of an update, encompassing stores and loads, while a guard action $\llparenthesis e\rrparenthesis$ represents a test on the state (does expression $e$ evaluate to $True$ ) which can be used to model conditionals/branches. A command is either the terminated command $\mathbf{nil}$ (corresponding to a no-op), an action $\alpha$ , or the composition of two commands according to some memory model m.

The intention is that a simple command $\alpha\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}\beta$ is free to execute instruction $\beta$ before instruction $\alpha$ (possibly with modifications due to forwarding, described below) provided the constraints of m are obeyed. Clearly this potential reordering of instructions may destroy the programmer’s intention if unconstrained; however the reordering (or parallelization) of independent instructions can potentially be more efficient than the possibly arbitrary order specified by the programmer. For example, consider a load followed by an update $r\mathbin{\mathtt{:\!=}}x\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}y\mathbin{\mathtt{:\!=}}1$ . If $x$ is a shared variable then retrieving its value from main memory may take many processor cycles. Rather than idle the independent instruction $y\mathbin{\mathtt{:\!=}}1$ can be immediately issued without compromising the programmer’s intention assuming a single-threaded context. In general, two assignments can be reordered if they preserve the sequential semantics on a single thread.

A memory model m is of type $Instr\rightarrow\mathop{\mathstrut{\mathbb P}}\nolimits(Instr\times Instr)$ . However we will typically express a memory model as a binary relation on instructions, the reordering relation, with the implicit application of a forwarding function, which is either a straightforward variable substitution or the identity function. We first consider the reordering relation and then forwarding, which significantly complicates matters.

Reorderings

First consider the base of a memory model, the “reordering relation”, which is a binary relation on instructions. We write $\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{m}}}}}}\beta$ if $\beta$ may be reordered before instruction $\alpha$ according to the reordering relation of memory model m, and $\alpha\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{m}}}}$}}}}\beta$ otherwise. We define the special case of the sequentially consistent memory model as that where no reordering is possible.

Model 1 (sc).

For all $\alpha,\beta\in Instr$ , $\alpha\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{sc}}}}$}}}}\beta$ .

We may now explicitly state the notion of a sequential model (Lamport, 1979), which is the minimal property that any practical memory model m should establish. Letting the ‘effect’ function ${\sf eff}$ return the relation between pre- and post-states for a program (see Sect. 5) we can state this property as follows.

Definition 2.1.

m is sequential if ${\sf eff}(c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2)\subseteq{\sf eff}(c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{sc}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2)$ .

That is, m is sequential if its reorderings give the same results (on a single thread) as when executed in program order. The weakest sequential memory model is one that allows reordering exactly when sequential semantics is maintained.

Model 2 (eff).

$\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{eff}}}}}}\beta\hskip 5.69054ptiff\hskip 5.69054pt{\sf eff}(\beta\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{sc}}}}}$}}{\mathchar 24635\relax}$}}\alpha)\subseteq{\sf eff}(\alpha\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{sc}}}}}$}}{\mathchar 24635\relax}$}}\beta)$

Theorem 2.2.

eff is sequential.

Proof.

The property $\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{eff}}}}}}\beta$ lifts to commands.

eff is impractical since processors cannot make semantic judgements about the overall effect of two instructions dynamically; however there are some simple syntactic constraints which guarantee sequential semantics. As such we propose the following memory model as the weakest possible that is of practical use.

Model 3 ( ${{\textsc{g}}}_{0}$ ).

$\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{g}}}_{0}}}}\beta~{}iff~{}\mbox{${\sf wv}(\alpha)\mathbin{\not\!\!\hskip 0.5pt\cap}{\sf fv}(\beta)$ ~{} and ~{} ${\sf wv}(\beta)\mathbin{\not\!\!\hskip 0.5pt\cap}{\sf fv}(\alpha)$ }$

Model 3 (abbreviated M3) allows instructions to be reordered based on a simple syntactic test, namely, is any variable that $\beta$ references ( ${\sf fv}(\beta)$ ) modified by $\alpha$ , and vice versa, using naming conventions below.

(1)	$\displaystyle{\sf fv}(e),{\sf fv}(\alpha)$	Free variables in expr. $e$ or action $\alpha$
(2)	$\displaystyle{\sf wv}(\alpha),{\sf rv}(\alpha)$	Write, read variables
(3)	$\displaystyle s_{\_}1\mathbin{\not\!\!\hskip 0.5pt\cap}s_{\_}2\mathrel{\mathstrut{\widehat{=}}}$	$s_{\_}1\cap s_{\_}2=\varnothing$ (mutual exclusion)

Reorderings eliminated by M3 include $(x\mathbin{\mathtt{:\!=}}1\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{g}}}_{0}}$}}}}x\mathbin{\mathtt{:\!=}}2)$ , $(x\mathbin{\mathtt{:\!=}}1\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{g}}}_{0}}$}}}}r\mathbin{\mathtt{:\!=}}x)$ and $(r\mathbin{\mathtt{:\!=}}x\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{g}}}_{0}}$}}}}x\mathbin{\mathtt{:\!=}}1)$ . In general, if ${\sf wv}(\alpha)\subseteq{\sf rv}(\beta)$ , for $\alpha,\beta\in Instr$ , then there is a data dependency between $\alpha$ and $\beta$ , that is, the value of a variable that $\beta$ depends on is being computed by $\alpha$ . It is straightforward that $(\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{g}}}_{0}}}}\beta)\Rightarrow(\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{eff}}}}}}\beta)$ i.e., ${{\textsc{g}}}_{0}$ is stronger than eff. We can therefore infer that ${{\textsc{g}}}_{0}$ is sequential.

Theorem 2.3.

If m is stronger than eff then m is sequential.

Proof.

A stronger model admits fewer behaviours (24).

The memory model ${{\textsc{g}}}_{0}$ is lacking in the age of multicore processors because it does not require two consecutive loads of the same shared variable to be performed in order. For instance, consider the following concurrent processes.

(4)

r_{\_}1\mathbin{\mathtt{:\!=}}x\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{g}}}_{0}}}$}}{\mathchar 24635\relax}$}}r_{\_}2\mathbin{\mathtt{:\!=}}x\quad\parallel\quad x\mathbin{\mathtt{:\!=}}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{g}}}_{0}}}$}}{\mathchar 24635\relax}$}}x\mathbin{\mathtt{:\!=}}2

The two loads should read values of $x$ in a globally “coherent” manner, that is, the value for $x$ loaded into $r_{\_}1$ must occur no later than that loaded by $r_{\_}2$ . Hence program (4) should not reach the final state $r_{\_}1=2\mathrel{\mathstrut{\wedge}}r_{\_}2=1$ . However, although $x\mathbin{\mathtt{:\!=}}1\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{g}}}_{0}}$}}}}x\mathbin{\mathtt{:\!=}}2$ , we have $r_{\_}1\mathbin{\mathtt{:\!=}}x\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{g}}}_{0}}}}r_{\_}2\mathbin{\mathtt{:\!=}}x$ in the first thread.

To cope with shared variables and coherence we divide the set of variables, ${\sf Var}$ , into mutually exclusive and exhaustive sets ${\sf Shared}$ and ${\sf Local}$ , with specialised definitions below.

(5)	$\displaystyle{\sf sv}(\alpha),{\sf rsv}(\alpha),$	$\displaystyle{\sf wsv}(\alpha)\hskip 5.69054pt\mbox{As (\ref{eqn:defn-fve}), (\ref{eqn:defn-wv}), restricted to ${\sf Shared}$}$
(6)	$\displaystyle\mathsf{isStore}(\alpha)\mathrel{\mathstrut{\widehat{=}}}$	$\displaystyle\hskip 5.69054pt{\sf wsv}(\alpha)\neq\varnothing\mathrel{\mathstrut{\wedge}}{\sf rsv}(\alpha)=\varnothing$
(7)	$\displaystyle\mathsf{isLoad}(\alpha)\mathrel{\mathstrut{\widehat{=}}}$	$\displaystyle\hskip 5.69054pt{\sf wsv}(\alpha)=\varnothing\mathrel{\mathstrut{\wedge}}{\sf rsv}(\alpha)\neq\varnothing$
(8)	$\displaystyle\mathsf{isReg}(\alpha)~{}\mbox{iff}$	$\alpha$ is an assign. or guard and ${\sf sv}(\alpha)=\varnothing$

To maintain “coherence per location” we extend ${{\textsc{g}}}_{0}$ to g by adding a constraint on the loaded shared variables. Additionally, since we are now explicitly concerned with concurrent behaviour, we add a fence instruction type to restore order, i.e., $\alpha\mathbin{:\!:\!=}\ldots\mathrel{~{}|~{}}\mathbf{fence}$ . We call this an “artificial” constraint, since it is not based on “natural” constraints arising from the preservation of sequential semantics.

Model 4 (g).

$\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{g}}}}}}\beta\ \ ~{}\mbox{iff~{}\ $\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{g}}}_{0}}}}\beta\mathrel{\mathstrut{\wedge}}{\sf rsv}(\alpha)\mathbin{\not\!\!\hskip 0.5pt\cap}{\sf rsv}(\beta)$}$ , except
$\ \qquad\hskip 5.69054pt\alpha\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{g}}}}$}}}}\mathbf{fence}\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{g}}}}$}}}}\alpha$ .

When specifying a memory model we typically give the base relation first, and then list the “exceptions”, which take precedence. M4 strengthens the condition of M3 to require loads from main memory to be kept in program order per shared variable. In addition fences block reordering, reinstating program-order execution explicitly if desired by the programmer (at the potential cost of efficiency). We let $\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{m}}}}}}\beta\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{m}}}}}}\gamma$ abbreviate $\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{m}}}}}}\beta\mathrel{\mathstrut{\wedge}}\beta\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{m}}}}}}\gamma$ , and similarly for $\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{m}}}}$}}}}$ .

We consider g to be the weakest memory model of practical consideration in a concurrent system as it maintains both coherence-per-location and sequential semantics.

Definition 2.4.

Model m is coherent if it is stronger than g.

Most modern processors are coherent. A memory model that is not coherent, and is the obvious weakest dual of M1, is one that allows any instructions to be reordered under any circumstances. If we disallow forwarding in this model (discussed in the next section), this weakest memory model corresponds to parallel composition.

Model 5 (par).

For all $\alpha,\beta\in Instr$ , $\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{par}}}}}}\beta$

We may now define $c\parallel d\mathrel{\mathstrut{\widehat{=}}}c\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{par}}}}}$}}{\mathchar 24635\relax}$}}d$ , lifting instruction-level parallelism to thread-level parallelism.

The key point about the sc memory model (M1) is that reasoning is “straightforward”, or classical, in that all the accepted techniques work. This is the property of sequential consistency (Lamport, 1979), formalised below.

Definition 2.5.

Command ${c}_{\langle{{\textsc{m}}}\rangle}$ is structurally identical to $c$ but every parallelized sequential composition, except for instances of par, is parameterized by m.

Definition 2.6 (Sequentially consistent).

A memory model m is sequentially consistent if, for any programs $c$ and $d$ , ignoring local variables, ${c}_{\langle{{\textsc{m}}}\rangle}\parallel{d}_{\langle{{\textsc{m}}}\rangle}~{}~{}~{}=~{}~{}~{}{c}_{\langle{{\textsc{sc}}}\rangle}\parallel{d}_{\langle{{\textsc{sc}}}\rangle}$ .

By definition sc is sequentially consistent, however even sequentially consistent uniprocessors are not as strong as M1, for instance, some speculate loads and reissue them if a change is detected (relatively quickly via the cache (Yeager, 1996)). Note the difference between sequential and sequentially consistent: a sequentially consistent memory model is sequential, but not vice versa. None of TSO, ARM or RISC-V are sequentially consistent in general, but are for programs in a particular form, e.g., where shared variables are accessed according to a lock-based programming discipline.

Forwarding

We now complicate matters significantly by considering forwarding, where the effect of an earlier operation can be taken into account when deciding if instructions can be reordered.²²2We use the term “forwarding” from ARM and POWER (Alglave et al., 2014), sometimes referred to as “bypassing” in TSO (Sewell et al., 2010). For instance, given a program $x\mathbin{\mathtt{:\!=}}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{g}}}}}$}}{\mathchar 24635\relax}$}}r\mathbin{\mathtt{:\!=}}x$ , we have $x\mathbin{\mathtt{:\!=}}1\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{g}}}}$}}}}r\mathbin{\mathtt{:\!=}}x$ because ${\sf wv}(x\mathbin{\mathtt{:\!=}}1)=\{x\}\subseteq{\sf fv}(r\mathbin{\mathtt{:\!=}}x)$ , violating M3. In practice however it is possible to forward the new value of $x$ to the later instruction – it is clear that the value assigned to $r$ will be 1 if $x$ is local, and in any case is a valid possible assignment to $r$ even if $x$ is shared. We define ${\beta}_{\mbox{\raisebox{-0.5pt}{\guillemetleft}}\alpha}$ , representing the effect of forwarding the (assignment) instruction $\alpha$ to $\beta$ , where the expression ${f}_{[x\backslash e]}$ is $f$ with references to $x$ replaced by $e$ .

Definition 2.7 (Forwarding).

${\beta}_{\mbox{\raisebox{-0.5pt}{\guillemetleft}}\alpha}=\beta$ , except

\displaystyle{(y\mathbin{\mathtt{:\!=}}f)}_{\mbox{\raisebox{-0.5pt}{\guillemetleft}}x\mathrel{\,\mathtt{:=}\,}e}=y\mathbin{\mathtt{:\!=}}({f}_{[x\backslash e]})\qquad{\llparenthesis f\rrparenthesis}_{\mbox{\raisebox{-0.5pt}{\guillemetleft}}x\mathrel{\,\mathtt{:=}\,}e}=\llparenthesis{f}_{[x\backslash e]}\rrparenthesis\qquad

Forwarding and reordering are combined to form a memory model as follows, where the effect of forwarding is taken into account before calculating reordering.

(9)		$\displaystyle{{\textsc{m}}}\hskip 5.69054pt\mathrel{\mathstrut{\widehat{=}}}\hskip 5.69054pt\mathop{\mathstrut{\lambda}}\nolimits\alpha.\{({\beta}_{\mbox{\raisebox{-0.5pt}{\guillemetleft}}\alpha},\beta)\|\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{m}}}}}}{\beta}_{\mbox{\raisebox{-0.5pt}{\guillemetleft}}\alpha}\}$
(10)		$\displaystyle\beta^{\prime}\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}\alpha\mathbin{\overset{{\color[rgb]{0.2.,0.5,0.2}{{\textsc{m}}}}}{\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}}}\beta\hskip 5.69054pt\mathrel{\mathstrut{\widehat{=}}}\hskip 5.69054pt(\beta^{\prime},\beta)\in{{\textsc{m}}}(\alpha)$

Thus a memory model m for a given action $\alpha$ returns a set of pairs $(\beta^{\prime},\beta)$ where $\beta$ reorders with $\alpha$ , after the effect of forwarding $\alpha$ to $\beta$ ( $\beta^{\prime}$ ) is taken into account. For convenience we sometimes use the notation $\beta^{\prime}\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}\alpha\mathbin{\overset{{\color[rgb]{0.2.,0.5,0.2}{{\textsc{m}}}}}{\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}}}\beta$ which notationally conveys the bringing forward of $\beta$ with respect $\alpha$ . For example, since ${(r\mathbin{\mathtt{:\!=}}x)}_{\mbox{\raisebox{-0.5pt}{\guillemetleft}}x\mathop{:=}1}=(r\mathbin{\mathtt{:\!=}}({x}_{[x\backslash 1]}))=r\mathbin{\mathtt{:\!=}}1$ , the load $r\mathbin{\mathtt{:\!=}}x$ “reorders” with $x\mathbin{\mathtt{:\!=}}1$ , becoming $r\mathbin{\mathtt{:\!=}}1$ .

(11)

r\mathbin{\mathtt{:\!=}}1\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}x\mathbin{\mathtt{:\!=}}1\mathbin{\overset{{\color[rgb]{0.2.,0.5,0.2}{{\textsc{g}}}_{0}}}{\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}}}r\mathbin{\mathtt{:\!=}}x

Forwarding is significant because it can change the orderings allowed in a non-standard way, since a later instruction that was blocked by $r\mathbin{\mathtt{:\!=}}x$ may no longer be blocked, and potentially can now also be reordered before $x\mathbin{\mathtt{:\!=}}1$ . Of course, this is potentially a significant efficiency gain, because local computation can proceed using the value 1 for $x$ without waiting for the update to propagate to the rest of the system.

Memory models with out-of-order execution typically use forwarding. (As noted above, one exception is par for interleaving parallel). In g and ${{\textsc{g}}}_{0}$ reordering is symmetric, however when calculated after the effect of forwarding is applied there are instructions that may be reordered in one direction but not the other. In general a reordering relation is neither reflexive nor transitive.

Memory model refinement

We define model refinement as a strengthening per action.

Definition 2.8.

${{\textsc{m}}}_{\_}1\mathbin{{\color[rgb]{0.,0.,1.}\mathrel{\sqsubseteq}}}{{\textsc{m}}}_{\_}2\mathrel{\mathstrut{\widehat{=}}}\mathop{\mathstrut{\forall}}\nolimits\alpha\mathrel{\mathstrut{\bullet}}{{\textsc{m}}}_{\_}2(\alpha)\subseteq{{\textsc{m}}}_{\_}1(\alpha)$

As we explore in the rest of the paper, the Total Store Order memory model (tso) strengthens g considerably (or alternatively, weakens sc for the particular case of stores and loads), while arm strengthens g to prevent stores from coming before branches. arm, risc-v, and the release consistency models rc_pc and rc_sc are related as below, focusing on their common instruction types; since each introduces unique instruction types a direct comparison is not possible.

Theorem 2.9 (Hierarchy of models).

${{\textsc{eff}}}\mathbin{{\color[rgb]{0.,0.,1.}\mathrel{\sqsubseteq}}}{{\textsc{g}}}_{0}\mathbin{{\color[rgb]{0.,0.,1.}\mathrel{\sqsubseteq}}}{{\textsc{g}}}\mathbin{{\color[rgb]{0.,0.,1.}\mathrel{\sqsubseteq}}}{{\textsc{rc${}_{pc}$}}}\mathbin{{\color[rgb]{0.,0.,1.}\mathrel{\sqsubseteq}}}{{\textsc{risc-v}}}\mathbin{{\color[rgb]{0.,0.,1.}\mathrel{\sqsubseteq}}}{{\textsc{rc${}_{sc}$}}}\mathbin{{\color[rgb]{0.,0.,1.}\mathrel{\sqsubseteq}}}{{\textsc{arm}}}\mathbin{{\color[rgb]{0.,0.,1.}\mathrel{\sqsubseteq}}}{{\textsc{tso}}}\mathbin{{\color[rgb]{0.,0.,1.}\mathrel{\sqsubseteq}}}{{\textsc{sc}}}$ .

Proof.

Straightforward from definitions.

This result is similar to other classifications (Alglave, 2012; Frigo and Luchangco, 1998; Gharachorloo et al., 1990). Note that par does not fit into the hierarchy because it does not allow forwarding.

Well-behaved models

A memory model could theoretically allow arbitrary reorderings and effects of forwarding; however from a reasoning perspective we make the following definition of well-behaved memory models.

Definition 2.10.

A memory model m is well-behaved if it is sc or if: i) the result of reordering is deterministic; ii) any resulting action $\beta^{\prime}$ must arise from the application of the forwarding function (Definition 2.7), or have no change at all, i.e., we do not allow arbitrary effects of forwarding; and iii) if an action allows any reordering then it must allow reordering with internal (“silent”) steps.

Conditions i) and ii) ensure determinacy and sequential semantics, while condition iii) simplifies reasoning (silent steps are defined in in Sect. 4). All the memory models we consider are well-behaved, with the exception of par.

3. Pipeline semantics

Before defining a full language we consider how a reordering relation can be used to define the semantics of a processor pipeline where instructions can be reordered. The command $(\mathbf{pline}_{\_}{{{\textsc{m}}}}~{}{\tt t}:~{}c)$ represents the execution of $c$ with ${\tt t}$ a sequence of instructions that are currently (in) the pipeline; instructions in ${\tt t}$ can be issued to the wider system in order, or out-of-order according to m. Assume that $c$ is executed sequentially, (i.e., is of the form ${c}_{\langle{{\textsc{sc}}}\rangle}$ ), then the behaviour of the pipeline is as follows.

(12)

\begin{array}[]{c}c\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\alpha}}{{\longrightarrow}}}c^{\prime}\\ \cline{1-1}\cr(\mathbf{pline}_{\_}{{{\textsc{m}}}}~{}{\tt t}:~{}c)\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\tau}}{{\longrightarrow}}}(\mathbf{pline}_{\_}{{{\textsc{m}}}}~{}{\tt t}{{\mathbin{\raise 2.39197pt\hbox{$\mathchar 0\relax\@@cat$}}}}\alpha:~{}c^{\prime})\end{array}

(13)

\begin{array}[]{c}\alpha^{\prime}\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}{\tt t}_{\_}1\mathbin{\overset{{\color[rgb]{0.2.,0.5,0.2}{{\textsc{m}}}}}{\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}}}\alpha\\ \cline{1-1}\cr(\mathbf{pline}_{\_}{{{\textsc{m}}}}~{}{\tt t}_{\_}1{{\mathbin{\raise 2.39197pt\hbox{$\mathchar 0\relax\@@cat$}}}}\alpha{{\mathbin{\raise 1.99332pt\hbox{$\mathchar 0\relax\@@cat$}}}}{\tt t}_{\_}2:~{}c)\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\alpha^{\prime}}}{{\longrightarrow}}}(\mathbf{pline}_{\_}{{{\textsc{m}}}}~{}{\tt t}_{\_}1{{\mathbin{\raise 1.66109pt\hbox{$\mathchar 0\relax\@@cat$}}}}{\tt t}_{\_}2:~{}c)\end{array}

Rule (12) states that the next instruction in $c$ in sequential order can be “fetched” and placed at the end of the pipeline; $c$ is effectively the “code base”. We use ‘ ${{\mathbin{\raise 2.39197pt\hbox{$\mathchar 0\relax\@@cat$}}}}$ ’ for appending lists, and for convenience allow it to apply to individual elements as well. The promoted step $\tau$ represents an internal action of the processor, which is ignored by the rest of the system. Rule (13) states that an instruction $\alpha$ somewhere in the pipeline can be “committed” out of order, provided it can be reordered with all prior instructions currently in the pipeline. The notation $\alpha^{\prime}\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}{\tt t}\mathbin{\overset{{\color[rgb]{0.2.,0.5,0.2}{{\textsc{m}}}}}{\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}}}\alpha$ (cf. (10)) is a shorthand for $(\alpha^{\prime},\alpha)\in{{\textsc{m}}}({\tt t})$ , i.e., lifting m from a function on instructions to sequences of instructions, that is, ${{\textsc{m}}}([])=\mathop{\mathstrut{\mathrm{id}}}\nolimits$ and ${{\textsc{m}}}(\alpha{{\mathbin{\raise 2.39197pt\hbox{$\mathchar 0\relax\@@cat$}}}}{\tt t})={{\textsc{m}}}(\alpha)\mathbin{\raise 2.15279pt\hbox{\oalign{\hfil$\scriptscriptstyle\mathrm{o}$\hfil\cr\hfil$\scriptscriptstyle\mathrm{9}$\hfil}}}{{\textsc{m}}}({\tt t})$ , where ‘ $[]$ ’ is the empty list, ‘ $\mathop{\mathstrut{\mathrm{id}}}\nolimits$ ’ is the identity relation, and ‘ $\mathbin{\raise 2.58334pt\hbox{\oalign{\hfil$\scriptscriptstyle\mathrm{o}$\hfil\cr\hfil$\scriptscriptstyle\mathrm{9}$\hfil}}}$ ’ is relational composition.

Consider executing the program $r_{\_}1\mathbin{\mathtt{:\!=}}x\mathbin{.{}\mathchar 24635\relax}r_{\_}2\mathbin{\mathtt{:\!=}}y\mathbin{.{}\mathchar 24635\relax}c$ in a pipeline. Both loads can be fetched (in order) into the pipeline by (12), but then, assuming $r_{\_}1\mathbin{\mathtt{:\!=}}x\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{m}}}}}}r_{\_}2\mathbin{\mathtt{:\!=}}y$ , $r_{\_}2\mathbin{\mathtt{:\!=}}y$ may be committed before $r_{\_}1\mathbin{\mathtt{:\!=}}x$ by (13) (or further instructions from $c$ could be fetched and potentially committed). The fetch/commit rules abstract from other stages in a typical pipeline (see, e.g., (Lustig et al., 2014)), for instance, the two loads above would be issued to the system in order, with the out-of-order commit corresponding to the second load being serviced earlier by the memory system.

We encoded this semantics straightforwardly in the Maude rewriting engine (Clavel et al., 2002; Verdejo and Martí-Oliet, 2006) as a model checker (see supplementary material). We find this processor-level view to be convenient for showing equivalence with other models of reordering, for instance, the store buffer model of TSO (Sect. 6.1) and an axiomatic specification of ARM (Sect. 8.1). However it is not directly amenable to established techniques for reasoning about programs such as Hoare Logic (Hoare, 1969), Owicki-Gries (OG) (Owicki and Gries, 1976) or rely/guarantee (Jones, 1983a), which are over the structure of a program. As such we now consider embedding reordering into a typical imperative language, where the following theorem holds (recall Definition 2.5).

Theorem 3.1.

For well-behaved m, ${c}_{\langle{{\textsc{m}}}\rangle}~{}=~{}(\mathbf{pline}_{\_}{{{\textsc{m}}}}~{}[]:~{}{c}_{\langle{{\textsc{sc}}}\rangle})$

Proof.

By induction on traces.

4. An imperative language with parallelized sequential composition

In this section we give the syntax and semantics for an imperative programming language, “IMP+pseq”, which uses parallelized sequential composition (extending that of Sect. 2).

Syntax

The syntax of IMP+pseq is given below.

	$\alpha\hskip 5.69054pt\mathbin{:\!:\!=}\hskip 5.69054ptx\mathbin{\mathtt{:\!=}}e\mathbin{~{}\|~{}}\llparenthesis e\rrparenthesis\mathbin{~{}\|~{}}\mathbf{barrier}(\mathsf{f})$
	$c\hskip 5.69054pt\mathbin{:\!:\!=}\hskip 5.69054pt\mathbf{nil}\mathbin{~{}\|~{}}\alpha\mathbin{~{}\|~{}}c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2\mathbin{~{}\|~{}}c_{\_}1\sqcap c_{\_}2\mathbin{~{}\|~{}}\body{c}{{{\textsc{m}}}}$
(14)		$\displaystyle c_{\_}1\mathbin{.{}\mathchar 24635\relax}c_{\_}2\mathrel{\mathstrut{\widehat{=}}}c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{sc}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2\hskip 5.69054pt$	$\displaystyle\quad\hskip 5.69054ptc_{\_}1\parallel c_{\_}2\mathrel{\mathstrut{\widehat{=}}}c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{par}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2$
(15)		$\displaystyle c^{0}_{{{\textsc{m}}}}\mathrel{\mathstrut{\widehat{=}}}\mathop{\bf nil}\hskip 5.69054pt$	$\displaystyle\quad\hskip 5.69054ptc^{n+1}_{{{\textsc{m}}}}\mathrel{\mathstrut{\widehat{=}}}c\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c^{n}_{{{\textsc{m}}}}$
(16)		$\displaystyle({\mathrel{\mathbf{if}}b\mathrel{\mathbf{then}}c_{1}\mathrel{\mathbf{else}}c_{2}})_{\_}{{\textsc{m}}}\hskip 5.69054pt$	$\displaystyle\mathrel{\mathstrut{\widehat{=}}}\hskip 5.69054pt\llparenthesis b\rrparenthesis\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}1\ \ \sqcap\ \ \llparenthesis\neg b\rrparenthesis\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2$
(17)		$\displaystyle(\mathrel{\mathbf{while}}b~{}{\mathrel{\mathbf{do}}}~{}c)_{\_}{{\textsc{m}}}\hskip 5.69054pt$	$\displaystyle\mathrel{\mathstrut{\widehat{=}}}\hskip 5.69054pt\body{(\llparenthesis b\rrparenthesis\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c)}{{{\textsc{m}}}}\ \mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}\ \llparenthesis\neg b\rrparenthesis$

The basic actions of a weak memory model are an update $x\mathbin{\mathtt{:\!=}}e$ , a guard $\llparenthesis e\rrparenthesis$ , and a barrier. (We give an atomic expression evaluation semantics for assignments and guards, which is typically reasonable for assembler-level instructions.) The barrier instruction type can be instantiated for some data type which we leave unspecified at this point; particular memory models typically introduce their own barrier/fence types and we define them in later sections. We assume that every model has at least a “full” fence, and define $\mathbf{fence}\mathrel{\mathstrut{\widehat{=}}}\mathbf{barrier}(\mathsf{full})$ . The special instruction $\tau\mathrel{\mathstrut{\widehat{=}}}\llparenthesis True\rrparenthesis$ is a silent step (defined later), having no effect on the state, possibly corresponding to some internal actions of a microprocessor.

A command $c$ may be the terminated command $\mathbf{nil}$ , a single instruction, the parallelized sequential composition of two commands (parameterised by a memory model), a binary choice, or an iteration. Iterations are parameterised by a memory model, as they implicitly contain sequencing.

In (14) we use parallelized sequential composition to define ‘ $\mathbin{.{}\mathchar 24635\relax}$ ’ as the usual notion of sequential composition (see M1), and ‘ $\parallel$ ’ as the usual notion of parallel composition (see M5). Finite iteration of a command, $c^{n}_{{{\textsc{m}}}}$ , is the $n$ -fold composition of $c$ with reordering according to m (15). Conditionals are modelled using guards and choice (where false branches are never executed) (16). By allowing instructions in $c_{\_}1$ or $c_{\_}2$ to be reordered before the guards one can model speculative execution, i.e., early execution of instructions which occur after a branch point (Sorin et al., 2011). We define a while loop using iteration (17) following (Fischer and Ladner, 1979; Kozen, 2000). Both conditionals and loops are parameterised by a memory model since they include a parallelized sequential composition.

Operational semantics

The meaning of IMP+pseq is formalised using an operational semantics, given below. The operational semantics generates a sequence of syntactic instructions (as opposed to Plotkin-style pairs-of-states (Plotkin, 2004)), allowing syntactic analysis of instructions to decide on reorderings. We show how to convert straightforwardly to pairs-of-states style in Sect. 5.

(18)

\alpha\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\alpha}}{{\longrightarrow}}}\mathbf{nil}

(19)

\body{c}{{{\textsc{m}}}}\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\tau}}{{\longrightarrow}}}c^{n}_{{{\textsc{m}}}}

(20)

c\sqcap d\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\tau}}{{\longrightarrow}}}c

(21)

c\sqcap d\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\tau}}{{\longrightarrow}}}d

(22)

\begin{array}[]{c}c_{\_}1\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\alpha}}{{\longrightarrow}}}c_{\_}1^{\prime}\\ \cline{1-1}\cr c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\alpha}}{{\longrightarrow}}}c_{\_}1^{\prime}\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2\end{array}

(23)

\mathop{\bf nil}\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\tau}}{{\longrightarrow}}}c_{\_}2

(24)

\begin{array}[]{c}c_{\_}2\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\beta}}{{\longrightarrow}}}c_{\_}2^{\prime}\qquad\beta^{\prime}\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}c_{\_}1\mathbin{\overset{{\color[rgb]{0.2.,0.5,0.2}{{\textsc{m}}}}}{\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}}}\beta\\ \cline{1-1}\cr c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\beta^{\prime}}}{{\longrightarrow}}}c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2^{\prime}\end{array}

Sequential fragment

The operational semantics of an instruction is simply a step labelled by the instruction itself (18). After executing the corresponding step an instruction $\alpha$ is terminated, i.e., becomes $\mathbf{nil}$ . The semantics of loops is given by unfolding a nondeterministically-chosen finite number of times (19) (recall (15)).³³3 We ignore infinite loops to avoid the usual complications they introduce, since they do not add anything to the discussion of instruction reordering. A nondeterministic choice (the internal choice of CSP (Hoare, 1985)) can choose either branch (20, 21). A parallelized sequential composition $c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2$ can take a step if $c_{\_}1$ can take a step (22), and continues with $c_{\_}2$ when $c_{\_}1$ has terminated (23), as in standard sequential composition. Together these rules give a standard sequential semantics for imperative programs, and we refer to them as the “sequential fragment”.

Parallelized sequential composition

We now consider a further rule, unique to IMP+pseq, that allows reordering of instructions. Rule (24) states that given a program $c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2$ , an instruction $\beta$ of $c_{\_}2$ can happen before the instructions of $c_{\_}1$ , provided that $\beta^{\prime}\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}c_{\_}1\mathbin{\overset{{\color[rgb]{0.2.,0.5,0.2}{{\textsc{m}}}}}{\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}}}\beta$ , i.e., $\beta$ is not dependent on instructions in $c_{\_}1$ (according to the rules of model m), and the result of (accumulated) forwarding of instructions in $c_{\_}1$ to $\beta$ results in $\beta^{\prime}$ . This is given by lifting a model m to commands, defined inductively below (recall (9) and (10)).

(27)		$\displaystyle{{\textsc{m}}}(\mathop{\bf nil})=\mathop{\mathstrut{\mathrm{id}}}\nolimits\quad{{\textsc{m}}}_{\_}1(c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}_{\_}2}}$}}{\mathchar 24635\relax}$}}c_{\_}2)={{\textsc{m}}}_{\_}1(c_{\_}1)\mathbin{\raise 2.58334pt\hbox{\oalign{\hfil$\scriptscriptstyle\mathrm{o}$\hfil\cr\hfil$\scriptscriptstyle\mathrm{9}$\hfil}}}{{\textsc{m}}}_{\_}1(c_{\_}2)$
(28)		$\displaystyle{{\textsc{m}}}(c_{\_}1\mathrel{\sqcap}c_{\_}2)={{\textsc{m}}}(c_{\_}1)\mathbin{\mathstrut{\cap}}{{\textsc{m}}}(c_{\_}2)\quad{{\textsc{m}}}(\body{c}{{{\textsc{m}}}})=\underset{n}{{\textstyle\bigcap}}{{\textsc{m}}}(c^{n}_{{{\textsc{m}}}})$

Any instruction may reorder with the empty command $\mathop{\bf nil}$ . Reordering according to memory model ${{\textsc{m}}}_{\_}1$ over $c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}_{\_}2}}$}}{\mathchar 24635\relax}$}}c_{\_}2$ is the relational composition of the orderings of $c_{\_}1$ and $c_{\_}2$ with respect to ${{\textsc{m}}}_{\_}1$ (independent of ${{\textsc{m}}}_{\_}2$ ) (27). Reordering over a choice is possible only if reordering can occur over both branches (28) (but choices can be resolved via (20, 21)). Reordering over an iteration is possible only if reordering is possible over all possible unrollings.

Trace semantics

Given a program $c$ the operational semantics generates a trace, that is, a finite sequence of steps $c_{\_}0\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\alpha_{\_}1}}{{\longrightarrow}}}c_{\_}1\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\alpha_{\_}2}}{{\longrightarrow}}}\ldots$ where the labels in the trace are actions. We write $c\ext@arrow 0359\Rightarrowfill@{}{\raisebox{5.69054pt}{\hbox{}}\hbox{${}_{\,t\,}$}}c^{\prime}$ to say that $c$ executes the actions in trace $t$ and evolves to $c^{\prime}$ . Traces of visible actions are accumulated into the trace, and silent actions (such as $\tau$ ) are discarded, i.e., we have a “weak” notion of equivalence (Milner, 1982). A visible action is any action with a visible effect, for instance, fences, assignments, and guards with free variables. Silent actions include any guard which is $True$ in any state and contains no free variables; for instance, $\llparenthesis 0=0\rrparenthesis$ is silent while $\llparenthesis x=x\rrparenthesis$ is not. A third category of actions, ${\sf infeasible}~{}\alpha$ , includes exactly those guards $\llparenthesis b\rrparenthesis$ where $b$ evaluates to $False$ in every state. This includes actions such as $\llparenthesis False\rrparenthesis$ and $\llparenthesis x\neq x\rrparenthesis$ . The meaning of a command $c$ is its set of all terminating behaviours, written $\llbracket c\rrbracket$ , with behaviours containing infeasible actions being excluded from consideration.

Refinement

We take the usual (reverse) subset inclusion definition of refinement, i.e., $c\mathrel{\sqsubseteq}d$ if every behaviour of $d$ is a behaviour of $c$ ; our notion of command equivalence is refinement in both direction. From this definition we can derive expected properties for the standard operators such as $\llbracket c\sqcap d\rrbracket=\llbracket c\rrbracket\mathbin{\mathstrut{\cup}}\llbracket d\rrbracket$ and $\llbracket c\mathbin{.{}\mathchar 24635\relax}d\rrbracket=\llbracket c\rrbracket{{\mathbin{\raise 2.39197pt\hbox{$\mathchar 0\relax\@@cat$}}}}\llbracket d\rrbracket$ (overloading ‘ ${{\mathbin{\raise 2.39197pt\hbox{$\mathchar 0\relax\@@cat$}}}}$ ’ to mean pairwise concatenation of sets of lists).

Properties for parallelized sequential composition are, of course, more interesting, and we provide some below that we make use of in the rest of the paper.

(29)	$\displaystyle c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2$	$\displaystyle\mathrel{\sqsubseteq}$	$\displaystyle c_{\_}1\mathbin{.{}\mathchar 24635\relax}c_{\_}2$
(30)	$\displaystyle c\sqcap d$	$\displaystyle\mathrel{\sqsubseteq}$	$\displaystyle c$
(31)	$\displaystyle(c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2)\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}3$	$\displaystyle=$	$\displaystyle c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}(c_{\_}2\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}3)$
(32)	$\displaystyle(\alpha\mathbin{.{}\mathchar 24635\relax}c)\parallel d$	$\displaystyle\mathrel{\sqsubseteq}$	$\displaystyle\alpha\mathbin{.{}\mathchar 24635\relax}(c\parallel d)$
(33)	$\displaystyle(c_{\_}1\sqcap c_{\_}2)\parallel d$	$\displaystyle=$	$\displaystyle(c_{\_}1\parallel d)\sqcap(c_{\_}2\parallel d)$

Law 29 states that sequential composition is always a refinement of parallelized sequential composition. Law 30 is the standard resolution of a choice to its left operand (a symmetric law holds for the right operand). Parallelized sequential composition is associative by Law 31, provided both instances are parameterised by the same model m. The interleaving semantics allows us to derive Law 32, a typical interleaving law, and Law 33 for distributing choice over parallel composition; these laws are important for reasoning about the effects of different instruction reorderings.

Now consider the case of two instructions in sequence.

(36)	$\displaystyle\alpha\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{m}}}}$}}}}\beta\Rrightarrow\hskip 5.69054pt\hskip 5.69054pt\alpha\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}\beta$	$\displaystyle=$	$\displaystyle\alpha\mathbin{.{}\mathchar 24635\relax}\beta$
(37)	$\displaystyle\beta^{\prime}\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}\alpha\mathbin{\overset{{\color[rgb]{0.2.,0.5,0.2}{{\textsc{m}}}}}{\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}}}\beta\Rrightarrow\hskip 5.69054pt\hskip 5.69054pt\alpha\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}\beta$	$\displaystyle\mathrel{\sqsubseteq}$	$\displaystyle\beta^{\prime}\mathbin{.{}\mathchar 24635\relax}\alpha$
(38)	$\displaystyle\beta^{\prime}\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}\alpha\mathbin{\overset{{\color[rgb]{0.2.,0.5,0.2}{{\textsc{m}}}}}{\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}}}\beta\Rrightarrow\hskip 5.69054pt\hskip 5.69054pt\alpha\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}\beta$	$\displaystyle=$	$\displaystyle(\alpha\mathbin{.{}\mathchar 24635\relax}\beta)\sqcap(\beta^{\prime}\mathbin{.{}\mathchar 24635\relax}\alpha)$
(39)	$\displaystyle c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}\mathbf{fence}\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2$	$\displaystyle=$	$\displaystyle c_{\_}1\mathbin{.{}\mathchar 24635\relax}\mathbf{fence}\mathbin{.{}\mathchar 24635\relax}c_{\_}2$

Law 36 states that if $\beta$ cannot be reordered according to m then they are executed sequentially. Law 37 states that if reordering is allowed then that is one possible behaviour. Law 38 composes these two rules to reduce parallelized sequential composition to a choice over sequential compositions, eliminating the memory model. Similarly, a full fence restores order and hence sequential reasoning as in Law 39.

Monotonicity

Monotonicity (congruence) holds for the standard operators of IMP+pseq, but monotonicity of parallelized sequential composition contains a subtlety in that the allowed traces of $c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2$ are dependent on the reorderings allowed by $c_{\_}1$ with respect to m (Rule (24)). To handle this we need a stronger notion of refinement, written $c\overset{{{\textsc{m}}}}{\mathrel{\sqsubseteq}}c^{\prime}$ , where traces are augmented to track the reorderings allowed,⁴⁴4 Similarly to refusal sets in CSP’s failures/divergences model (Roscoe, 1998). allowing strengthening only.

Theorem 4.1.

$c\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}d\mathrel{\sqsubseteq}c^{\prime}\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}^{\prime}}}$}}{\mathchar 24635\relax}$}}d^{\prime}$ if $c\overset{{{\textsc{m}}}}{\mathrel{\sqsubseteq}}c^{\prime}$ , $d\mathrel{\sqsubseteq}d^{\prime}$ , and ${{\textsc{m}}}\mathbin{{\color[rgb]{0.,0.,1.}\mathrel{\sqsubseteq}}}{{\textsc{m}}}^{\prime}$ .

Proof.

By induction on traces: the requirement for the left argument is a consequence of (24); and strengthening of the memory model reduces the number of possible traces.

5. Reasoning about IMP+pseq

So far we have considered trace-level properties; now we turn attention to state- and predicate-based reasoning. The action-trace semantics can be converted into a typical pairs-of-states semantics straightforwardly.

(40)		$\displaystyle{\sf eff}(x\mathbin{\mathtt{:\!=}}e)~{}~{}=~{}~{}\{(\sigma,{\sigma}_{[x\mathop{:=}{e}_{\sigma}]})\}\quad$
(41)		$\displaystyle{\sf eff}(\llparenthesis e\rrparenthesis)=\{(\sigma,\sigma)\|\sigma\in e\}\qquad{\sf eff}(\mathbf{barrier}(\mathsf{f}))~{}~{}=~{}~{}\mathop{\mathstrut{\mathrm{id}}}\nolimits$
(44)		$\displaystyle{\sf eff}([])~{}~{}=~{}~{}\mathop{\mathstrut{\mathrm{id}}}\nolimits\qquad\quad{\sf eff}(a{{\mathbin{\raise 2.39197pt\hbox{$\mathchar 0\relax\@@cat$}}}}t)~{}~{}=~{}~{}{\sf eff}(\alpha)\mathbin{\raise 2.15279pt\hbox{\oalign{\hfil$\scriptscriptstyle\mathrm{o}$\hfil\cr\hfil$\scriptscriptstyle\mathrm{9}$\hfil}}}{\sf eff}(t)$
(45)		$\displaystyle{\sf eff}(c)={\textstyle\bigcup}\{{\sf eff}(t)\|t\in\llbracket c\rrbracket\}$
(46)		$\displaystyle wp(c)~{}~{}\mathrel{\mathstrut{\widehat{=}}}~{}~{}\mathop{\mathstrut{\lambda}}\nolimits q.\{\sigma\|\mathop{\mathstrut{\forall}}\nolimits\sigma^{\prime}\mathrel{\mathstrut{\bullet}}(\sigma,\sigma^{\prime})\in{\sf eff}(c)\Rightarrow\sigma^{\prime}\in q\}$
(47)		$\displaystyle\{p\}\,c\,\{q\}~{}~{}\mathrel{\mathstrut{\widehat{=}}}~{}~{}p\Rightarrow wp(c){q}$

Let the type $\Sigma$ be the set of total mappings from variables to values, and let the effect function ${\sf eff}:Instr\rightarrow\mathop{\mathstrut{\mathbb P}}\nolimits(\Sigma\times\Sigma)$ return a relation on states given an instruction. We let ‘ $\mathop{\mathstrut{\mathrm{id}}}\nolimits$ ’ be the identity relation on states, and given a Boolean expression $e$ we write $\sigma\in e$ if $e$ is $True$ in state $\sigma$ . The effect of actions is straightforward from (40) and (41), giving the trivial case ${\sf eff}(\tau)=\mathop{\mathstrut{\mathrm{id}}}\nolimits$ . The relationship with standard Plotkin style operational semantics (Plotkin, 2004) is straightforward.

(48)

c\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\alpha}}{{\longrightarrow}}}c^{\prime}\mathrel{\mathstrut{\wedge}}(\sigma,\sigma^{\prime})\in{\sf eff}(\alpha)\hskip 5.69054pt\Rrightarrow\hskip 5.69054pt\langle c,\sigma\rangle\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}}}{{\longrightarrow}}}\langle c^{\prime},\sigma^{\prime}\rangle

The advantage of our approach is that syntax of the action $\alpha$ can be used to reason about allowed reorderings using (13, 24), whereas in general one cannot reconstruct the action from a pair of states. Mapping ${\sf eff}$ onto a trace $t$ , $\mathsf{map}({\sf eff},t)$ , yields the sequence of relations corresponding to the set of sequences of pairs of states in a Plotkin-style trace. We can lift ${\sf eff}$ to traces by inductively composing such a sequence of relations (44), and we define the overall effect of a command by the union of the effect of its traces (45).

The predicate transformer for weakest precondition semantics is given in (46). A predicate is a set of states, so that given a command $c$ and predicate $q$ , $wp(c)(q)$ returns the set of (pre) states $\sigma$ where every post-state related to $\sigma$ by ${\sf eff}(c)$ satisfies $q$ (following, e.g., (Dijkstra and Scholten, 1990)). Given these definitions we can show the following.

Theorem 5.1.

For Sequential m, $wp(c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2)=wp(c_{\_}1\mathbin{.{}\mathchar 24635\relax}c_{\_}2)$

Proof.

By Definition 2.1, Theorem 2.2 and (46).

We define Hoare logic judgements using weakest preconditions (47) (note that we deal only with partial correctness as we consider only finite traces). From these definitions we can derive the standard rules of weakest preconditions and Hoare logic for commands such as nondeterministic choice and sequential composition, but there are no general compositional rules for parallelized sequential composition.

(49)		$\displaystyle\{{q}_{[x\backslash e]}\}\,x\mathbin{\mathtt{:\!=}}e\,\{q\}\quad\{q\mathrel{\mathstrut{\wedge}}e\}\,\llparenthesis e\rrparenthesis\,\{q\}\quad\{q\}\,\mathbf{fence}\,\{q\}$
(50)		$\displaystyle\{p\}\,c_{\_}1\mathbin{.{}\mathchar 24635\relax}c_{\_}2\,\{q\}\Leftrightarrow\{p\}\,c_{\_}1\,\{r\}\mathrel{\mathstrut{\wedge}}\{r\}\,c_{\_}2\,\{q\}$
(51)		$\displaystyle\{p\}\,c_{\_}1\sqcap c_{\_}2\,\{q\}\Leftrightarrow\{p\}\,c_{\_}1\,\{q\}\mathrel{\mathstrut{\wedge}}\{p\}\,c_{\_}2\,\{q\}$
(52)		$\displaystyle\{p\}\,c_{\_}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}\mathbf{fence}\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{m}}}}}$}}{\mathchar 24635\relax}$}}c_{\_}2\,\{q\}\Leftrightarrow\{p\}\,c_{\_}1\,\{r\}\mathrel{\mathstrut{\wedge}}\{r\}\,c_{\_}2\,\{q\}$

Law 49 follows from (41) and (46), while Laws 50 and 51 are straightforward by definition. Law 52 is a key rule that shows how, for any m where $\mathbf{fence}$ acts as a full fence, inserting a fence restores sequential reasoning in that a mid-point (predicate $r$ ) can be used for compositional reasoning.

Theorem 5.2.

If $c\mathrel{\sqsubseteq}c^{\prime}$ then ${\sf eff}(c^{\prime})\subseteq{\sf eff}(c)$ , $wp(c)\subseteq wp(c^{\prime})$ , and $\{p\}\,c\,\{q\}\Rightarrow\{p\}\,c^{\prime}\,\{q\}$

Proof.

Straightforward from definitions.

We use two key theorems to establish (or deny) properties of programs executing under weak memory models.

Theorem 5.3 (Verification).

	$\displaystyle(i)~{}~{}\mbox{If}\hskip 5.69054ptc={c^{\prime}}_{\langle{{\textsc{sc}}}\rangle}\hskip 5.69054pt\mbox{then}\hskip 5.69054pt\{p\}\,c\,\{q\}\Leftrightarrow\{p\}\,{c^{\prime}}_{\langle{{\textsc{sc}}}\rangle}\,\{q\}$
	$\displaystyle(ii)~{}~{}\mbox{If}\hskip 5.69054ptc\mathrel{\sqsubseteq}c^{\prime}\hskip 5.69054pt\mbox{and}\hskip 5.69054pt\{p\}\,c^{\prime}\,\{\neg q\}\hskip 5.69054pt\mbox{then}\hskip 5.69054pt\neg\{p\}\,c\,\{q\}$

Proof.

Straightforward from definition and Theorem 5.2.

Theorem 5.3(i) is simply monotonicity of a Hoare triple, but we make the reduction to a sequential form explicit; once this has happened standard compositional rules from Hoare logic can be applied to establish a property of the original program. Theorem 5.3(ii) applies when a particular reordering of instructions in $c$ (typically $c^{\prime}$ will be a sequence of instructions) contradicts some postcondition, from which we can determine that the original program does not establish that postcondition (for a given $p$ ). Hoare logic is used as the basis for reasoning about concurrent programs in the Owicki-Gries method (Owicki and Gries, 1976), and so Theorem 5.3(i) enables the application of standard techniques for concurrent programs.

We now encode some well-known memory models in our framework and show how properties and behaviours can be derived from the base we have provided.

6. Total store order (TSO)

The “total store order” memory model (used in Intel, AMD and SPARC processors; a history of its development is given in (Sewell et al., 2010)) maintains program order on stores but is relaxed in the sense that it allows loads to come before stores.

Model 6 (tso).

$\alpha\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{tso}}}}$}}}}\beta$ except, for $x\in{\sf Shared},r\in{\sf Local}$ ,

(57)

\displaystyle x\mathbin{\mathtt{:\!=}}e\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{tso}}}}}}r\mathbin{\mathtt{:\!=}}f\quad\mbox{if\ \ $x\mathbin{\ooalign{\hss{\hskip 0.7pt/\hss}\cr{$\in$}}}{\sf sv}(f)$ and $r\mathbin{\ooalign{\hss{\hskip 0.7pt/\hss}\cr{$\in$}}}{\sf fv}(e)$}

TSO allows loads to come before independent stores, and, due to forwarding, for dependent loads to “bypass” (recall (1) and (5)). That is, even though by M6 we have $x\mathbin{\mathtt{:\!=}}1\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{tso}}}}$}}}}r\mathbin{\mathtt{:\!=}}x$ , due to forwarding we have $r\mathbin{\mathtt{:\!=}}1\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}x\mathbin{\mathtt{:\!=}}1\mathbin{\overset{{\color[rgb]{0.2.,0.5,0.2}{{\textsc{tso}}}}}{\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}}}r\mathbin{\mathtt{:\!=}}x$ . Note that tso allows independent register operations to also be reordered before stores. We define $\mathtt{mfence}\mathrel{\mathstrut{\widehat{=}}}\mathbf{fence}$ , which is x86-TSO’s primary way to restore order.⁵⁵5 x86-TSO also has store and load fences, which we discuss in the context of later memory models, but these are effectively no-ops for TSO; however TSO’s $\mathtt{lfence}$ blocks speculative execution (Intel, 2019).

The classic behaviours defining tso as opposed to sc can be summarised by the equations below.

(58)	$\displaystyle x\mathbin{\mathtt{:\!=}}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{tso}}}}}$}}{\mathchar 24635\relax}$}}y\mathbin{\mathtt{:\!=}}1\ ~{}~{}~{}=~{}~{}~{}$	$\displaystyle\ x\mathbin{\mathtt{:\!=}}1\mathbin{.{}\mathchar 24635\relax}y\mathbin{\mathtt{:\!=}}1\qquad\qquad$
(59)	$\displaystyle x\mathbin{\mathtt{:\!=}}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{tso}}}}}$}}{\mathchar 24635\relax}$}}r\mathbin{\mathtt{:\!=}}x~{}\mathrel{\sqsubseteq}~{}$	$\displaystyle\ r\mathbin{\mathtt{:\!=}}1\mathbin{.{}\mathchar 24635\relax}x\mathbin{\mathtt{:\!=}}1$
(60)	$\displaystyle x\mathbin{\mathtt{:\!=}}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{tso}}}}}$}}{\mathchar 24635\relax}$}}r\mathbin{\mathtt{:\!=}}y~{}=~{}$	$\displaystyle\ (x\mathbin{\mathtt{:\!=}}1\mathbin{.{}\mathchar 24635\relax}r\mathbin{\mathtt{:\!=}}y)\sqcap(r\mathbin{\mathtt{:\!=}}y\mathbin{.{}\mathchar 24635\relax}x\mathbin{\mathtt{:\!=}}1)$

Stores are kept in program order by tso (58) (an instance of Law 36). A load of $x$ preceded by a store can use the stored value immediately (59) (an instance of Law 37); only later will the store become visible to the rest of the system – the classic bypassing behaviour. A load of $y$ preceded by a store of $x$ , for $x\neq y$ , could be executed in either order (60). Perhaps the simplest system which can observe this behaviour is the classic “store buffer” ( ${\tt SB}$ ) test (Adve and Hill, 1993).

{\tt SB}\hskip 5.69054pt\mathrel{\mathstrut{\widehat{=}}}\hskip 5.69054ptx\mathbin{\mathtt{:\!=}}1\mathbin{.{}\mathchar 24635\relax}r_{\_}1\mathbin{\mathtt{:\!=}}y\hskip 5.69054pt\parallel\hskip 5.69054pty\mathbin{\mathtt{:\!=}}1\mathbin{.{}\mathchar 24635\relax}r_{\_}2\mathbin{\mathtt{:\!=}}x

First note that in a sequential system at least one register must read the value 1.

Theorem 6.1.

$\{x=y=0\}\,{\tt SB}\,\{\neg(r_{\_}1=r_{\_}2=0)\}$ .

Proof.

Lahav and Vafeaidis (Lahav and Vafeiadis, 2015) provide an Owicki-Gries proof, which we replicated using Isabelle/HOL (Nipkow and Nieto, 1999).

However this behaviour is not ruled out under tso.

Theorem 6.2.

$\neg\{x=y=0\}\,{{\tt SB}}_{\langle{{\textsc{tso}}}\rangle}\,\{\neg(r_{\_}1=r_{\_}2=0)\}$ .

Proof.

Abbreviate $c_{\_}1\mathrel{\mathstrut{\widehat{=}}}r_{\_}1\mathbin{\mathtt{:\!=}}y\mathbin{.{}\mathchar 24635\relax}x\mathbin{\mathtt{:\!=}}1$ and $c_{\_}2\mathrel{\mathstrut{\widehat{=}}}r_{\_}2\mathbin{\mathtt{:\!=}}x\mathbin{.{}\mathchar 24635\relax}y\mathbin{\mathtt{:\!=}}1$ , and hence ${{\tt SB}}_{\langle{{\textsc{tso}}}\rangle}={c_{\_}{1}}_{\langle{{\textsc{tso}}}\rangle}\parallel{c_{\_}{2}}_{\langle{{\textsc{tso}}}\rangle}$ . Also let $\overset{\curvearrowleft}{c_{\_}{i}}$ represent $c_{\_}i$ with its instructions reordered.
${{\tt SB}}_{\langle{{\textsc{tso}}}\rangle}\hskip 5.69054pt=\hskip 5.69054pt{c_{\_}{1}}_{\langle{{\textsc{tso}}}\rangle}\parallel{c_{\_}{2}}_{\langle{{\textsc{tso}}}\rangle}$ $=$ $(c_{\_}1\sqcap\overset{\curvearrowleft}{c_{\_}{1}})\parallel(c_{\_}2\sqcap\overset{\curvearrowleft}{c_{\_}{2}})$ By (60) $=$ $(c_{\_}1\parallel c_{\_}2)\sqcap(\overset{\curvearrowleft}{c_{\_}{1}}\parallel c_{\_}2)\sqcap(c_{\_}1\parallel\overset{\curvearrowleft}{c_{\_}{2}})\sqcap(\overset{\curvearrowleft}{c_{\_}{1}}\parallel\overset{\curvearrowleft}{c_{\_}{2}})$ Law 33

We have reduced ${{\tt SB}}_{\langle{{\textsc{tso}}}\rangle}$ to four concurrent, sequential programs representing each possible combination of reorderings. By Law 51 we can complete the proof by showing any one of the four violates the postcondition; we already know that the first reordering does establish the pre/post condition by Theorem 6.1, however the other three all violate it, as we demonstrate below for the fourth case.

$\overset{\curvearrowleft}{c_{\_}{1}}\parallel\overset{\curvearrowleft}{c_{\_}{2}}$ $=$	$r_{\_}1\mathbin{\mathtt{:\!=}}y\mathbin{.{}\mathchar 24635\relax}x\mathbin{\mathtt{:\!=}}1\parallel r_{\_}2\mathbin{\mathtt{:\!=}}x\mathbin{.{}\mathchar 24635\relax}y\mathbin{\mathtt{:\!=}}1$	Def.
$\mathrel{\sqsubseteq}$	$r_{\_}1\mathbin{\mathtt{:\!=}}y\mathbin{.{}\mathchar 24635\relax}r_{\_}2\mathbin{\mathtt{:\!=}}x\mathbin{.{}\mathchar 24635\relax}x\mathbin{\mathtt{:\!=}}1\mathbin{.{}\mathchar 24635\relax}y\mathbin{\mathtt{:\!=}}1$	Law 32

Hoare logic (Laws 49 and 50), gives the following.

\halign to=433.62pt{\zstrut\z@left\z@startline\ignorespaces$\@lign#$\z@stopline\hfil&\hbox to 0pt{\hss\@lign#}\cr\hbox{\hskip 0.0pt\hbox{\hbox{}\hbox{\ignorespaces\ignorespaces$\{x=y=0\}\,r_{\_}1\mathbin{\mathtt{:\!=}}y\mathbin{.{}\mathchar 24635\relax}r_{\_}2\mathbin{\mathtt{:\!=}}x\mathbin{.{}\mathchar 24635\relax}x\mathbin{\mathtt{:\!=}}1\mathbin{.{}\mathchar 24635\relax}y\mathbin{\mathtt{:\!=}}1\,\{r_{\_}1=r_{\_}2=0\}$}}}\crcr}

The proof is completed by Theorem 5.3(ii) – a possible reordering and interleaving contradicts the postcondition.

To reinstate sequential behaviour under tso, fences can be inserted in both branches.

Theorem 6.3.

Let ${\tt SB}^{+\mathtt{mfence}}$ be ${\tt SB}$ with fences inserted into each branch; then $\{x=y=0\}\,{{\tt SB}}_{\langle{{\textsc{tso}}}\rangle}^{+\mathtt{mfence}}\,\{\neg(r_{\_}1=r_{\_}2=0)\}$

Proof.

By M6, Law 52 and the reasoning of Theorem 6.1.

Note that reasoning is relatively direct in this framework: we can use properties of the model and the structure of the program to reduce reasoning to sequential cases where established techniques can be applied (Theorem 6.3), or a partcular case that violates a desired property can be enumerated (Theorem 6.2). Other reasoning frameworks typically monitor reorderings with respect to global abstract (graph) data structures, requiring custom assertion languages and judgements.

6.1. Equivalence to an explicit store buffer model

One of the best-known formalisations of a weak memory model is the operational model of x86-TSO (Sewell et al., 2010; Owens et al., 2009). In that model the code is executed sequentially, but interacts with a store buffer that temporarily holds stores before sending them to the storage system, allowing loads that occur in the meantime to use values found in the buffer. Below we give an extension of IMP+pseq to add an explicit store buffer ${\tt s}$ , written $(\mathbf{buf}~{}{\tt s}\mathrel{\mathstrut{\bullet}}c)$ , following the semantics in (Owens et al., 2009).⁶⁶6 We give a per-process buffer, whereas (Owens et al., 2009) uses a single global buffer, with each write in the buffer tagged by the originating process’s id.

(61)	$\displaystyle c\xrightarrow{\raisebox{5.69054pt}{\hbox{}}\hbox{${}_{x\mathop{:=}v}$}}c^{\prime}\Rightarrow$	$\displaystyle~{}~{}(\mathbf{buf}~{}{\tt s}\mathrel{\mathstrut{\bullet}}c)\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\tau}}{{\longrightarrow}}}(\mathbf{buf}~{}{\tt s}{{\mathbin{\raise 2.39197pt\hbox{$\mathchar 0\relax\@@cat$}}}}(x,v)\mathrel{\mathstrut{\bullet}}c^{\prime})$
(62)		$\displaystyle~{}~{}(\mathbf{buf}~{}(x,v){{\mathbin{\raise 2.39197pt\hbox{$\mathchar 0\relax\@@cat$}}}}{\tt s}\mathrel{\mathstrut{\bullet}}c)\xrightarrow{\raisebox{5.69054pt}{\hbox{}}\hbox{${}_{x\mathop{:=}v}$}}(\mathbf{buf}~{}{\tt s}\mathrel{\mathstrut{\bullet}}c)$
(63)	$\displaystyle c\xrightarrow{\raisebox{5.69054pt}{\hbox{}}\hbox{${}_{\mathtt{mfence}}$}}c^{\prime}\Rightarrow$	$\displaystyle~{}~{}(\mathbf{buf}~{}[]\mathrel{\mathstrut{\bullet}}c)\xrightarrow{\raisebox{5.69054pt}{\hbox{}}\hbox{${}_{\mathtt{mfence}}$}}(\mathbf{buf}~{}[]\mathrel{\mathstrut{\bullet}}c^{\prime})$
(64)	$\displaystyle c\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\tau}}{{\longrightarrow}}}c^{\prime}\Rightarrow$	$\displaystyle~{}~{}(\mathbf{buf}~{}{\tt s}\mathrel{\mathstrut{\bullet}}c)\mathrel{\stackrel{{\scriptstyle\raisebox{5.69054pt}{\hbox{}}\tau}}{{\longrightarrow}}}(\mathbf{buf}~{}{\tt s}\mathrel{\mathstrut{\bullet}}c^{\prime})$
(67)	$c\xrightarrow{\raisebox{5.69054pt}{\hbox{}}\hbox{${}_{r\mathop{:=}x}$}}c^{\prime}\mathrel{\mathstrut{\wedge}}{\tt s}~{}=~{}{\tt s}_{\_}1{{\mathbin{\raise 2.39197pt\hbox{$\mathchar 0\relax\@@cat$}}}}(x,v){{\mathbin{\raise 1.99332pt\hbox{$\mathchar 0\relax\@@cat$}}}}{\tt s}_{\_}2\mathrel{\mathstrut{\wedge}}x\mathbin{\ooalign{\hss{\hskip 0.7pt/\hss}\cr{$\in$}}}{\tt s}_{\_}2\Rightarrow$
(68)		$\displaystyle~{}~{}(\mathbf{buf}~{}{\tt s}\mathrel{\mathstrut{\bullet}}c)\xrightarrow{\raisebox{5.69054pt}{\hbox{}}\hbox{${}_{r\mathop{:=}v}$}}(\mathbf{buf}~{}{\tt s}\mathrel{\mathstrut{\bullet}}c^{\prime})$
(71)	$c\xrightarrow{\raisebox{5.69054pt}{\hbox{}}\hbox{${}_{r\mathop{:=}x}$}}c^{\prime}\mathrel{\mathstrut{\wedge}}x\mathbin{\ooalign{\hss{\hskip 0.7pt/\hss}\cr{$\in$}}}{\tt s}\Rightarrow$
(72)		$\displaystyle~{}~{}(\mathbf{buf}~{}{\tt s}\mathrel{\mathstrut{\bullet}}c)\xrightarrow{\raisebox{5.69054pt}{\hbox{}}\hbox{${}_{r\mathop{:=}x}$}}(\mathbf{buf}~{}{\tt s}\mathrel{\mathstrut{\bullet}}c^{\prime})$

A store is a variable/value pair, $(x,v)$ , and $x\mathbin{\ooalign{\hss{\hskip 0.7pt/\hss}\cr{$\in$}}}{\tt s}$ means there is no store to $x$ in ${\tt s}$ . In all rules $c$ is executed using the sequential fragment of the semantics only, and we assume $x\in{\sf Shared}$ and $r\in{\sf Local}$ . If $c$ issues a store, it is placed at the end of the store buffer (the system sees only a silent ( $\tau$ ) step) (61). The first store in the buffer can be flushed to the system at any time (62). A fence can only proceed when the buffer is empty (63), while internal steps of $c$ can proceed independently of the state of the buffer (64). The interesting rules are for loads: if $c$ issues a load $r\mathbin{\mathtt{:\!=}}x$ then this can be serviced by the buffer using the most recent value for $x$ (say $v$ ) resulting in a step $r\mathbin{\mathtt{:\!=}}v$ , and no interaction with the global system (68). If $c$ issues a load of $x$ that is not in the buffer then the load is promoted to the system level (72).

Theorem 6.4.

For any command $c$ , issuing only assembler-level instructions (stores, loads, fences and register-only operations), $(\mathbf{buf}~{}[]\mathrel{\mathstrut{\bullet}}c)=(\mathbf{pline}_{\_}{{{\textsc{m}}}}~{}[]:~{}{c}_{\langle{{\textsc{sc}}}\rangle})$

Proof.

The main difference between the store buffer semantics and a pipeline is that only stores may be fetched, while the rules for loads combine fetching and committing in one step. All of the preconditions for the store buffer rules reduce to an equivalent form of the reordering relation over sequences of stores. For instance, for (68) to apply for a load $r\mathbin{\mathtt{:\!=}}x$ , the most recently buffered (fetched) store $(x,v)$ is used, and the promoted label is $r\mathbin{\mathtt{:\!=}}v$ . This is exactly the condition for a load to reorder with the equivalent trace via (13), that is, $r\mathbin{\mathtt{:\!=}}v\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}t\mathbin{\overset{{\color[rgb]{0.2.,0.5,0.2}{{\textsc{tso}}}}}{\mathop{\textstyle{\raisebox{1.0pt}{${\color[rgb]{0.2.,0.5,0.2}\mbox{\guillemetleft}}$}}}}}r\mathbin{\mathtt{:\!=}}x$ iff trace (pipeline) $t$ is formed from stores only and is of the form $t_{\_}1{{\mathbin{\raise 2.39197pt\hbox{$\mathchar 0\relax\@@cat$}}}}x\mathbin{\mathtt{:\!=}}v{{\mathbin{\raise 1.99332pt\hbox{$\mathchar 0\relax\@@cat$}}}}t_{\_}2$ , with $x\mathbin{\ooalign{\hss{\hskip 0.7pt/\hss}\cr{$\in$}}}{\sf wv}(t_{\_}2)$ , which follows from M6 and lifting tso to traces as in Sect. 3. The other (simpler) cases similarly reduce.

Theorem 6.5.

For any command $c$ , issuing only assembler-level instructions, $(\mathbf{buf}~{}[]\mathrel{\mathstrut{\bullet}}c)={c}_{\langle{{\textsc{tso}}}\rangle}$

Proof.

By Theorem 6.4 and Theorem 3.1.

That is, the semantics of parallelized sequential composition instantiated with reordering (and forwarding) given by tso gives precisely those behaviours obtained by sequential execution with an (initially empty) store buffer.

7. Release consistency

The release consistency memory model (Gharachorloo et al., 1990) has been highly influential, having been implemented in the Dash processor (Lenoski et al., 1992), guided the development of the C language memory model (Boehm and Adve, 2008), and the concepts incorporated into ARM (Pulte et al., 2017) and RISC-V (RISC-V International, 2017). The key concept revolves around release writes and acquire loads: a release write is stereotypically used to set a flag to indicate a block of computation has ended, and and an acquire load is correspondingly used to observe a release write. Code before the release should happen before, and code after the acquire should happen after; conceptually these are weaker (one-way) fences. Release consistency’s motivation was finding an easy-to-implement mechanism for interprocess communication that is feasible and inexpensive computationally, and relatively straightforward for programmers.

We extend the action syntax of IMP+pseq to include ordering constraints (oc) as annotations to any action, though as noted above release store and acquire load are the most commonly used.

(73)		$\displaystyle{\textsc{oc}}~{}~{}\mathbin{:\!:\!=}~{}~{}{\textsc{rel}}\mathrel{~{}\|~{}}{\textsc{acq}}\qquad\alpha~{}~{}\mathbin{:\!:\!=}~{}~{}\ldots\mathrel{~{}\|~{}}\alpha^{{\textsc{oc}}}$
(74)		$\displaystyle{(\beta^{{\textsc{oc}}})}_{\mbox{\raisebox{-0.5pt}{\guillemetleft}}\alpha}=({\beta}_{\mbox{\raisebox{-0.5pt}{\guillemetleft}}\alpha})^{{\textsc{oc}}}\qquad{\beta}_{\mbox{\raisebox{-0.5pt}{\guillemetleft}}(\alpha^{{\textsc{oc}}})}={\beta}_{\mbox{\raisebox{-0.5pt}{\guillemetleft}}\alpha}$

Forwarding for the new annotated actions is defined inductively so that the base actions take effect and ignore the annotations (74); and we define ${\sf eff}(\alpha^{{\textsc{oc}}})={\sf eff}(\alpha)$ .

Following (Gharachorloo et al., 1990) we distinguish two models, rc_pc (where $pc$ stands for “processor consistency”) and rc_sc (where $sc$ stands for “sequential consistency”), the latter of which is a strengthening of the former; an alternative would be to distinguish $pc/sc$ in the annotations themselves, allowing mixing of the two types in one model (cf. ARM’s ldar/ldapr instructions). For simplicity we assume that g (M4) controls reordering outside of annotation considerations, although in the theory of (Gharachorloo et al., 1990) stronger constraints are possible.

Model 7 (rc_pc).

$\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{rc${}_{pc}$}}}}}}\beta~{}\mbox{iff~{} $\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{g}}}}}}\beta$ }$ except

(77)			$\displaystyle\alpha\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{rc${}_{pc}$}}}}$}}}}\beta^{{\textsc{rel}}}\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{rc${}_{pc}$}}}}}}\gamma$	iff $\beta\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{rc${}_{pc}$}}}}}}\gamma$
(80)			$\displaystyle\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{rc${}_{pc}$}}}}}}\beta^{{\textsc{acq}}}\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{rc${}_{pc}$}}}}$}}}}\gamma$	iff $\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{rc${}_{pc}$}}}}}}\beta$

Model 8 (rc_sc).

$\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{rc${}_{sc}$}}}}}}\beta~{}\mbox{iff~{} $\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{rc${}_{pc}$}}}}}}\beta$ }$ except $\alpha^{{\textsc{rel}}}\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{rc${}_{sc}$}}}}$}}}}\beta^{{\textsc{acq}}}$ .

rc_pc straightforwardly follows the intuition of (Gharachorloo et al., 1990), where a release action $\beta^{{\textsc{rel}}}$ is always blocked from reordering and hence all earlier instructions must be complete before it can execute, but it does not block later instructions from happening early (77) (provided $\beta$ does not on its own block later instructions, calculated by recursively applying the reordering relation). An acquire action is the converse (80). rc_sc strengthens rc_pc by additionally requiring order between release and acquire actions in the one thread (the reverse direction is already implied). Consider the behaviour of the classic “message passing” pattern ( ${\tt MP}$ ).

(81)

{\tt MP}\hskip 5.69054pt\mathrel{\mathstrut{\widehat{=}}}\hskip 5.69054ptx\mathbin{\mathtt{:\!=}}1\mathbin{.{}\mathchar 24635\relax}y\mathbin{\mathtt{:\!=}}1\hskip 5.69054pt\parallel\hskip 5.69054ptr_{\_}1\mathbin{\mathtt{:\!=}}y\mathbin{.{}\mathchar 24635\relax}r_{\_}2\mathbin{\mathtt{:\!=}}x

Theorem 7.1.

$\{x=y=0\}\,{\tt MP}\,\{r_{\_}1=1\Rightarrow r_{\_}2=1\}$

Proof.

Straightforward by Owicki-Gries reasoning: the stores are executed in the order $x$ , $y$ , and read in reverse order, hence if the latter is observed the former must have taken effect.

Consider using the weaker rc_pc model with annotations.

{\tt MP}^{+}\hskip 5.69054pt\mathrel{\mathstrut{\widehat{=}}}\hskip 5.69054ptx\mathbin{\mathtt{:\!=}}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{rc${}_{pc}$}}}}}$}}{\mathchar 24635\relax}$}}(y\mathbin{\mathtt{:\!=}}1)^{{\textsc{rel}}}\hskip 5.69054pt\parallel\hskip 5.69054pt(r_{\_}1\mathbin{\mathtt{:\!=}}y)^{{\textsc{acq}}}\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{rc${}_{pc}$}}}}}$}}{\mathchar 24635\relax}$}}r_{\_}2\mathbin{\mathtt{:\!=}}x

Here the release annotation on the write to $y$ means that $y$ acts as a flag that $x$ has been written, and so if the other process sees the modification to $y$ via an acquire it must also see the write to $x$ .

Theorem 7.2.

$\{x=y=0\}\,{\tt MP}^{+}\,\{r_{\_}1=1\Rightarrow r_{\_}2=1\}$

Proof.

Using the definition of ${\tt MP}^{+}$ ,
$x\mathbin{\mathtt{:\!=}}1\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{rc${}_{pc}$}}}}}$}}{\mathchar 24635\relax}$}}(y\mathbin{\mathtt{:\!=}}1)^{{\textsc{rel}}}\hskip 5.69054pt\parallel\hskip 5.69054pt(r_{\_}1\mathbin{\mathtt{:\!=}}y)^{{\textsc{acq}}}\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{rc${}_{pc}$}}}}}$}}{\mathchar 24635\relax}$}}r_{\_}2\mathbin{\mathtt{:\!=}}x$ $=$ $x\mathbin{\mathtt{:\!=}}1\mathbin{.{}\mathchar 24635\relax}(y\mathbin{\mathtt{:\!=}}1)^{{\textsc{rel}}}\hskip 5.69054pt\parallel\hskip 5.69054pt(r_{\_}1\mathbin{\mathtt{:\!=}}y)^{{\textsc{acq}}}\mathbin{.{}\mathchar 24635\relax}r_{\_}2\mathbin{\mathtt{:\!=}}x$

The equality holds by applying Law 36 in each process from (77) and (80). Now the proof follows using the same reasoning as Theorem 7.1 (annotations have no effect on sequential semantics, only reorderings).

Note that without the annotations the instructions in each process could be reordered according to ${\bf M\ref{mm:G}}$ , under which conditions it is straightforward to find a behaviour that contradicts $r_{\_}1=1\Rightarrow r_{\_}2=1$ .

8. ARM version 8

In this section we consider the latest version of ARM v8, which is simpler than earlier versions due to it being “multi-copy atomic” (Pulte et al., 2017). ARM’s instruction set has artificial barriers including a “control fence” ${\tt isb}\mathrel{\mathstrut{\widehat{=}}}\mathbf{barrier}(\mathsf{ctrl})$ , a write barrier ${\tt dsb.st}\mathrel{\mathstrut{\widehat{=}}}\mathbf{barrier}(\mathsf{ww})$ , and a full fence ${\tt dsb}\mathrel{\mathstrut{\widehat{=}}}\mathbf{fence}$ .

Model 9 (arm).

$\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{arm}}}}}}\beta\quad\mbox{if\ \ $\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{rc${}_{sc}$}}}}}}\beta$ }$ except

(86)	$\displaystyle\alpha\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{arm}}}}$}}}}{\tt dsb.st}\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{arm}}}}$}}}}\alpha$	if $\mathsf{isStore}(\alpha)$
(91)	$\displaystyle\llparenthesis b\rrparenthesis\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{arm}}}}$}}}}{\tt isb}\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{arm}}}}$}}}}\alpha$	if $\mathsf{isLoad}(\alpha)$
(94)	$\displaystyle\llparenthesis b\rrparenthesis\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{arm}}}}$}}}}\alpha$	if $\mathsf{isStore}(\alpha)$

Store fences maintain order between stores (86) (recall (6)), while control fences are blocked by branches and correspondingly block loads (91) (recall (7)); when taken in conjunction a control fence enforces order between loads within and before a branch, preventing the observable effects of speculative execution. Branches block stores, including independent stores (94); this is a practical consideration to do with speculating down branches: one cannot commit stores until it is known that the branch will be taken. Other than these exceptions, arm behaves as rc_sc for release/acquire annotations,⁷⁷7As mentioned in Sect. 7, ARM’s LDAPR explicitly weakens the ordering between release/acquire instructions, which can be handled by distinguishing annotations syntactically rather than within the memory model definition. fundamentally behaving as g (M4).

As an example of the weak nature of ARM, i.e., issuing loads before the branch condition for the load is evaluated, consider the following behaviour of a variant of the reader process of ${\tt MP}$ (81), where the second load is guarded. Define ${\tt MP}_{\_}w\mathrel{\mathstrut{\widehat{=}}}x\mathbin{\mathtt{:\!=}}1\mathbin{.{}\mathchar 24635\relax}y\mathbin{\mathtt{:\!=}}1$ and ${\tt MP}_{\_}r\mathrel{\mathstrut{\widehat{=}}}r_{\_}1\mathbin{\mathtt{:\!=}}y\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{arm}}}}}$}}{\mathchar 24635\relax}$}}(\mathrel{\mathbf{if}}r_{\_}1=1\mathrel{\mathbf{then}}r_{\_}2\mathbin{\mathtt{:\!=}}x)$ , where for brevity we leave the arm parameter implicit on conditionals.

Theorem 8.1.

$\neg\{x=y=0\}\,{\tt MP}_{\_}w\parallel{\tt MP}_{\_}r\,\{r_{\_}1=1\Rightarrow r_{\_}2=1\}$

Proof.

Consider the following behaviour of ${\tt MP}_{\_}r$ .
${\tt MP}_{\_}r\mathrel{\mathstrut{\widehat{=}}}$ $r_{\_}1\mathbin{\mathtt{:\!=}}y\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{arm}}}}}$}}{\mathchar 24635\relax}$}}(\mathrel{\mathbf{if}}r_{\_}1=1\mathrel{\mathbf{then}}\underline{r_{\_}2\mathbin{\mathtt{:\!=}}x})$ $\mathrel{\sqsubseteq}$ $r_{\_}1\mathbin{\mathtt{:\!=}}y\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{arm}}}}}$}}{\mathchar 24635\relax}$}}\llparenthesis r_{\_}1=1\rrparenthesis\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{arm}}}}}$}}{\mathchar 24635\relax}$}}\underline{r_{\_}2\mathbin{\mathtt{:\!=}}x}$ (16), Law 30 $\mathrel{\sqsubseteq}$ $r_{\_}1\mathbin{\mathtt{:\!=}}y\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{arm}}}}}$}}{\mathchar 24635\relax}$}}\underline{r_{\_}2\mathbin{\mathtt{:\!=}}x}\mathbin{.{}\mathchar 24635\relax}\llparenthesis r_{\_}1=1\rrparenthesis$ Law 37 by M4 $\mathrel{\sqsubseteq}$ $\underline{r_{\_}2\mathbin{\mathtt{:\!=}}x}\mathbin{.{}\mathchar 24635\relax}r_{\_}1\mathbin{\mathtt{:\!=}}y\mathbin{.{}\mathchar 24635\relax}\llparenthesis r_{\_}1=1\rrparenthesis$ Law 37 by M4
The load of $x$ (underlined) may be reordered before the branch point, and subsequently before the load of $y$ . Even with the stores to $x$ and $y$ being strictly ordered in ${\tt MP}_{\_}w$ we can interleave this ordering so that the postcondition is invalidated, and complete the proof by Theorem 5.3(ii).

Hence under arm conditionals do not guarantee sequential order. Placing an ${\tt isb}$ instruction inside the branch, before the second load, however, prevents this behaviour. Define ${\tt MP}_{\_}{\tt isb}\mathrel{\mathstrut{\widehat{=}}}r_{\_}1\mathbin{\mathtt{:\!=}}y\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{arm}}}}}$}}{\mathchar 24635\relax}$}}(\mathrel{\mathbf{if}}r_{\_}1=1\mathrel{\mathbf{then}}{\tt isb}\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{arm}}}}}$}}{\mathchar 24635\relax}$}}\underline{r_{\_}2\mathbin{\mathtt{:\!=}}x})$ .

Theorem 8.2.

$\{x=y=0\}\,{\tt MP}_{\_}w\parallel{\tt MP}_{\_}{\tt isb}\,\{r_{\_}1=1\Rightarrow r_{\_}2=1\}$

Proof.

Consider the following behaviour of ${\tt MP}_{\_}{\tt isb}$ .
$r_{\_}1\mathbin{\mathtt{:\!=}}y\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{arm}}}}}$}}{\mathchar 24635\relax}$}}(\mathrel{\mathbf{if}}r_{\_}1=1\mathrel{\mathbf{then}}{\tt isb}\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{arm}}}}}$}}{\mathchar 24635\relax}$}}\underline{r_{\_}2\mathbin{\mathtt{:\!=}}x})$ $\mathrel{\sqsubseteq}$ $r_{\_}1\mathbin{\mathtt{:\!=}}y\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{arm}}}}}$}}{\mathchar 24635\relax}$}}\llparenthesis r_{\_}1=1\rrparenthesis\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{arm}}}}}$}}{\mathchar 24635\relax}$}}{\tt isb}\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{arm}}}}}$}}{\mathchar 24635\relax}$}}\underline{r_{\_}2\mathbin{\mathtt{:\!=}}x}$ (16); Law 30 $=$ $r_{\_}1\mathbin{\mathtt{:\!=}}y\mathbin{.{}\mathchar 24635\relax}\llparenthesis r_{\_}1=1\rrparenthesis\mathbin{.{}\mathchar 24635\relax}{\tt isb}\mathbin{.{}\mathchar 24635\relax}\underline{r_{\_}2\mathbin{\mathtt{:\!=}}x}$ Law 36 by (91)
The loads are strictly ordered and so the proof is completed straightforwardly using OG reasoning.

Conformance.

We validate our model using litmus tests (Alglave et al., 2011; Sarkar et al., 2011; Mador-Haim et al., 2010; Lustig et al., 2017; Bornholt and Torlak, 2017; Maranget et al., 2012). ARM has released an official axiomatic model using the herd tool (Alglave et al., 2014) available online via the herdtools7 application (Deacon and Alglave, 2016) (see (Arm Ltd, 2020), Sect. B2.3). Using the diy7 tool and the official model (Alglave, 2020) we generated a set of 99,881 litmus tests covering forbidden behaviours of up to 4 processes using the instruction types covered in M9. In addition we used a further 5757 litmus tests covering allowed and forbidden behaviours using the tests for an earlier version of ARM (Alglave et al., 2014) and a set covering more recent features (Maranget, [n.d.]). We ran these tests using the model checking tool based on the pipeline semantics in Sect. 3. In each case (approximately 105,000 tests) M9 agreed with the published model.

8.1. An axiomatic specification

Perhaps the best known way of describing memory models is via axioms over global traces. Below we give an axiomatic model using a straightforward translation of the reordering relationship M9. We refer to this model as ${\tt ax}_{\_}{ro}$ .

\displaystyle\begin{array}[]{r@{~}l}\texttt{{let}}~{}{\tt ARM}=~{}&[{\tt W}]\mathchar 24635\relax\;\!\!{\tt po}\mathchar 24635\relax\;\!\![{\tt dsb.st}]\mathchar 24635\relax\;\!\!{\tt po}\mathchar 24635\relax\;\!\![{\tt W}]~{}|~{}{\tt ctrl}\mathchar 24635\relax\;\!\![{\tt isb}]~{}|{}\\ ~{}&[{\tt isb}]\mathchar 24635\relax\;\!\!{\tt po}\mathchar 24635\relax\;\!\![{\tt R}]~{}|~{}{\tt ctrl}\mathchar 24635\relax\;\!\![{\tt W}]~{}|~{}{\tt po}\mathchar 24635\relax\;\!\![{\tt dsb}]\mathchar 24635\relax\;\!\!{\tt po}\\ \texttt{{let}}~{}{\tt RC}=~{}&{\tt[A]}\mathchar 24635\relax\;\!\!{\tt po}~{}|~{}{\tt po}\mathchar 24635\relax\;\!\!{\tt[L]}~{}|~{}{\tt[L]}\mathchar 24635\relax\;\!\!{\tt po}\mathchar 24635\relax\;\!\!{\tt[A]}\\ \texttt{{let}}~{}\texttt{{rec}}~{}{\tt ob}=~{}&{\tt ARM}~{}|~{}{\tt RC}~{}|~{}{\tt data}~{}|~{}{\tt data}\mathchar 24635\relax\;\!\!{\tt rfi}~{}|{}\\ ~{}&{\tt rfe}~{}|~{}{\tt fre}~{}|~{}{\tt co}~{}|~{}{\tt ob}\mathchar 24635\relax\;\!\!{\tt ob}\\ \lx@intercol\texttt{{acyclic}}~{}{\tt po\text{-}loc}~{}|~{}{\tt fr}~{}|~{}{\tt co}~{}|~{}{\tt rf}~{}\texttt{{as}}~{}\texttt{{internal}}{}\hfil\lx@intercol\\ \vskip 3.0pt\cr\lx@intercol\texttt{{irreflexive}}~{}{\tt ob}~{}\texttt{{as}}~{}\texttt{{external}}{}\hfil\lx@intercol\end{array}

An axiomatic specification is formed from relations over event traces, typically with acyclic or irreflexive constraints on the defined relations. Union is represented by ‘—’ and relational composition by ‘;’. The ARM relation is essentially a straight translation from M9, along with the fence constraint from M4. For instance, the po relation relates instructions in textual program order, and dsb is the set of dsb instructions in the program, with the square brackets denoting the identity relation on that set. Hence the constraint “ ${\tt po}\mathchar 24635\relax\;\!\![{\tt dsb}]\mathchar 24635\relax\;\!\!{\tt po}$ ” (the last in ARM) states a requirement that instructions before and after a fence must appear in that order in any trace. The remaining constraints in ARM are similarly translated, noting W and R are the store (write) and load (read) instructions, and ctrl relates instructions before and after a branch point. The RC relation captures release/acquire constraints ([L]/[A]) from M7/M8. The ob relation (observation) is recursively defined to include data dependencies, including forwarding (via ${\tt rfi}$ , “reads-from internal”), corresponding to M3, and the “reads-from” and “from-reads” relations, relating loads to corresponding and earlier stores, and the global coherence order $({\tt co})$ on stores. Note that these relations arise directly from our small step operational semantics. The definition of ${\tt ob}$ and the acyclic and irreflexive constraints, which govern internal (local) and external (global) views of the system, are based on the pattern of (Deacon and Alglave, 2016; Pulte et al., 2017) – see (Alglave et al., 2014) for more details on axiomatic specifications.

The ${\tt ax}_{\_}{ro}$ model (available in the supplementary material) agrees with the official model (Deacon and Alglave, 2016) on all 100,000+ litmus tests using herd7. Following the lead of (Pulte et al., 2017) we give a by-hand proof that the traces of the axiomatic model in Sect. 8.1 are the same as the traces obtained by application of the operational semantics.

Theorem 8.3.

The traces of a program $c$ allowed by ${\tt ax}_{\_}{ro}$ are exactly the traces of $(\mathbf{pline}_{\_}{{{\textsc{m}}}}~{}[]:~{}c)$ .

Proof.

Consider a trace $t$ of $(\mathbf{pline}_{\_}{{{\textsc{m}}}}~{}[]:~{}c)$ . This trace must be obtained by some original sequential trace $t^{\prime}$ of $c$ , fetched into the pipeline via (12), and then reordered by successive applications of (13). Without loss of generality consider the case where $t^{\prime}$ is fetched into the pipeline in its entirety before any commits, and is nontrivial, i.e., contains two or more actions. Then the pipeline is exactly $t^{\prime}$ and of the form ${\tt t}_{\_}1{{\mathbin{\raise 2.39197pt\hbox{$\mathchar 0\relax\@@cat$}}}}\alpha{{\mathbin{\raise 1.99332pt\hbox{$\mathchar 0\relax\@@cat$}}}}{\tt t}_{\_}2{{\mathbin{\raise 1.66109pt\hbox{$\mathchar 0\relax\@@cat$}}}}\beta{{\mathbin{\raise 1.38423pt\hbox{$\mathchar 0\relax\@@cat$}}}}{\tt t}_{\_}3$ , with $\alpha$ occurring earlier than $\beta$ in program order in $c$ . If $\alpha$ and $\beta$ are related by ARM or RC then they must appear in order in any axiomatic trace of ${\tt ax}_{\_}{ro}$ ; and also they cannot be reordered using (13) (commit), which follows from the straightforward relationship between ARM & RC and M9 & M8, respectively. Forwarding is covered by the internal /external division in axiomatic models: the internal (local) constraints are more strict, meaning locally sequential semantics is maintained, but externally (globally) actions may appear to occur out of order. The fundamental constraints of M3 and M4 are captured by ${\tt data}$ and ${\tt po\text{-}loc}$ , with full fences captured by ${\tt po}\mathchar 24635\relax\;\!\![{\tt dsb}]\mathchar 24635\relax\;\!\!{\tt po}$ . The acyclic and irreflexive constraints are implicit in an operational semantics – a trace is always strictly ordered, and in particular loads can only access previous stores, it is not possible to access “future” stores. Hence the ${\tt rf}$ , ${\tt fr}$ and ${\tt co}$ constraints are implicitly enforced – these govern the interaction between loads and stores in a trace. In summary, the pointwise description of arm translates straightforwardly to axioms over traces, where the program order ( ${\tt po}$ ) relation captures the intervening actions.

9. RISC-V

The RISC-V memory model (RISC-V International, 2017; Armstrong et al., 2019) is influenced by ARM’s weak ordering on loads and stores (corresponding to g), but has release consistency annotations using the weaker rc_pc (M7) rather than the stronger rc_sc (M8). It also defines six different types of artificial barriers (more are technically possible but their use is not recommended (Armstrong et al., 2019)): a full fence given by ${\tt fence~{}{\tt rw,rw}}\mathrel{\mathstrut{\widehat{=}}}\mathbf{fence}$ ; a store fence given by ${\tt fence~{}{\tt w,w}}\mathrel{\mathstrut{\widehat{=}}}\mathbf{barrier}(\mathsf{ww})$ (identical to ARM’s ${\tt dsb.st}$ ); a corresponding load fence ${\tt fence~{}{\tt r,r}}$ ; two new types ${\tt fence~{}{\tt rw,w}}$ and ${\tt fence~{}{\tt r,rw}}$ described below; and a barrier used to mimic TSO’s in-built weakening where loads can come before stores, which we define as ${\tt fence{\tt.tso}}\mathrel{\mathstrut{\widehat{=}}}{\tt fence~{}{\tt r,rw}}\mathbin{\raisebox{0.0pt}{$\overset{\raisebox{0.0pt}{${{{{\textsc{risc-v}}}}}$}}{\mathchar 24635\relax}$}}{\tt fence~{}{\tt rw,w}}$ . Additionally RISC-V has a barrier ${\tt fence{\tt.i}}$ which has a technical specification beyond what is considered here, and so it is defined as a no-op ( $\tau$ ).

Model 10 (risc-v).

$\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{risc-v}}}}}}\beta~{}\mbox{iff~{}$\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{rc${}_{pc}$}}}}}}\beta$ }$ except

(99)	$\displaystyle\alpha\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{arm}}}}$}}}}{\tt fence~{}{\tt r,r}}\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{arm}}}}$}}}}\alpha$	iff $\mathsf{isLoad}(\alpha)$
(102)	$\displaystyle\alpha\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{risc-v}}}}$}}}}{\tt fence~{}{\tt rw,w}}\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{risc-v}}}}}}\beta$	iff $\mathsf{isLoad}(\beta)$
(105)	$\displaystyle\alpha\mathop{{\color[rgb]{0.2.,0.5,0.2}\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{risc-v}}}}}}{\tt fence~{}{\tt r,rw}}\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{risc-v}}}}$}}}}\beta$	iff $\mathsf{isStore}(\alpha)$
(108)	$\displaystyle\llparenthesis b\rrparenthesis\mathrel{{\color[rgb]{1.,0.,0.}\ooalign{\hss{/\hss}\cr{$\ext@arrow 0055{\Leftarrowfill@}{}{\,{{\textsc{risc-v}}}}$}}}}\alpha$	iff $\mathsf{isStore}(\alpha)$

RISC-V’s load fence, ${\tt fence~{}{\tt r,r}}$ , restricts ordering with loads (99), and is is the straightforward dual of ARM’s store fence ( ${\tt dsb.st}$ , (86)). RISC-V’s ${\tt fence~{}{\tt rw,w}}$ barrier is intended to maintain order between loads and stores and later stores only, allowing later loads to potentially come earlier; it therefore allows reordering of loads, but blocks everything else (102). Similarly the ${\tt fence~{}{\tt r,rw}}$ barrier ensures order between loads and later loads and stores, and hence can ‘jump’ over stores but is blocked by loads (105), which therefore are strictly ordered with later loads and stores. Like ARM, RISC-V prevents stores from taking effect before branches are resolved (108) (see (RISC-V International, 2017)[Rule 11, A.3.8]).

Conformance

We tested our model against the litmus tests outlined in the RISC-V manual (RISC-V International, 2017) and made available online with expected results (Flur and Maranget, 2019). Restricting attention to those tests involving instructions we consider in M10 (and M7) our tests agree with the official model in all 3937 cases, covering the rc_pc behaviours and the six barrier types defined above, with M4 controlling interactions between stores and loads.

10. Related work

There has been significant work in defining the semantics of processor-level instruction reordering since the 1980s (Lamport, 1979; Dubois et al., 1986; Shasha and Snir, 1988) and more recently under the umbrella of weak memory models (Lahav et al., 2016; Atig et al., 2012; Boudol and Petri, 2009; Boudol et al., 2012; Jagadeesan et al., 2012; Ridge, 2010; Crary and Sullivan, 2015; Batty et al., 2015; Abe and Maeda, 2016; Hou et al., 2016; Kavanagh and Brookes, 2018; Doherty et al., 2019). To the best of our knowledge we are the first to encode the basis for instruction reordering as a parameter to the language, rather than as a parameter to the semantics. This has allowed us to describe relevant properties and relationships at the program level, based on per-process properties about how individual cores manage their pipeline, and supporting program-structure-based reasoning techniques. In the literature weak memory model specifications are typically described with respect to properties of the global system. The most well-used framework in this style is the axiomatic approach (Steinke and Nutt, 2004; Mador-Haim et al., 2012; Wickerson et al., 2017), perhaps best exemplified by Alglave et al. (Alglave et al., 2014), in which many of the behaviours of diverse processors such as ARM, IBM’s POWER, and Intel’s x86 are described, and common properties elucidated, and whose approach we compared with in Sect. 8.1. Another common approach to formalisation is with a semantics that is closer to the behaviour of a real microarchitecture, e.g., (Sarkar et al., 2011; Flur et al., 2016; Pulte et al., 2017) (we gave a direct semantic comparison to the operational model of (Sewell et al., 2010; Owens et al., 2009) in Sect. 6). In both the axiomatic and the concrete style it is more difficult to derive abstract properties and to reason about a particular system over the structure of the program, however both give rise to efficient tools for model checking (Atig et al., 2010; Abdulla et al., 2017; Abdulla et al., 2016; Atig et al., 2011; Abd Alrahman et al., 2014; Kokologiannakis and Vafeiadis, 2020; Kokologiannakis et al., 2017; Abdulla et al., 2019a, b).

The Promising semantics (Kang et al., 2017; Pulte et al., 2019; Lee et al., 2020) is operational and can be instantiated with different memory models (including software memory models), and a proof framework has been developed for reasoning about programs. Weak behaviours are governed by abstract global data structures that maintain a partial order on events, and hence the semantics is defined and reasoning is performed with respect to these global data structures.

This paper supersedes (Colvin and Smith, 2018), which defines only a simple prefixing command for actions (a special case of parallelized sequential composition). That paper does not consider a general theory for memory models (Sect. 2), nor consider pipelines, and does not address TSO, Release Consistency, or RISC-V (but does consider POWER), and showed conformance for ARM against an older version without release/acquire atomics, against a much smaller set of litmus tests (approximately 400 vs over 100,000 in this paper). That theory is not machine-checked, contains only a few simple refinement rules, and does not employ Owicki-Gries reasoning.

The “PipeCheck” framework of Lustig, Martonosi et al. (Lustig et al., 2014; Manerkar et al., 2018; Trippel et al., 2018, 2019) is designed to validate that processors faithfully implement their intended memory model, using a detailed pipeline semantics based on an axiomatic specification. Given that our approach has an underlying link to the behaviour of pipelines it may be possible to extend our framework so that it can make use of those existing tools for processor validation.

Our operational approach based on out-of-order instruction execution follows work such as Arvind et al. (Xiaowei Shen and Arvind, 1999; Arvind and Maessen, 2006; Zhang et al., 2018), and the development of the Release Consistency and related models (Gharachorloo et al., 1990; Adve and Hill, 1990; Gharachorloo et al., 1991; Adve and Hill, 1993; Adve and Boehm, 2010; Zhang et al., 2017). The algebraic approach we adopt to reducing programs is similar in style to the Concurrent Kleene Algebra (Hoare et al., 2009), where sequential and parallel composition contribute to the event ordering.

11. Conclusion

In this paper we have formalised instruction-level parallelism (ILP), a feature of processors since the 1960s, and a major factor in the weak behaviours associated with modern memory consistency models. We showed how modern memory models build on generic properties of instruction reordering with respect to preservation of sequential semantics, calculated pointwise on instructions. We gave a straightforward abstract semantics of a pipeline, lifting pointwise comparison to sequences of instructions. We also lifted the comparison to the command level, and hence defined a program operator (parallelized sequential composition) which is behaviourally equivalent to using a pipeline, but supports compositional reasoning about behaviours over the structure of parallel processes. We empirically validated the models for large sets of litmus tests for ARM and RISC, showed that the reordering semantics for TSO is equivalent to the established store buffer semantics, and showed how sterotypical results emerge across a range of models, for instance, the store buffer pattern of TSO where loads can come before stores, the message passing paradigm from release consistency using release/acquire flags to control interprocess communication, and load speculation from ARM. We provide in the supplementary material a model checker based on the pipeline semantics in Maude (Clavel et al., 2002; Verdejo and Martí-Oliet, 2006) and encoded and machine-checked the theory for the IMP+pseq language in Isabelle (Nipkow et al., 2002; Paulson, 1994).

The reasoning style in this framework is more direct than many in the literature, in that the nondeterminism due to ILP can be made explicit in the structure of the program, and either shown to have no effect on desired properties, or a specific trace that contradicts a desired property can be elucidated. In the literature reasoning about memory models is often with respect to orderings over global event traces, rather than per-process reorderings. Our approach, e.g., to prepare for the application of the Owicki-Gries method, appears to be simpler than techniques that incorporate memory model constraints directly (Lahav and Vafeiadis, 2015; Alglave and Cousot, 2017; Dalvandi et al., 2020). Furthermore, we could have instead chosen to apply rely/guarantee reasoning (Jones, 1983a, b) once the programs were reduced to concurrent sequential processes. The reduction of commands involving parallelized sequential composition to a sequential form (e.g., Laws 38 and 39) might be infeasible for large programs with few memory barriers, but such programs are not typically where one is concerned with memory models: more commonly memory models apply to library data structures and low-level applications where shared-variable communication is tightly controlled (Lê et al., 2013; Kang, [n.d.]). An advantage of our language-level encoding is that other program properties can be tackled in a familiar setting and at the program level, for instance, information flow logic for security or progress properties (Smith et al., 2019; Colvin and Winter, 2020), making use of existing frameworks directly.

As future work we intend to extend the semantics to cover other features of modern processors that contribute to their memory models, for instance, POWER’s cache system that lacks multicopy atomicity, TSO’s global locks, and ARM’s global monitor for controlling load-linked/store-conditional instructions. As these features are global they cannot be captured directly as behaviours of per-processor pipelines. A proof technique for this extended framework will incorporate features from existing work, perhaps as a hybrid separating local reordering from global behaviours (Alglave et al., 2013; Kang et al., 2017). Our relatively simple semantics for the pipeline contains only two stages, fetch and commit, neglecting in particular retirement; this could be crucial to include in security vulnerability analysis; for instance, it is the retirement of loads where invalid memory accesses are detected, and this in association with early load speculation leads to the Meltdown vulnerability (Lipp et al., 2018; Trippel et al., 2018).

References

(1)
Abd Alrahman et al. (2014) Yehia Abd Alrahman, Marina Andric, Alessandro Beggiato, and Alberto Lluch Lafuente. 2014. Can We Efficiently Check Concurrent Programs Under Relaxed Memory Models in Maude?. In Rewriting Logic and Its Applications, Santiago Escobar (Ed.). Springer International Publishing, Cham, 21–41.
Abdulla et al. (2019a) Parosh Aziz Abdulla, Jatin Arora, Mohamed Faouzi Atig, and Shankaranarayanan Krishna. 2019a. Verification of Programs under the Release-Acquire Semantics. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (Phoenix, AZ, USA) (PLDI 2019). Association for Computing Machinery, New York, NY, USA, 1117–1132. https://doi.org/10.1145/3314221.3314649
Abdulla et al. (2017) Parosh Aziz Abdulla, Mohamed Faouzi Atig, Ahmed Bouajjani, and Tuan Phong Ngo. 2017. Context-Bounded Analysis for POWER. In Proceedings, Part II, of the 23rd International Conference on Tools and Algorithms for the Construction and Analysis of Systems - Volume 10206. Springer-Verlag, Berlin, Heidelberg, 56–74. https://doi.org/10.1007/978-3-662-54580-5_4
Abdulla et al. (2016) Parosh Aziz Abdulla, Mohamed Faouzi Atig, Bengt Jonsson, and Carl Leonardsson. 2016. Stateless Model Checking for POWER. In Computer Aided Verification, Swarat Chaudhuri and Azadeh Farzan (Eds.). Springer International Publishing, Cham, 134–156.
Abdulla et al. (2019b) Parosh Aziz Abdulla, Mohamed Faouzi Atig, and Rojin Rezvan. 2019b. Parameterized Verification under TSO is PSPACE-Complete. Proc. ACM Program. Lang. 4, POPL, Article 26 (Dec. 2019), 29 pages. https://doi.org/10.1145/3371094
Abe and Maeda (2016) Tatsuya Abe and Toshiyuki Maeda. 2016. Observation-Based Concurrent Program Logic for Relaxed Memory Consistency Models. In Programming Languages and Systems, Atsushi Igarashi (Ed.). Springer International Publishing, Cham, 63–84.
Adve and Boehm (2010) Sarita V. Adve and Hans-J. Boehm. 2010. Memory Models: A Case for Rethinking Parallel Languages and Hardware. Commun. ACM 53, 8 (Aug. 2010), 90–101. https://doi.org/10.1145/1787234.1787255
Adve and Gharachorloo (1996) S. V. Adve and K. Gharachorloo. 1996. Shared memory consistency models: a tutorial. Computer 29, 12 (Dec 1996), 66–76. https://doi.org/10.1109/2.546611
Adve and Hill (1990) Sarita V. Adve and Mark D. Hill. 1990. Weak Ordering—a New Definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture (Seattle, Washington, USA) (ISCA ’90). Association for Computing Machinery, New York, NY, USA, 2–14. https://doi.org/10.1145/325164.325100
Adve and Hill (1993) S. V. Adve and M. D. Hill. 1993. A unified formalization of four shared-memory models. IEEE Transactions on Parallel and Distributed Systems 4, 6 (1993), 613–624.
Alglave (2012) Jade Alglave. 2012. A formal hierarchy of weak memory models. Formal Methods in System Design 41, 2 (01 Oct 2012), 178–210. https://doi.org/10.1007/s10703-012-0161-5
Alglave (2020) Jade Alglave. 2020. How to generate litmus tests automatically with the diy7 tool. https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/generate-litmus-tests-automatically-diy7-tool (Accessed June 2020).
Alglave and Cousot (2017) Jade Alglave and Patrick Cousot. 2017. Ogre and Pythia: An Invariance Proof Method for Weak Consistency Models. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages (Paris, France) (POPL 2017). ACM, New York, NY, USA, 3–18. https://doi.org/10.1145/3009837.3009883
Alglave et al. (2013) Jade Alglave, Daniel Kroening, Vincent Nimal, and Michael Tautschnig. 2013. Software Verification for Weak Memory via Program Transformation. In Programming Languages and Systems, Matthias Felleisen and Philippa Gardner (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 512–532.
Alglave et al. (2011) Jade Alglave, Luc Maranget, Susmit Sarkar, and Peter Sewell. 2011. Litmus: Running Tests against Hardware. In Tools and Algorithms for the Construction and Analysis of Systems: 17th International Conference, TACAS 2011, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2011, Saarbrücken, Germany, March 26–April 3, 2011. Proceedings, Parosh Aziz Abdulla and K. Rustan M. Leino (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 41–44. https://doi.org/10.1007/978-3-642-19835-9_5
Alglave et al. (2014) Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory. ACM Trans. Program. Lang. Syst. 36, 2, Article 7 (July 2014), 74 pages. https://doi.org/10.1145/2627752
Anderson et al. (1967) D. W. Anderson, F. J. Sparacio, and R. M. Tomasulo. 1967. The IBM System/360 Model 91: Machine Philosophy and Instruction-Handling. IBM Journal of Research and Development 11, 1 (1967), 8–24.
Arm Ltd (2020) Arm Ltd 2020. Arm® Architecture Reference Manual, for the Armv8-A architecture profile. Arm Ltd. https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile
Armstrong et al. (2019) Alasdair Armstrong, Thomas Bauereiss, Brian Campbell, Alastair Reid, Kathryn E. Gray, Robert M. Norton, Prashanth Mundkur, Mark Wassell, Jon French, Christopher Pulte, Shaked Flur, Ian Stark, Neel Krishnaswami, and Peter Sewell. 2019. ISA Semantics for ARMv8-a, RISC-V, and CHERI-MIPS. Proc. ACM Program. Lang. 3, POPL, Article 71 (Jan. 2019), 31 pages. https://doi.org/10.1145/3290384
Arvind and Maessen (2006) Arvind and Jan-Willem Maessen. 2006. Memory Model = Instruction Reordering + Store Atomicity. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA ’06). IEEE Computer Society, USA, 29–40. https://doi.org/10.1109/ISCA.2006.26
Atig et al. (2010) Mohamed Faouzi Atig, Ahmed Bouajjani, Sebastian Burckhardt, and Madanlal Musuvathi. 2010. On the Verification Problem for Weak Memory Models. In Proceedings of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Madrid, Spain) (POPL ’10). Association for Computing Machinery, New York, NY, USA, 7–18. https://doi.org/10.1145/1706299.1706303
Atig et al. (2012) Mohamed Faouzi Atig, Ahmed Bouajjani, Sebastian Burckhardt, and Madanlal Musuvathi. 2012. What’s Decidable about Weak Memory Models?. In Programming Languages and Systems, Helmut Seidl (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 26–46.
Atig et al. (2011) Mohamed Faouzi Atig, Ahmed Bouajjani, and Gennaro Parlato. 2011. Getting Rid of Store-Buffers in TSO Analysis. In Computer Aided Verification, Ganesh Gopalakrishnan and Shaz Qadeer (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 99–115.
Batty et al. (2015) Mark Batty, Kayvan Memarian, Kyndylan Nienhuis, Jean Pichon-Pharabod, and Peter Sewell. 2015. The Problem of Programming Language Concurrency Semantics. In Programming Languages and Systems, Jan Vitek (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 283–307.
Boehm and Adve (2008) Hans-J. Boehm and Sarita V. Adve. 2008. Foundations of the C++ Concurrency Memory Model. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (Tucson, AZ, USA) (PLDI ’08). ACM, New York, NY, USA, 68–78. https://doi.org/10.1145/1375581.1375591
Bornholt and Torlak (2017) James Bornholt and Emina Torlak. 2017. Synthesizing Memory Models from Framework Sketches and Litmus Tests. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (Barcelona, Spain) (PLDI 2017). Association for Computing Machinery, New York, NY, USA, 467–481. https://doi.org/10.1145/3062341.3062353
Boudol and Petri (2009) Gérard Boudol and Gustavo Petri. 2009. Relaxed Memory Models: An Operational Approach. In Proceedings of the 36th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Savannah, GA, USA) (POPL ’09). Association for Computing Machinery, New York, NY, USA, 392–403. https://doi.org/10.1145/1480881.1480930
Boudol et al. (2012) Gérard Boudol, Gustavo Petri, and Bernard Serpette. 2012. Relaxed Operational Semantics of Concurrent Programming Languages. Electronic Proceedings in Theoretical Computer Science 89 (Aug 2012), 19–33. https://doi.org/10.4204/eptcs.89.3
Clavel et al. (2002) Manuel Clavel, Francisco Duran, Steven Eker, Patrick Lincoln, Narciso Marti-Oliet, José Meseguer, and José F. Quesada. 2002. Maude: specification and programming in rewriting logic. Theoretical Computer Science 285, 2 (2002), 187 – 243. https://doi.org/10.1016/S0304-3975(01)00359-0
Colvin and Smith (2018) Robert J. Colvin and Graeme Smith. 2018. A Wide-Spectrum Language for Verification of Programs on Weak Memory Models. In Formal Methods, Klaus Havelund, Jan Peleska, Bill Roscoe, and Erik de Vink (Eds.). Springer International Publishing, Cham, 240–257. https://doi.org/10.1007/978-3-319-95582-7_14
Colvin and Winter (2020) Robert J. Colvin and Kirsten Winter. 2020. An Abstract Semantics of Speculative Execution for Reasoning About Security Vulnerabilities. In Formal Methods. FM 2019 International Workshops, Emil Sekerinski, Nelma Moreira, José N. Oliveira, Daniel Ratiu, Riccardo Guidotti, Marie Farrell, Matt Luckcuck, Diego Marmsoler, José Campos, Troy Astarte, Laure Gonnord, Antonio Cerone, Luis Couto, Brijesh Dongol, Martin Kutrib, Pedro Monteiro, and David Delmas (Eds.). Springer International Publishing, Cham, 323–341.
Crary and Sullivan (2015) Karl Crary and Michael J. Sullivan. 2015. A Calculus for Relaxed Memory. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Mumbai, India) (POPL ’15). Association for Computing Machinery, New York, NY, USA, 623–636. https://doi.org/10.1145/2676726.2676984
Dalvandi et al. (2020) Sadegh Dalvandi, Simon Doherty, Brijesh Dongol, and Heike Wehrheim. 2020. Owicki-Gries Reasoning for C11 RAR. In 34th European Conference on Object-Oriented Programming (ECOOP 2020) (Leibniz International Proceedings in Informatics (LIPIcs), Vol. 166), Robert Hirschfeld and Tobias Pape (Eds.). Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 11:1–11:26. https://doi.org/10.4230/LIPIcs.ECOOP.2020.11
Deacon and Alglave (2016) Will Deacon and Jade Alglave. 2016. The herd ARMv8 model. https://github.com/herd/herdtools7/blob/master/herd/libdir/aarch64.cat (Accessed June 2020).
Dijkstra and Scholten (1990) Edsger W. Dijkstra and Carel S. Scholten. 1990. Predicate Calculus and Program Semantics. Springer-Verlag, Berlin, Heidelberg.
Doherty et al. (2019) Simon Doherty, Brijesh Dongol, Heike Wehrheim, and John Derrick. 2019. Verifying C11 Programs Operationally. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (Washington, District of Columbia) (PPoPP ’19). ACM, New York, NY, USA, 355–365. https://doi.org/10.1145/3293883.3295702
Dubois et al. (1986) M. Dubois, C. Scheurich, and F. Briggs. 1986. Memory Access Buffering in Multiprocessors. In Proceedings of the 13th Annual International Symposium on Computer Architecture (Tokyo, Japan) (ISCA ’86). IEEE Computer Society Press, Washington, DC, USA, 434–442.
Fischer and Ladner (1979) Michael J. Fischer and Richard E. Ladner. 1979. Propositional dynamic logic of regular programs. J. Comput. System Sci. 18, 2 (1979), 194 – 211. https://doi.org/10.1016/0022-0000(79)90046-1
Flur et al. (2016) Shaked Flur, Kathryn E. Gray, Christopher Pulte, Susmit Sarkar, Ali Sezgin, Luc Maranget, Will Deacon, and Peter Sewell. 2016. Modelling the ARMv8 Architecture, Operationally: Concurrency and ISA. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (St. Petersburg, FL, USA) (POPL ’16). ACM, New York, NY, USA, 608–621. https://doi.org/10.1145/2837614.2837615
Flur and Maranget (2019) Shaked Flur and Luc Maranget. 2019. RISC-V architecture concurrency model litmus tests. https://github.com/litmus-tests/litmus-tests-riscv (Accessed June 2020).
Frigo and Luchangco (1998) Matteo Frigo and Victor Luchangco. 1998. Computation-Centric Memory Models. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures (Puerto Vallarta, Mexico) (SPAA ’98). Association for Computing Machinery, New York, NY, USA, 240–249. https://doi.org/10.1145/277651.277690
Gharachorloo et al. (1991) Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. 1991. Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (Santa Clara, California, USA) (ASPLOS IV). Association for Computing Machinery, New York, NY, USA, 245–257. https://doi.org/10.1145/106972.106997
Gharachorloo et al. (1990) Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. 1990. Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture (Seattle, Washington, USA) (ISCA ’90). Association for Computing Machinery, New York, NY, USA, 15–26. https://doi.org/10.1145/325164.325102
Hill (1998) M. D. Hill. 1998. Multiprocessors should support simple memory consistency models. Computer 31, 8 (1998), 28–34.
Hoare (1969) C. A. R. Hoare. 1969. An Axiomatic Basis for Computer Programming. Commun. ACM 12, 10 (Oct. 1969), 576–580. https://doi.org/10.1145/363235.363259
Hoare (1985) C. A. R. Hoare. 1985. Communicating Sequential Processes. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
Hoare et al. (2009) C. A. R. Tony Hoare, Bernhard Möller, Georg Struth, and Ian Wehrman. 2009. Concurrent Kleene Algebra. In CONCUR 2009 - Concurrency Theory: 20th International Conference, CONCUR 2009, Bologna, Italy, September 1-4, 2009. Proceedings, Mario Bravetti and Gianluigi Zavattaro (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 399–414. https://doi.org/10.1007/978-3-642-04081-8_27
Hou et al. (2016) Zhe Hou, David Sanan, Alwen Tiu, Yang Liu, and Koh Chuen Hoa. 2016. An Executable Formalisation of the SPARCv8 Instruction Set Architecture: A Case Study for the LEON3 Processor. In FM 2016: Formal Methods, John Fitzgerald, Constance Heitmeyer, Stefania Gnesi, and Anna Philippou (Eds.). Springer International Publishing, Cham, 388–405.
Intel (2019) Intel 2019. Intel 64 and IA-32 Architectures Software Developer’s Manual. Intel.
Jagadeesan et al. (2012) Radha Jagadeesan, Gustavo Petri, and James Riely. 2012. Brookes Is Relaxed, Almost!. In Foundations of Software Science and Computational Structures, Lars Birkedal (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 180–194.
Jones (1983a) Cliff B. Jones. 1983a. Specification and Design of (Parallel) Programs. In IFIP Congress. 321–332.
Jones (1983b) Cliff B. Jones. 1983b. Tentative steps toward a development method for interfering programs. ACM Trans. Program. Lang. Syst. 5 (October 1983), 596–619. Issue 4. https://doi.org/10.1145/69575.69577
Kang ([n.d.]) Jeehoon Kang. [n.d.]. An implementation of concurrent work-stealing double-ended queue (deque) on Crossbeam, and its correctness proof. https://github.com/jeehoonkang/crossbeam-rfcs/blob/deque-proof/text/2018-01-07-deque-proof.md. Accessed: 16-01-2020.
Kang et al. (2017) Jeehoon Kang, Chung-Kil Hur, Ori Lahav, Viktor Vafeiadis, and Derek Dreyer. 2017. A Promising Semantics for Relaxed-memory Concurrency. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages (Paris, France) (POPL 2017). ACM, New York, NY, USA, 175–189. https://doi.org/10.1145/3009837.3009850
Kavanagh and Brookes (2018) Ryan Kavanagh and Stephen Brookes. 2018. A Denotational Semantics for SPARC TSO. Electronic Notes in Theoretical Computer Science 336 (2018), 223 – 239. https://doi.org/10.1016/j.entcs.2018.03.025 The Thirty-third Conference on the Mathematical Foundations of Programming Semantics (MFPS XXXIII).
Kocher et al. (2019) P. Kocher, J. Horn, A. Fogh, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom. 2019. Spectre Attacks: Exploiting Speculative Execution. In 2019 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 1–19.
Kokologiannakis et al. (2017) Michalis Kokologiannakis, Ori Lahav, Konstantinos Sagonas, and Viktor Vafeiadis. 2017. Effective Stateless Model Checking for C/C++ Concurrency. Proc. ACM Program. Lang. 2, POPL, Article 17 (Dec. 2017), 32 pages. https://doi.org/10.1145/3158105
Kokologiannakis and Vafeiadis (2020) Michalis Kokologiannakis and Viktor Vafeiadis. 2020. HMC: Model Checking for Hardware Memory Models. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’20). Association for Computing Machinery, New York, NY, USA, 1157–1171. https://doi.org/10.1145/3373376.3378480
Kozen (2000) Dexter Kozen. 2000. On Hoare Logic and Kleene Algebra with Tests. ACM Trans. Comput. Logic 1, 1 (July 2000), 60–76. https://doi.org/10.1145/343369.343378
Lahav et al. (2016) Ori Lahav, Nick Giannarakis, and Viktor Vafeiadis. 2016. Taming Release-Acquire Consistency. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (St. Petersburg, FL, USA) (POPL ’16). Association for Computing Machinery, New York, NY, USA, 649–662. https://doi.org/10.1145/2837614.2837643
Lahav and Vafeiadis (2015) Ori Lahav and Viktor Vafeiadis. 2015. Owicki-Gries Reasoning for Weak Memory Models. In Automata, Languages, and Programming, Magnús M. Halldórsson, Kazuo Iwama, Naoki Kobayashi, and Bettina Speckmann (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 311–323.
Lamport (1979) Lamport. 1979. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Trans. Comput. C-28, 9 (1979), 690–691.
Lê et al. (2013) Nhat Minh Lê, Antoniu Pop, Albert Cohen, and Francesco Zappa Nardelli. 2013. Correct and Efficient Work-stealing for Weak Memory Models. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Shenzhen, China) (PPoPP ’13). ACM, New York, NY, USA, 69–80. https://doi.org/10.1145/2442516.2442524
Lee et al. (2020) Sung-Hwan Lee, Minki Cho, Anton Podkopaev, Soham Chakraborty, Chung-Kil Hur, Ori Lahav, and Viktor Vafeiadis. 2020. Promising 2.0: Global Optimizations in Relaxed Memory Concurrency. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (London, UK) (PLDI 2020). Association for Computing Machinery, New York, NY, USA, 362–376. https://doi.org/10.1145/3385412.3386010
Lenoski et al. (1992) D. Lenoski, J. Laudon, K. Gharachorloo, W. . Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. S. Lam. 1992. The Stanford Dash multiprocessor. Computer 25, 3 (1992), 63–79.
Lipp et al. (2018) Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Anders Fogh, Jann Horn, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Michael Hamburg. 2018. Meltdown: Reading Kernel Memory from User Space. In USENIX Security Symposium.
Lustig et al. (2014) D. Lustig, M. Pellauer, and M. Martonosi. 2014. PipeCheck: Specifying and Verifying Microarchitectural Enforcement of Memory Consistency Models. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. 635–646.
Lustig et al. (2017) Daniel Lustig, Andrew Wright, Alexandros Papakonstantinou, and Olivier Giroux. 2017. Automated Synthesis of Comprehensive Memory Model Litmus Test Suites. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’17). Association for Computing Machinery, New York, NY, USA, 661–675. https://doi.org/10.1145/3037697.3037723
Mador-Haim et al. (2010) Sela Mador-Haim, Rajeev Alur, and Milo M. K. Martin. 2010. Generating Litmus Tests for Contrasting Memory Consistency Models. In Computer Aided Verification: 22nd International Conference, CAV 2010, Edinburgh, UK, July 15-19, 2010. Proceedings, Tayssir Touili, Byron Cook, and Paul Jackson (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 273–287. https://doi.org/10.1007/978-3-642-14295-6_26
Mador-Haim et al. (2012) Sela Mador-Haim, Luc Maranget, Susmit Sarkar, Kayvan Memarian, Jade Alglave, Scott Owens, Rajeev Alur, Milo M. K. Martin, Peter Sewell, and Derek Williams. 2012. An Axiomatic Memory Model for POWER Multiprocessors. In Proceedings of the 24th International Conference on Computer Aided Verification (Berkeley, CA) (CAV’12). Springer-Verlag, Berlin, Heidelberg, 495–512. https://doi.org/10.1007/978-3-642-31424-7_36
Manerkar et al. (2018) Y. A. Manerkar, D. Lustig, M. Martonosi, and A. Gupta. 2018. PipeProof: Automated Memory Consistency Proofs for Microarchitectural Specifications. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 788–801.
Maranget ([n.d.]) Luc Maranget. [n.d.]. AArch64 Model vs. Hardware. http://moscova.inria.fr/~maranget/cats7/model-aarch64/ (Accessed January 2020).
Maranget et al. (2012) Luc Maranget, Susmit Sarkar, and Peter Sewell. 2012. A Tutorial Introduction to the ARM and POWER Relaxed Memory Models. https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
Milner (1982) Robin Milner. 1982. A Calculus of Communicating Systems. Springer-Verlag New York, Inc.
Nipkow and Nieto (1999) Tobias Nipkow and Leonor Prensa Nieto. 1999. Owicki/Gries in Isabelle/HOL. In Fundamental Approaches to Software Engineering, Jean-Pierre Finance (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 188–203.
Nipkow et al. (2002) Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel. 2002. Isabelle/HOL — A Proof Assistant for Higher-Order Logic. LNCS, Vol. 2283. Springer.
Owens et al. (2009) Scott Owens, Susmit Sarkar, and Peter Sewell. 2009. A Better x86 Memory Model: x86-TSO. In Theorem Proving in Higher Order Logics, Stefan Berghofer, Tobias Nipkow, Christian Urban, and Makarius Wenzel (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 391–407.
Owicki and Gries (1976) Susan Owicki and David Gries. 1976. An Axiomatic Proof Technique for Parallel Programs I. Acta Inf. 6, 4 (Dec. 1976), 319–340. https://doi.org/10.1007/BF00268134
Paulson (1994) Lawrence C. Paulson. 1994. Isabelle: a generic theorem prover. Springer Verlag.
Plotkin (2004) Gordon D. Plotkin. 2004. A structural approach to operational semantics. J. Log. Algebr. Program. 60-61 (2004), 17–139.
Pulte et al. (2017) Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2017. Simplifying ARM Concurrency: Multicopy-Atomic Axiomatic and Operational Models for ARMv8. Proc. ACM Program. Lang. 2, POPL, Article 19 (Dec. 2017), 29 pages. https://doi.org/10.1145/3158107
Pulte et al. (2019) Christopher Pulte, Jean Pichon-Pharabod, Jeehoon Kang, Sung-Hwan Lee, and Chung-Kil Hur. 2019. Promising-ARM/RISC-V: A Simpler and Faster Operational Concurrency Model. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (Phoenix, AZ, USA) (PLDI 2019). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3314221.3314624
Ridge (2010) Tom Ridge. 2010. A Rely-Guarantee Proof System for x86-TSO. In Verified Software: Theories, Tools, Experiments, Gary T. Leavens, Peter O’Hearn, and Sriram K. Rajamani (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 55–70.
RISC-V International (2017) RISC-V International 2017. The RISC-V Instruction Set Manual. Volume I: User-Level ISA; Volume II: Privileged Architecture. RISC-V International. https://riscv.org/specifications/
Roscoe (1998) A.W. Roscoe. 1998. The Theory and Practice of Concurrency. Prentice Hall.
Sarkar et al. (2011) Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams. 2011. Understanding POWER Multiprocessors. SIGPLAN Not. 46, 6 (June 2011), 175–186. https://doi.org/10.1145/1993316.1993520
Sewell et al. (2010) Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. 2010. x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors. Commun. ACM 53, 7 (July 2010), 89–97. https://doi.org/10.1145/1785414.1785443
Shasha and Snir (1988) Dennis Shasha and Marc Snir. 1988. Efficient and Correct Execution of Parallel Programs That Share Memory. ACM Trans. Program. Lang. Syst. 10, 2 (April 1988), 282–312. https://doi.org/10.1145/42190.42277
Smith et al. (2019) Graeme Smith, Nicholas Coughlin, and Toby Murray. 2019. Value-Dependent Information-Flow Security on Weak Memory Models. In Formal Methods – The Next 30 Years, Maurice H. ter Beek, Annabelle McIver, and José N. Oliveira (Eds.). Springer International Publishing, Cham, 539–555.
Sorin et al. (2011) Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2011. A Primer on Memory Consistency and Cache Coherence (1st ed.). Morgan & Claypool Publishers.
Steinke and Nutt (2004) Robert C. Steinke and Gary J. Nutt. 2004. A Unified Theory of Shared Memory Consistency. J. ACM 51, 5 (Sept. 2004), 800–849. https://doi.org/10.1145/1017460.1017464
Thornton (1964) James E. Thornton. 1964. Parallel Operation in the Control Data 6600. In Proceedings of the October 27-29, 1964, Fall Joint Computer Conference, Part II: Very High Speed Computer Systems (San Francisco, California) (AFIPS ’64 (Fall, part II)). Association for Computing Machinery, New York, NY, USA, 33–40. https://doi.org/10.1145/1464039.1464045
Tomasulo (1967) R. M. Tomasulo. 1967. An Efficient Algorithm for Exploiting Multiple Arithmetic Units. IBM Journal of Research and Development 11, 1 (1967), 25–33.
Trippel et al. (2018) Caroline Trippel, Daniel Lustig, and Margaret Martonosi. 2018. MeltdownPrime and SpectrePrime: Automatically-Synthesized Attacks Exploiting Invalidation-Based Coherence Protocols. CoRR abs/1802.03802 (2018). arXiv:1802.03802 http://arxiv.org/abs/1802.03802
Trippel et al. (2019) C. Trippel, D. Lustig, and M. Martonosi. 2019. Security Verification via Automatic Hardware-Aware Exploit Synthesis: The CheckMate Approach. IEEE Micro 39, 3 (2019), 84–93. https://doi.org/10.1109/MM.2019.2910010
Trippel et al. (2018) C. Trippel, Y. A. Manerkar, D. Lustig, M. Pellauer, and M. Martonosi. 2018. Full-Stack Memory Model Verification with TriCheck. IEEE Micro 38, 3 (2018), 58–68. https://doi.org/10.1109/MM.2018.032271062
Verdejo and Martí-Oliet (2006) Alberto Verdejo and Narciso Martí-Oliet. 2006. Executable structural operational semantics in Maude. Journal of Logic and Algebraic Programming 67, 1-2 (2006), 226 – 293. https://doi.org/10.1016/j.jlap.2005.09.008
Wickerson et al. (2017) John Wickerson, Mark Batty, Tyler Sorensen, and George A. Constantinides. 2017. Automatically Comparing Memory Consistency Models. SIGPLAN Not. 52, 1 (Jan. 2017), 190–204. https://doi.org/10.1145/3093333.3009838
Xiaowei Shen and Arvind (1999) Xiaowei Shen and L. R. Arvind. 1999. Commit-Reconcile and Fences (CRF): a new memory model for architects and compiler writers. In Proceedings of the 26th International Symposium on Computer Architecture. 150–161.
Yeager (1996) K. C. Yeager. 1996. The MIPS R10000 superscalar microprocessor. IEEE Micro 16, 2 (1996), 28–41.
Zhang et al. (2017) S. Zhang, M. Vijayaraghavan, and Arvind. 2017. Weak Memory Models: Balancing Definitional Simplicity and Implementation Flexibility. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). 288–302.
Zhang et al. (2018) Sizhuo Zhang, Muralidaran Vijayaraghavan, Andrew Wright, Mehdi Alipour, and Arvind. 2018. Constructing a Weak Memory Model. In Proceedings of the 45th Annual International Symposium on Computer Architecture (Los Angeles, California) (ISCA ’18). IEEE Press, 124–137. https://doi.org/10.1109/ISCA.2018.00021

Parallelized sequential composition, pipelines, and hardware weak memory models

Abstract.

1. Introduction

2. Foundations of instruction reordering

Reorderings

Model 1 (sc).

Definition 2.1.

Model 2 (eff).

Theorem 2.2.

Proof.

Model 3 (g0{{\textsc{g}}}_{0}).

Theorem 2.3.

Proof.

Model 4 (g).

Definition 2.4.

Model 5 (par).

Definition 2.5.

Definition 2.6 (Sequentially consistent).

Forwarding

Definition 2.7 (Forwarding).

Memory model refinement

Definition 2.8.

Theorem 2.9 (Hierarchy of models).

Proof.

Well-behaved models

Definition 2.10.

3. Pipeline semantics

Theorem 3.1.

Proof.

4. An imperative language with parallelized sequential composition

Syntax

Operational semantics

Sequential fragment

Parallelized sequential composition

Trace semantics

Refinement

Monotonicity

Theorem 4.1.

Proof.

5. Reasoning about IMP+pseq

Theorem 5.1.

Proof.

Theorem 5.2.

Proof.

Theorem 5.3 (Verification).

Proof.

6. Total store order (TSO)

Model 6 (tso).

Theorem 6.1.

Proof.

Theorem 6.2.

Proof.

Theorem 6.3.

Proof.

6.1. Equivalence to an explicit store buffer model

Theorem 6.4.

Proof.

Theorem 6.5.

Proof.

7. Release consistency

Model 7 (rcpc).

Model 8 (rcsc).

Theorem 7.1.

Proof.

Theorem 7.2.

Proof.

8. ARM version 8

Model 9 (arm).

Theorem 8.1.

Proof.

Theorem 8.2.

Proof.

Conformance.

8.1. An axiomatic specification

Theorem 8.3.

Proof.

9. RISC-V

Model 10 (risc-v).

Conformance

10. Related work

Model 3 ( ${{\textsc{g}}}_{0}$ ).

Model 7 (rc_pc).

Model 8 (rc_sc).