Separation of concerning things: a simpler basis for defining and programming with the C/C++ memory model (extended version)
Abstract.
The C/C++ memory model provides an interface and execution model for programmers of concurrent (shared-variable) code. It provides a range of mechanisms that abstract from underlying hardware memory models – that govern how multicore architectures handle concurrent accesses to main memory – as well as abstracting from compiler transformations. The C standard describes the memory model in terms of cross-thread relationships between events, and has been influenced by several research works that are similarly based. In this paper we provide a thread-local definition of the fundamental principles of the C memory model, which, for concise concurrent code, serves as a basis for relatively straightforward reasoning about the effects of the C ordering mechanisms. We argue that this definition is more practical from a programming perspective and is amenable to analysis by already established techniques for concurrent code. The key aspect is that the memory model definition is separate to other considerations of a rich programming language such as C, in particular, expression evaluation and optimisations, though we show how to reason about those considerations in the presence of C concurrency. A major simplification of our framework compared to the description in the C standard and related work in the literature is separating out considerations around the “lack of multicopy atomicity”, a concept that is in any case irrelevant to developers of code for x86, Arm, RISC-V or SPARC architectures. We show how the framework is convenient for reasoning about well-structured code, and for formally addressing unintuitive behaviours such as “out-of-thin-air” writes.
1. Introduction
C/C++ is one of the most widely used programming languages, including for low-level concurrent code with a high imperative for efficiency. The C (weak) memory model, governed by an ISO standard, provides an interface (atomics.h) for instrumenting shared-variable concurrency that abstracts from the multitude of multicore architectures it may be compiled to, each with its own guarantees about and mechanisms for controlling accesses to shared variables.
The C memory model standard is described in terms of a cross-thread “happens before” relationship, relating stores and loads within and between threads, and “release sequences”. The fundamentals of this approach were established by Boehm & Adve (Boehm and Adve, 2008), and the standard has been influenced and improved by various research works (e.g., (Batty et al., 2011; Vafeiadis et al., 2015)). However, because it is cross-thread, verification techniques are often complex to apply, and the resulting formal semantics are often highly specialised and involve global data structures capturing a partial order on events, and rarely cover the full range of features available in C’s atomics.h (Chakraborty and Vafeiadis, 2019; Lahav et al., 2017). In many cases this is because such formalisms attempt to explain how reordering can occur by mixing considerations such as expression optimisations and Power’s cache coherence system, alongside local compiler-induced and processor pipeline reorderings. We instead take a separation-of-concerns approach, where the three fundamental principles involved in C concurrent code – data dependencies, fences, and memory ordering constraints – are specified separately to other aspects of C which may complicate reasoning, such as expression evaluation, expression optimisations, and arbitrary compiler transformations. Provided a programmer steers clear of concurrent code that is subject to these extra factors, reasoning about their code is relatively straightforward in our framework, but if the full richness of C is insisted upon, our framework is also applicable.
Syntactically and semantically the key aspect is the parallelized sequential composition operator, formalising a processor-pipeline concept that has been around since the 1960s, and which has been previously shown to have good explanatory power for most behaviours observed on hardware weak memory models (Colvin, 2021a, b). Reasoning in this setting involves making explicit at the program level what effects the memory model has, either reducing to a sequential form where the use of atomics.h features prevents the compiler or hardware from making problematic reorderings, or making the residual parallelism explicit; in either case, standard techniques (such as Owicki-Gries or rely/guarantee) apply to the reduced program. We cover a significant portion of the C weak memory model, including release/acquire and release/consume synchronisation, and sequentially consistent accesses and fences. We demonstrate verification of some simple behaviours (litmus tests), a spin lock implementation, and also explain how two types of related problematic behaviours – out-of-thin-air stores and read-from-untaken-branch – can be analysed and addressed. We argue that this foundation provides a simpler and more direct basis for discussion of the consequences of choices within the C memory model, and in particular for analysing the soundness of compiler transformations for concurrent code.
We explain the main aspects of the C memory model using a simple list-of-instructions language in Sect. 2, covering all relevant aspects of the C memory model’s principles. We then give the syntax and semantics of a more realistic imperative language (with conditionals and loops) in Sect. 3. We give a set of reduction rules for reasoning about the effects of the memory model in Sect. 4, and explain how standard reasoning techniques can be applied in Sect. 5. We show some illustrative examples in Sect. 6, including the often-discussed “out-of-thin-air” behaviours, showing how in our framework an allowed version of the pattern arises naturally, and a disallowed version is similarly disallowed. In Sections 7, 8, and 9 we extend the language of Sect. 3 with other features of programming in C such as incremental (usually called non-atomic) expression evaluation and instruction execution, expression optimisations, and forwarding of values from earlier instructions to later ones. Crucially, despite the complexities these features introduce, the fundamental principles of the C memory model from Sect. 2 do not change. In Sect. 10 we give a formal discussion of the “read-from-untaken branch behaviour” which exposes the often problematic interactions between standard compiler optimisations and controlling shared-variable concurrency.
2. A simple language with thread-local reordering
In this section we give a simple language formed from primitive actions and “parallelized sequential prefixing”, which serves to explain the crucial parts of reordering due to the C memory model. In Sect. 3 we extend the language to include standard imperative constructs such as conditionals, loops, and composite actions.
2.1. Syntax and semantics of a language with instruction reordering
To focus on the semantic point of reorderings we introduce a very basic language formed from primitive instructions representing assignments, branches, and fences, which are composed solely by a prefix operator that allows reordering (early execution) of later instructions.
Expressions can be base values , variables , or involve the usual unary () or binary () operators. A primitive action in this language is either an assignment , where is a variable and an expression, guard , where is an expression, or fence , where is some fence type, described below. A command can be the terminated command , or a simple prefixing of a primitive instruction before command , parameterised by some memory model m, written . As we show elsewhere (Colvin and Smith, 2018; Colvin, 2021b, a), the parameter m can be instantiated to give the behaviour of hardware weak memory models, but in this paper we focus mostly on C’s memory model, denoted formally by ‘c’, and the special cases of sequential and parallel composition. The operator essentially allows the construction of a sequence of instructions that may be reordered under some circumstances, similar to a hardware pipeline.
The semantics of the language is given operationally below. All primitive actions are executed in a single indivisible step.111Normally called atomic but we avoid that term to keep the notion separate from C’s notion of atomics.
A command may either immediately execute , (rule ) or it may execute some action of provided that may reorder with with respect to m, written (rule ). The conditions under which reordering may occur are specific to the memory model under consideration, and we define these below for C. The memory model parameter is defined pointwise on instruction types. This is a relatively convenient way to express reorderings, especially as it is agnostic about global traces and behaviours. As shown empirically in (Colvin, 2021b) it is suitable for explaining observed behaviours on architectures such as Arm, x86 and RISC-V. It specialises to the notions of sequential and parallel conjunction straightforwardly.
As an example, consider a memory model m such that assignments of values to different variables can be executed in either order (as on Arm, but not on x86, for instance), that is, for . Then we have two possible terminating traces (traces ending in ) for the program .
The first behaviour results from two applications of rule above (as in prefixing in CSP (Hoare, 1985) or CCS (Milner, 1982)). The second behaviour results from applying rule , noting that by assumption , and then rule .
We define , and subsequently treat it is a full fence in m by defining and for all (where abbreviates ). Then we have exactly one possible trace in the following circumstance. For convenience below we omit the trailing .
(2.1) |
The fence has prevented application of the second rule (since by definition both and ) and hence restored sequential order on the instructions.
The framework admits the definition of standard sequential and parallel composition as (extreme) memory models. Let the “sequential consistency” model sc be the model that prevents all reordering, and introduce a special operator for that case. We define the complement of sc to be par, i.e., the memory model that allows all reordering, which corresponds to parallel execution.
(2.2) | |||||
(2.3) | |||||
(2.4) |
Using for trace equality (defined formally later),
(2.5) |
Without the fence and effectively execute in parallel (under m), that is,
(2.6) |
Whether satisfies some property depends exactly on whether the parallel execution satisfies the property.
2.2. Reordering in C
We now consider the specifics of reordering in the memory model, which considers three aspects: (i) variable (data) dependencies; (ii) fences; and (iii) “memory ordering constraints”, that can be used to annotate variables or fences. We cover each of these three aspects in turn.
2.2.1. Data dependencies/respecting sequential semantics
A key concept underpinning both processor pipelines and compiler transformations is that of data dependencies, where one instruction depends on a value being calculated by another. To capture this we write if instruction depends on a value that instruction writes to. We define a range of foundational syntactic and semantic concepts below. In a concurrent setting we distinguish between local and shared variables, that is, the set is divided into two mutually exclusive and exhaustive sets and . By convention we let be shared variables and be local variables. For convenience we introduce the syntax to mean that sets and are mutually exclusive.
(2.7) | |||||
(2.8) | |||||
(2.9) | |||||
(2.10) | |||||
(2.11) | |||||
(2.12) | |||||
(2.13) | |||||
(2.14) | |||||
(2.16) | |||||
(2.21) |
The write variables of instructions (written ) are collected syntactically (2.8), as are the read variables (written ) (2.9), which depend on the usual notion of the free variables in an expression (written , defined straightforwardly over the syntax of expressions). The free variables of an instruction are the union of the write and read variables (2.10). Shared and local variables have different requirements from a reordering perspective and so we introduce specialisations of these concepts to just the shared variables (2.11). We define “store” instructions () as those that write to a shared variable, and “load” instructions () as those that read shared variables (and hence an instruction such as is both a store and a load).
Using these definitions we can describe various relationships between actions. One of the key notions is that of “data dependence”, where we write if instruction references a variable being modified by instruction (2.13) (and similarly we write if there is no data dependence (2.14)). For instance, but . The former can be expressed as “carries a dependency into” . Two instructions are interference free if there is no data dependence in either direction, and they write to different variables (2.2.1).222 This is a well-known property, the earliest example being Hoare’s disjointness (Hoare, 1972, 2002; Apt and Olderog, 2019), and is also called non-interference in separation logic (Brookes, 2007). This condition, formally discussed since the 1960’s, is remarkably powerful for explaining the majority of observed behaviours of basic instructions types on modern hardware, although this author did not find evidence of cross-over in older references ((Thornton, 1964; Tomasulo, 1967), etc). Note that instructions that are interference-free may still load the same variables; we say they are load independent if they access distinct shared variables (2.16). Finally, two instructions are “order independent” if the effect of executing them is independent of the execution order (2.21). The “effect” function denotes actions as a set of pairs of states in the usual imperative style (defined later in Fig. 4), using ‘’ for relational composition. Note that order-independence is a weaker condition than non-interference, for example, and are order-independent but not interference-free. All of these definitions or concepts defined for instructions can be lifted straightforwardly to commands by induction on the syntax (see Appendix A).
The key aspect of interference-freedom is the following.
Theorem 2.1 (Disjointness).
Proof.
Straightforward: independent actions do not change each other’s outcomes.
We say may be reordered before with respect to sequential semantics, written , if they are interference-free and load-independent; the latter constraint maintains coherence of loads.
Definition 2.2 (g).
For instruction ,
(2.22) |
A syntactic check may be used to validate . Of course one could do a semantic check to establish the weaker property of order-independence, however this is typically not done on a case-by-case basis but rather is used to justify compiler transformations. It is not feasible to check order independence directly for all cases.
We can derive the following, where we assume , , and all are distinct, and . We write if .
(2.25) | but | ||||
(2.28) | but | ||||
(2.31) | but | ||||
(2.32) | |||||
(2.33) |
This shows that independent stores can be reordered, but not stores to the same variable (2.25); similarly independent stores and loads can be reordered, as can loads, provided they are not referencing the same (shared) variable ((2.28) and (2.31)). Accesses of the same local variable, however, can be reordered (2.32), since no interference is possible. Finally, stores can be reordered before (independent) guards (2.33). Since, as we show later, guards model branch points, this is a significant aspect of c. For hardware weak memory models (2.33) is not allowed, as it implies a “speculative” write that may not be valid if is eventually found not to hold; however at compile-time whether a branch condition will eventually hold may be able to be predetermined.
2.2.2. Respecting fences
Since the reorderings allowed by weak memory models may be problematic for establishing communication protocols between processes such models typically have their own “fence” instruction types, which are artificial constraints on reordering (as opposed to the “natural” data-dependence constraint). C has fences specifically to establish order between stores, between loads, or between either type. We define below, relating fences and instructions.
(2.36) | |||||
(2.39) | |||||
(2.42) | for all |
Eq. (2.36) states that a store fence blocks store instructions (recall (2.12)), while (2.39) similarly states that load fences block loads. Eq. (2.42) states that a “full” fence blocks all instruction types.
We use this base definition to define when two instructions can be reordered according to their respective fences (lifting the relation name ).
Definition 2.3 (Fence reorderings).
where extracts the fences present in , i.e.,
(2.43) |
Hence and can be reordered (considering fences only) if they each respect the others’ fences. Note that all C fences are defined symmetrically so we simply check the pair in each direction.333 For asymmetric fences, such as Arm’s isb instruction, both directions need to be specified (Colvin, 2021b).
Based on these definitions we can determine the following special cases, where and .
(2.48) | |||||
(2.53) | |||||
(2.58) |
Statements of the form should be read as a shorthand for and . Inserting fences limits the number of possible traces of a program, an outcome which obviously may restore intended relationships between variables.
2.3. Memory ordering constraints
C introduces “atomics” in the stdatomic library, which includes several types of “memory ordering constraints” (which we herewith call ordering constraints), which can tag loads and stores of variables declared to be “atomic” (e.g., type atomic int for a shared integer). These ordering constraints control how atomic variables interact. We start with an informal overview of how ordering constraints are intended to work, and then show how we incorporate them into the syntax of the language and define a reordering relation over them.
For the discussions below we use a more compact notation for stores and loads, as exemplified by the following C equivalents.444 In C++, r = y.load(std::memory order relaxed) and y.store(3, std::memory order seq cst)
(2.59) | |||||
(2.60) |
2.3.1. Informal overview of ordering constraints
We describe the six ordering constraints defined in type memoryoder below.
-
•
Relaxed (rlx, memory order relaxed). A relaxed access is the simplest, not adding any extra constraints to the variable access, allowing both the hardware and compiler to potentially reorder independent relaxed accesses. For instance, consider the following program snippet (where ‘’ is C’s semicolon).
(2.61) If the programmer’s intention is that the variable is set to after the data is set to 1, then they have erred: either the compiler or the hardware may reorder the apparently independent stores to and .
-
•
Release (rel, memory order release). The release tag can be applied to stores, indicating the end of a set of shared accesses (and hence control can be “released” to the system). Typically this tag is used on the setting of a flag variable, e.g., modifying the above case,
(2.62) now ensures that any other process that sees knows that the update () preceding the change to has taken effect.
-
•
Acquire (acq, memory order acquire). The acquire tag can be applied to loads, and is the reciprocal of a release: any subsequent loads will see everything the acquire can see. For example, continuing from above, a simple process that reads the in parallel with the above process may be written as follows:
(2.63) In this case the loads are kept in order by the acquire constraint, and hence at the end of the program, in the absence of any other interference, .
-
•
Consume (con, memory order consume). The consume tag is similar to, but weaker than, acq, in that it is intended to be partnered with a rel, but only subsequent loads that are data-dependent on the loaded value are guaranteed to see the change. Hence,
(2.64) does not give . However, after
(2.65) then . Data-dependence is maintained by the relation, and in that sense the con constraint provides no extra reordering information to that of rlx. However the intention is that a con load indicates to the compiler that it must not lose, via optimisations, data dependencies to later instructions. We return to con in Sect. 8.2, in the context of expression optimisations, but for concision in the rest of the paper we omit consideration of con, which otherwise behaves as a rlx constraint.
-
•
Sequentially consistent (sc, memory order seq cst). The sequentially consistent constraint is the strongest, forcing order between sc-tagged instructions and any other instructions. For example, the snippet
(2.66) ensures (in fact only one instruction needs the constraint for this to hold). This is considered a more “heavyweight” method for enforcing order than the acquire/release constraints.
-
•
Acquire-release (acqrel, memory order acq rel). This constraint is used on instructions that have both an acquire and release component, and as will be seen it is straightforward to combine them in our syntax.
Non-atomics
Additionally shared data can be “non-atomic”, i.e., variables that are not declared of type atomic*, and as such cannot be directly associated with the ordering constraints above. Programs which attempt to concurrently access shared data without any of the ordering mechanisms of atomics.h essentially have no guarantees about the behaviour, and so we ignore such programs. For programs that include shared non-atomic variables which are correctly synchronised, i.e., are associated with some ordering mechanism (e.g., barriers or acquire/release flag variables) they can be treated as if relaxed, and are subject to potential optimisations as outlined in Sect. 8. Distinguishing between correctly synchronised shared non-atomic variables is a syntactic check that compilers carry out, which could be carried over into our syntactic framework, but this is not directly relevant to the question of reorderings.
2.3.2. Formalising memory order constraints
The relationship between the ordering constraints can be thought of in terms of a reordering relationship, , as in the previous sections.
Definition 2.4 (Memory ordering constraints).
We write if , and as before otherwise. Expressing the relation as a negative is perhaps more intuitive, i.e., for all constraints oc, , , and . Additionally, the con constraint is equal to a rlx constraint for the purposes of ordering (but see Sect. 8.2), and acqrel is the combination of both acq and rel. An alternative presentation of the relationship of is as a grid, i.e.,
(2.67) |
Note that acquire loads may come before release stores, that is, C follows the rcpc model rather than the the stronger rcsc model in (Gharachorloo et al., 1990) (it is of course straightforward to accommodate either; see also (Colvin, 2021b)).
We extend the syntax of instructions to incorporate ordering constraints, allowing variables and fences to be annotated with a set thereof, as shown in Fig. 1. The reordering relation for instructions, considering only ordering constraints, can be defined as below.
Definition 2.5.
Reordering instructions with respect to ordering constraints
(2.68) |
where the function extracts the ordering constraints from the expression and instructions syntax.
(2.69) | |||
For example, . Thus is checked by comparing point-wise each pair of ordering constraints in and . If or has no ordering constraints then vacuously holds.
For example
Note that a locals-only instructions, such as , does not have any ordering constraints, and so is not affected by them, that is and hence for all .
oc | ||||
acqrel | ||||
for , | abbreviates | |||
for , | abbreviates , and | |||
abbreviates | ||||
for a fence, | abbreviates |
TODO
For convenience we define some abbreviations and conventions at the bottom of Fig. 1: we require every reference to a shared variable to have (at least one) ordering constraint, with the default being rlx; when there is exactly one ordering constraint we omit the set comprehension brackets in the syntax, and typically, when the types are clear, we abbreviate to plain as rlx is the default. Local variables are by definition never declared “atomic” and hence their set of ordering constraints is always empty; hence, when is a local variable, we abbreviate to . Similarly we abbreviate ordering constraints on fences. We can now define release, acquire and sequentially consistent fences as the combination of a fence and ordering constraint. A “release fence” () operates according to the rel semantics above, and in addition blocks stores, hence is a combination of a and rel ordering, and similarly an “acquire fence” () acts as a and acq ordering. A “sequentially consisitent” fence () is also defined straightforwardly; these fences map to C’s atomicthreadfence(...) definition.
To complete the syntax extension we update the syntax-based definitions for extracting variables from expressions and instructions by defining and , that is, read/write variables do not include the ordering constraints (and ).
2.4. The complete reordering relation
We can now define the C memory model as the combination of the three aspects above.
Definition 2.6 (Reordering of instructions in c).
(2.70) | (i) , (ii) , and (iii) |
Hence reordering of instructions within a C program can occur provided the sequential semantics, fences, and ordering constraints are respected. We show in later sections how this principle does not change for more complex language features, though, of course, the semantics and hence the analysis is correspondingly more complex.
As examples, for distinct and ,
Hence, following the earlier definitions, we have various ways of enforcing program order using the flag set/check (message passing) pattern from earlier. We leave rlx accesses of shared variables and implicit.
(2.75) |
3. An imperative language with reordering
We now show how reordering according to the C memory model can be embedded into a more realistic imperative language that has conditionals and loops, based on the previously described wide-spectrum language IMP+pseq (Colvin, 2021a; Colvin and Smith, 2018; Colvin, 2021b). We give a small-step operational semantics and define trace equivalence for its notion of correctness.
3.1. Syntax
(3.1) | |||||
(3.2) | |||||
(3.3) | |||||
(3.4) | |||||
(3.5) | |||||
(3.6) | |||||
(3.7) | |||||
(3.8) | |||||
(3.9) | |||||
(3.10) | |||||
(3.11) | |||||
(3.12) |
TODO
The syntax of IMP+pseqCis given in Fig. 2, with expressions and instructions remaining as shown in Fig. 1.
Commands
The command syntax (defn. (3.1)) includes the terminated command, , a sequence of instructions, (allowing composite actions to be defined), the parallelized sequential composition of two commands according to some memory model m, , a choice between two commands, , or a parallelized iteration of a command according to some memory model m, . From this language we can build an imperative language with conditionals and loops following algebraic patterns (Fischer and Ladner, 1979; Kozen, 2000). We define the special action type as a guard (3.2); this action has no observable effect and is not considered for the purposes of determining (trace) equivalence. We allow the abbreviation for the case where the model parameter is c (3.3). We also introduce abbreviations for strict (traditional) order (3.4) and parallel composition (3.5), based on the memory models sc and par introduced earlier (defns. (2.2) and (2.3)). The (parallelized) iteration of a command a finite number of times is defined inductively in defn. (3.6). Conditionals defn. (3.7) and while loops (3.8) can be constructed in the usual way using guards and iteration. We define an empty branch conditional, , as .
Guards
The use of a guard action type allows established encodings of conditionals and loops as described above. However treating a guard as a separate action is useful in considering reorderings as well, and in particular in understanding the interaction of conditionals with compiler optimisations: the fundamentals of reorderings involving guards are based on the principles of preserving sequential semantics (on a single thread) as in Sect. 2.2.1, and these can be lifted to the conditional and loop command types straightforwardly, without needing to deal with them monolithically. Note that if a guard evaluates to this represents a behaviour that cannot occur.
Composite actions
Since we allow sequences of instructions to be the basic building block of the language, with the intention that all instructions in the sequence are executed, in order, as a single indivisible step, we can straightforwardly define complex instructions types such as “compare-and-swap”. For lists we write for the empty list, for concatenation, and as the list constructor. For notational ease we let a singleton sequence of actions just be written where the intended type is clear from the context. For brevity we allow to accept single elements in place of singleton lists. Note that an instruction such as happens as a single indivisible step, that is, the value for is fetched and written into in one step. This is not realistic for C, and as such we later (Sect. 7, see also Appendix B) show how instructions can be incrementally executed (in the above case, with the value for fetched, and only later updating to that value). For now we assume that anything evaluated incrementally is written out to make the granularity explicit, for example, the above assignment becomes , for some fresh identifier ; see further discussion in Sect. 7.3.
The composite compare-and-swap command is defined as a choice between determining that and updating to in a single indivisible step, or determining that (defn. (3.9)). This can be generalised to include ordering constraints, for instance, in the acqrel case (3.10). Updates to a local variable can be included to record the result (3.11). It is of course straightforward to define other composite commands, such as fetch-and-add (3.12), which map to C’s inbuilt functions such as atomic compare exhange strong explicit(…) and atomic fetch add explicit(…). We show these to emphasise that the definition of reordering does not have to change whenever a new instruction type is added; we can easily syntactically extract the required elements. For instance, given the above definitions, we can determine the following.
(3.13) |
since , and . We lift the write/read variables of instructions to commands straightforwardly (see Appendix A), and as such reordering on commands can be calculated, for example,
(3.14) |
since the assignment to is not dependent on anything in the conditional statement.
3.2. Small-step operational semantics
Rule 3.15 (Action).
Rule 3.16 (Choice).
Rule 3.17 (Iterate).
Rule 3.18 (Parallelized sequential composition).
TODO
The semantics of IMP+pseq is given in Fig. 3 (an adaptation of (Colvin, 2021b)). A step (action) is a sequence of instructions which are considered to be executed together, without interference; in the majority of cases the sequences (actions) are singleton. Rule 3.15 places a list of instructions into the trace as a single list (so a trace is a list of lists). Rule 3.16 states that a nondeterministic choice is resolved silently to either branch (in the rules we let abbreviate the action ; alternatively we could define as the empty sequence). Rule 3.17 nondeterministically picks a finite number () of times to iterate , where finite iteration is defined in defn. (3.6).
Rule 3.18 is the interesting rule, which generalises the earlier rule for prefixing: a command can take a step of , or begin execution of if is terminated, or execute a step of if reorders with . Reordering of an action (list of instructions) with a command is lifted from reordering on instructions straightforwardly (Appendix A).
As an example, from (3.14) we can deduce the following.
That is, the assignment to can occur before the conditional; this represents the compiler deciding to move the store before the test since it will happen on either path.
3.3. Trace semantics
Given a program the operational semantics generates a trace, that is, a finite sequence of steps where the labels in the trace are actions555 Since infinite traces do not add anything of special interest to the discussion of weak memory models over and above finite traces, we focus on finite traces only to avoid the usual extra complications that infinite traces introduce. . We write to say that executes the actions in trace and evolves to , inductively constructed below. The base case for the induction is given by .
(3.19) | |||||
(3.20) | |||||
(3.21) |
Traces of visible actions are accumulated into the trace (using ‘’ for list concatenation) (3.19), and silent actions (such as ) are discarded (3.20), i.e., we have a “weak” notion of equivalence (Milner, 1982). A visible action is any action with a visible effect, for instance, fences, assignments, and guards with free variables. Silent actions include any guard which is in any state and contains no free variables; for instance, is silent while is not. A third category of actions, , includes exactly those guards where evaluates to in every state. This includes actions such as , with the simplest example being , which we abbreviate to (Morgan, 1994). Any behaviour of in which an infeasible action occurs does not result in a finite terminating trace, and hence is excluded from consideration. Such behaviours include those where a branch is taken that eventually evaluates to .
The meaning of a command is its set of all possible terminating behaviours , leading to the usual (reverse) subset inclusion notion of refinement, where if every behaviour of is a behaviour of ; our notion of command equivalence is refinement in both directions (3.21).
From the semantics we can derive the usual properties such as and (overloading ‘’ to mean pairwise concatenation of lists). We can use trace equivalence to define a set of rules for manipulating a program under refinement or equivalence; we elucidate a general set of these in the following section.
4. Reduction rules
Using the notion of trace equivalence the following properties can be derived for the language, and verified in Isabelle/HOL (Colvin, 2021b). The usual properties of commutativity, associativity, etc., for the standard operators of the language hold, and so we focus below on properties involving parallelized sequential composition.
(4.1) | |||||
(4.2) | |||||
(4.3) | |||||
(4.4) | |||||
(4.5) | |||||
(4.8) | |||||
(4.9) | |||||
(4.10) |
Law 4.1 states that a parallelized sequential composition can always be refined to a strict ordering. Law 4.2 states that a choice can be refined to its left branch (a symmetric rule holds for the right branch). Law 4.3 says that the first instruction of either process in a parallel composition can be the first step of the composition as a whole. Such refinement rules are useful for elucidating specific reordering and interleavings of parallel processes that lead to particular behaviours (essentially reducing to a particular trace). Law 4.4 is an equality, which states that if there is nondeterminism in a parallel process the effect can be understood by lifting the nondeterminism to the top level; such a rule is useful for the application of, for instance, Owicki-Gries reasoning (Sect. 5.4). Law 4.5 states that parallelized sequential composition is associative (provided the same model m is used on both instances). Laws 4.8 and 4.9 are special cases where, given two actions and , if they cannot reorder then they are executed in order, and if they can it is as if they are executed in parallel. Law 4.10 straightforwardly promotes action of before , and depending on the structure of , further of its actions may be reordered before .
We can extend these rules to more complex structures.
(4.11) | |||||
(4.12) | |||||
(4.13) |
Law 4.11 shows that (full) fences enforce ordering. Law 4.12 gives a special case of a conditional where the branch depends on a shared variable in the condition, in which case the command is executed in-order (assuming model m respects data dependencies). Law 4.13 elucidates the potentially complex case of reasoning about conditionals in which there are no dependencies: theoretically the compiler could allow inner instructions to appear to be executed before the evaluation of the condition.
Assuming , , and that is independent of commands and (i.e., ), we can derive the following.
(4.14) | |||||
(4.15) |
These sort of structural rules help elucidate consequences of the memory model for programmers at a level that is easily understood.
The next few laws allow reasoning about “await”-style loops, as used in some lock implementations.
(4.16) | |||||
(4.17) | |||||
(4.18) |
Law 4.16 states that a sequence of repeated instructions in order, according to any model m, can be treated as executing in order. Using this property, and others like (we omit such trivial laws that do not involve reordering) we can deduce Law 4.18, that states that a spin-loop that polls a shared variable can be treated as if executed in strict order .
Lifting to commands
So far we have considered reduction laws that apply to relatively simple cases involving individual actions. Lifting the concepts to commands is nontrivial in general, that is, for there could be arbitrary dependencies between and which mean they partially overlap perhaps with pre- or post-sequences of non-overlapping instructions. Here associativity (Law 4.5) may help, allowing rearrangement of the text to split into sections, for instance,
(4.19) |
We now consider the cases where two commands are completely independent, and where one is always blocked by the other. Independence can be established by straightforwardly lifting from instructions to commands, using lifting conventions in Appendix A. The key property is that
(4.20) |
This follows from partial execution of the semantics: at no point is there an instruction within that prevents an instruction in from executing (a trivial case is where ).
To define the converse case, where always prevents from executing, consider the following. We write to indicate that at no point can reorder during the execution of (a trivial case is where ). If then one can treat them as sequentially composed. We define these concepts with respect to the operational semantics.
(4.25) | |||||
(4.30) |
We write when all immediate next possible steps of are blocked by (4.25). Thus holds when, after any unfinished partial execution of via some trace resulting in , continues to block (4.30). We exclude from the set of partial executions the cases where execution is effectively finished, i.e., when is or equivalent (otherwise would never hold as the final allows reordering). From these we can derive:
(4.33) |
Such reduction may lift to more complex structures, for instance, the following law is useful for sequentialising loops when each iteration has no overlap with the preceding or succeeding ones.
(4.36) |
We now show how the application of reduction laws to eliminate (elucidate) the allowed reorderings in terms of the familiar sequential and parallel composition enables the application of standard techniques for analysis.
5. Applying concurrent reasoning techniques
In this section we show how the reduction rules in the previous section can be used as the precursor to the application of already established techniques for proving properties of concurrent programs. For correct, real algorithms influenced by weak memory models there will be no reordering except where it does not violate whatever the desired property is (be that a postcondition, or a condition controlling allowed interference). In such circumstances our framework supports making the allowed reorderings explicit in the structure of the program, with a corresponding influence on the proofs of correctness. Where a violation of a desired property occurs due to reordering, the framework also supports the construction of the particular reordering that leads to the problematic behaviour.
The IMP+pseq language includes several aspects which do not exist in a standard imperative language, namely, fences and the ordering constraints that annotate variables and fences. We start with a simple syntactic notion of equivalence modulo these features, reducing a program in IMP+pseq into its underlying ‘plain’ equivalent (fences and ordering constraints have no effect on sequential semantics directly). We then explain how techniques such as Owicki-Gries and rely/guarantee may be applied.
5.1. Predicate transformer semantics and weakest preconditions
The action-trace semantics of Sect. 3.3 can be converted into a typical pairs-of-states semantics straightforwardly, as shown in Fig. 4. Let the type be the set of total mappings from variables to values, and let the effect function return a relation on states given an instruction. We let ‘’ be the identity relation on states, and given a Boolean expression we write if is in state , and for the evaluation of within state (note that ordering constraints are ignored for the purposes of evaluation). The effect of an assignment is a straightforward update of to the evaluation of (defn. (5.1)), where is overwritten so that maps to . A guard is interpreted as a set of pairs of identical states that satisfy (defn. (5.2)), giving trivial cases and . A fence has no affect on the state (defn. (5.3)). Conceptually, mapping onto an action trace yields a sequence of relations corresponding to a set of sequences of pairs of states in a standard Plotkin-style treatment (Plotkin, 2004). We can lift to traces by composing such a sequence of relations, which is defined recursively in (5.7), and the effect of a command is given by the union of the effect of its traces (5.8).
(5.1) | |||||
(5.2) | |||||
(5.3) | |||||
(5.4) | |||||
(5.7) | |||||
(5.8) | |||||
(5.9) |
TODO
The predicate transformer for weakest precondition semantics is given in defn. (5.9). A predicate is a set of states, so that given a command and predicate , returns the set of (pre) states where every post-state related to by satisfies (following, e.g., (Dijkstra and Scholten, 1990)). We define Hoare logic judgements with respect to this definition (note that we deal only with partial correctness as we consider only finite traces). From these definitions we can derive the standard rules of weakest preconditions and Hoare logic for commands such as nondeterministic choice and sequential composition, but there are no general compositional rules for parallelized sequential composition.
Trace refinement is related to these notions as follows.
Theorem 5.1 (Refinement preserves sequential semantics).
Assuming then
Proof.
Straightforward from definitions.
The action-trace semantics can be converted into a typical pairs-of-states semantics straightforwardly, based on the effect function. The relationship with standard Plotkin style operational semantics (Plotkin, 2004) is straightforward, i.e.,
if and then |
The advantage of our approach is that the syntax of the action can be used to reason about allowed reorderings using Rule 3.18, whereas in general one cannot reconstruct or deduce an action from a pair of states.
5.2. Relating to standard imperative languages
The language IMP+pseq extends a typical imperative language with four aspects which allow it to consider weak memory models: ordering constraints on variables; fence instructions; parallelized sequential composition; and parallelized iteration. The first two have no direct effect on the values of variables (the sequential semantics essentially ignores them), while the second two affect the allowed traces and hence indirectly affect the values of variables. However, both parallelized sequential composition and parallelized iteration can be instantiated to correspond to usual notions of execution; hence we consider a plain subset of IMP+pseq which maps to a typical imperative program.
Definition 5.2 (Plain imperative programs).
A command of IMP+pseq is plain if:
-
i
all instances of parallelized sequential composition in are parameterised by either sc or par, i.e., correspond to standard sequential or parallel composition; and
-
ii
all instances of parallelized iteration in are parameterised by sc, i.e., loops are executed sequentially.
Definition 5.3 (Imperative syntax equivalence).
Given a plain command in IMP+pseq, a command in a standard imperative language (which does not contain memory ordering constraints on variables or fences), is imperative-syntax-equivalent to if and are structurally and syntactically equivalent except that:
-
i
all variable references in (which are of the form for some set of ordering constraints ) appear in as simply ; and
-
ii
no fence instructions are present in , that is, they can be thought of as no-ops and removed.
We write if is imperative-syntax-equivalent to , or if there exists a where and is imperative-syntax-equivalent to .
For example, the following two programs are imperative-syntax-equivalent (where ‘’ in the program on the right should be interpreted as as standard sequential composition, i.e., ‘’ or ‘’ in this paper).
(5.10) |
The program on the left is plain because all instances of parallelized sequential composition correspond to sequential or parallel execution.
Note that there is no imperative-syntax-equivalent form unless all instances of parallelized sequential composition have been eliminated/reduced to sequential or parallel forms. A structurally typical case is the following, where reduces to a straightforward imperative-syntax-equivalent program.
(5.11) | |||||
(5.12) |
For reasoning purposes one can use the latter program rather than the former.666Strictly speaking we should not use , etc., on both sides of because the plain version of does not contain ordering constraints on variables; for brevity we ignore the straightforward modifications required in this case. This is because for a plain program in IMP+pseq the semantics correspond to the usual semantic interpretation of imperative programs, ignoring ordering constraints and treating fences as no-ops.
Theorem 5.4 (Reduction to standard imperative constructs).
If , then is exactly equal to the usual denotation of as a relation (or sequence of relations) on states; and hence any state-based property of based on can be established for that uses a standard denotation for programs.
Proof.
From Defn. (5.3) and Fig. 4 we can see that i) ordering constraints have no effect on states; ii) fence instructions are equivalent to a no-op (in terms of states) and hence can be ignored; iii) Rule 3.18 reduces to the usual rule for sequential composition when reordering is never allowed, and for parallel composition when reordering is always allowed; and iv) Rule 3.17 and defn. (3.6) reduce to the usual sequential execution of a loop when instantiated with sc.
We introduce some syntax to help with describing examples.
Definition 5.5 (Plain interpretation).
Given a command in IMP+pseq, which need not be plain, we let be the plain interpretation of , that is, where i) all instances of parallelized sequential composition and parallelized iteration in are instantiated by sc, except for instances of parallelized sequential composition instantiated by par (corresponding to parallel composition) which remain as-is; and ii) fences in do not appear in .
Establishing properties about programs may proceed as outlined below. Assume for some program in IMP+pseq that its plain interpretation satisfies a property under the usual imperative program semantics according to some method . There are three approaches:
-
(1)
Reduction to plain equivalence. Show that the ordering constraints (due to variables or fences) within are such that it reduces equivalently to (ie, no reordering is possible within ). Hence satisfies using in C.
-
(2)
Reduction to some plain form. Reduce equivalently to some plain command and apply (or some other known method) to to show that holds.
-
(3)
Reduce and deny. Refine to some plain command and apply (or some other known method) to to show that does not hold. This corresponds to finding some new behaviour (due to a new ordering of instructions) that breaks the original property ; in this paper tends to be a particular path and we straightforwardly apply Hoare logic to deny the original property .
We now explain how standard notions of correctness can be applied in our framework.
5.3. Hoare logic
We define a Hoare triple using weakest preconditions (because we only consider finite traces we do not deal with potential non-termination).
As with Theorem 5.1 refinement preserves Hoare-triple inferences.
(5.13) |
Given some plain that is equivalent to we can apply Hoare logic.
Traditional Hoare logic is used for checking every final state satisfies a postcondition, but it also useful to consider reachable states, which can be defined using the conjugate pattern (see, e.g., (Hoare, 1978; Morgan, 1990; Winter et al., 2013); it is related to, but different from, O’Hearn’s concept of incorrectness logic (O’Hearn, 2019) (which is stronger except in the special case when post-state is )).
Given these definitions we naturally derive the following properties in terms of a relational interpretation of programs.
(5.14) | |||||
(5.15) |
Due to its familiarity we use Hoare logic for top-level specifications, although other choices could be made. However Hoare logic is of course lacking in the presence of concurrency (due to parallelism or reordering), and hence we later consider concurrent verification techniques. At the top level we use the following theorems to formally express our desired (or undesired) behaviours.
Theorem 5.6 (Verification).
(5.16) | |||||
(5.17) | |||||
(5.18) |
Theorem 5.16 allows properties of a reduced program () to carry over to the original program (). Alternatively by Theorem 5.17 if any behaviour () is found to violate a property then that property cannot hold for the original (). Finally by Theorem 5.18 if any behaviour () is found to satisfy a property then it is a possible behaviour of the original ().
5.4. Owicki-Gries
The Owicki-Gries method (Owicki and Gries, 1976) can be used to prove a top-level Hoare-triple specification of parallel program . If involves reordering, then reducing to some plain form allows the application of Owicki-Gries; several examples of this approach appear in (Colvin, 2021a).
For instance, recalling (5.11), given a program of the form for which one wants to establish some property specified as a Hoare triple, one can use Owicki-Gries on the plain program by Theorem 5.4: the fences enforce program order, but can be ignored from the perspective of correctness.
Now consider the following slightly more complex case with nested parallelism, where one process uses a form of critical section. Assume .
Unfortunately the Owicki-Gries method is not directly applicable due to the nested parallelism, but in this simple case we can use the fact that execution of two (atomic) actions in parallel means that either order can be chosen.
(5.20) |
Given an Owicki-Gries rule for nondeterministic choice this can be reasoned about directly, or alternatively we can lift this choice to the top level. Continuing from (5.4):
(5.21) |
Now Owicki-Gries can be applied to both top-level possibilities. Clearly this is not desirable in general, however, and in the next section we show how the compositional rely/guarantee method can be used to handle nested parallelism.
5.5. Rely/guarantee
Rely/guarantee (Jones, 1983a, b) (see also (Coleman and Jones, 2007; Hayes et al., 2012, 2014)) is a compositional proof method for parallel programs. A rely/guarantee quintuple states that program establishes postcondition provided it is executed from a state satisfying and within an environment that satisfies (rely) relation on each step; in addition also satisfies (guarantee) relation on each of its own steps. For instance, states that establishes when it finishes execution, and guarantees that it will not modify , provided initially and the environment only ever increases . In the relations above we have used the common convention in relations that primed variables () refer to their value in the post-state and unprimed variables () refer to their value in the pre-state. A top-level Hoare-triple specification can be related to a rely/guarantee quintuple by noting , that is, if holds initially in some top-level context that does not modify any variables, then establishes (with the weakest possible guarantee).
Reasoning using the rely/guarantee method is compositional over parallel conjunction; for instance, without going into detail about rely/guarantee inference rules, consider the plain program from (5.4). We can show a rely/guarantee quintuple holds provided we can find rely/guarantee relations that control the communication between the two subprocesses; we abstract from these nested relations using different names.
This is straightforward application of standard rely/guarantee inference rules (where we leave the format of the predicates and relations unspecified). Reasoning may proceed from this point, in particular, analysing the quintuple containing the nested parallel composition . The question becomes whether the guarantee and the intermediate relation is satisfied regardless of the order that and are executed, and if not, a fence or ordering constraints will need to be introduced to enforce order and eliminate the undesirable proof obligation. While completing the proof may or may not be straightforward, the point is it will be matter of the specifics of the example in question, and not of a tailor-made inference system to manage a complex semantic representation. The initial work (as shown in (5.4)) is to elucidate the allowed reorderings as either sequential or parallel composition.
5.6. Linearisability
Linearisability is a correctness condition for concurrent objects (Herlihy and Wing, 1990), which is not an “end-to-end” property such as Hoare triples and rely/guarantee quintuples, but rather requires that operations on some data structure appear to take place atomically, with weakened liveness guarantees. The following abstract program follows a common pattern for operations on lock-free, linearisable data structures (in this case ), where there may be other processes also executing .
(5.22) |
The algorithm repeatedly reads the value of (into local variable ), calculates a new value for (applying some function , and storing the result in local variable ), and attempts to atomically update to ; if interference is detected ( has changed since being read into ) then the loop is retried, otherwise the operation may complete.
The natural dependencies arising from the above structure naturally maintain order, that is, , hence , and any reasoning about executing under sequential consistency – in a vacuum – applies to the C version of . (Other algorithms may of course have fewer dependencies which will manifest as more complex parallel structures, which must be elucidated via reduction.) Alternative approaches to reasoning about linearisability under weak memory models typically modify the original definition in some way to account for reorderings (Derrick and Smith, 2017; Derrick et al., 2014).
However when considering a calling context reordering must also be taken into account. That is, assume is some program which contains a call to , and that is just one of many parallel processes which may be calling . Multiple calls to (or other operations on ) within are unlikely to cause a problem because the natural dependencies on will prevent problematic reordering; however one must consider the possibility of unrelated instructions in reordering with (internal) instructions of . For instance, consider a case where implements a queue and places a new value into the queue, and within a flag is set immediately after calling to indicate the enqueue has happened, i.e.,
The assignment can be reordered with instructions within and destroy any reasoning that uses to determine when has been called. Placing a release constraint on the flag update resolves the issue (which can be shown straightforwardly in our framework since , or in other words, ), but more generally this leaves a question about how to capture dependencies within the specification of . Hence, while one can argue that is “linearisable” in the sense that if it is executed in parallel with other linearisable operations on it will operate correctly, whether or not a calling context works correctly will still depend on reordering (Batty et al., 2013; Smith and Groves, 2020); this can also be addressed in our framework, using reduction and applying an established technique (e.g., (Schellhorn et al., 2014; Smith, 2016; Filipović et al., 2010))
6. Examples
In this section we explore the behaviour of small illustrative examples from the literature, specifically classic “litmus test” patterns that readily show the weak behaviours of memory models, and also examples taken from the C standard.
To save space we define the following initialisation predicate for a list of variables.
(6.1) |
6.1. Message passing pattern
Consider the message passing (“MP”) communication pattern.
(6.2) |
The question is whether in the final state that , i.e., if the “flag” is observed to have been set, can one assume that the “data” () has been transferred? This is of course expected under a plain interpretation (Defn. (5.5)).
Theorem 6.1 (Message passing under sequential consistency).
(6.3) |
Proof.
The proof is checked in Isabelle/HOL using Nieto’s encoding of rely/guarantee inference (Nieto, 2003); the key part of the proof is the left process guarantees and (translated into relational form).
If we naively code this pattern in C this property no longer holds.
Theorem 6.2 (Naive message passing fails under c ).
Proof.
By definition of c (Model 2.6) we have both and .
(6.4) | defn. (6.2) | ||||
(6.5) | by Law 4.9 | ||||
(6.6) | by Law 4.3 |
All instructions effectively execute in parallel; in the final step we have picked one particular interleaving that breaks the expected postcondition, that is,
(6.7) |
Since we complete the proof by Theorem 5.17.
C’s release/acquire atomics are the recommended way to instrument message passing; we make this explicit below using release-acquire constraints on (leaving relaxed).
(6.8) |
The new constraints prevent reordering in each branch and therefore the expected outcome is reached.
Theorem 6.3 (Release/acquire message passing maintains sequential consistency).
(6.9) |
Proof.
An alternative approach to restoring order is to insert fences, e.g.,
(6.11) | |||
(6.12) |
No reordering is possible within each thread and hence these cases also reduce to . We emphasise that these proofs are as straightforward as one would expect: the programmer has enforced order and so properties of the sequential version of the program carry over. There is no need to appeal to complex global data types that maintain information about fences, orderings, etc..
6.2. Test-and-set lock
We now consider the application of the framework to more realistic code, in this case a lock implementation taken from (Herlihy and Shavit, 2011)(Sect. 7.3). Conceptually the shared lock is represented as a boolean which is when some process holds the lock, and otherwise.
(6.13) | |||||
(6.14) | |||||
(6.15) | |||||
(6.16) |
A “repeat-until” command repeatedly executes (at least once) until is true, under memory model m, which we encode using parallelized iteration (defn. (6.13)). A “get-and-set” command returns the initial value of into and updates to the value of , as a single step (defn. (6.14)). The load of is defined as an acq access and the update is a rel write. Using these the concurrent procedure is defined to repeatedly set to , and finish when a value for is read. If the lock is already held by another process then the get-and-set has no effect and the loop continues, until the case where is (as recorded in the local variable ), when is updated to to indicate the current process now has the lock. Unlocking is implemented by simply setting to .
We have the following general rule for spin loops of the form in lock().
(6.17) |
Intuitively, since each iteration of the loop updates a variable that is then checked in the guard, no reordering is possible, and it is as if the loop is executed in sequential order. This holds by the following reasoning.
Recalling that is the plain version of , i.e., ignoring the ordering constraints and using sequential composition only, we have the following.
Theorem 6.4.
Proof.
Therefore this c implementation is ‘correct’ if the original is. However, as discussed in Sect. 5.6, one must consider the calling context. Theorem 6.4 does not depend on the ordering constraints in the definition of lock(), as its natural data dependencies maintain order. However this does not also guarantee that for instance, , which may be a desirable property if the behaviour of assumes the lock is held. By defn. (6.15) we have and which in many cases will be enough to allow sequential reasoning in the calling context; alternatively fences could be inserted around .
6.3. Out-of-thin-air behaviours
We now turn our attention to the “out of thin air” problem, where some memory model specifications allow values to be assigned where those values do not appear in the program. Firstly consider the following program, which appears in the standard.
(6.18) |
Under a plain interpretation neither store ever happens: one of the loads must occur, and the subsequent test fail, first, preventing the condition in the other process from succeeding.
Theorem 6.5.
.
Proof.
Trivial using Owicki-Gries reasoning, checked in Isabelle/HOL (Nipkow and Nieto, 1999).
However, this behaviour is allowed under the C memory model according to the specification (although compiler writers are discouraged from implementing it!). The behaviour is allowed in our framework, that is, it is possible for both and to read 42. This is because (unlike hardware memory models (Colvin, 2021b)) we allow stores to come before guards (via the relation, e.g., (2.33)).
Theorem 6.6.
.
Proof.
Under hardware weak memory models (the observable effects of) writes can not happen before branch points, and so out-of-thin-air behaviours are not possible.
Consider the following variant of .
(6.20) |
The inner assignments have changed from (resp. ) to (resp. ). Arguably the compiler knows that within the true branch of the conditional it must be the case that , and thus the assignment can be treated as , reducing to the original . But this outcome is expressly disallowed, by both the standard and naturally in our framework. That is, we can show that every possible final state satisfies .
Theorem 6.7.
.
Proof.
The initial load into creates a dependency with the rest of the code, i.e., . Hence we can sequence the initial load of (into ) with the remaining code.
(6.21) |
Although the guard and the assignment in the conditional may be reordered with each other (i.e., ), the fact that the initial load must happen first means that, similarly to Theorem 6.5, there is no execution of in which a non-zero value is written to any variable.
Note that the C standard allows the behaviour of and forbids the behaviour of , and both results arise naturally in our semantics framework without the introduction of any ad-hoc mechanisms. We return to the question of whether guards should be allowed to simplify instructions in Sect. 10.
7. Incremental evaluation of code
In this and subsequent sections we consider more complex aspects of the C language and execution where they relate to concurrent behaviour, namely, in this section, non-atomic evaluation of instructions (until now we have assumed all instructions appearing in the text of the program are executed in a single indivisible step), in Sect. 8 optimisations of expressions (reducing expressions to improve efficiency), and in Sect. 9 forwarding (using earlier program text to simplify later instructions). We show how each can be incorporated into the framework straightforwardly without any need for change to the underlying definition of the c memory model, although the consequences for reasoning about particular programs may not be straightforward.
7.1. Incremental evaluation of expressions
In C one cannot assume expressions are evaluated in a single state. That is, programmers are allowed to write “complex” assignments and conditions but these may be compiled into multiple (indivisible) assembler instructions. For instance, the assignment may compile into (at least) three separation instructions, one to load the value of , one to load the value of , and one to finally store the result to .
We give an operational semantics for incremental expression evaluation in Fig. 5. (We use the term “incremental” rather than the more usual “non-atomic” to avoid terminology clash with C’s atomics.) Recall the syntax of an expression from Fig. 1. Each evaluation step of either loads a value of a free variable or reduces the expression in some way, and evaluation stops when a single value remains. Rule 7.1 states that a variable access is evaluated to a value via the guard . Any choice of that is not the value of will result in a false guard, i.e., leads to an infeasible behaviour, which can be ignored. Only the correct value for leads to a feasible behaviour. Note that the set of ordering constraints on also appears in the label. Rule 7.4 evaluates a unary expression by simply inductively evaluating , while Rule 7.9 similarly evaluates each operand of a binary expression, in either order (it is of course straightforward to instead insist on left-to-right evaluation). Rule 7.14 uses a meta-level functor application method to calculate the final value of an expression once all variables have been evaluated.
(7.1) |
(7.4) |
(7.9) |
(7.14) |
TODO
As an example, consider the following possible evaluation of the expression .
(7.15) |
The first step applies Rule 7.9 and (inductively) Rule 7.1, and the second step is similar. The choice of values 3 and 2 are arbitrary, and any values could be chosen, reflecting the behaviour of the code in any state. The third and final step follows from (the assumed interpretation) . Note that the (relaxed) load of is not restricted by the acquire of appearing within the same expression, since we have given a nondeterministic expression evaluation order. Of course, one can change the rule for evaluating binary expressions to evaluate left-to-right and consider constraints appearing “earlier” in the expression.
7.2. Incremental execution of instructions
Now consider the incremental execution of a single instruction (for brevity, in this section we assume a single instruction is the base action of the language, rather than a list as in previous sections (defn. (3.1)); we give a full semantics for incremental execution of lists of instructions in Appendix B). We have the concept of indivisible () actions, which are the only instructions that may be executed directly in the operational semantics. We define an indivisible instruction as one where there are no shared variables to be read, i.e.,
From this we can derive
(7.16) |
For instance, is indivisible, while is not – must be separately loaded and the result later stored into .
(7.19) |
(7.22) |
(7.25) |
TODO
Incremental execution rules for instructions are given in Fig. 6. Rule 7.19 states that an assignment instruction is evaluated by first evaluating the assigned expression, and similarly Rule 7.22 states that a guard is executed by first incrementally evaluating the expression. Rule 7.25 states that directly executable instructions can be executed as a single action. This rule applies for fences, and when evaluation of assignment or guard instructions has reduced them to an indivisible form. Note that we allow the (final) evaluation steps to include an arbitrary number of local variables. We insist only on shared variables being evaluated in separate steps, as these involve interactions with the memory system.
As an example of instruction evaluation, recalling (7.15), we place this expression evaluation in a release write.
The first two steps are inherited from the expression evaluation semantics (via Rule 7.19). The final step is via Rule 7.25, noting . The individual evaluation of local variables is not affected by the shared memory system and hence the following incremental execution is also allowed (where we leave implicit rlx accesses),
Incremental execution makes explicit what a compiler might do with C code involving shared variables. Note that, importantly, the reordering relation did not have to change or be reconsidered at all, even though the scope of C considered was increased (and made more complex to reason about, as is always the case when considering incremental evaluation).
7.3. Reasoning about incremental evaluation
Syntax-directed, general inference rules (e.g., in an Owicki-Gries or rely/guarantee framework) are rare for concurrent languages with incremental evaluation of expressions and instructions, irrespective of weak memory models. For instance, a simple program like can result in final states where when executed incrementally, but this is not immediately derivable from typical compositional proof rules (Hayes et al., 2013). The problem is that, in general, syntax-directed proof rules do not directly apply as there are many places where interference may occur, which do not neatly align with syntactic terms. The situation is more complicated again with non-deterministic evaluation order, as specified by the C standard.
From a practical perspective the issue is typically resolved by making the atomicity explicit (possibly requiring the introduction of new temporary variables), i.e., we may rewrite the above program as follows.
(7.26) |
In this format one can apply standard (non-incrementally evaluated) syntax-directed proof rules, such as Owicki-Gries or rely/guarantee, and show that in the final state (possibly requiring the further introduction of auxiliary variables or other techniques). Of course specialised rules to handle particular forms of incrementally-evaluated instructions can be derived, and these may be applied in some cases, but in general the intent is to precisely deal with communication/behaviour as written in the text of program.
As the difficulty of reasoning about possibly dynamically changing code structure impacts on reasoning about C programs, especially with reordering and incremental evaluation, we make this clearer by expressing it in terms of a definition and some remarks.
Definition 7.1 (Atomic-syntax-structured code).
A command in standard imperative programming syntax, where all basic building blocks (conditions, assignments) of command are evaluated/executed atomically, and execution proceeds in program-order, is atomic-syntax-structured code.
Note that a subset of IMP+pseq can be atomic-syntax-structured, especially if taking the plain subset (Defn. (5.2)) and using the semantics of Sect. 3.2.
Remark 7.1.
Most inference systems for concurrent code work on the basis the code is atomic-syntax-structured; it is non trivial to apply syntax-based approaches if the syntax does not directly map to execution. Often the atomicity is made explicit by introducing temporary variables, or a non-syntax based approach is used for verification, e.g., translating into automaton systems where, again, the atomicity is explicit.
Remark 7.2.
Many other approaches are still applicable to non-atomic-syntax-structured code, for instance, model checking.
We emphasise that C programs are in general not atomic-syntax-structured, and thus complicates analysis by some techniques, regardless of whether or not the c memory model is taken into account.
It is beyond the scope of this paper to develop rules that handle non-atomic-syntax-structured code but, as before, one may apply such rules after reduction. We argue that the level of granularity should be made explicit, or in other words, programmers (who wish to do analysis) should restrict themselves to instructions that are directly executable. For instance, normal assignments that reference as most one shared variable (see e.g., (Hayes et al., 2021) for rules coping with such situations).
If the developer insists on reasoning about code that is not atomic-syntax-structured then some of the reduction rules need provisos under incremental execution. For instance, Law 4.8 holds only when both instructions are indivisible, i.e., it must be updated to ensure the relevant instructions are .
If either is not indivisible then there may be parts of that can be incrementally evaluated before , for instance consider:
Although due to , the load of can come before , i.e., it is not the case that under incremental execution. The reference to can be incrementally evaluated and reordered before the store to .
(7.27) |
As with proof methods, specific reduction rules to handle incremental evaluation can be derived (possibly using a program-level encoding of evaluation as given in (Colvin et al., 2017; Hayes et al., 2019)).
8. Expression optimisations
We now consider a further important factor influencing execution of programs under the C memory model: expression optimisations (we consider structural optimisations for instance, changing loop structures, in Sect. 11.2). There are three principles when considering “optimising” expression to :
-
•
Value equality. Expression must be equal to in all states. However, as we see below, extra contextual information can be used.
-
•
Lexicographic simplification. We say if is a “more optimised” expression than , in the sense that it is less computationally intensive to evaluate. A precise definition of for C is beyond the scope of this work, but we assume it is irreflexive and transitive, and that intuitive properties such as , and hold. An important aspect is that may allow the removal of variables (as in ), and this could have an effect on allowed reorderings according to Model 2.2, g (i.e., one cannot rely on “false dependencies” (Alglave et al., 2014)).
-
•
Memory ordering constraint simplification. We say if does not lose any significant memory ordering constraints. For instance, it may be the case that compilers should not “optimise away” an explicit sc constraint, even if valid according to the other optimisations. Again, the precise definition of this is a matter for the C committee, but we explore some of the options and their consequences below in Sect. 8.1.
These three constraints must be satisfied before an expression is ‘optimised’ to expression , written .
(8.1) |
The following operational rule allows optimisations as an expression evaluation step, superseding Rule 7.14 and in some cases removes the need for Rule 7.9, etc.
Rule 8.2 (Optimise expression).
Rule 8.2 states that an expression can be optimised to some expression , in the process emitting a guard , where provides the context which makes the optimisation valid (the expression is used only to show in defn. (8.1)). The guard acts as a check that the optimisation is valid in the current state; for many optimisations will simply be .
As an example, assuming a definition of where holds provided contains only relaxed or no ordering constraints, and has a subset of those, then the following steps are allowed by Rule 8.2.
Since | ||||
Since | ||||
Since |
On the other hand, given the above assumption about , cannot be optimised to as this would lose a significant reordering constraint ( and hence ). We consider other examples in the subsequent section.
8.1. Defining allowed changes to ordering constraints
One of the tensions in the development of the C memory model is how accepted compiler optimisations interact with memory ordering constraints. Rather than take a particular position, or try to be exhaustive, we show how different options can be expressed and their consequences enumerated formally and (relatively) straightforwardly.
Consider the following five possible definitions for . Recall that extracts the memory ordering constraints from expression (defn. (2.69)). We abbreviate , as these are the significant ordering constraints that may appear in expressions (rel constraints appear only on the left-hand side of assignments and thus are not subject to expression optimisation).
(8.8) |
-
Option (a)
is the most conservative option and simply says the compiler must not change the constraints in at all. This would be simple to implement and reason about, but possibly prevents some sensible/expected optimisations.
-
Option (b)
says that only relaxed or non-atomic expressions may be removed, and the optimised expression can either remain relaxed or become non-atomic itself. Stronger constraints (acq, con, and sc) will not be “optimised away”.
-
Option (c)
says that only relaxed or non-atomic expressions can be optimised, however, any such constraints can be strengthened (e.g., a rlx access can be strengthen to sc). While more subtle than the other options, it imposes fewer constraints on the compiler writer, and would only have the effect of reducing the number of behaviours of code.
-
Option (d)
requires acq and sc constraints to be maintained, if they occur, but other parts of a complex expression can be optimised.
-
Option (e)
allows full freedom to the compiler, leaving the programmer unable to rely entirely on memory ordering constraints to enforce order.
We give some examples in tabular form to understand the consequences of these choices.
As with incremental evaluation of expressions, while it is straightforward to incorporate optimisations into the semantics, programs may not be atomic-syntax-structured (Defn. (7.1)), and hence a programmer interested in serious analysis is well advised to avoid potential simplification of expressions that a compiler may or may not choose to make.
8.2. The consume ordering constraint
Because C’s consume (con) constraint has no reordering constraint beyond that of data dependencies we have for brevity not included it in earlier sections. The intent of a con load is to indicate to the compiler not to lose data dependencies during optimisations. Options (b)-(e) for allow rlx variable accesses to be removed, which may also remove a data dependency to some earlier load. This is the situation that a con load is intended to avoid. For instance, the following optimisation should not be allowed.
The flow-on effect is that now is independent of and may be reordered before it (as con is equivalent to rlx in calculating ). To faithfully implement this a compiler must track data dependencies, and apparently this has never been implemented; as such all known compilers translate con constraints directly to the stronger acq constraint (which on many architectures results in a more computationally expensive mechanism than necessary). Such syntactic tracking is straightforward, if tedious, to implement in a formal setting. For instance, by tracing data-dependencies (via write and read variables) from all con loads, marking them with some special ordering constraint ‘condep’ (consume-dependent), and requiring the definition of to never allow condep constraints to be removed. As always, a programmer is well-advised to minimise the use of con constraints with later code that may allow expression optimisations to break data dependencies.
8.3. Examples
We show how the compiler may “optimise-away” an ordering constraint and hence open up more behaviours than the programmer may expect. In the following assume Option (b) above for the definition of . A programmer may choose to enforce order in the program (defn. (6.2)) using a data dependency.
Although and thus it seems the assignment to must occur after the assignment to , after optimising the second instruction the updates can occur in the reverse order. That is, by Rule 8.2 (within Rule 7.19), , and thus we have the following behaviour.
(8.9) |
As a consequence, for reasoning about the original code, one must accept the following refinement.
Alternatively a programmer may choose to avoid the dependence on data and instead enforce order using an sc constraint in the flag expression on an unrelated variable. Under option (e) this would be erroneous, since , and thus as above,
9. Forwarding
A key aspect of hardware pipelines is that instructions later in the pipeline can read values from earlier instructions (under certain circumstances). At the microarchitectural level, rather than waiting for an earlier instruction to commit and write a value to a register before reading that value from the register, it may be quicker to read an “in-flight” value directly from an earlier instruction before it commits. This behaviour may be implemented by so-called “reservation stations” or related mechanisms (Tomasulo, 1967). Fortunately there is a straightforward way to capture this in a structured program, by allowing later instructions that are reordered to pick up and use values in the text of earlier instructions. For instance, the rules of the earlier section forbid reordering before , i.e., , which is conceptually prevented because reading an earlier value of (than 1) is prohibited by coherence-per-location (formally, because ). However processors will commonly just use the value 1 and assign it to as well, before any other process has seen the write to .
In hardware the mechanism of using earlier values in later instructions is called forwarding. Notationally we write to mean the effect of forwarding (assignment) action to , which is just simple substitution (ignoring memory ordering constraints). For instance, in the above situation the value 1 assigned to can be “forwarded” to , written , avoiding the need to access main memory.
Definition 9.1 (Forwarding).
Given an assignment instruction of the form , forwarding of to an expression (written ) is standard replacement of instances of by within , ignoring ordering constraints. This is lifted to forwarding to instructions as below.
Forwarding to/from commands and traces is similarly straightforward; see Appendix A.2. Note that , i.e., forwarding is relevant only if there is a data dependence. The following examples show the relatively straightforward application of forwarding.
(9.1) |
To capture the potential for an earlier instruction to affect a later one we generalise reordering to a triple, . Whereas earlier we considered whether directly, we now allow to affect via forwarding, and for the resulting instruction () to be considered for the purposes of calculating reordering. More precisely, for instructions and ,
(9.2) |
The reordering triple notationally gives the idea of executing earlier with respect to , with possible modifications due to forwarding. The new instruction is the result of forwarding to , and it must be reorderable with (note, therefore, that reordering is calculated after applying forwarding). For instance, expresses that can reorder with , after forwarding is taken into account. Additionally the effect of forwarding must not have significantly altered the ordering constraints, i.e., (recall the options in Sect. 8.1). For instance, depending on the definition of , it may or may not be the case that , as this depends on whether . The reordering triple lifts to traces and commands straightforwardly; see Appendix A.2.
We use this more general triple in place of binary reordering
in Rule 3.18.
Rule 9.3 (Reorder with forwarding).
Now, given command , and that can execute step , then the composition can execute the modified ,
where takes in to account any forwarding that may occur from instructions in to .
Consider the simple statement , which can be optimised to provided . That is, by Rule 8.2, . Let us consider this statement immediately following the assignment . By Defn. (9.1) and by (9.1) we have . Hence by applying Rule 9.3 this program can take an initial silent step, representing an optimisation of the compiler, to simplify the assignment to .
(9.4) |
From here the assignment to can proceed first, despite the data dependence , that is,
9.1. Reduction with forwarding
We update some of the reduction rules from Sect. 4 to include forwarding via Rule 9.3, essentially replacing with the more general . We assume and . Note that if then , and hence in many cases the original rules apply.
(9.5) | |||||
(9.6) | |||||
(9.7) | |||||
(9.10) |
Law 9.5 expresses the reordering of earlier than but with any forwarding from to taken into account in the promoted instruction, while Law 9.6 is the corollary of Law 4.9. Law 9.7 is the generalisation of Law 9.5 to a command on the left. Law 9.10 applies when reordering is not possible even taking into account the effects of forwarding, replacing Law 4.8. The presence of forwarding therefore complicates reduction, however the derived properties for reasoning (Sect. 5) still apply to any reduced program.
A subtlety of a chain of dependent instructions with forwarding is that associativity can be lost. For instance, consider the program . The behaviours are different depending on how this is bracketed: or . We have by Law 9.7, however because . Hence , and so associativity has been broken. (A more complete discussion of associativity and monotonicity is given in (Colvin, 2021b).)
The addition of forwarding also explains why a compiler transformation such as “sequentialisation” (Kang et al., 2017a) is not valid. While it is straightforward that , i.e., enforcing a strict order between two parallel processes, it is not the case in general that , due to forwarding. The simplest case is , which has exactly two possible traces (interleavings), however by Law 9.5, where receives the value 1 before it is assigned to , which is not possible when executing in parallel.
9.2. Example
Forwarding admits some perhaps unexpected behaviours, for instance, in the following code, although and are strictly ordered locally, it is possible for to receive that later value, ostensibly breaking local coherence.
Theorem 9.2.
Proof.
Call the left and right programs and , respectively. Note that in although precedes the assignment (the only occurrence of the value 1 in the program), and cannot be reordered before , this theorem states that can read that value. Firstly note that , that is, can read and be reordered before the preceding instructions, and hence by Law 9.7 we have
Hence interleaving and so that in becomes the second instruction executed gives the following.
This reordering and interleaving satisfies , and thus the outcome is a possible final state of by Theorem 5.18.
It appears that coherence has been broken (indirectly through a concurrent process). However, the intuition is that the compiler (and/or the processor) decides that the programmer intended when they wrote , and it is this value that is read by via the second process. To make this clearer, consider changing the trailing assignment in the first process to . In this case a possible final state is (but not ), meaning that the initial load of has read the (arguably independent) write to , not the write to .
Forwarding of assignments is considered standard, and behaviours such as that shown by this example are accepted (in real code, typically a programmer will try to avoid loading a variable that has already been locally calculated, so this pattern, while certainly valid and has its uses, is not necessarily widespread). However, the situation is more complicated if one wishes to forward guards as well as assignments to later instructions, which we explore in Sect. 10.
9.3. Combining forwarding with optimisation/simplification
To keep the discussion of forwarding relatively simple we separated it from optimisation, but we can generalise defn. (9.2) to incorporate optimisations as well.
(9.11) |
This definition allows any “optimised” version of to be reordered before , provided forwarding is taken into account. It can become the basis for the application of Rule 9.3, and thus the derived reduction rules in Sect. 9.1. For instance, in comparison with the example in Sect. 9, we can reorder, forward and optimise in a single step since, by defn. (9.11), . The corresponding deduction rules allow us to show, for instance, the following allowed reordering and simplification of code by the compiler.
(9.12) |
because . This answers the question “can be reordered before ?”: provided all references are relaxed then can be executed earlier than the update to . Of course, structured reasoning about code which is susceptible to such compiler transformations may be nontrivial as it is not atomic-syntax-structured.
10. Read-from-untaken-branch (RFUB)/Self-fulfilling prophecies (SFPs)
In this section we separately consider a problematic situation which is debated by the C committee, where what is considered reasonable compiler optimisation leads to complex behaviour which is difficult to reason about. We show how the problematic behaviour is disallowed by the semantics we have given so far, and a small modification – which we call allowing self-fulfilling prophecies (SFP) – can be made which then allows the problematic behaviour; and we also show that allowing SFPs contradicts other, simpler cases which are expressly forbidden. We believe this provides a firm, accessible basis on which to assess which compiler optimisations should be allowed in the presence of shared-variable concurrency (i.e., C’s atomics). Importantly the framework gives a step-based, relatively intuitive, explanation for the different possible behaviours, and flexibility to accommodate different decisions.
10.1. Read-from-untaken-branch behaviour
Consider the following program, which, for particular compiler optimisations, exposes a “read from untaken branch” (rfub) behaviour.
(10.1) |
Taking a plain, sequential execution of , starting with and all other variables 0, then both and are 42 in the final state, and is either 0 or 42 depending on when the right-hand process interleaves. The branch of the conditional is always executed.
Theorem 10.1.
Proof.
Straightforward: the only assignment of 42 is within the conditional, after the read of , and hence at that point cannot be 42. The condition always holds and the path is taken, setting to 42 and to , and finally setting to 42. The final value of depends on whether the assignment occurs before or after the assignment to . By corollary it is therefore not possible to reach a final state where all variables are 42 and .
Given this, in a sequential setting a compiler may choose to optimise the conditional as follows, letting ‘’ stand for a compiler transformation.
(10.2) |
The conditional is eliminated, and now will be if when the branch would have been entered, and in the final state. More precisely, if we let be the original conditional and be the optimised version, they both satisfy the following assuming initially, and is the initival value of .
The transformed code preserves the expected behaviour of .
Theorem 10.2.
Let be using the transformed conditional from (10.2). Then
Proof.
Straightforward reasoning as in Theorem 10.1.
If we consider reordering according to the c model and semantics we have given so far, the above state is still not reachable.
Theorem 10.3.
Proof.
More behaviours are possible, but it is still straightforward in our step-based semantics: if the branch is never taken there is no way for the value 42 to be forwarded to and so . Making the paths explicit (defn. (3.7)) the left process of rfub is equal to:
The first case allows , following reasoning from Theorem 10.1, but clearly holds in any such final state. In the second case is blocked by (but not by ), and since there are no (out-of-thin-air) assignments of 42 to in the concurrent process the guard will never be , and so this branch can never be taken. Although reordering before the guard is possible in the first branch (due to forwarding/optimisations), the guard stays in the program as an “oracle”, preventing inconsistent behaviour. The compiler/hardware can allow the store to go ahead, but it must be checked for validity later. The assignment to depends on which, in the branch, depends on , and the only way that can receive 42 in that case is if receives 42, but this is not possible as the only instance of 42 is in the branch, which is already ruled out.
However, the compiler transformation makes the state reachable.
Theorem 10.4.
Recalling is using the transformed conditional in (10.2),
Proof.
In the left-hand process of the assignment can be reordered to be the first instruction executed, which by forwarding becomes . More formally, by Law 9.7 we get the following.
(10.3) |
Now consider the effect of interleaving from the second process immediately after this assignment.
Because reads 42 the value assigned to is , and thus holds in the final state. The proof is completed by Theorem 5.18.
The difficulty with this result is that, since in the final state, the branch of the original conditional from could not have been taken (since it contains ), therefore the branch was taken, i.e., the condition fails, and hence . But the only way for to hold is if the branch is/was executed, containing , but this has already been ruled out.
Since has a behaviour which does not, it cannot be the case that , which suggests the compiler transformation is not valid. Essentially, the transformation eliminates a guard and hence breaks a dependency-cycle check.
The question remains, however, whether the compiler transformation (10.2) is reasonable in the presence of the c memory model, and if so what the implications for the memory model are. We show how we can tweak the framework (specifically, increasing the circumstances under which the concept of forwarding can apply) to allow the possible state using the original version of (thus justifying the compiler transformation). However, there are other consequences. We can straightforwardly accommodate either outcome in this framework, and can do so with a clear choice that has enumerable consequences to other code: can guards be used to simplify expressions?
10.2. Self-fulfilling prophecies (forwarding guards)
In Defn. (9.1) we defined , for an assignment, to update expressions in according to the assignment. If is a guard (or fence) then . However, it is reasonable, in the sequential world at least, to allow guards to assist in transformations.
We introduce an extended version of forwarding, where returns a set of possible outcomes.
(10.4) | |||||
(10.5) | |||||
(10.6) |
If is an assignment it returns a singleton set containing just (10.4), e.g., . However, if is a guard then can be used as context to modify (optimise) , e.g., by (10.5) and (10.6).
We modify the reordering triple defn. (9.2) to accommodate in place of .
(10.7) |
Thus we incorporate using the guards in conditionals to justify reordering via Rule 9.3. For instance, following from above, .
The use of defn. (10.7) for the reordering triple used in Rule 9.3, and the derived laws such as Law 9.7, allows behaviours that weren’t possible before, for instance,
Since we also have
(which holds under defn. (9.2) as well as defn. (10.7)) we can show that a trailing assignment can use information from a preceding conditional for simplification.
(10.8) |
The trailing assignment can be executed first, using the value 42 for , since both branches imply that will always be the value assigned to . We call this use of a guard to simplify a later assignment as a self-fulfilling prophecy, for reasons which will become clear below.
10.3. Behaviour of read-from-untaken-branch with self-fulfilling prophecies
We now return to the behaviour of , this time allowing self-fulfilling prophecies via forwarding (defn. (10.7)). To highlight the impact this has we note refinement steps that are allowed under SFPs (using defn. (10.7) for the antecedent of Rule 9.3) using , and steps that are allowed normally (using defn. (9.2) for the antecedent of Rule 9.3) as . The following theorem stands in contradiction to Theorem 10.3.
Theorem 10.5.
Allowing self-fulfilling prophecies,
(10.9) |
Proof.
This theorem shows that the debated outcome of rfub, which is possible under a seemingly reasonable compiler transformation, arises naturally if conditionals are allowed to modify future instructions. Indeed, this is essentially what is used to justify the transformation itself: if the branch is taken then already , so any later use of can be assumed to use the value 42. However, in a concurrent, reordering setting, there may be unexpected consequences.777 Forbidding branches from simplifying later calculations does not prevent all reasonable compiler optimisations, for instance, This is because there is a definite assignment to in both branches.
We do not intend to take a stand about what is ultimately the best choice for C; rather we show that whichever choice is taken can be linked to clear principles about whether or not guards should be allowed to simplify assignments in a concurrent setting. Note that we can explain it in a step-based semantics.
10.4. Behaviour of out-of-thin-air with self-fulfilling prophecies
Recall the out-of-thin-air example (defn. (6.20)).
(10.15) |
Its behaviour, which is intended to be forbidden, is allowed under SFPs, in contrast to Theorem 6.7.
Theorem 10.6.
Under SFPs, .
Proof.
Focus on the left-hand process, and in particular the behaviour when the branch is taken. The key step depends on , that is, the guard can be used to simplify the inner assignment.
(10.16) | |||||
(10.17) | |||||
(10.18) |
Using symmetric reasoning in the second process and interleaving we have the following behaviour.
(10.19) |
Here both stores have been promoted to the first instruction executed, which are then read into local registers and subsequently satisfy the guards (a “satisfaction cycle” (Batty et al., 2013)). The postcondition is straightforwardly reached.
(10.20) |
The proof is completed by Theorem 5.18.
This demonstrates the problematic nature of allowing SFPs, and unifies the known underlying problem with these two related patterns ( and rfub).
11. Further extensions
In this section we discuss some other extensions to the model that may be of interest in some domains; however we emphasise that the definition of the memory model does not need to change, we simply make the language and its execution model richer.
11.1. Power and non-multicopy atomicity
IBM’s multicore Power architecture (Maranget et al., 2012; Sarkar et al., 2011; Mador-Haim et al., 2012) (one of the first commercially available) has processor pipeline reorderings similar to Arm (Flur et al., 2016), but in addition has a cache coherence system that provides weaker guarantees than that of Arm (and x86): (cached) memory operations can appear in different orders to different processes. For instance, although on ARM the instructions and may be reordered in the pipeline of a particular processor, whichever order they occur in will be the same for every other process in the system. On Power, however, one process may see them in one order and another may see the modification in the reverse order (assuming no fences are used). This extra layer of complexity is introduced through cache interactions which allow one processor that shares a cache with another to read writes early, before the memory operation is fully committed to central storage (Sarkar et al., 2011).
Programmers targetting implementations on Power architectures must take these extra behaviours into account (C accommodates the lack of multicopy atomicity to be compatible with this hardware). A formalisation of the Power cache system, separate from the processor-level reordering, and compatible with the framework in this paper, is given in (Colvin and Smith, 2018), which is based on an earlier formalisation of the Power memory subsystem given in (Sarkar et al., 2011). The cache subsystem, and its formalisation in (Colvin and Smith, 2018) sits conceptually above the processes within the system, and stores and loads interact with it rather than with the traditional variable-to-value mapping. As such, local reorderings, and the influence of ordering constraints and fences are still relevant. However under the influence of such a system, no code can be assumed to be atomic-syntax-structured (Defn. (7.1)), and thus traditional syntax-based reasoning techniques will not directly apply; however by instituting the system of (Colvin and Smith, 2018) the consequence on behaviours can be determined. We do not devote more time to this feature of the C model in this paper for several reasons.
-
•
Power is still subject to processor reorderings which are captured by the concepts herein; the memory subsystem is governed by separate mechanisms;
-
•
Power is the only commercially available hardware that exhibits such behaviours. Arm once allowed them, but no manufacturer ever implemented them on a real chip, and Arm has since removed these behaviours from its model (Pulte et al., 2017). A large reason for omitting the possibility of such weak behaviours was due to the complexity of the induced axiomatic models (Alglave et al., 2014), which were difficult to specify and reason about. Arm is generally considered one of the faster architectures, and does not appear to suffer a significant penalty due to not having a similarly weak cache coherence system. (Intel’s Itanium architecture, and mooted memory model (e.g.(Yang et al., 2003)), was also intended to allow these weak behaviours; however it ceased production in 2020.)
-
•
In practice, Power’s weak behaviours are so difficult to reason about that developers make heavy use of fences to eliminate the effects; and as a result whatever performance gains there may have been are eliminated.
-
•
The behaviours that arise are entirely to do with that micro-architectural feature and not due to compiler transformations or processor reorderings. Reasoning about C memory model features that affect local code (for, say, x86) architectures carries over identically to reasoning about local code in Power.
-
•
Although, theoretically, under the C memory model a compiler could instrument code transformations that mimic the behaviour of Power’s cache system on architectures that do not exhibit them natively, this seems highly unlikely; therefore, only developers targetting Power will ever experience the resulting complexity (which arguably can and must be removed by the insertion of fences).
Outside of such considerations, the C model as we have presented it is relatively straightforward; the behaviour introduced by Power’s cache system is completely separate to pipeline reordering and compiler transformations. It seems unfortunate to complicate the entirety of the C memory model to accommodate one architecture, when its effects can be captured in a separate memory subsystem specification (which can be ignored by developers targetting different architectures). Hence we recommend keeping the extra behaviours induced by this separate mechanism, peculiar to a single currently-in-production architecture, separately specified within the model; we point to (Sarkar et al., 2011; Pichon-Pharabod and Sewell, 2016; Colvin and Smith, 2018) as examples of how this can be done, which fulfil similar roles to time-stamped event structures in (Kang et al., 2017a) and the shared action graph of (Wright et al., 2021).
11.2. Incorporating compiler transformations
Any possible compiler transformation can be straightforwardly incorporated into our framework as an operational rule. For instance, if is some generic code pattern that can be transformed into then this can become a rule , i.e., a silent transformation that can occur at any time (an example of such a transformation might be to refactor a loop). The downside of including such transformations within the semantics is, of course, depending on the structural similarity of and , one cannot expect to reason about using syntax-directed proof rules (one has broken the principle of atomic-syntax-structured code, Defn. (7.1)). Of course, trace-based approached can still be applied.
As an example, consider a transformation discussed in Sect. 10.1, which simplifies and removes a conditional. This can be expressed as an operational rule that applies in some circumstances.
Rule 11.1 (Eliminate conditional).
As discussed in Sect. 10.1, allowing this transformation has significant effects on behaviours in a wider context. Consider also transformations to eliminate redundant loads or stores, the simplest instrumentation of which are given by the following rules.
Rule 11.2 (Load coalescing).
Rule 11.3 (Write coalescing).
These transformations eliminate an interaction with main memory. Rule 11.2 reduces the number of behaviours of the program since will always take the same final value as , whereas in the original code it is possible for them to receive different final values. The transformation encoded in Rule 11.3 reduces the overall behaviours of the system as a parallel process can never receive the value of (of course, these transformations could be made dependent on memory ordering constraints using, for instance, the relation). We believe it should be outside of the scope of the definition of C memory model, and certainly outside the scope of this work, to consider every possible transformation of every possible compiler; however we have provided a framework in which the consequences of a particular transformation can be relatively straightforwardly assessed, separately to the specification and principles of the memory model itself. In particular this may feed in to the development of verified compilers, and the development of a bespoke set of transformations that are valid within concurrent systems.
12. Related work
The development of the C (and C++) memory model is ongoing and governed by an ISO committee (but heavily influenced by compiler writers (Memarian et al., 2016)), covering the vast range of features of the language itself and including a semi-formal description of the memory model. Boehm and Adve (Boehm and Adve, 2008) were highly influential initially, building on earlier work on memory model specifications (e.g., (Lamport, 1979; Dubois et al., 1986; Shasha and Snir, 1988; Gharachorloo et al., 1990; Jones et al., 1998; Arvind and Maessen, 2006; Fox and Harman, 2000)). Since then many formal approaches have been taken to further understand the implications of the model and tighten its specification, especially the works of Batty et. al (Batty et al., 2011, 2013; Jeffrey et al., 2022) and Vafeiadis et. al (Vafeiadis et al., 2015; Kang et al., 2017b; Lahav et al., 2016, 2017). The model abstracts away from the various hardware memory models that C is designed to run on (e.g., Arm, x86, Power, RISC-V, SPARC), leaving it to the compiler writer to translate the higher-level program code into assembler-level primitives that will provide the behaviour specified by the C model. This behaviour is described with respect to a cross-thread “happens-before” order, which is influenced by so called “release sequences” of events within the system as a whole. As a formalism this is difficult to reason about, and in particular, it removes the ability to think thread-locally – whether or not enough fences or constraints have been placed in a thread requires an analysis of all other events in the system, even though, with the exception of the Power architecture (see Sect. 11.1), any weak behaviours are due to local compiler transformations or instruction-level parallelisation at the assembler level. Our observations of a myriad of discussions on programming discussion boards online is that programmers tend to think in terms of local reorderings, and appealing to cross-thread concepts is not a convenient abstraction. The framework we present here is based on a close abstraction of instruction-level parallelisation that occurs in real pipelined processors or instruction shuffling that a compiler may perform. For simple programs, where enough synchronisations have been inserted to maintain order on the key instructions, reasoning can be shown to reduce to a well-known sequential form; or specific reorderings that are problematic can be elucidated. The underlying semantics model is the standard interleaving of relations on states (mappings from variables to values). We argue this is simpler and intuitive – as far as instruction reorderings can be considered so – than the current description in the C standard. Part of the complication of the standard arises from handling the complexity of the Power cache coherence system, which cannot be encoded in a thread-local manner; however those complications can be treated separately (Sarkar et al., 2011; Pichon-Pharabod and Sewell, 2016; Colvin and Smith, 2018), and in any case, the Power model also involves instruction-level parallelism governed by the same principles as the other major architectures (Arm, x86, RISC-V, SPARC).
In comparison with other formalisms in the literature (Chakraborty and Vafeiadis, 2019; Ou and Demsky, 2018; Nienhuis et al., 2016), many use an axiomatic approach (as exemplified by (Alglave et al., 2014)), which are necessarily cross-thread specifications based on the happens-before relationship, many use a complex operational semantics, and many combine the two. The Promising semantics (Kang et al., 2017a; Lee et al., 2020) is operational in flavour but has a complex semantics (Chakraborty and Vafeiadis, 2019) involving time-stamped operations and several abstract global data structures for managing events. In these frameworks reasoning about even simple cases is problematic, whereas in our framework the effects of reordering can be analysed by reduction rules at the program-level (Sect. 4), and our underlying model is that of the typical relations on states and so admits standard techniques (Sect. 5). A recent advance in reasoning for the C memory model is that of Wright et. al (Wright et al., 2021), but that framework appeals to a shared event graph structure, and does not consider memory fences or sc constraints. Very few of the formalisms surveyed have a simple way of address consume (con) constraints (Sect. 8.2); we also argue that our formal approach to understanding pathological behaviours such as out-of-thin-air (Sect. 6.3) and read-from-untaken-branch (Sect. 10.1) provides a clearer framework for understanding and addressing their consequences.
An overarching philosophy for our approach has been that of a separation-of-concerns, meaning that a programmer can consider which mechanisms for enforcing order should suffice for their concurrent code; of course, if their ordering mechanisms are embedded in complex expressions that may be optimised (Sect. 8) or incrementally evaluated (Sect. 7) then the picture becomes more complex, but this can be analysed separately, and without regard to how such features interact with cross-thread, complex, abstract data structures controlling events. The reordering framework we present here is based on earlier work (Colvin, 2021a, b; Colvin and Smith, 2018), which provides a model checker (written in Maude (Clavel et al., 2002)) and machine-checked refinement rules (in Isabelle/HOL (Paulson, 1994; Nipkow et al., 2002)) for the language in Sect. 3.2 (i.e., ignoring the possibility of incremental evaluation and optimisations) with forwarding (Sect. 9). We straightforwardly encoded the definition of the C memory model (Model 2.6) in the model checker and used it on the examples in this paper as well as those provided by the Cerberus project (et. al, 2022). The framework has provided the basis for other analyses involving security (Colvin and Winter, 2020; Smith et al., 2019; Winter et al., 2021) and automated techniques (Coughlin et al., 2021; Singh et al., 2021)
13. Conclusions
We have given a definition of the C memory model which keeps the fundamental concepts involved (fences, ordering constraints, and data dependencies) separated from other aspects of the language such as expression evaluation and optimisations, which are present regardless of whether or not the memory model is considered. Provided programmers keep to a reasonable discipline of programming that aids analysis of concurrent programs, i.e., program statements are generally indivisible (at most one shared variable per-expression), our framework facilitates structured reasoning about the code. This involves elucidating the effect of orderings on the code as-written, and then applying existing techniques for establishing the desired properties. We argue that our framework more closely expresses how programmers of concurrent code think about memory model effects than the abstraction given in the current standard (cross-thread happens-before relationships and release sequences). This is largely because the effects that a C programmer needs to consider are compiler reorderings of instructions for efficiency reasons, or architecture-level reorderings in instruction pipelines. The C language is rich in features and any claim to a full semantics of arbitrary C code with concurrency requires a full semantics of C in general, and as far as we are aware this has not been fully realised as yet; but we intend that our approach can be relatively straightforwardly incorporated into such by virtue of its separation of concerns - the fundamental properties of the memory model are universal and consistent even in the presence of complicating factors.
We note that the difficulties that arise in the attempt to formalise the C memory model stem from the tension between well-established compiler transformations as well as the need to support a multitude of hardware-level memory models seamlessly versus the well-known intricacies of programming correct shared-variable algorithms (Lau et al., 2019). This will be an ongoing balancing act that involves many competing factors, especially and including efficiency, and, increasingly, safety and security (Liu et al., 2021); if we were to take a position, it would be that sections of code – hopefully, relatively small and localised – can be protected from arbitrary transformations from compiler theory and practice. For instance, a C scoping construct concurrent { } which is compiled with minimal optimisations or reorderings, regardless of command-line flags such as -O. The C standard may then be able to provide guarantees about executions for such delimited code, while also allowing the programmer flexibility to make use of optimisations where they have determined they are applicable.
References
- (1)
- Alglave et al. (2014) Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory. ACM Trans. Program. Lang. Syst. 36, 2, Article 7 (July 2014), 74 pages. https://doi.org/10.1145/2627752
- Apt and Olderog (2019) Krzysztof R Apt and Ernst-Rüdiger Olderog. 2019. Fifty years of Hoare’s logic. Formal Aspects of Computing 31, 6 (2019), 751–807.
- Arvind and Maessen (2006) Arvind and Jan-Willem Maessen. 2006. Memory Model = Instruction Reordering + Store Atomicity. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA ’06). IEEE Computer Society, USA, 29–40. https://doi.org/10.1109/ISCA.2006.26
- Batty et al. (2013) Mark Batty, Mike Dodds, and Alexey Gotsman. 2013. Library Abstraction for C/C++ Concurrency. In Proceedings of the 40th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Rome, Italy) (POPL ’13). Association for Computing Machinery, New York, NY, USA, 235–248. https://doi.org/10.1145/2429069.2429099
- Batty et al. (2011) Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. 2011. Mathematizing C++ Concurrency. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Austin, Texas, USA) (POPL ’11). Association for Computing Machinery, New York, NY, USA, 55–66. https://doi.org/10.1145/1926385.1926394
- Boehm and Adve (2008) Hans-J. Boehm and Sarita V. Adve. 2008. Foundations of the C++ Concurrency Memory Model (PLDI ’08). ACM, 68–78. https://doi.org/10.1145/1375581.1375591
- Brookes (2007) Stephen Brookes. 2007. A semantics for concurrent separation logic. Theoretical Computer Science 375, 1-3 (2007), 227 – 270. https://doi.org/10.1016/j.tcs.2006.12.034
- Chakraborty and Vafeiadis (2019) Soham Chakraborty and Viktor Vafeiadis. 2019. Grounding Thin-Air Reads with Event Structures. Proc. ACM Program. Lang. 3, POPL, Article 70 (jan 2019), 28 pages. https://doi.org/10.1145/3290383
- Clavel et al. (2002) Manuel Clavel, Francisco Duran, Steven Eker, Patrick Lincoln, Narciso Marti-Oliet, José Meseguer, and José F. Quesada. 2002. Maude: specification and programming in rewriting logic. Theoretical Computer Science 285, 2 (2002), 187 – 243. https://doi.org/10.1016/S0304-3975(01)00359-0
- Coleman and Jones (2007) Joey W. Coleman and Cliff B. Jones. 2007. A Structural Proof of the Soundness of Rely/guarantee Rules. J. Log. Comput. 17, 4 (2007), 807–841.
- Colvin (2021a) Robert J. Colvin. 2021a. Parallelized Sequential Composition and Hardware Weak Memory Models. In Software Engineering and Formal Methods, Radu Calinescu and Corina S. Păsăreanu (Eds.). Springer, Cham, 201–221.
- Colvin (2021b) Robert J. Colvin. 2021b. Parallelized sequential composition, pipelines, and hardware weak memory models. CoRR abs/2105.02444 (2021). arXiv:2105.02444 https://arxiv.org/abs/2105.02444
- Colvin et al. (2017) Robert J. Colvin, Ian J. Hayes, and Larissa A. Meinicke. 2017. Designing a semantic model for a wide-spectrum language with concurrency. Formal Aspects of Computing 29, 5 (01 Sep 2017), 853–875. https://doi.org/10.1007/s00165-017-0416-4
- Colvin and Smith (2018) Robert J. Colvin and Graeme Smith. 2018. A Wide-Spectrum Language for Verification of Programs on Weak Memory Models. In Formal Methods, Klaus Havelund, Jan Peleska, Bill Roscoe, and Erik de Vink (Eds.). Springer, Cham, 240–257. https://doi.org/10.1007/978-3-319-95582-7_14
- Colvin and Winter (2020) Robert J. Colvin and Kirsten Winter. 2020. An Abstract Semantics of Speculative Execution for Reasoning About Security Vulnerabilities. In E. Sekerinski et al., editors, Formal Methods 2019 International Workshops, pages 323–341, Springer.
- Coughlin et al. (2021) Nicholas Coughlin, Kirsten Winter, and Graeme Smith. 2021. Rely/Guarantee Reasoning for Multicopy Atomic Weak Memory Models. In Formal Methods, M. Huisman, C. Păsăreanu, and N. Zhan (Eds.). Springer, 292–310.
- Derrick and Smith (2017) John Derrick and Graeme Smith. 2017. An Observational Approach to Defining Linearizability on Weak Memory Models. In Formal Techniques for Distributed Objects, Components, and Systems, Ahmed Bouajjani and Alexandra Silva (Eds.). Springer International Publishing, Cham, 108–123.
- Derrick et al. (2014) John Derrick, Graeme Smith, and Brijesh Dongol. 2014. Verifying Linearizability on TSO Architectures. In Integrated Formal Methods: 11th International Conference, IFM 2014, Bertinoro, Italy, September 9-11, 2014, Proceedings, Elvira Albert and Emil Sekerinski (Eds.). Springer International Publishing, Cham, 341–356. https://doi.org/10.1007/978-3-319-10181-1_21
- Dijkstra and Scholten (1990) Edsger W. Dijkstra and Carel S. Scholten. 1990. Predicate Calculus and Program Semantics. Springer-Verlag, Berlin, Heidelberg.
- Dubois et al. (1986) M. Dubois, C. Scheurich, and F. Briggs. 1986. Memory Access Buffering in Multiprocessors. In Proceedings of the 13th Annual International Symposium on Computer Architecture (Tokyo, Japan) (ISCA ’86). IEEE Computer Society Press, 434–442.
- et. al (2022) Peter Sewell et. al. 2022. The Cerberus project. https://www.cl.cam.ac.uk/~pes20/cerberus/ https://www.cl.cam.ac.uk/~pes20/cerberus/ Accessed March 2022.
- Filipović et al. (2010) Ivana Filipović, Peter O’Hearn, Noam Rinetzky, and Hongseok Yang. 2010. Abstraction for concurrent objects. Theoretical Computer Science 411, 51 (2010), 4379 – 4398. https://doi.org/10.1016/j.tcs.2010.09.021 European Symposium on Programming 2009.
- Fischer and Ladner (1979) Michael J. Fischer and Richard E. Ladner. 1979. Propositional dynamic logic of regular programs. J. Comput. System Sci. 18, 2 (1979), 194 – 211. https://doi.org/10.1016/0022-0000(79)90046-1
- Flur et al. (2016) Shaked Flur, Kathryn E. Gray, Christopher Pulte, Susmit Sarkar, Ali Sezgin, Luc Maranget, Will Deacon, and Peter Sewell. 2016. Modelling the ARMv8 Architecture, Operationally: Concurrency and ISA (POPL 2016). ACM, New York, NY, USA, 608–621. https://doi.org/10.1145/2837614.2837615
- Fox and Harman (2000) A. C. J. Fox and N. A. Harman. 2000. Algebraic Models of Correctness for Microprocessors. Formal Aspects of Computing 12, 4 (2000), 298–312. https://doi.org/10.1007/PL00003936
- Gharachorloo et al. (1990) Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. 1990. Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors (ISCA 1990). ACM, 15–26. https://doi.org/10.1145/325164.325102
- Hayes et al. (2012) Ian Hayes, Cliff Jones, and Robert Colvin. 2012. Refining rely-guarantee thinking. Technical Report CS-TR-1334. Newcastle University. http://www.cs.ncl.ac.uk/research/pubs/trs/papers/1334.pdf
- Hayes et al. (2013) Ian J. Hayes, Alan Burns, Brijesh Dongol, and Cliff B. Jones. 2013. Comparing Degrees of Non-Determinism in Expression Evaluation. Comput. J. 56, 6 (02 2013), 741–755. https://doi.org/10.1093/comjnl/bxt005
- Hayes et al. (2014) Ian J. Hayes, Cliff B. Jones, and Robert J. Colvin. 2014. Laws and semantics for rely-guarantee refinement. Technical Report CS-TR-1425. Newcastle University.
- Hayes et al. (2021) Ian J. Hayes, Larissa A. Meinicke, and Patrick A. Meiring. 2021. Deriving Laws for Developing Concurrent Programs in a Rely-Guarantee Style. CoRR abs/2103.15292 (2021). arXiv:2103.15292 https://arxiv.org/abs/2103.15292
- Hayes et al. (2019) Ian J. Hayes, Larissa A. Meinicke, Kirsten Winter, and Robert J. Colvin. 2019. A synchronous program algebra: a basis for reasoning about shared-memory and event-based concurrency. Formal Aspects of Computing 31, 2 (2019), 133–163. https://doi.org/10.1007/s00165-018-0464-4
- Herlihy and Shavit (2011) Maurice Herlihy and Nir Shavit. 2011. The Art of Multiprocessor Programming. Morgan Kaufmann.
- Herlihy and Wing (1990) Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: a correctness condition for concurrent objects. TOPLAS 12, 3 (1990), 463 – 492.
- Hoare (1972) C. A. R. Hoare. 1972. Towards a Theory of Parallel Programming. In Operating System Techniques. Academic Press, 61–71. Proceedings of Seminar at Queen’s University, Belfast, Northern Ireland, August-September 1971.
- Hoare (1978) C. A. R. Hoare. 1978. Some Properties of Predicate Transformers. J. ACM 25, 3 (July 1978), 461–480. https://doi.org/10.1145/322077.322088
- Hoare (1985) C. A. R. Hoare. 1985. Communicating Sequential Processes. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
- Hoare (2002) C. A. R. Hoare. 2002. Towards a Theory of Parallel Programming. In The Origin of Concurrent Programming: From Semaphores to Remote Procedure Calls, Per Brinch Hansen (Ed.). Springer New York, New York, NY, 231–244. https://doi.org/10.1007/978-1-4757-3472-0_6
- Jeffrey et al. (2022) Alan Jeffrey, James Riely, Mark Batty, Simon Cooksey, Ilya Kaysin, and Anton Podkopaev. 2022. The Leaky Semicolon: Compositional Semantic Dependencies for Relaxed-Memory Concurrency. Proc. ACM Program. Lang. 6, POPL, Article 54 (jan 2022), 30 pages. https://doi.org/10.1145/3498716
- Jones (1983a) Cliff B. Jones. 1983a. Specification and Design of (Parallel) Programs. In IFIP Congress. 321–332.
- Jones (1983b) Cliff B. Jones. 1983b. Tentative steps toward a development method for interfering programs. ACM Trans. Program. Lang. Syst. 5 (October 1983), 596–619. Issue 4. https://doi.org/10.1145/69575.69577
- Jones et al. (1998) Robert B. Jones, Jens U. SkakkebÆk, and David L. Dill. 1998. Reducing Manual Abstraction in Formal Verification of Out-of-Order Execution. In Formal Methods in Computer-Aided Design, Ganesh Gopalakrishnan and Phillip Windley (Eds.). Springer, 2–17.
- Kang et al. (2017a) Jeehoon Kang, Chung-Kil Hur, Ori Lahav, Viktor Vafeiadis, and Derek Dreyer. 2017a. A Promising Semantics for Relaxed-memory Concurrency. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages (Paris, France) (POPL 2017). ACM, New York, NY, USA, 175–189. https://doi.org/10.1145/3009837.3009850
- Kang et al. (2017b) Jeehoon Kang, Chung-Kil Hur, Ori Lahav, Viktor Vafeiadis, and Derek Dreyer. 2017b. A Promising Semantics for Relaxed-memory Concurrency (POPL 2017). ACM, 175–189. https://doi.org/10.1145/3009837.3009850
- Kozen (2000) Dexter Kozen. 2000. On Hoare Logic and Kleene Algebra with Tests. ACM Trans. Comput. Logic 1, 1 (July 2000), 60–76. https://doi.org/10.1145/343369.343378
- Lahav et al. (2016) Ori Lahav, Nick Giannarakis, and Viktor Vafeiadis. 2016. Taming Release-Acquire Consistency (POPL 2016). Association for Computing Machinery, 649–662. https://doi.org/10.1145/2837614.2837643
- Lahav et al. (2017) Ori Lahav, Viktor Vafeiadis, Jeehoon Kang, Chung-Kil Hur, and Derek Dreyer. 2017. Repairing Sequential Consistency in C/C++11. SIGPLAN Not. 52, 6 (June 2017), 618–632. https://doi.org/10.1145/3140587.3062352
- Lamport (1979) Lamport. 1979. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. IEEE Trans. Comput. C-28, 9 (1979), 690–691.
- Lau et al. (2019) Stella Lau, Victor B. F. Gomes, Kayvan Memarian, Jean Pichon-Pharabod, and Peter Sewell. 2019. Cerberus-BMC: A Principled Reference Semantics and Exploration Tool for Concurrent and Sequential C. In Computer Aided Verification, I. Dillig and S. Tasiran (Eds.). Springer, 387–397.
- Lee et al. (2020) Sung-Hwan Lee, Minki Cho, Anton Podkopaev, Soham Chakraborty, Chung-Kil Hur, Ori Lahav, and Viktor Vafeiadis. 2020. Promising 2.0: Global Optimizations in Relaxed Memory Concurrency (PLDI 2020). ACM, 362–376. https://doi.org/10.1145/3385412.3386010
- Liu et al. (2021) Lun Liu, Todd Millstein, and Madanlal Musuvathi. 2021. Safe-by-Default Concurrency for Modern Programming Languages. ACM Trans. Program. Lang. Syst. 43, 3, Article 10 (sep 2021), 50 pages. https://doi.org/10.1145/3462206
- Mador-Haim et al. (2012) Sela Mador-Haim, Luc Maranget, Susmit Sarkar, Kayvan Memarian, Jade Alglave, Scott Owens, Rajeev Alur, Milo M. K. Martin, Peter Sewell, and Derek Williams. 2012. An Axiomatic Memory Model for POWER Multiprocessors (CAV 2012). Springer, 495–512. https://doi.org/10.1007/978-3-642-31424-7_36
- Maranget et al. (2012) Luc Maranget, Susmit Sarkar, and Peter Sewell. 2012. A Tutorial Introduction to the ARM and POWER Relaxed Memory Models. https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
- Memarian et al. (2016) Kayvan Memarian, Justus Matthiesen, James Lingard, Kyndylan Nienhuis, David Chisnall, Robert N. M. Watson, and Peter Sewell. 2016. Into the Depths of C: Elaborating the de Facto Standards. SIGPLAN Not. 51, 6 (jun 2016), 1–15. https://doi.org/10.1145/2980983.2908081
- Milner (1982) Robin Milner. 1982. A Calculus of Communicating Systems. Springer-Verlag.
- Morgan (1990) Carroll Morgan. 1990. Of wp and CSP. In Beauty Is Our Business: A Birthday Salute to Edsger W. Dijkstra, W. H. J. Feijen, A. J. M. van Gasteren, D. Gries, and J. Misra (Eds.). Springer New York, 319–326. https://doi.org/10.1007/978-1-4612-4476-9_37
- Morgan (1994) Carroll Morgan. 1994. Programming from Specifications (second ed.). Prentice Hall.
- Nienhuis et al. (2016) Kyndylan Nienhuis, Kayvan Memarian, and Peter Sewell. 2016. An Operational Semantics for C/C++11 Concurrency. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (Amsterdam, Netherlands) (OOPSLA 2016). Association for Computing Machinery, New York, NY, USA, 111–128. https://doi.org/10.1145/2983990.2983997
- Nieto (2003) Leonor Prensa Nieto. 2003. The Rely-Guarantee Method in Isabelle/HOL. In Programming Languages and Systems, Pierpaolo Degano (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 348–362.
- Nipkow and Nieto (1999) Tobias Nipkow and Leonor Prensa Nieto. 1999. Owicki/Gries in Isabelle/HOL. In Fundamental Approaches to Software Engineering, Jean-Pierre Finance (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 188–203.
- Nipkow et al. (2002) Tobias Nipkow, Lawrence C. Paulson, and Markus Wenzel. 2002. Isabelle/HOL — A Proof Assistant for Higher-Order Logic. LNCS, Vol. 2283. Springer.
- O’Hearn (2019) Peter W. O’Hearn. 2019. Incorrectness Logic. Proc. ACM Program. Lang. 4, POPL, Article 10 (Dec. 2019), 32 pages. https://doi.org/10.1145/3371078
- Ou and Demsky (2018) Peizhao Ou and Brian Demsky. 2018. Towards Understanding the Costs of Avoiding Out-of-Thin-Air Results. Proc. ACM Program. Lang. 2, OOPSLA, Article 136 (oct 2018), 29 pages. https://doi.org/10.1145/3276506
- Owicki and Gries (1976) Susan Owicki and David Gries. 1976. An Axiomatic Proof Technique for Parallel Programs I. Acta Inf. 6, 4 (Dec. 1976), 319–340. https://doi.org/10.1007/BF00268134
- Paulson (1994) Lawrence C. Paulson. 1994. Isabelle: a generic theorem prover. Springer Verlag.
- Pichon-Pharabod and Sewell (2016) Jean Pichon-Pharabod and Peter Sewell. 2016. A Concurrency Semantics for Relaxed Atomics That Permits Optimisation and Avoids Thin-Air Executions. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (St. Petersburg, FL, USA) (POPL ’16). Association for Computing Machinery, New York, NY, USA, 622–633. https://doi.org/10.1145/2837614.2837616
- Plotkin (2004) Gordon D. Plotkin. 2004. A structural approach to operational semantics. J. Log. Algebr. Program. 60-61 (2004), 17–139.
- Pulte et al. (2017) Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2017. Simplifying ARM Concurrency: Multicopy-Atomic Axiomatic and Operational Models for ARMv8. Proc. ACM Program. Lang. 2, POPL, Article 19 (Dec. 2017), 29 pages. https://doi.org/10.1145/3158107
- Sarkar et al. (2011) Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams. 2011. Understanding POWER Multiprocessors. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (San Jose, California, USA) (PLDI ’11). Association for Computing Machinery, New York, NY, USA, 175–186. https://doi.org/10.1145/1993498.1993520
- Schellhorn et al. (2014) Gerhard Schellhorn, John Derrick, and Heike Wehrheim. 2014. A Sound and Complete Proof Technique for Linearizability of Concurrent Data Structures. ACM Trans. Comput. Logic 15, 4, Article 31 (sep 2014), 37 pages. https://doi.org/10.1145/2629496
- Shasha and Snir (1988) Dennis Shasha and Marc Snir. 1988. Efficient and Correct Execution of Parallel Programs That Share Memory. ACM Trans. Program. Lang. Syst. 10, 2 (April 1988), 282–312. https://doi.org/10.1145/42190.42277
- Singh et al. (2021) Sanjana Singh, Divyanjali Sharma, and Subodh Sharma. 2021. Dynamic Verification of C11 Concurrency over Multi Copy Atomics. In 2021 International Symposium on Theoretical Aspects of Software Engineering (TASE). 39–46. https://doi.org/10.1109/TASE52547.2021.00023
- Smith (2016) Graeme Smith. 2016. Model Checking Simulation Rules for Linearizability. In Software Engineering and Formal Methods, Rocco De Nicola and Eva Kühn (Eds.). Springer International Publishing, Cham, 188–203.
- Smith et al. (2019) Graeme Smith, Nicholas Coughlin, and Toby Murray. 2019. Value-Dependent Information-Flow Security on Weak Memory Models. In Formal Methods, M. H. ter Beek, A. McIver, and J. N. Oliveira (Eds.). Springer, 539–555.
- Smith and Groves (2020) Graeme Smith and Lindsay Groves. 2020. Weakening Correctness and Linearizability for Concurrent Objects on Multicore Processors. In Formal Methods. FM 2019 International Workshops, Emil Sekerinski, Nelma Moreira, José N. Oliveira, Daniel Ratiu, Riccardo Guidotti, Marie Farrell, Matt Luckcuck, Diego Marmsoler, José Campos, Troy Astarte, Laure Gonnord, Antonio Cerone, Luis Couto, Brijesh Dongol, Martin Kutrib, Pedro Monteiro, and David Delmas (Eds.). Springer International Publishing, Cham, 342–357.
- Thornton (1964) James E. Thornton. 1964. Parallel Operation in the Control Data 6600. In Proceedings of the October 27-29, 1964, Fall Joint Computer Conference, Part II: Very High Speed Computer Systems (San Francisco, California) (AFIPS ’64). ACM, 33–40. https://doi.org/10.1145/1464039.1464045
- Tomasulo (1967) R. M. Tomasulo. 1967. An Efficient Algorithm for Exploiting Multiple Arithmetic Units. IBM Journal of Research and Development 11, 1 (1967), 25–33.
- Vafeiadis et al. (2015) Viktor Vafeiadis, Thibaut Balabonski, Soham Chakraborty, Robin Morisset, and Francesco Zappa Nardelli. 2015. Common Compiler Optimisations Are Invalid in the C11 Memory Model and What We Can Do About It. In Proceedings of the 42Nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Mumbai, India) (POPL ’15). ACM, New York, NY, USA, 209–220. https://doi.org/10.1145/2676726.2676995
- Winter et al. (2021) Kirsten Winter, Nicholas Coughlin, and Graeme Smith. 2021. Backwards-directed information flow analysis for concurrent programs. In 2021 IEEE 34th Computer Security Foundations Symposium (CSF). 1–16. https://doi.org/10.1109/CSF51468.2021.00017
- Winter et al. (2013) Kirsten Winter, Chenyi Zhang, Ian J. Hayes, Nathan Keynes, Cristina Cifuentes, and Lian Li. 2013. Path-Sensitive Data Flow Analysis Simplified. In Formal Methods and Software Engineering, Lindsay Groves and Jing Sun (Eds.). Springer, 415–430.
- Wright et al. (2021) Daniel Wright, Mark Batty, and Brijesh Dongol. 2021. Owicki-Gries Reasoning for C11 Programs with Relaxed Dependencies. In Formal Methods, Marieke Huisman, Corina Păsăreanu, and Naijun Zhan (Eds.). Springer, Cham, 237–254.
- Yang et al. (2003) Yue Yang, Ganesh Gopalakrishnan, Gary Lindstrom, and Konrad Slind. 2003. Analyzing the Intel Itanium Memory Ordering Rules Using Logic Programming and SAT. In Correct Hardware Design and Verification Methods, Daniel Geist and Enrico Tronci (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 81–95.
Appendix A Lifting from actions to traces and commands
Throughout the paper we have used several functions and definitions based on actions, which for the most part are straightforwardly lifted to commands and traces. For completeness we give these below.
A.1. Extracting variables
For any function that simply collects syntax elements into a set (i.e., , etc.) and is defined over instructions can be straightforwardly lifted to commands, actions and traces in the following generic pattern.
(A.1) | |||||
(A.2) | |||||
(A.3) | |||||
(A.4) | |||||
(A.5) | |||||
(A.6) | |||||
(A.7) |
A.2. Lifting forwarding/reordering triples
Forwarding (see Sect. 9) an action to action is defined below. Assume .
Given is an assignment then essentially replaces references to (with any constraints) by .
The reordering relation can be lifted from actions to commands straightforwardly as below.
(A.8) | |||||
(A.9) | |||||
(A.10) | |||||
(A.11) | |||||
(A.12) |
Appendix B Mixing incremental and non-incremental evaluation
In Sect. 7 we gave a semantics for evaluating instructions incrementally, as opposed to treating instructions as indivisible in Sect. 3.2. In this section, for completeness, we show how to mix both possibilities within the syntax of IMP+pseq.
We define an instruction to be one of the three basic types of assignment, guard and fence, and an action to be a list of instructions (written ). Actions are the basic type of a step in the operational semantics. We define a “specification instruction” to pair a basic instruction with a designation as to whether it is divisible or indivisible, i.e., whether it is to be executed incrementally or as a single indivisible step. Finally, within the syntax of IMP+pseq, instead of a list of actions as the base type (defn. (3.1)), we allow a statement , which is a list of specification instructions.
This gives a significant amount of flexibility in describing the execution mode of composite actions, for instance, is a statement that calculates whether in the current state, and then incrementally evaluates before assigning the result to . Of course, this level of flexibility is not necessary for the majority of cases, and syntactic sugar can be used to cover the commonly occurring cases, in particular, letting a singleton list of specification instructions be written as a single specification instruction; and conventions for distinguishing between divisible and indivisible versions of instructions.
We lift defn. (7.16) for indivisible instructions to the new types.
Any specification instruction tagged is indivisible, while a specification instruction tagged is divisible if its instruction is, but is otherwise indivisible.
The relevant operational semantics for specification instructions and statements is as follows.
(B.3) | |||
(B.8) |
Rule B.3 states that any instruction tagged divisible can take an incremental execution step according to the evaluation rules in Sect. 7. This is used to build Rule B.8 for a statement , where specification instructions within are executed incrementally from left to right: the first divisible specification instruction () in may take a step, which becomes a step of the statement. When all instructions within are indivisible a final, single, indivisible step is taken. This action is formed by simply stripping the tags from the specification instructions, i.e., , which is lifted to statements by applying onto each element, i.e., .
A compare and swap command (defn. (3.9)) can be redefined to incrementally evaluate its arguments before executing an indivisible (“atomic”) test-and-set step.
(B.9) |
In the successful case first is evaluated to a value , then is evaluated to a value , and finally the action is executed (the tags are stripped). This means that and can be incrementally evaluated, but the final test/update remains atomic.