Consistency vs. Availability in
Distributed Real-Time Systems* ^†^†thanks: ^* This work was supported by the iCyPhy Research Center at UC Berkeley, supported by Denso, Siemens, and Toyota.

Edward A. Lee EECS, UC Berkeley
Berkeley, USA Ravi Akella Denso NA
San Jose, USA Soroush Bateni
EECS, UC Berkeley
Berkeley, USA Shaokai Lin EECS, UC Berkeley
Berkeley, USA Marten Lohstroh EECS, UC Berkeley
Berkeley, USA Christian Menard TU Dresden
Dresden, Germany

Abstract

In distributed applications, Brewer’s CAP theorem tells us that when networks become partitioned (P), one must give up either consistency (C) or availability (A). Consistency is agreement on the values of shared variables; availability is the ability to respond to reads and writes accessing those shared variables. Availability is a real-time property whereas consistency is a logical property. We have extended the CAP theorem to relate quantitative measures of these two properties to quantitative measures of communication and computation latency (L), obtaining a relation called the CAL theorem that is linear in a max-plus algebra. This paper shows how to use the CAL theorem in various ways to help design real-time systems. We develop a methodology for systematically trading off availability and consistency in application-specific ways and to guide the system designer when putting functionality in end devices, in edge computers, or in the cloud. We build on the Lingua Franca coordination language to provide system designers with concrete analysis and design tools to make the required tradeoffs in deployable software.

Index Terms:

real-time systems, cyber-physical systems, embedded systems, networking, verification

I Introduction

Brewer’s well-known CAP Theorem states that in the presence of network partitioning (P), a distributed system must sacrifice at least one of availability (A) or consistency (C) [6, 7]. Consistency is where distributed components agree on the value of shared state, and availability is the ability to respond to user requests using and/or modifying that shared state. Gilbert and Lynch [13] prove two variants of this theorem, one strong result for asynchronous networks [27, chapter 8] and one weaker result for partially synchronous networks. The CAP theorem has helped the research and development of distributed database systems by clarifying a fundamental limit and suggesting application-dependent tradeoffs. For some database applications, availability is more important than consistency, while for others it is the other way around. The purpose of this paper is to apply and adapt the CAP theorem to distributed real-time systems to derive similar benefits.

Real-time systems are cyber-physical systems (CPSs) with timing constraints. In distributed real-time systems, the state information shared between software components is often information about the physical world in which the software is operating. In autonomous vehicles, for example, the state of an intersection is shared among all the vehicles contending for access to that intersection. Even within a single vehicle, where software components may be distributed across an onboard network, many of these software components will share state information about the vehicle and its environment. We therefore generalize the notion of consistency to include physical state rather than just variables in software.

In real-time systems, the time it takes for a software subsystem to respond through an actuator to stimulus from a sensor is a critical property of the system. We therefore generalize the notion of availability to include this time, not just the system’s response time to human users. A software subsystem where sensor-to-actuator response time is large is less available than one for which it is small.

Brewer’s CAP theorem, then, immediately applies in an obvious way. When the network becomes partitioned, one of availability or consistency must be sacrificed. However, network partitioning is just the limiting case of network latency, as pointed out by Brewer [7] and Abadi [1]. Moreover, network latency is not the only latency that can force this compromise. Long execution times can be just as damaging as long network latencies, and so can large clock synchronization discrepancies, as we will show. In our formulation, network latencies, execution times, and clock synchronization errors get lumped together into a single measurable quantity that we call, simply, “latency” (or “apparent latency” to emphasize that the quantity we work with is measurable).

We have recently discovered that consistency, availability, and latency can all be quantified, and that they have a simple algebraic relationship between them. We call this relationship the CAL theorem, replacing “Partitioning” with “Latency.” The relation is a linear system of equations in a max-plus algebra, where the structure of the equations reflects the communication topology of an application.

The only prior attempt we are aware of to quantify the CAP theorem was done by Yu and Vahdat, who quantified availability and consistency and show a tradeoff between them [35]. Their quantifications, however, are in terms of fractions of satisfied accesses (availability) and fractions of out-of-order writes (inconsistency), and they show that finding the availability as a function of consistency is NP-hard. In contrast, the CAL theorem defines these quantities as time intervals and gives a strikingly simpler relationship, one that is linear in a max-plus algebra.

How to trade off consistency, availability, and latency against one another is application specific. Consider for example a four-way intersection, access to which is regulated by a distributed algorithm running in software on autonomous vehicles that contend for the intersection. For this application, and specifically for the state of the intersection, consistency is paramount. All vehicles must agree on the state of the intersection (strong consistency) before any one vehicle can enter the intersection. Hence, for this application, when latencies get large for any reason, we choose to sacrifice availability (vehicles do not enter the intersection) rather than consistency (vehicles crash).

Consider, however, a complementary application. Suppose a vehicle has a computer-vision-based automatic braking system as part of an ADAS (advanced driver assistance system) as well as an ordinary brake pedal. Suppose the vision-based system has significant latency (it may even be computed in the cloud). Should the system delay responses to pushes of the brake pedal until the vision system has reported the state of the world at the time of the brake pedal push? The answer is most certainly “no.” The system should respond immediately to brake pedal pushes, thereby maintaining high availability, even at the cost of consistency.

The CAL theorem also easily accommodates tiered, heterogeneous networks, where end devices may connect to edge computers over wired or wireless links, and edge computers may connect to cloud-based services that enable wide area aggregation and scalability, for example for machine learning. The various networks involved may have widely varying characteristics, yielding enormously different latencies and latency variability. A time-sensitive network (TSN) [23] on a factory floor, for example, may yield reliable latencies on the scale of microseconds between edge computers, whereas wide-area networks (WAN) may yield highly variable latencies that can extend up to tens of seconds [37]. Moreover, any of these networks can fail, yielding unbounded latencies, and systems need to be designed to handle such failures gracefully.

The CAL theorem will allow us to model a heterogeneous network topology interconnecting a wide variety of nodes. In particular, the matrix form of the equations enables compact modeling of heterogeneous networks, where the latencies between pairs of nodes can vary considerably.

We will use the Lingua Franca (LF) coordination language [26] to specify programs that explicitly define availability and consistency requirements for a distributed CPS application. We can then use the CAL theorem to derive the network latency bounds that make meeting the requirements possible. This can be used to guide decisions about which services must be placed in the end devices, which can be placed on an edge computer, and which can be put in the cloud. Moreover, we will show how, once such a system is deployed, violations of the network latency requirements, which will make it impossible to meet the consistency and availability requirements, can be detected. System designers can build in to the application fault handlers that handle such failures.

System designers can use the CAL theorem in at least two complementary ways. They can derive networking requirements from availability and/or consistency requirements, or they can derive availability and/or consistency properties from assumptions about the network behavior.

The CAP theorem itself is rather obvious and very much part of the folklore in distributed computing. By quantifying it and relating it to individual point-to-point latencies, the CAL theorem elevates the phenomenon from folklore to an engineering principle, enabling rigorous design with clearly stated assumptions. Moreover, by quantifying consistency and availability, the CAL theorem makes the concept applicable to real-time systems. In this paper, we show how to carry out such rigorous design using LF, which supports explicit representations of availability and consistency requirements. Moreover, we demonstrate how to detect situations where the networking requirements that are implied by the availability and consistency requirements cannot be met, for example when the network fails or has excessive latency. We describe how LF can provide exception handlers that enable the designer to explicitly choose how to handle such fault conditions, for example, by reducing accuracy [37] or by switching to failsafe modes of operation.

The contributions of this paper are as follows:

•

We show how availability, a real-time property of a system, and consistency, a logical property, relate numerically to clock synchronization and latencies introduced by networks and computation.
•

We derive how the deadlines commonly used to specify real-time requirements in real-time systems are availability requirements and therefore are subject to this relation. Specifically, as latencies increase, it becomes impossible to meet deadlines without sacrificing consistency.
•

We propose a methodology that allows a system designer to explicitly define availability and consistency requirements using the Lingua Franca coordination language.
•

We illustrate how a system designer can choose how to handle runtime violations of these requirements by explicitly choosing whether to further relax consistency or availability and how to provide fault handlers to be invoked when violations are detected.
•

We give practical real-time systems examples that show that the choice of whether to sacrifice availability or consistency when faults occur is application dependent.

The paper is organized as follows. Section II explains the underlying model of time. Section III formally defines the terms and derives the CAL theorem. Section IV introduces the Lingua Franca coordination language, shows how it can explicitly specify availability and consistency requirement, and shows how to use the CAL theorem to analyze a program. Section V gives two practical distributed real-time system examples and shows how one needs to prioritize availability while the other needs to prioritize consistency in the presence of faults. Section VI draws some conclusions.

II Logical and Physical Time

Central to our ability to quantify both consistency and availability is the use of two distinct notions of time, logical and physical. A physical time $T\in\mathbb{T}$ is an imperfect measurement of time taken from some clock somewhere the system. The set $\mathbb{T}$ contains all the possible times that a physical clock can report. We assume that $\mathbb{T}$ is totally ordered and includes two special members: $\infty,-\infty\in\mathbb{T}$ , larger and smaller than any time any clock can report. We will occasionally make a distinction between the set $\mathbb{I}$ of time intervals (differences between two times) and time values $T\in\mathbb{T}$ . It is often convenient to have the set $\mathbb{T}$ represent a common definition of physical time, such as Coordinated Universal Time (UTC) so that times correlate with physical reality.

In LF, which we will use to specify our real-time systems, $\mathbb{T}$ and $\mathbb{I}$ are both the set of 64-bit integers (for all targets built to date). Following the POSIX standard, $T\in\mathbb{T}$ is the number of nanoseconds that have elapsed since midnight, January 1, 1970, Greenwich mean time. The largest and smallest 64-bit integers represent $\infty$ and $-\infty$ , respectively. As a practical matter, these numbers will overflow in systems running near the year 2270.

For logical time, we use an element that we call a tag $g$ of a totally-ordered set $\mathbb{G}$ . Each event in a distributed system is associated with a tag $g\in\mathbb{G}$ . From the perspective of any component of a distributed system, the order in which events occur is defined by the order of their tags. If two distinct events have the same tag, we say that they are logically simultaneous. We assume the tag set $\mathbb{G}$ also has largest and smallest elements. Moreover, we assume a metric that measures the distance between two tags. This metric is what we will use to quantify consistency.

In LF, $\mathbb{G}=\mathbb{T}\times\mathbb{U}$ , where $\mathbb{U}$ is the set of 32-bit unsigned integers representing the microstep of a superdense time system [28, 8, 11]. Following the tagged-signal model [20], we use the term tag rather than timestamp to allow for such a richer model of logical time. For the purposes of this paper, however, the microsteps will not matter, and hence you can think of a tag as a timestamp and ignore the microstep. In particular, the metric we will use to measure the distance between tags ignores the microstep.

We will consistently denote tags with a lower case $g\in\mathbb{G}$ or a lower-case tuple $(t,m)\in\mathbb{G}$ and measurements of physical time $T\in\mathbb{T}$ with upper case.

To combine tags with physical times, we assume a monotonically nondecreasing function $\mathcal{T}\colon\mathbb{G}\to\mathbb{T}$ that gives a physical time interpretation to any tag. For any tag $g$ , we call $\mathcal{T}(g)$ its timestamp. In LF, for any tag $g=(t,m)\in\mathbb{G}$ , $\mathcal{T}(g)=t$ . Hence, to get a timestamp from a tag, you just have to ignore the microstep. The set $\mathbb{G}$ also includes largest and smallest elements such that $\mathcal{T}(\infty_{\mathbb{G}})=\infty_{\mathbb{T}}$ and $\mathcal{T}(-\infty_{\mathbb{G}})=-\infty_{\mathbb{T}}$ , where the subscripts disambiguate the infinities.

An external input, such as a user input or query, will be assigned a tag $g$ such that $\mathcal{T}(g)=T$ , where $T$ is a measurement of physical time taken from the local clock where the input first enters the software system. In LF, this tag is normally given microstep 0, $g=(T,0)$ .

For any tag $g$ , the time $\mathcal{T}(g)$ is a logical time. It may be derived from a physical time, as it is for an externally triggered event, but once the tagged event enters the system, its relationship to physical time becomes incidental. The only requirement is that software components process events in tag order, irrespective of physical time. We will use the tags $g$ to specify consistency requirements. In Lingua Franca, however, to get real-time behavior, it is possible to associate deadline with the processing of an event. A deadline $d\in\mathbb{I}$ is a declaration that the event must be processed before physical time $T$ exceeds the logical time $\mathcal{T}(g)+d$ . We will use these deadlines to specify availability requirements.

III The CAL Theorem

Following Schwartz and Mattern [33], assume we are given a trace of an execution of a distributed system consisting of $N$ sequential processes, where each process is an unbounded sequence of (tagged) events. Although the theory is developed for traces, the CAL theorem can be used for programs, not just traces because a program is formally a family of traces. The $k$ -th event of a process is associated with a tag $g_{k}$ and a physical time $T_{k}$ . The physical time $T_{k}$ is the reading on a local clock at the time where the event starts being processed. The events in a process are required to have nondecreasing tags and increasing physical times. That is, if $g_{k}$ is the tag and $T_{k}$ is the physical time of the $k$ -th event, then $g_{k}\leq g_{k+1}$ and $T_{k}<T_{k+1}$ .

Within each process, a read event with tag $g$ yields the value of a shared variable $x$ . The shared variable $x$ is stored as a local copy, which has previously acquired a value via a write event or an accept event in the same process. An accept event receives an updated value of the variable from the network. A read event with tag $g$ will yield the value assigned by the write or accept event with the largest tag $g^{\prime}$ where $g^{\prime}\leq g$ . If $g^{\prime}=g$ , we require that $T^{\prime}<T$ , where $T^{\prime}$ is the physical time of the write or accept event and $T$ is the physical time of the read event. This requirement ensures that a read event reads a value that was written at an earlier physical time.

A send event is where a process launches into the network an update to a shared variable $x$ . (Here, the “network” is whatever medium is being used for communication between processes.) Like a read event, the send event has a tag greater than or equal to that of the write or accept event that it is reporting and a physical time greater than that of the write or accept event. An accept event that receives the update sent by the send event has a tag greater than or equal to that of the send event. The physical time of the accept event relative to the originating send event is unconstrained, however, because these times likely come from distinct physical clocks.

III-A Consistency

Definition 1

For each write event on process $j$ with tag $g_{j}$ , let $g_{i}$ be the tag of the corresponding accept event on process $i$ or $\infty$ if there is no corresponding accept event. The inconsistency $\bar{C}_{ij}\in\mathbb{I}$ from $j$ to $i$ is defined to be

\bar{C}_{ij}=\max(\mathcal{T}(g_{i})-\mathcal{T}(g_{j})),

(1)

where the maximization is over all write events on process $j$ . If there are no write events on $j$ , then we define $\bar{C}_{ij}=0$ .

It is clear that $\bar{C}_{ij}\geq 0$ . If $\bar{C}=0$ , we have strong consistency. We will see that this strong consistency comes at a price in availability, and that network failures can result in unbounded unavailability. If $\bar{C}$ is finite, we have eventual consistency, and $\bar{C}$ quantifies “eventual.”

Notice that inconsistency measures the difference between two logical times. We will show how Lingua Franca enables manipulation of the tags of events to relax consistency requirements in order to gain availability. It does so without sacrificing determinacy.

III-B Availability

In database systems, unavailability, $\bar{A}$ , is a measure of the time it takes for a system to respond to user requests [15]. A user request is an external event that originates from outside the distributed system. We generalize the external events to include sensor inputs, not just user requests, and the responses to include actuations.

Assume that a user request or sensor input triggers a read event in process $i$ with tag $g_{i}$ such that its timestamp $\mathcal{T}(g_{i})$ is the reading of a local clock when the external event occurs. Let $T_{i}$ be the physical time of the read event, i.e., the physical time at which the read is processed. Hence, $T_{i}\geq\mathcal{T}(g_{i})$ .

Definition 2

For each read event on process $i$ , let $g_{i}$ be its tag and $T_{i}$ be the physical time at which it is processed. The unavailability $\bar{A}_{i}\in\mathbb{I}$ at process $i$ is defined to be

\bar{A}_{i}=\max(T_{i}-\mathcal{T}(g_{i})),

(2)

where the maximization is over all read events on process $i$ that are triggered by user requests. If there are no such read events on process $i$ , then $\bar{A}_{i}=0$ .

Because we are considering only read events that are triggered from outside the software system, $\mathcal{T}(g_{i})\leq T_{i}$ , so $\bar{A}_{i}\geq 0$ . If $\bar{A}_{i}=0$ , then we have maximum availability (minimum unavailability). This situation arises when external triggers cause immediate reactions.

III-C Processing Offsets

We require that each process handle events in tag order. This gives the overall program a formal property known as causal consistency, which is analyzed in depth by Schwartz and Mattern. They define a causality relation, written $e_{1}\rightarrow e_{2}$ , between events $e_{1}$ and $e_{2}$ to mean that $e_{1}$ can causally affect $e_{2}$ . The phrase “causally affect” is rather difficult to pin down (see Lee [21, Chapter 11] for the subtleties around the notion of causation), but, intuitively, $e_{1}\rightarrow e_{2}$ means $e_{2}$ cannot behave as if $e_{1}$ had not occurred. Put another way, if the effect of an event is reflected in the state of a local replica of a variable $x$ , then any cause of the event must also be reflected. Put yet another way, an observer must never observe an effect before its cause.

Formally, the causality relation of Schwartz and Mattern is the smallest transitive relation such that $e_{1}\rightarrow e_{2}$ if $e_{1}$ precedes $e_{2}$ in a process, or $e_{1}$ is the sending of a value in one process and $e_{2}$ is the acceptance of the value in another process. If neither $e_{1}\rightarrow e_{2}$ nor $e_{2}\rightarrow e_{1}$ holds, then we write $e_{1}||e_{2}$ or $e_{2}||e_{1}$ and say that $e_{1}$ and $e_{2}$ are incomparable. The causality relation is identical to the “happened before” relation of Lamport [17], but Schwartz and Mattern prefer the term “causality relation” because even if $e_{1}$ occurs unambiguously earlier than $e_{2}$ in physical time, they may nevertheless be incomparable, $e_{1}||e_{2}$ .

The causality relation is a strict partial order. Schwartz and Mattern use their causality relation to define a “consistent global snapshot” of a distributed computation to be a subset $S$ of all the events $E$ in the execution that is a downset, meaning that if $e^{\prime}\in S$ and $e\rightarrow e^{\prime}$ , then $e\in S$ (this was previously called a “consistent cut” by Mattern [29]).

To maintain causal consistency, the requirement that a process have nondecreasing tags means that, in a trace, a read or write event triggered by an external input may have a physical time $T$ that is significantly larger than its tag’s timestamp $\mathcal{T}(g)$ . While $\mathcal{T}(g)$ is determined by the physical clock at the time the external input appears, the physical time at which the event is actually processed may have to be later to ensure that all events with earlier tags have been processed. This motivates the following definition:

Definition 3

For process $i$ , the processing offset $O_{i}\in\mathbb{I}$ is

O_{i}=\max(T_{i}-\mathcal{T}(g_{i}))

(3)

where $T_{i}$ and $g_{i}$ are the physical time and tag, respectively, of a write event on process $i$ that is triggered by a local external input (and hence assigned a timestamp drawn from the local clock). The maximization is over all such write events in process $i$ . If there are no such write events, then $O_{i}=0$ .

The processing offset closely resembles the unavailability of Definition 2, but the former refers to write events and the latter refers to read events. The processing offset, by definition, is greater than or equal to zero.

III-D Apparent Latency

When a write to a shared variable occurs in process $j$ , some time will elapse before a corresponding accept event on process $i$ triggers a write to its local copy of the shared variable. This motivates the following definition:

Definition 4

Let $g_{j}$ be the tag of a write event in process $j$ that is triggered by an external input at $j$ (so $\mathcal{T}(g_{j})$ is the physical time of that external input). Let $T_{i}$ be the physical time of the corresponding accept event in process $i$ (or $\infty$ if there is no such event). (If $i=j$ , we assume $T_{i}$ is the same as the physical time of the write event.) The apparent latency or just latency $\mathcal{L}_{ij}\in\mathbb{I}$ for communication from $j$ to $i$ is

\mathcal{L}_{ij}=\max(T_{i}-\mathcal{T}(g_{j})),

(4)

where maximization is over all such write events in process $j$ . If there are no such write events, then $\mathcal{L}_{ij}=0$ .

Note that $T_{i}$ and $\mathcal{T}(g_{j})$ are physical times on two different clocks if $i\neq j$ , so this apparent latency is an actual latency only if those clocks are perfectly synchronized. Unless the two processes are actually using the same physical clock, they will never be perfectly synchronized. Hence, the apparent latency may even be negative. Note that despite these numbers coming from different clocks, if tags are sent along with messages, this apparent latency is measurable.

The apparent latency is a sum of four components,

\mathcal{L}_{ij}=O_{j}+X_{ij}+L_{ij}+E_{ij},

(5)

where $X_{ij}$ is execution time overhead at node $j$ for sending a message to node $i$ , $L_{ij}$ is the network latency from $j$ to $i$ , and $E_{ij}$ is the clock synchronization error. The three latter quantities are indistinguishable and always appear summed together, so there is no point in breaking apparent latency down in this way. Moreover, these latter three quantities would have to be measured with some physical clock, and it is not clear what clock to use. The apparent latency requires no problematic measurement since it explicitly refers to local clocks and tags.

The clock synchronization error can be positive or negative, whereas $O_{j}$ , $X_{ij}$ , and $L_{ij}$ are always nonnegative. If $E_{ij}$ is a sufficiently large negative number, the apparent latency will itself also be negative. Because of the use of local clocks, the accept event will appear to have occurred before the user input that triggered it. This possibility is unavoidable with imperfect clocks.

When $i=j$ , $\mathcal{L}_{ij}=O_{j}=O_{i}$ . The apparent latency at any node due to itself is just its processing offset.

III-E The CAL Theorem

The above definitions lead immediately to the following theorem:

Theorem 1

Given a trace, the unavailability at process $i$ is, in the worst case,

\bar{A}_{i}=\max\left(O_{i},\max_{j\in N}(\mathcal{L}_{ij}-\bar{C}_{ij})\right),

(6)

where $O_{i}$ is the processing offset, $\mathcal{L}_{ij}$ is apparent latency (which includes $O_{j}$ ), and $\bar{C}_{ij}$ is the inconsistency.

This can be put in an elegant form using max-plus algebra [3]. Let $N$ be the number of processes, and define an $N\times N$ matrix $\Gamma$ such that its elements are given by

\Gamma_{ij}=\mathcal{L}_{ij}-\bar{C}_{ij}-O_{j}=X_{ij}+L_{ij}+E_{ij}-\bar{C}_{ij}.

(7)

That is, from (5), the $i$ , $j$ -th entry in the matrix is an assumed bound on $X_{ij}+L_{ij}+E_{ij}$ (execution time, network latency, and clock synchronization error), adjusted downwards by the specified tolerance for inconsistency $\bar{C}_{ij}$ .

Let $\bm{A}$ be a column vector with elements equal to the unavailabilities $\bar{A}_{i}$ , and $\bm{O}$ be a column vector with elements equal to the processing offsets $O_{i}$ . Then the CAL theorem (6) can be written as

\bm{A}=\bm{O}\oplus\Gamma\bm{O},

(8)

where the matrix multiplication is in the max-plus algebra. This can be rewritten as

\bm{A}=(\bm{I}\oplus\Gamma)\bm{O},

(9)

where $\bm{I}$ is the identity matrix in max-plus, which has zeros along the diagonal and $-\infty$ everywhere else. Hence, unavailability is a simple linear function of the processing offsets, where the function is given by a matrix that depends on the network latencies, clock synchronization error, execution times, and specified inconsistency in a simple way.

III-F Evaluating Processing Offsets

The processing offsets $O_{i}$ and $O_{j}$ are physical time delays incurred on nodes $i$ and $j$ before they can begin handling events. Specifically, node $i$ can begin handling a user input (a write event) with tag $g_{i}$ at physical time $T_{i}=\mathcal{T}(g_{i})+O_{i}$ . In the absence of any further information about a program, we can use our $\Gamma$ matrix to calculate these offsets. However, the result is conservative because it does not use dependency information that may be present in a program (and is present in the Lingua Franca programs we give in the next section). A less conservative technique is explained below in Section V-B.

First, in the current implementation of LF, by default, logical time “chases” physical time, meaning that logical time never gets ahead of physical time. To model this, define a zero column vector $\bm{Z}$ where every element is zero. With this, we require at least that $\bm{O}\geq\bm{Z}.$ Note that, in general, this vector could be given negative numbers or even $-\infty$ , in which case the federate may be able to advance logical time ahead of physical time, but this is not currently supported in LF, so we use a zero vector. In addition, to ensure that node $i$ processes events in tag order, it is sufficient to ensure that node $i$ has received all network input events with tags less than or equal to $g_{i}$ before processing any event with tag $g_{i}$ . With this (conservative) policy, $O_{i}\geq\max_{j}(\mathcal{L}_{ij}-\bar{C}_{ij}).$ The smallest processing offsets that satisfy these two constraints satisfy

\bm{O}=\bm{Z}\oplus\Gamma\bm{O}.

(10)

This is a system of equations in the max-plus algebra. From Baccelli, et al. [3] (Theorem 3.17), if every cycle of the matrix $\Gamma$ has weight less than zero, then the unique solution of this equation is

\bm{O}=\Gamma^{*}\bm{Z},

(11)

where the Kleene star is $\Gamma^{*}=\bm{I}\oplus\Gamma\oplus\Gamma^{2}\oplus\cdots.$ Baccelli et al. (Theorem 3.20 [3]) show that this reduces to

\Gamma^{*}=\mathbf{I}\oplus\Gamma\oplus\cdots\oplus\Gamma^{N-1},

(12)

where $N$ is the number of processes.

The requirement that the cycle weights be less than zero is intuitive, but overly restrictive. It means that along any communication path from a node $i$ back to itself, the sum of the logical delays $D_{jk}$ must exceed the sum of the execution times, network latencies, and clock synchronization errors along the path. This implies that we have to tolerate a non-zero inconsistency somewhere on each cycle.

In practice, programs may have zero or positive cycle means. Theorem 3.17 of Baccelli, et al. [3] shows that if all cycle weights are non-positive, then there is a solution, but the solution may not be unique. If there are cycles with positive cycle weights, there is no finite solution for $\bm{O}$ in (10). In this case, the only solution to (10) sets all the processing offsets to $\infty$ . Every node must wait forever before handling any user input. This is, of course, the ultimate price in availability.

Equation (11) is conservative because, absent more information about the application logic, we must assume that any network input at node $i$ with tag $g_{i}$ can causally affect any network output with tag $g_{i}$ or larger. For particular applications, it is possible to use the detailed structure of the Lingua Franca program to derive processing offsets that are not conservative, as we do below in Section V-B.

IV Availability and Consistency in LF

In this section, we briefly introduce Lingua Franca and show how it expresses consistency and availability requirements. We then discuss how processing offsets can be determined.

IV-A Brief Introduction to Lingua Franca

Lingua Franca (or LF, for short) [26] is a polyglot coordination language that orchestrates concurrent and distributed programs written in any of several target languages (as of this writing, C, C++, Python, TypeScript, and Rust). In LF, applications are defined as concurrent compositions of components called reactors [25, 24]. LF borrows the best semantic features of established models of computation, such as actors [2], logical execution time (LET) [14], synchronous reactive languages [5], and discrete event systems [19] including DEVS [36] and SystemC [22]. LF programs are concurrent and deterministic [18] (except when fault handlers are invoked). Given any set of tagged input events, there is exactly one correct behavior.

Fig. 1 gives a simple example that we use to explain the structure of an LF program and how it specifies availability and consistency requirements. This program defines a simple pipeline consisting of a data source, a data processor, and a data sink. The data source could, for example, poll a sensor and filter its readings. The data processor could use the sensor data to calculate a command to send to an actuator. The data sink could drive the actuator.

The diagram at the bottom of the figure is automatically generated by the LF tools given the source code at the top.¹¹1The diagram synthesis feature was created by Alexander Schulz-Rosengarten of Kiel University using the graphical layout tools from the KIELER Lightweight Diagrams framework [32] (see https://rtsys.informatik.uni-kiel.de/kieler). In later examples, we will show the diagram only and not the source code because the diagram contains sufficient information. The chevrons in the figure represent reactions, which process events, and their dependencies on inputs and their ability to produce outputs is shown using dashed lines.

Line 1 in the source code defines the target language, which is the language in which reactions are written, and the language into which the entire LF program is translated. This example uses the C target, which means that the bodies of reactions are written in C.

Line 2 declares a reactor class named Sense, which has an output port (line 3), a timer (line 4), a state variable (line 5), and a reaction (line 6). The output port has name out and type int. The timer has name t, offset 0 (meaning it should start when the program starts), and period 10 ms. The state variable has a name, type, and initial value. Each of these properties of the reactor is represented in the diagram at the bottom.

A reaction, like that on line 6, is defined with a syntax of the form

reaction( $L_{1}$ ) $L_{2}$ -> $L_{3}$ {= code body =}

where $L_{1}$ is a list of triggers, which are inputs, timers, and actions (we will discuss actions later); $L_{2}$ is an optional list of observables, which are inputs and actions that do not trigger the reaction but may be read by the reaction; and $L_{3}$ is an optional list of effects, which are outputs, actions, and modes (which are not used in this paper).

Refer to caption — Figure 1: Structure of an LF program for a simple pipeline.

The particular reaction on line 6 is triggered by the timer every 10 ms. When it is triggered for the $n$ -th time, its logical time will be $t=s+n\times 10$ ms, where $s$ is the logical start time, typically set using the local physical clock when the program starts. The runtime system attempts to align logical time with physical time, so this reaction will be invoked roughly every 10 ms, but this cannot be done perfectly. By default, logical time “chases” physical time in a program execution, so that reactions with a logical time $t$ are invoked close to (but never before) physical time $T=t$ .

A reaction may optionally have a deadline, as shown in the Actuate reactor class on line 14. This gives a time value and a code body to execute instead of the reaction if the deadline is violated. The time value may be a parameter of the reactor class but here is shown as the constant 10 ms. A deadline with time value $d=10$ ms is violated for an event with tag $g$ if the reaction is invoked at a physical time $T$ where $T>\mathcal{T}(g)+d$ . Such a deadline explicitly specifies an availability requirement. The deadline violation handler (line 15) is a fault handler. The LF runtime system uses an EDF scheduler to attempt to avoid violating this deadline, as usual in real-time systems.

The top-level (main) reactor is defined on line 25. Within it, reactor instances are created as on line 26 using the new keyword. If the main keyword is replaced with federated, then a separate C program is generated for each reactor instantiated within the federated reactor. Otherwise, a single multi-threaded C program is generated for the entire program. For the federated case, each instance is called a federate, and tagged inputs will arrive from the network at the input ports and be handled in tag order.

The routing of messages is specified by connections, as shown on line 29, which connects the output of i1 to the input of i2. Such a connection may optionally alter the timestamp of the message using the after keyword. The connection on line 29 specifies that the timestamp of the input event at i2 should be 10 ms greater than the timestamp at the output of i1. Such a logical delay explicitly relaxes the consistency requirements because it explicitly states that i2 can use information that is 10 ms out of date relative to i1.

We can see immediately that use of logical delays improves availability for this example, as expected from the CAL theorem. Suppose that this program is federated and that the three instances are mapped to distinct processors on a network. Were it not for the logical delays, intuitively, the Actuate reactor i3 would be unable to react until the message from the Sense reactor i1 had flowed through the network to i2, i2 had completed its reaction, and the result from i2 had flowed over the network to i3. If these delays add up to more than 10 ms, the i3’s deadline will be missed. With the logical delays, however, as long as the delays add up to less than 30 ms, the deadline will not be missed. If the delays add up to less than 20 ms, then i3 can react to its input with timestamp $t$ as soon as physical time $T$ matches or exceeds $t$ . The specified tolerance for inconsistency improves availability.

This intuition can be made rigorous using the CAL theorem.

IV-B Evaluating Unavailability

For the program in Fig. 1, the $\Gamma$ matrix is given by

\Gamma=\left[\begin{matrix}0&-\infty&-\infty\\ \Gamma_{21}&0&-\infty\\ -\infty&\Gamma_{32}&0\end{matrix}\right]

(13)

where

•

$\Gamma_{21}=X_{21}+L_{21}+E_{21}-10\mbox{ms}$ ,
•

$\Gamma_{32}=X_{32}+L_{32}+E_{32}-10\mbox{ms}$ ,

and

•

$X_{21}$ is the execution time for the reaction in Sense,
•

$L_{21}$ is the network latency from Sense to Compute,
•

$E_{21}$ is the clock synchronization error from i1 to i2,
•

$X_{32}$ is the execution time for the reaction in Compute,
•

$L_{32}$ is the network latency from Compute to Actuate, and
•

$E_{32}$ is the clock synchronization error from i2 to i3.

The $-\infty$ entries in the matrix are a consequence of a lack of communication.

On the communication path from i1 to i2, there is a logical delay of 10 ms, which is an explicit declaration of an inconsistency $\bar{C}_{21}=10$ ms. The Compute reactor’s view of the data from the Sense reactor is 10 ms behind. We can now determine that this allowance of 10 ms of inconsistency improves availability compared to what we would get without it.

First, we can use the analysis of Section III-F to evaluate the processing offsets. For this example, $N=3$ , so (12) reduces to

\Gamma^{*}=\mathbf{I}\oplus\Gamma\oplus\Gamma^{2}.

It is straightforward to evaluate this to get

\Gamma^{*}=\left[\begin{matrix}0&-\infty&-\infty\\ \Gamma_{21}&0&-\infty\\ \Gamma_{21}+\Gamma_{32}&\Gamma_{32}&0\end{matrix}\right]

Intuitively, this matrix captures the fact that the Actuate reactor indirectly depends on the Sense reactor, something not directly represented in the $\Gamma$ matrix.

We can now evaluate (11) to get

\bm{O}=\Gamma^{*}\bm{Z}=\left[\begin{matrix}0\\ \max(\Gamma_{21},0)\\ \max(\Gamma_{21}+\Gamma_{32},\Gamma_{32},0)\end{matrix}\right]

(14)

Next, we evaluate (9) to get the unavailability at each node,

\bm{A}=(\bm{I}\oplus\Gamma)\bm{O}=\left[\begin{matrix}0\\ \max(\Gamma_{21},0)\\ \max(\Gamma_{21}+\Gamma_{32},\Gamma_{32},0)\end{matrix}\right]

(15)

In this simple case, the unavailability is equal to the processing offsets, which means that the processing offsets capture all the waiting that needs to be done to realize the semantics of the program.

These unavailability numbers are intuitive. First, note that the Sense can react to external stimulus immediately. It has no network inputs to worry about, so $\bm{A}_{0}=0$ . The Compute reactor, however, can react to an input stimulus with timestamp $t$ immediately when physical time $T=t$ only if $X_{21}+L_{21}+E_{21}\leq 10\mbox{ms}$ . Otherwise, in order to ensure that it processes events in timestamp order, it may need to wait until $T=t+X_{21}+L_{21}+E_{21}-10\mbox{ms}$ . Similarly, the Actuate reactor can respond immediately if $\Gamma_{32}\leq 0$ and $\Gamma_{21}+\Gamma_{32}\leq 0$ . These conditions occur if the 10 ms logical delay is larger than the apparent latencies in communication.

If we change line 29 to this subtly different version:

⬇

29 i1.out

\small\mathtt{\sim}

> i2.in;

then there is no upper bound on the inconsistency between these two instances. The subtle change is to replace the logical connection -> with a physical connection $\small\mathtt{\sim}$ >. In Lingua Franca, this is a directive to assign a new tag $g_{i}$ at the receiving end $i$ based on a local measurement of physical time $T_{i}$ when the message is received such that $\mathcal{T}(g_{i})=T_{i}$ . The original tag is discarded. Such connections, therefore, have no effect on availability, but they completely abandon consistency.

IV-C Processing Offsets in Lingua Franca

Lingua Franca offers two coordination strategies for federated execution, centralized and decentralized [4]. The centralized coordinator is an extension of the High-Level Architecture (HLA) [12], a distributed discrete-event simulation standard. The decentralized coordinator is an extension of PTIDES [38], a real-time technique also implemented in Google Spanner [10], a globally distributed database. This coordinator is also influenced by Lamport [16] and Chandy and Misra [9, 30].

For the purposes of this paper, we only need to know how these coordination mechanisms relate to processing offsets and availability. The centralized coordinator is the easiest to understand. It does whatever is necessary to ensure that events are processed in tag order. In particular, execution in a federate will be delayed when such a delay is needed to ensure tag order semantics. As a consequence, the processing offsets and unavailability bounds emerge from an execution of the program. As network latencies vary, the offsets and unavailability vary. The CAL theorem tells what to expect these numbers to be, given network latencies. The programmer, therefore, can use the CAL theorem to determine whether deadlines will be met, given specific latencies.

The processing offsets and unavailability bounds play a bigger role when using the decentralized coordinator. In particular, with this coordinator, the programmer is required to specify a safe-to-advance (STA) offset for each federate. The choice of STA at federate $i$ can be guided by processing offset $\bm{O}_{i}$ for federate $i$ , with the caveat that $\bm{O}_{i}$ is a property of a trace, whereas STA is a property of a program (a family of traces). The STA specified for a federate $i$ enforces $O_{i}$ during execution for all the traces produced by federate $i$ . For particular reactions, the programmer can also give an additional safe-to-assume-absent (STAA) offset. STAA gives an additional time beyond the STA to wait before assuming that the absence of a message on a particular input means that there will be no message on that port with the current tag or less. The availability bound for any shared state read by that reaction can similarly guide the choice of STAA. Such guidance is much better than guesswork.

Both coordinators will deterministically execute the LF program identically, yielding the same behavior as an unfederated execution, under certain assumptions. For the centralized coordinator, the assumptions are that the network latencies are sufficiently low that no deadlines are missed. For the decentralized coordinator, the assumptions are that the network latencies are sufficiently low that events are seen by each reactor in tag order. In both cases, violations of the assumptions are detectable and can be handled by fault-handling code provided by the programmer.

The key difference between the two coordinators, therefore, is in their fault handling. When network latencies get large (or the network gets partitioned), the centralized coordinator sacrifices availability, whereas the decentralized coordinator sacrifices consistency. Which of these is the right choice is application dependent.

Note also that for both coordinators, the CAL theorem can be used to derive the requirements on network latencies, and therefore provides a principled guide for choosing networking technology and can guide refactoring designs to move computations between embedded, edge, and cloud computing.

IV-D Fault Handling

Any assumptions about network latency may be violated in the field. In centralized coordination, such violations will manifest as deadline violations, whereas in decentralized coordination, they will manifest as consistency violations. In both cases, LF allows the programmer to specify exception handlers to be invoked when such violations occur.

Hence, for safety-critical CPS applications, the proposed framework promises some astonishingly attractive properties. First, a programmer can explicitly decide when and how much to give up consistency and when and how much to give up availability to accommodate execution times, network latency, and clock synchronization error. Second, the programmer’s specification will imply explicit constraints on the technology (networking, processing, and clock synchronization) that can be used to guide selection of parts to use and mapping of software components onto resources (embedded, edge, or cloud). Third, to allow for (inevitable) possible failures in the field, where specifications are not met due to unforeseen circumstances such as hardware failures, the programmer can explicitly give code to execute when the fault occurs.

We will show next how these principles can be applied through two complementary practical examples.

V Tradeoffs in Practical Systems

In this section, we consider two safety-critical real-time systems with complementary properties. The first demands that availability (timely responses) be prioritized over consistency in the presence of faults. This example also requires a measured relaxation of consistency to meet tight deadlines. The second example demands that consistency be prioritized over availability in the presence of faults. The second example also illustrates how cycles are handled by the CAL theorem.

V-A ADAS

Consider an Advanced Driver Assist Systems (ADAS), as shown in Fig. 2. Such a system uses a camera with a computer vision system that analyzes images for pedestrians and applies braking when a pedestrian is detected, as shown in the photo. The figure shows the diagram synthesized from an LF program that shows the structure of this system. In this structure, there are two federated reactors, a Vision subsystem and a Braking subsystem. The structure is a pipeline similar to that of Fig. 1, except with a twist. Inside the Braking subsystem is a second sensor, which senses when a driver presses on the brake pedal. Both the camera and the pedal can affect the same actuator, which is driven by the Brake reactor.

The Vision federate has a time-triggered periodically invoked reaction that captures and analyzes an image. It then sends the results over the network to the Braking federate. The Braking federate has a local interface to a sensor on the brake pedal, represented in the diagram by the triangle with a “P” (which represents a physical action in LF). When the brake pedal is pushed, an event is generated that is assigned a time stamp from the local clock and triggers invocation of the reaction labeled “2” within the BrakePedal reactor.

Let the Vision federate be process 1 and the Braking federate be process 2. Then the $\Gamma$ matrix is similar to (13):

\Gamma=\left[\begin{matrix}0&-\infty\\ \Gamma_{21}&0\end{matrix}\right]

(16)

where

•

$\Gamma_{21}=X_{21}+L_{21}+E_{21}-10\mbox{ms},$

and

•

$X_{21}$ is the execution time in Vision to prepare the data to send to Braking,
•

$L_{21}$ is the network latency from Vision to Braking, and
•

$E_{21}$ is the clock synchronization error.

The logical delay of 10 ms on the communication path from 1 to 2 is an explicit declaration of an inconsistency $\bar{C}_{21}=10$ ms. The Braking system’s view of the sensor data from the Vision system is 10 ms behind. Using the same methods, the processing offsets and unavailability are similar to (14) and (15):

\bm{O}=\left[\begin{matrix}0\\ \max(\Gamma_{21},0)\end{matrix}\right]\quad\bm{A}=\left[\begin{matrix}0\\ \max(\Gamma_{21},0)\end{matrix}\right]

The allowance of 10 ms of inconsistency improves availability compared to what we would get without it. In particular, if $X_{21}+L_{21}+E_{21}\leq 10\mbox{ms},$ then the processing offsets and unavailability are all zero.

In Fig. 2, notice that the first reaction of Brake has a deadline of 3 ms. This deadline is as an explicit requirement for availability, stating, effectively, that we require

X_{21}+L_{21}+E_{21}-10\mbox{ms}+X_{22}\leq 3\mbox{ms},

where $X_{22}$ is the execution time of reaction 2 in BrakePedal.²²2Note that LF implements an EDF scheduling policy, and that deadlines are inherited upstream, so reaction 2 of BrakePedal will have high priority. The requirement becomes

X_{21}+L_{21}+E_{21}<13\mbox{ms}-X_{22}.

(17)

This requirement almost certainly means that the vision processing cannot be done in the cloud. If it is, then the deadline is likely to be violated. In principle, this analysis can be automated, so that a system designer simply enters the requirements (by specifying deadlines, communication paths, and consistency requirements), and the system provides feedback on the realizability of the requirements.

If the system designer really wants to do the vision processing in the cloud, then these results can be used to negotiate a service-level agreement, for example, with the 5G network vendor and the cloud service provider. Alternatively, the 10ms tolerance for inconsistency could be increased, although this would require an evaluation of whether the ADAS system continues to be able to do its job safely.

Once the requirements and assumptions are specified, then the next key decision is what to do when those assumptions are violated. For this application, missing the deadline could be disastrous, so we should emphasize availability over consistency. To accomplish that in Lingua Franca, we just have to specify to use decentralized coordination. With this coordination mechanism, if messages fail to arrive on time from the network, each local runtime system assumes there are no messages and continues accordingly. This will ensure that the brake pedal event gets handled as long as the Braking federate’s host computer is still working.

V-B Four-Way Intersection

Consider autonomous or semi-autonomous vehicles that leverage communication with a roadside unit (RSU) to mediate access to a four way intersection. There are many projects working on such automation to improve traffic flow [31, 34]. A prototype is depicted in Fig. 3. This prototype uses a popular open-source vehicle simulator called Carla, which generates the animated image in the figure that gets updated as the program runs. The prototype is implemented using the Python target in LF, which enables easy integration of large legacy subsystems, such as Carla, that have Python APIs.

The LF program depicted in Fig. 3 consists of nine top-level components, four vehicle simulators, four vehicle controllers, and one roadside unit. The program uses a compact LF notation for banks of reactors and a multiplicity of communication channels. In this program, as a vehicle approaches the intersection, it communicates with the RSU, sending its kinematic state (position and velocity). The RSU handles competing requests for access to a single shared resource, the intersection, by granting time windows to particular vehicles during which they may use the intersection. This application represents a common pattern that occurs whenever distinct agents contend for access to a shared resource.

A key property of this application is that, very much unlike the ADAS example in Section V-A, consistency is far more important. All vehicles and the RSU must have a consistent view of the state of the intersection before any vehicle can enter the intersection. In other words, we prefer that vehicles stop (making the intersection unavailable) over having them enter the intersection without a consistent view on the state of the system, which could lead to a collision.

In LF, if we choose centralized coordination for the federated execution, the system will emphasize consistency over availability in the event of faults (violations of the assumptions and requirements). If messages from the network do not arrive on time, each federate stops progressing, which will prevent a vehicle from entering the intersection.

The contrast between the requirements of the intersection and ADAS examples demonstrates that tradeoffs between availability and consistency are application dependent. System designers should be able to make such tradeoffs in system requirements, and software needs to be designed so it responds to faults in coherence with the stated requirements. For the ADAS example, we need to sacrifice consistency, whereas for the intersection example, we need to sacrifice availability.

A second key difference between this intersection example and the ADAS example is that the program has communication cycles without logical delays. This changes how we do the analysis because we will no longer be able to use (11) to determine the processing offsets. Instead, we have to derive the processing offsets using more detailed information about the program structure. We now show how to do that.

The simpler LF program depicted in Fig. 4 has the essential structure of the intersection example reduced to the minimum that illustrates the issues. The $\Gamma$ matrix is

\Gamma=\left[\begin{matrix}0&-\infty&\Gamma_{13}&-\infty\\ -\infty&0&-\infty&\Gamma_{24}\\ \Gamma_{31}&\Gamma_{32}&0&-\infty\\ \Gamma_{41}&\Gamma_{42}&-\infty&0\end{matrix}\right]

The finite non-zero entries are defined as before,

\Gamma_{ij}=X_{ij}+L_{ij}+E_{ij}-\bar{C}_{ij},

where, in this case, $\bar{C}_{ij}=0$ because none of the connections has a logical delay. To find the processing offsets, we use information that is evident in the LF program but not in the matrix. Specifically, note that any inputs that arrive at the inputs of the Sim reactors will have the same tag as an output produced by one of the two Sim reactors. Moreover, the two Sim reactors’ outputs are driven by timers with the same offset and period (zero and 16 ms), so these outputs are logically simultaneous. We now make a key assumption:

Assumption 1

The period of the timers is greater than any unavailability.

With this assumption, each Sim reactor will have completed processing all events with timestamp $t$ before it needs to advance to logical time $t+p$ , where $p$ is the timer period. Recall that we are assuming that logical time chases physical time, so physical time has to advance to $t+p$ before the federate will even attempt to advance its logical time. At this point, it can safely advance its logical time immediately. Hence, the processing offset for both Sim reactors is zero if our assumption 1 is true.

The processing offset for the two Vehicle is easier to derive. It simply depends on the communication latencies, clock synchronization errors, and execution time bounds. The resulting processing offset vector is

\bm{O}=\left[\begin{matrix}0\\ 0\\ \max(\Gamma_{31},\Gamma_{32})\\ \max(\Gamma_{41},\Gamma_{42})\end{matrix}\right]

(18)

We can now use (9) to calculate the unavailability:

\bm{A}=(\bm{I}\oplus\Gamma)\bm{O}=\left[\begin{matrix}\Gamma_{13}+\max(\Gamma_{31},\Gamma_{32})\\ \Gamma_{24}+\max(\Gamma_{41},\Gamma_{42})\\ \max(\Gamma_{31},\Gamma_{32})\\ \max(\Gamma_{41},\Gamma_{42})\end{matrix}\right]

(19)

We can now see that if the clock period is less than that top two entries in this vector, then assumption 1 will be violated, so this becomes a requirement.

These results are intuitive and correspond with observation when we run the federated LF program. Assumption 1 asserts that the period of the clocks is large enough that each period begins fresh without an accumulated backlog of unprocessed events. The execution will begin each period by advancing the logical time of each Sim federate to the next period as soon as physical time matches that logical time. The zeros in the first two entries of (18) tell us this is done without delay. The logical time of the two Vehicle federates, however, cannot be advanced until enough physical time has elapsed to allow for propagation of events from both Sim federates. This delay is represented by last two entries in (18).

As shown in (19), the unavailability of the two Vehicle federates matches their processing offset. This is not surprising because they each have only one reaction and that reaction reacts to both network inputs. However, the unavailability at the Sim reactors is larger than their processing offset. This reflects the fact that reaction 2 in each of the Sim reactors has to wait for the upstream Vehicle reactors to execute and for their results to propagate over the network. In other words, strong consistency—which lets actuation be logically simultaneous with the acquisition of sensor inputs—comes at the cost of a penalty in availability. The actuation is delayed in physical time, and, more fundamentally, the period with which sensing and actuation can be done has a lower bound that depends on the network delays.

Under centralized coordination, the actual values of all the $\Gamma_{ij}$ latencies are determined automatically at runtime as apparent latencies. If the program fails to keep up for any reason (e.g. network failures), then the centralized coordinator will preserve consistency; unavailability will rise and deadlines (if any are specified) will be missed. Fault handlers provided by the programmer can adapt the system accordingly. Moreover, in this case, assumption 1 becomes invalid, so the derived unavailability bound becomes invalid. Unavailability will exceed our calculated bound for such a trace and, in the event of total network failure, will grow without bound.

Under decentralized coordination, the programmer chooses numbers to the $\Gamma_{ij}$ latencies based on assumptions about network behavior and derives processing offsets (18) and unavailability (19). These then guide the choices of STA and STAA numbers specified in the program. At runtime, each federate proceeds on the assumption that the network latencies will be respected. If these assumptions are violated, then a reactor may see events out of timestamp order, in which case a fault handler will be invoked. If the network fails altogether, however, no reactor will see events out of timestamp order, and no fault handler will be invoked. Instead, each vehicle will act based on inconsistent information. Hence, with decentralized coordination, availability is prioritized over consistency when a fault occurs, which is the wrong choice for this application.

VI Conclusions

The CAL theorem, which generalizes Brewer’s CAP theorem, quantifies the relationship between inconsistency, unavailability, and apparent latency in distributed systems, where apparent latency includes network latency, execution time overhead, and clock synchronization error. The relationship is a linear system of equations in a max-plus algebra. We have applied this theorem to distributed real-time systems, showing how consistency affects the ability to bound the time it takes to react to an external stimulus, such as a sensor input, and produce a response, an actuator output. These bounds (which we call unavailability) depend on apparent latency and can be reduced by explicitly relaxing consistency requirements. Moreover, because the CAL theorem defines the effect of network latency on the responsiveness of a system, it can serve to guide placement of software components in end devices, in edge computers, or in the cloud. The consequences of such choices can be derived rather than measured or intuited.

We have shown how the Lingua Franca coordination language enables arbitrary tradeoffs between consistency and availability as apparent latency varies. We have also shown how LF programs can define fault handlers, sections of code that are executed when specified consistency and availability requirements cannot be met because apparent latency has exceeded the assumed bounds. Because of its deterministic semantics, LF provides predictable and repeatable behaviors in the absence of faults. And when faults occur, LF provides mechanisms for the system to adapt.

References

[1] Daniel Abadi. Consistency tradeoffs in modern distributed database system design: CAP is only part of the story. Computer, 45(2):37–42, February 2012.
[2] Gul A. Agha. Abstracting interaction patterns: A programming paradigm for open distributed systems. In E. Najm Stefani and J.-B., editors, Formal Methods for Open Object-based Distributed Systems, IFIP Transactions. Chapman and Hall, 1997.
[3] F. Baccelli, G. Cohen, G. J. Olster, and J. P. Quadrat. Synchronization and Linearity, An Algebra for Discrete Event Systems. Wiley, New York, 1992.
[4] Soroush Bateni, Marten Lohstroh, Hou Seng Wong, Rohan Tabish, Hokeun Kim, Shaokai Lin, Christian Menard, Cong Liu, and Edward A. Lee. Xronos: Predictable coordination for safety-critical distributed embedded systems. arXiv:2207.09555 [cs.DC], July 2022.
[5] Albert Benveniste and Gérard Berry. The synchronous approach to reactive and real-time systems. Proceedings of the IEEE, 79(9):1270–1282, 1991.
[6] Eric Brewer. Towards robust distributed system. In Symposium on Principles of Distributed Computing (PODC), 2000. Keynote talk.
[7] Eric Brewer. CAP twelve years later: How the ”rules” have changed. IEEE Computer, 45(2):23–29, February 2012.
[8] Adam Cataldo, Edward A. Lee, Xiaojun Liu, Eleftherios Matsikoudis, and Haiyang Zheng. A constructive fixed-point theorem and the feedback semantics of timed systems. In Workshop on Discrete Event Systems (WODES), 2006.
[9] K. Mani Chandy and Jayadev Misra. Distributed simulation: A case study in design and verification of distributed programs. IEEE Trans. on Software Engineering, 5(5):440–452, 1979.
[10] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. Spanner: Google’s globally-distributed database. In OSDI, 2012.
[11] Fabio Cremona, Marten Lohstroh, David Broman, Edward A. Lee, Michael Masin, and Stavros Tripakis. Hybrid co-simulation: it’s about time. Software and Systems Modeling, 18:1655–1679, November 2017.
[12] Judith S Dahmann, Richard M Fujimoto, and Richard M Weatherly. The department of defense high level architecture. In Proceedings of the 29th conference on Winter simulation, pages 142–149, 1997.
[13] Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, page 33(2), June 2002.
[14] Christoph M Kirsch and Ana Sokolova. The logical execution time paradigm. In Advances in Real-Time Systems, pages 103–120. Springer, 2012.
[15] Martin Kleppmann. A critique of the CAP theorem, 2015. arXiv:1509.05393 [cs.DC].
[16] Leslie Lamport. Using time instead of timeout for fault-tolerant distributed systems. ACM Transactions on Programming Languages and Systems, 6(2):254–280, 1984.
[17] Leslie Lamport, Robert Shostak, and Marshall Pease. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.
[18] Edward A. Lee. Determinism. ACM Transactions on Embedded Computing Systems (TECS), 20(5):1–34, July 2021.
[19] Edward A. Lee, Jie Liu, Lukito Muliadi, and Haiyang Zheng. Discrete-event models. In Claudius Ptolemaeus, editor, System Design, Modeling, and Simulation using Ptolemy II. Ptolemy.org, 2014.
[20] Edward A. Lee and Alberto Sangiovanni-Vincentelli. A framework for comparing models of computation. IEEE Transactions on Computer-Aided Design of Circuits and Systems, 17(12):1217–1229, 1998.
[21] Edward Ashford Lee. The Coevolution: The Entwined Futures of Humans and Machines. MIT Press, Cambridge, MA, 2020.
[22] Stan Liao, Steve Tjiang, and Rajesh Gupta. An efficient implementation of reactivity for modeling hardware in the Scenic design environment. In Design Automation Conference. ACM, Inc., 1997.
[23] Lucia Lo Bello and Wilfried Steiner. A perspective on IEEE time-sensitive networking for industrial communication and automation systems. Proceedings of the IEEE, 107(6):1094–1120, 2019.
[24] Marten Lohstroh. Reactors: A Deterministic Model of Concurrent Computation for Reactive Systems. PhD thesis, EECS Department, University of California, Berkeley, Dec 2020.
[25] Marten Lohstroh, Íñigo Íncer Romeo, Andrés Goens, Patricia Derler, Jeronimo Castrillon, Edward A. Lee, and Alberto Sangiovanni-Vincentelli. Reactors: A deterministic model for composable reactive systems. In 8th International Workshop on Model-Based Design of Cyber Physical Systems (CyPhy’19), volume LNCS 11971. Springer-Verlag, 2019. in press.
[26] Marten Lohstroh, Christian Menard, Soroush Bateni, and Edward A. Lee. Toward a lingua franca for deterministic concurrent systems. ACM Transactions on Embedded Computing Systems (TECS), 20(4):Article 36, May 2021.
[27] N. A. Lynch. Distributed Algorithms. Morgan Kaufmann, 1996.
[28] Oded Maler, Zohar Manna, and Amir Pnueli. From timed to hybrid systems. In Real-Time: Theory and Practice, REX Workshop, pages 447–484. Springer-Verlag, 1992.
[29] Friedemann Mattern. Virtual time and global states of distributed systems. In Michel Cosnard, Patrice Quinton, Michel Raynal, and Yves Robert, editors, Parallel and Distributed Algorithms, pages 215–226. North-Holland, 1988.
[30] Jayadev Misra. Distributed discrete event simulation. ACM Computing Surveys, 18(1):39–65, 1986.
[31] B.S.Y. Rao and P. Varaiya. Roadside intelligence for flow control in an intelligent vehicle and highway system. Transportation Research Part C: Emerging Technologies, 2(1):49–72, 1994.
[32] Christian Schneider, Miro Spönemann, and Reinhard von Hanxleden. Just model! – Putting automatic synthesis of node-link-diagrams into practice. In Proceedings of the IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC ’13), pages 75–82, San Jose, CA, USA, September 2013.
[33] Reinhard Schwarz and Friedemann Mattern. Detecting causal relationships in distributed computations: in search of the holy grail. Distributed Computing, 7:149–174, 1994.
[34] Sanaz Shaker Sepasgozar and Samuel Pierre. Network traffic prediction model considering road traffic parameters using artificial intelligence methods in vanet. IEEE Access, 10:8227–8242, 2022.
[35] Haifeng Yu and Amin Vahdat. The costs and limits of availability for replicated services. ACM Transactions on Computer Systems, 2(24):70–113, February 2006.
[36] Bernard P Zeigler, Yoonkeon Moon, Doohwan Kim, and George Ball. The devs environment for high-performance modeling and simulation. IEEE Computational Science and Engineering, 4(3):61–71, 1997.
[37] Ben Zhang, Xin Jin, Sylvia Ratnasamy, John Wawrzynek, and Edward A. Lee. AWStream: Adaptive wide-area streaming analytics. In Proceedings of SIGCOMM, Budapest, Hungary, August 20-25 2018. ACM.
[38] Yang Zhao, Edward A. Lee, and Jie Liu. A programming model for time-synchronized distributed real-time systems. In Real-Time and Embedded Technology and Applications Symposium (RTAS), pages 259 – 268. IEEE, 2007.

Consistency vs. Availability in Distributed Real-Time Systems* ††thanks: * This work was supported by the iCyPhy Research Center at UC Berkeley, supported by Denso, Siemens, and Toyota.