CoMesh: Fully-Decentralized Control for Sense-Trigger-Actuate Routines in Edge Meshes
Abstract.
While mesh networking for edge settings (e.g., smart buildings, farms, battlefields, etc.) has received much attention, the layer of control over such meshes remains largely centralized and cloud-based. This paper focuses on applications with sense-trigger-actuate (STA) workloads—these are similar to the abstraction of routines popular in smart homes, but applied to larger-scale edge IoT deployments. We present CoMesh, which tackles the challenge of building local, non-cloud, and decentralized solutions for control of sense-trigger-actuate applications. At its core CoMesh uses an abstraction called k-groups to spread in a fine-grained way, the load of STA actions. Coordination within the k-group uses selective fast and cheap mechanisms rather than expensive off-the-shelf solutions. k-group selection is proactively dynamic, and occurs by using a combination of zero-message-exchange mechanisms (to reduce load) and locality sensitive hashing (to be aware of physical layout of devices). We analyze and theoretically prove the safety of CoMesh’s mechanisms. Our evaluations using both simulation and Raspberry Pi lab deployments show that CoMesh is load-balanced, fast, and fault-tolerant.
1. Introduction
Commercial edge deployments with Internet of Things (IoT) devices are rapidly growing in size and density. There are over 10 B IoT devices today, expected to exceed 25 B by 2030 (J., 2021). For instance, smart buildings account for 1.26 B devices, and the smart building IoT sector is forecast to grow to 2.5 B devices by 2027, and comprise a $90+ B market (Memoori, 2022). Other emerging areas for edge IoT deployment include smart farms with robots and sensors (Vasisht et al., 2017; Ryu et al., 2015; Gan and Lee, 2018; Chandra and Collis, 2021; Sivakumar et al., 2021), battlefield deployments (Liu et al., 2022; Feng et al., 2020; Shahid et al., 2021), performance arenas (Fox et al., 2019; Turchet et al., 2018), and even indoor spaces such as warehouses (Fatima et al., 2022; Wang et al., 2022) and large marketplaces (Lampropoulos et al., 2019; Lin et al., 2016) with growing markets (Research, 2022; Aarti and Vineet, 2022; Future, 2022).
In such physically distributed settings, it is commonplace today to build mesh networks among the IoT and edge devices. Mesh networks can be robust, resilient, and allow easy “scale out” by adding and removing devices, without requiring a central operating unit (or hub) which may be overloaded and prolong latencies. Some devices may be “smart” devices capable of compute and storage (both small-scale), while other devices may be “simple” devices. Mesh networks are today being built using Wifi, Zigbee, Bluetooth, LoRa, etc. (K. et al., 2021; Guo et al., 2022; Li et al., 2022). Industrial mesh networks using IEEE 802.15.4, WirelessHart (Song et al., 2008), or IEC (ISO Central Secretary, 2016) are becoming common. There is rich literature at the networking layer, e.g., on how to build, configure, and route inside mesh networks (Akyildiz and Wang, 2005; Yang et al., 2005).
Sitting above the mesh network is the control plane—this needs to execute three kinds of actions: Sense-Trigger-Actuate (or STA). The control plane needs to sense: continuously collect measurements from multiple devices. Applications specify predicates, which are arbitrary Boolean clauses involving multiple device readings. When a predicate is satisfied, the control plane needs to trigger programmed series of actions that touch many devices. Third, because some of these actions may be long-running (e.g., mechanical movements, repetitive actions, or time-based actions such as water sprinklers), the control plane needs to initiate and monitor actuation of these commands (at multiple devices), while satisfying the challenging goal of maintaining consistency (Ahsan et al., 2021).
At a smaller scale, this STA abstraction exists in smart homes. The most popular programmatic abstraction for STA in smart homes, which is also applicable in edge meshes, is a routine. A routine is a sequence of commands that touches multiple devices, wherein each command executes an action on only one device. Multiple routines may be active at any given time, while many more remain dormant, waiting to be triggered. A routine may be triggered either based on sensor values, or at particular times, (or a combination thereof), or manually. Routines are offered today by Amazon Alexa (Amazon Alexa, 2014), Google Home (Google Home, 2016), Samsung SmartThings (Samsung Smart Things, 2012), and others. For instance, one routine at an office building may be triggered every week day at 9 pm, switching on security cameras, switching on specific external lights, and locking the outside doors. Another routine might be triggered by a combination of sensors and time, e.g., if the building temperature drops below 45 F between 11 pm and 6 am, take emergency actions to prevent bursting of pipes, and send notifications to facilities staff. During the day multiple routines may run continuously, e.g., some raising and lowering shades on the exterior of the building in order to regulate sunlight flowing into the building as a function of temperature, while other routines change interior lighting in different rooms and corridors as a function of occupancy, etc. Routines may be triggered either by humans (on demand) or automatically.
In a smart home the small number of devices, and the physically proximate nature of the home, means that a single central hub (e.g., an Alexa unit or a Google Home unit) can be used to execute all the three S-T-A categories of actions in the control plane. However, in commercial IoT deployments, the central hub approach becomes untenable as these deployments span large areas, involve many tens or hundreds of devices or more, and routines (active+dormant) numbering in several tens to a few hundred. In fact, a centralized hub may even be infeasible to even set up in really large areas like smart farms and battlefields. Hence mesh networks are growing in popularity for such settings.
Where it is possible to set up, the central hub may be overwhelmed very quickly by S-T-A traffic. For the setting with routines that we described, with a (say) 1 second fidelity for monitoring devices that trigger routines, monitoring just 500 devices in a building involves the home hub exchanging messages every 2 ms; even using AWS IoT’s liberal (suggested) 30 s ping frequency (AWS, 2022) means the central hub exchanges messages every 40 ms.
Distributed support for STA needs to be local, without needing a connection to the cloud. Central hubs in use today are themselves unreliable (Moore et al., 2020), and almost always connect to the cloud which itself may suffer outages (Grothaus, 2019). No mechanisms exist today to coordinate multiple local hubs for the specific STA workload of routines running over large edge meshes.
This paper presents CoMesh (Control Mesh), the first system for monitoring and management of routines (like those available via Amazon Alexa, Google Home, Samsung SmartThings, etc.) over edge meshes. CoMesh is naturally fully-decentralized and contains local distributed protocols running among available smart devices (other simple devices unable to run compute or memory, may also co-exist)—only a small fraction of devices need to be smart for CoMesh to work. CoMesh relies on neither a cloud nor on a local central hub.
CoMesh has to tackle four major challenges: (1) fully-decentralized monitoring, triggering, and actuation of routines, (2) load-balance work across both devices and time, (3) tolerate simultaneous failures of multiple devices, and (4) have load that is aware of the physical layout of devices. We propose a building block called a k-group, which is a dynamically selected subgroup of smart devices. This allows us to break the STA workload into fine granularities, by assigning a separate k-group for sensing each device (or a group thereof), and one k-group for triggering and actuating each routine. CoMesh’s k-groups use a combination of spatial load balancing and temporal load balancing: a) each k-group is selected via consistent hashing, and b) migrated periodically. The migration spreads the load of “heavy” subgroups (e.g., k-groups assigned to devices whose readings vary significantly, or k-groups assigned to frequently triggered routines) over the system. Within each k-group, we eschew using internet-based coordination systems like Zookeeper (Hunt et al., 2010) which may be expensive, and instead we carefully design fast and cheap versions of coordination techniques including agreement via quorums, leader election, and responding to failures of subgroup leader or members. Specifically, to tolerate failures, it suffices for each subgroup to contain members.
To make proactive k-group migrations cheap, CoMesh uses zero-message exchange protocols—“zero-message” means that any node can (in the fast path), without exchanging any messages, calculate the current subgroup for any other given device or routine. In large spaces it is important keep communication local for latency and energy reasons—CoMesh adapts ideas from the space of locality-sensitive hashing (LSH) (Datar et al., 2004) to IoT and smart space settings, so that sub-group selections are in the locality of the monitored device(s), and yet retain the benefits of zero-message exchange mechanisms.
Underneath CoMesh we require: i) any mesh routing algorithm, and ii) a weakly consistent membership protocol for mesh networks (e.g., Medley (Yang et al., 2022)).
The contributions of this paper are:
-
•
We present CoMesh, the first local and fully-decentralized system for routine monitoring and management over edge meshes intended for sense-trigger-actuate scenarios.
-
•
CoMesh proposes a building block called k-groups, which help load-balance sensing, triggering and actuation work over devices. k-groups are selected using zero-message mechanisms and locality sensitive hashing, and within k-groups are running fast and cheap protocols for coordination and election.
-
•
We analyze and theoretically prove safety, liveness, and progress of CoMesh’s mechanisms.
-
•
We perform large-scale simulations as well as smaller-scale Raspberry Pi-4 deployment experiments.
2. System Model
We assume a crash-recovery model: IoT devices may fail (crash) at any time, and then recover at a later time. We do not target Byzantine (malicious) and leave them to future work. As is traditional in fault-tolerant literature, we assume that only up to simultaneous crash failures occur in the system. The value of can be calculated from the deployment settings. For instance, buildings contain power domains, each connected to a power breaker. A common failure mode is an entire power domain going out, thus making all the devices inside that domain fail. Thus one rule of thumb is to set to the number of smart devices in the largest power domain inside the building. Because a building contains many power domains, remains much smaller than , the total number of devices in the building.
We assume a subset of devices () are smart devices, equipped with (small) memory and computation capability—other simple (non-smart) IoT devices may also co-exist in the system. Smart devices may include any combination of IoT devices capable of arbitrary compute, Raspberry Pis, home hubs, handheld devices, etc. We assume non-smart devices can only receive and process commands, and respond to queries for status, but they cannot run any computation.
Since IoT devices are typically fixed early in the operation of a building, we assume that the locations of all devices are known by the smart devices 111Device motion can be treated as a failure event, followed by recovery event at its new location.. We assume smart devices’ clocks are synchronized (although CoMesh works with loose synchronization).
Smart devices communicate via a wireless network that is lossy. For generality, we assume the existence of a wireless ad-hoc routing protocol among smart devices (Dijkstra, 1959; Perkins et al., 2003; Johnson and Maltz, 1996). We assume any smart device can send a command to any IoT device. Finally, CoMesh is built atop a weakly-consistent membership service like Medley (Yang et al., 2022), which helps maintain a full membership list (of all IoT device IDs that are alive) at each smart device. Medley provides detection time of failure and recovery events. Table 1 summarizes key terms, and Fig. 1 shows an example setting.
Term | Description |
---|---|
Command | A message sent to, and associated action executed by, one IoT device. |
Routine | A sequence of commands. Triggered either by time, or a set of triggering clauses or manually. |
IoT Device | All IoT devices in system. |
Smart Device | An IoT device that has (small) memory and computation capability. |
Simple Device | An IoT Device that is not a Smart Device. Typically contain sensor(s) or actuator(s). |

3. CoMesh’s subgroups: The -groups Mechanism
CoMesh’s protocols rely on its subgroup formation and maintenance. CoMesh uses these subgroups to sense-trigger-actuate routines and devices (described in following sections). We describe subgroup formation in this section.
CoMesh’s subgroups are called k-groups, signifying our goal of keeping a -group’s size to be around (a small value) . is a globally configured variable. To tolerate system-wide failures, we set . We present -group selection (Sec. 3.1), election (Sec. 3.2), quorums (Sec. 3.3), temporal migration (Sec. 3.4), failure response (Sec. 3.5), and locality (Sec. 3.6).
3.1. -group Member Selection
CoMesh’s -groups need to be selected in way that is: (i) load-balanced, and (ii) uses a minimum number of messages.
Load-balanced Selection via Zero-message Exchange: A smart device with ID discovers if it is present in a -group for a target device with ID by calculating a consistent hash: . The lowest hashed IDs across the entire system are then the approved -group members. This design is inspired by the idea of cryptographic sortition (Micali et al., 1999; Apecechea, 2019). Notice that, given a fixed hash function (e.g., a cryptographic hash like SHA-2 or MD-5), any arbitrary smart device can calculate the entire membership of any -group by calculating this hash function on each of the IDs from its (full) membership list. With correct membership lists, this takes zero messages.
3.2. Leader Election
The -group’s goal is to fault-tolerantly replicate the state of its monitored entity. To make its operations efficient, each -group elects a leader. The alive smart device with the lowest hashed value () is considered to be the leader. While election could be done via a zero-exchange mechanism, in order to ensure that the leader is actually alive, CoMesh runs the well-known Bully election algorithm to quickly elect the leader (Garcia-Molina, 1982). The Bully algorithm is attractive as it is fast in the common case: 1 RTT for the lowest-hashed node (leader) to inform others in the -group. If failures occur, the Bully algorithm takes a worst case of 5 RTTs.
3.3. Quorum Agreement
Fast agreement in a -group is done via quorums. In a -group of size , anyone in a group (typically the leader) sends a message to all members, and waits for acknowledgments from at least members. Thus with failures, at least one surviving -group member knows about the latest decisions in the -group, and can communicate it to future leaders.
3.4. Epoch-based -group Migration
While Section 3.1 is spatially load-balanced, for temporal load balancing, CoMesh migrates each -group membership periodically. This is done once every epoch, and epoch lengths are fixed system-wide. (Different -groupscan migrate at different times.)
First we extend the hash function for the selection to include the epoch number, making it . The is incremented by 1 on each epoch change. Next, when an epoch expires, the following sequence executes: 1) the new -group is formed by using Hash (++epoch, S, D); 2) an election is run in the new -group (using Sec. 3.2); 3) any state maintained by the leader in the old -group is migrated to the new -group’s leader; 4) the new -group’s leader replicates the state information at a quorum of its new -group members (Sec. 3.3); 5) atomic changeover to new -group: new -group leader tells old -group leader that the old -group is decommissioned, old leader acknowledges; and 6) old leader tells old -group members that they are decommissioned.
When the new -group leader acknowledges to the old -group leader in step 5 that it has received all the state, the old -group leader can delete its old -group state. After step 6, an old -group node can delete its old state for that -group. During steps 1 though 7, no monitoring operations occur in the -group. This means that old -group members and old leader cease all -group operations at exactly the epoch changeover boundary. In practice the epoch change is fast enough that this gap in monitoring (between steps 1 to 6) is small, as shown by our experimental results. After step 6, all normal -group operations resume.
3.5. Failure Response: -group Updates
If a smart node fails, the failure detector layer informs all other nodes of its failure. Because CoMesh’s operations remain correct for group size , we trigger a -group selection only when nodes have failed (rather than on each failures)—this keeps the overhead of failure response low.
The only exception is if the failed node was a leader of a -group. This causes Section 3.2’s election algorithm to be re-run. This new leader needs to reconstruct the state from all the nodes in the -group. Each piece of state needs to be confirmed by a quorum ( nodes). This state is correct because all past updates were present at at least nodes, so the new leader receives each past update from at least one non-faulty node. If the new leader fails during reconstruction, election is re-run and the process repeats.
3.6. Grouping via Locality-Sensitive Hashing
Because the underlying routing infrastructure is an ad-hoc network, this raises two issues. First, the sheer number of devices and routines may lead to a large number of -groups. Second, a randomly selected -group will consist of nodes that are spread across the ad-hoc network, thus incurring high message overhead on links.
Limiting the Number of -groups: We use a clustering algorithm (Krishna and Narasimha Murty, 1999) to partition the devices into clusters. Each cluster is assigned to one -group, with a random device from the cluster chosen as the “center” (i.e., representative) in order to calculate the locality sensitive hashing function described above. For routines, we pick a random device it touches as its representative, and that device’s cluster is now the routine’s cluster (and thus its -group).
Locality-Sensitive Hashing: Selecting -groups randomly (Section 3.1) spreads the -group members all over the network, slowing down intra-group operations (election, quorum, inter-leader). The challenge is to make -group selection local while still retaining many of the spatial load-balancing benefits of hashing. To do so, we need to adapt ideas from locality-sensitive hashing (LSH) (Datar et al., 2004), an idea from high-dimensional data science. To the best of our knowledge, our paper is the first to apply LSH to edge/IoT networks.
Concretely, we modify the -group member selection to be a two stage process: select nodes in the vicinity of the monitored device/cluster, and an additional nodes randomly from across the group. This provides for speed in intra-group operations (via nearby members), while still ensuring that a power domain failure (see Section 2) does not compromise the operation of a group (since there will be fewer than devices in a power domain).
The main idea of LSH (Datar et al., 2004) is to hash similar vectors (in space) into the same “buckets” with high probability. A randomly selected “center” (from the monitored device group) and the smart devices are hashed into buckets and the majority is selected from the smart devices that are in the same buckets with the center “candidates”). If there are more than candidates, we choose the nearest ones to the center. Notice that selecting the majority randomly from the candidates usually does not work well since buckets may also have nodes that are not close to the center. If there are fewer than candidates, CoMesh first checks if the center’s neighbors (devices that are close to the center) have any candidates that we can “borrow”, before moving to random selection.
Overlapping of -group’s members in consecutive epochs helps with data transfer operations during epoch change. To achieve this, we append to its own location vector , where is the average location of all devices this smart device monitored during the previous epoch on th dimension. Then we use this vector as an input for LSH. Adding some randomness to the vector entries for load balance purposes is also an option (our default setting). We append to a simple device’s location vector to make all devices have the same dimensions.
The resultant locality-sensitive hash function adapted from (Datar et al., 2004), maps a dimensional vector onto a set of integers, where is a random dimensional vector whose entries are independently chosen from a -stable distribution (Gaussian distribution in our case) (Zolotarev, 1986; Datar et al., 2004); is a real number uniformly chosen from and is a parameter. To further reduce the probability that vectors with very different entries fall into the same buckets, several individual hash functions can be concatenated, such that ). Additionally, we can use hashes (s) to increase the number of similar vectors in the same buckets. That is, we have .




We give some intuition for choosing the values of these LSH parameters. Large values exclude nodes that are not similar and reduce the number of candidates, while large and may increase this number (Fig. 2(a) and Fig. 2(b)). The default setting in our paper is . In Fig. 2(c) and Fig. 2(d) we tested our default setting for grid topologies with different sizes. We can see that while the average hop count to achieve quorum (quorum distance) increases linearly with the system size for the random selection policy, it stays small for the LSH-based selection policy. More experiments on random and clustered networks can be found in Sec. 6.
50 | 250 | 500 | 750 | 1000 | |
#candidates | 13.0223 | 34.1258 | 50.7501 | 52.8035 | 66.9135 |
Although the -group overlap based on LSH drops for large (Fig. 2(d)), it is much larger than the random policy. We hypothesize that with more nodes, and thus more candidates, the probability of the same device being selected in consecutive epochs grows smaller. This hypothesis is validated by Table 2. One may also adjust LSH parameters to reduce the candidate count without affecting quorum distance. E.g., at , , the average number of candidates is 21 and the average quorum distance is 2.4 (vs. 5.3 for the random policy).
4. Routine & Device Management
We now describe how sense-trigger-actuation management of routines and devices builds atop the -groupsof Section 3.
4.1. Device Management
In CoMesh each device is assigned to a unique -group. This -group is responsible for sensing actions on the device, i.e., tracking the up/down status of the device, and the latest state of the device (e.g., temperature readings for a thermostat). The leader of the device’s -group (henceforth called the device leader) periodically pings the device. If the device does not respond within 2 RTTs, the device is presumed failed, and the device leader notifies all smart devices in the topology. If the detected state of the device (received in the ack) differs from its latest recorded state at the device leader, the device leader sends this new state to all the routines which contain in their triggering clauses (specifically, sending to the leaders of those routine’s respective -groups).
4.2. Routine Management
Recall that a routine consists of a trigger clause (arbitrary boolean clause involving multiple devices and their states), and a set of commands to execute when the routine is triggered. The set of devices in the former trigger set (read from) and the latter command set (actuated on) may be different or may overlap.
In CoMesh each routine is assigned a unique -group, which is responsible for sensing (monitoring), triggering, and actuation (execution) of the routine. The sensing is done in collaboration with device -groups, as described earlier. A routine may be in one of the following states (maintained at its -group): (i) NOT_TRIGGERED, (ii) ACQUIRING_LOCKS, (iii) EXECUTING, and (iv) RELEASING_LOCKS. We call the leader node of the routine’s -group as the routine leader. The routine leader connects with the device leaders of each device in the routine’s trigger set. Whenever these device leaders send updated states to the routine leader (Section 4.1), the latter checks if this state change satisfies the trigger clause for the routine.
When a routine is triggered, the routine leader changes its state from NOT_TRIGGERED, to ACQUIRING_LOCKS, and replicates this state to all its -group members. It waits for () responses, counting itself. When a routine starts it first acquires (virtual) locks to all devices in the routine’s command set. Then the routine executes its commands. Finally when the routine is finished it releases all its locks. This pessimistic concurrency control approach has been shown to be fast, avoid deadlocks and livelocks, and scales well (Ahsan et al., 2021).
The lock for a given device is only maintained in that device -group (rather than the routine -group)—this allows multiple routines to compete for locks. For a routine to acquire a device ’s lock, ’s routine leader communicates with ’s device leader. (If the routine leader is unaware of a current device leader, it contacts all -group members of that device, to know the leader via Section 3.1’s zero-message exchange protocol, and then contacts the device leader.)
The device leader maintains a wait list of routines requesting a lock, and grants only one lock at a time (in FIFO order). When a lock is released by a routine leader (at RELEASING_LOCKS stage), the device leader dequeues the next entry from the wait list, sends a request to all its device -group members, and waits for acks, before it sends back the lock to this requesting routine leader. When all locks have been acquired, the routine leader changes the routine’s state to EXECUTING, communicates this to all routine -group members, waits for acks, and then starts issuing the commands for the routine. A routine’s command is always routed from the routine leader to the appropriate device leader which then forwards it to the device. This way the device leader (-group) always knows the device’s latest status.
The order in which locks are acquired can affect the speed of triggering. There are two design options—sequential and optimistic.
4.2.1. Sequential Lock-Acquiring (SLA) Strategy
The routine leader acquires locks sequentially. If the locks are acquired in increasing order of device ID, this prevents deadlocks (cycles cannot occur) among multiple routines with conflicting device sets. The SLA strategy can be slow when there are no or few conflicts.
4.2.2. Optimistic Lock-Acquiring (OLA) Strategy
CoMesh’s Optimistic Lock-Acquiring (OLA) attempts to acquire all locks for its (desired) devices simultaneously. If any lock request fails, the routine leader releases all locks it acquired, and retries all again. To prevent message implosion, retries occur after a backoff timeout. OLA is costly when routines’ command sets overlap.
We experimentally compare these two strategies later.
5. Formal Analysis
We formally analyze CoMesh’s properties. For brevity, we intuitively summarize our findings first. Then we state the formal theorems. The proofs can be found in Appendix A.
-
Inheritance: When a k-group changes, the state held by the old leader and new leader are identical.
-
Safety: No two routines that touch an overlapping set of devices, are allowed to execute simultaneously.
-
Liveness: No routines deadlock.
-
Progress: Every routine makes progress.
-
We also calculate the (probabilistic) availability of the system when more than devices fail.
Definition 0.
Inheritance means that given a -group, the old and new versions of the -group—before and after epoch change, or after leader failure—maintain identical state.
We now state the key results.
Lemma 0.
[Inheritance] Inheritance is guaranteed after the state transfer stage when at most one node fails. That is, the state held by an old -group leader is the same as the state held by the new -group leader after the state transfer stage, under failure of at most one node.
Theorem 3.
[Safety] No two routines that conflict in devices (i.e., their touched device sets are not disjoint) can execute simultaneously, when there is at most one node failure. This is true for both SLA and OLA locking strategies (Section 4.2.1).
Theorem 4.
[No Deadlocks] No two routines are stuck in a deadlock.
Theorem 5.
[Progress/Liveness] Consider a set of executing routines each still waiting to acquire 1 or more locks, and the (union) set of devices they are waiting for. If no further routines arrive into , and SLA locking is used, then: at least one routine from the set will make progress (i.e., get access to one more device that it desires).


Theorem 6.
[Availability] With simultaneous failures, the system remains available, i.e., all -groups contain at least non-faulty nodes, with probability that varies as:
-
(1)
When ,
-
(2)
When , the availability is:
6. Simulation Evaluation
To perform large-scale experiments with 100s of nodes on CoMesh, as well as to measure its internal behavior, we wrote a custom simulator. Higher-fidelity simulators like NS3 have known scalability issues beyond a couple of hundred nodes. We measure both: i) user-facing metrics, e.g., latency to start routines, load balancing, etc., and ii) microbenchmarks of -group operations.
Our simulator models network contention by scaling down network bandwidth based on density: specifically sending bandwidth from node to node is set to the default node bandwidth divided by number of 1-hop neighbors of . We set default node bandwidth (Table 3) conservatively based on bandwidth numbers reported via user complaints on online forums about IoT devices (McNally, 2022; rpi, 2019). Latency on each hop is fixed to 1 time unit, as we assume that transmission radii are small. Ad-hoc routing is based on Dijkstra’s shortest paths.
Parameter | Notation | Default value |
---|---|---|
# all devices | 250 | |
# failures for completeness | 2 | |
k-group size | 5 | |
# smart devices | 100 (40% of all devices) | |
#seeds | – | 10 |
epoch length | – | 200 |
max routine length | – | 5 |
avg devices per routine | – | 5 |
# device managed by -group | – | 25 |
bandwidth cap | – | 625 KBps/5Mbps |
-group selection policy | – | LSH-random mix |
device cluster policy | – | locality |
leader election policy | – | LSH smallest hash |
default LSH parameters | – | m = 2, l = 2, r = 4 |
We use three different network topologies: 3D grid, random, and clustered. We ensure the graph is connected. CoMesh was configured with topology-aware LSHMix (Sec. 3.6). Table 3 shows default parameters.
6.1. Client Delay



When a routine’s trigger conditions become true, how long does it take for the routine to execute its first command, given there are no other routines active? This is the client delay for a routine: the time it takes for its -group to detect (and replicate) the trigger, lock all of its touched devices, and start execution. In Fig. 4, we observe that CoMesh’s client delay scales with device count, regardless of topology or locking strategy (SLA = serial or OLA = parallel). The trend is gradually increasing in all plots.
As expected the parallel locking (OLA) strategy is faster than serial locking (SLA) because client delay excludes conflicting routines. Among the 3 topologies, grid has higher client delay than random and clustered (which are similar) because grid topology’s devices are more spread out. That is, while all three topologies have the same overall density, the variance of density in different parts of the network is higher for random and clustered topologies, causing “clusters” of nodes to form and speeding up communication.
6.2. Churn Effect on Client/Sync Delays

Fig. 5 shows that when 40% of the smart devices’ nodes are churned in a grid topology (SLA locking), the client delay increases only 25%. This extra delay accounts for the leader election and state transfer to the new leader after a -group’s old leader has failed. We define synchronization delay between a routine releasing its last conflicted lock, and the next waiting (blocked) routine executing its first command. This experiment was performed with two instances of the same routine triggered back to back (i.e., all devices are conflicting devices). We observe that sync delay also rises slowly with system size—increasing number of nodes by 15 from 50 to 750 causes increase in sync delay (and client delay).
One may be led to believe that client delay is higher than sync delay. Yet, each has operations excluded in the other. Sync delay excludes triggering of routine and associated -group quorum. Client delay excludes lock release latency and associated quorums.
6.3. Load Distribution




Fig. 6 shows the CDF of load distributions, in a grid topology, for the leader, for member, and for all (leader or member) nodes. Temporal load is the amount of time that a smart device serves in a particular role over the entire experiment duration. Spatial load is the number of concurrent k-groups that a smart device is present in.
The top two plots show CoMesh with LSH and the bottom two plots show CoMesh with random member selection. As expected, random member selection is the most load balanced across nodes and across time. Although LSH clusters responsibilities closer to target devices leading to load imbalance (top two plots), the median temporal load (left two plots) for member and leader/member are surprisingly comparable between random and LSH, showing that LSH balances member load almost as well as random. LSH member load has a higher tail due to clustering effects. The same clustering effects lead to repeated leader candidates and increase the median leader load in LSH (top left plot).
6.4. k-Group Operations



We now measure the internal performance within -groups. We use the default parameters of Table 3. Fig. 7 shows that all -group operations—election, quorum, and state transfer—scale well as the size of the -group is increased from 3 () to 11 (). Quorums are scalable because requests and replies are sent in parallel. Leader election delays are scalable in the common case because of the best case of the Bully algorithm. State transfer delay is a function of the amount of state, but does not depend on .
Fig. 8 confirms our hypothesis that locality-aware mechanisms for -group entity selection, member selection, and leader election decrease the operational delays of a -group. The state transfer operation’s delay is not decreased as much as the quorum or leader election’s delay since, apart from a quorum, the unicast to the old leader is not affected by LSH.
Fig. 9 shows the k-group’s operations are not affected by the smart devices’ churn in the topology. Only when the churn percentage affects about 60% of smart devices do -group operations’ get slower. This is because a -group’s operation only requires a majority of its members. Upon the failure of members in a -group, the leader ensures that failed members are replaced so that future failures of up to smart devices do not affect it either.
7. Deployment Evaluation
We implemented CoMesh in the Raspberry Pi (RPi) 4 environment. Our implementation contains 12.5K lines of Java code. Our deployments are on RPi 4 model B, possessing 2GB of LPDDR4 RAM and a Broadcom BCM2711 quad-core CPU of 1.5GHz Cortex-A72 cores. While Section 6 showed larger-scale simulation results, our goal with RPi’s is to benchmark the behavior of CoMesh’s core mechanism, i.e., k-groups, on a small number of real devices.
We deployed 11 RPi 4 devices in our lab, forming different mesh topologies. To attenuate RPi’s strong signal and create the mesh topology, we consistently wrap each device in aluminum foil to reduce transmission power to the lowest setting, 15dBm (Ahsan et al., 2021). While CoMesh works with any ad-hoc routing protocol, for concreteness we use OLSR routing (OLSR.org, 2021) due to its standardization in Pi 4s.
We configure CoMesh to use underneath it the open-source implementation of the Medley (Yang et al., 2022) membership protocol. CoMesh is configured to use -groups with , and monitors devices every second. To avoid cold start biases, we start measurements 10 s after CoMesh is initialized. Each data point involves at least 5 trial runs with different seeds. Epochs change every 60 s. Different routines may be triggered simultaneously. The same routine may be triggered multiple times, but never within less than 10 s of its own previous invocation. We use SLA locking (Sec.4.2.1).


We implement two Pi topologies shown in Fig. 10: (i) a 4 4 2D Grid-like topology and (ii) an L-shaped Line (modeling the outline of one floor of our office building). Each covers an area of 4 meters 4 meters. The Line topology’s diameter is higher than Grid’s.
Bandwidth Usage—Decentralized vs. Centralized: Fig. 11 shows decentralized CoMesh’s bandwidth per node, both end to end (E2E) messages, as well as ad-hoc routed (hop to hop/ H2H) messages. For comparison we also instantiated CoMesh with and one -group, i.e., a single centralized device responsible for all routines’ monitoring and management—this is akin to using a central home hub (Sec. 1). We observe that: a) Decentralized CoMesh reduces bandwidth by an order of magnitude (over 10 ) compared to the centralized approach. b) Both end to end bandwidth and hop bandwidth are scalable, i.e., remain flat, as the number of routines grows. c) Bandwidth of foreground operations (e.g., election, k-group migration, etc.) remain small compared to unavoidable background bandwidth (e.g., device monitoring).


Wait Times to Start a Routine: Fig. 12 shows two metrics for any given routine. First, the Lock Acquisition time—the average time to acquire a device lock after it becomes available—varies between 3.41 s - 13.02 s. As routine count increases, median latencies remain stable for Grid, and rise slightly for Line. Second, the Sync delay—the time between the routine’s last device becoming available and the routine starting—stays stable with increasing routine count, with median between 10.23 s - 20.833 s.
Inter--group Operation Latencies: Table 4 shows that CoMesh’s common operations are fast and take a median of 1.08 s for quorum, 1.17 s for election, and 2.07 s for state transfer during epoch change.
-group delay | median | mean | P90 |
---|---|---|---|
Quorum reply | 1.0805 | 1.4021 | 1.4962 |
Leader election | 1.1675 | 1.4913 | 1.9249 |
State transfer | 2.0750 | 2.8468 | 3.1390 |
8. Related Work
Consistent Hashing & Committees: Peer-to-peer (p2p) networked systems such as Chord (Stoica et al., 2001), Dynamo (DeCandia et al., 2007), Cassandra (Lakshman and Malik, 2010) employ consistent hashing (Karger et al., 1997) to find the node storing a key-value pair. CoMesh’s -group mechanism is inspired by consistent subgroups used by the election protocol in (Gupta et al., 2000), but that mechanism is neither dynamic, nor intended for IoT settings, nor has locality.
Consensus: A variety of agreement protocols for wireless environment have been proposed, such as Wireless Paxos (Poirot et al., 2019), and WSN Byzantine consensus (Xu et al., 2021). Unlike these consensus algorithms which require many message exchange rounds, CoMesh’s -groups are efficient and fast, and work quickly in the common case, with just one communication round for a quorum.
IoT Application Management: IoT routine and application management system have been explored widely, but with a centralized control plane, e.g., HomeOS (Dixon et al., 2012), Beam (Shen et al., 2016), DepSys (Munir and Stankovic, 2014), eHome (Retkowitz and Kulle, 2009), and SafeHome (Ahsan et al., 2021). Rivulet (Ardekani et al., 2017) is distributed and uses smart devices to exchange messages and backup state to shadow nodes. However, unlike CoMesh, Rivulet focuses on fault-tolerance and omits the routine management. Other works on IoT fault-tolerance (Su et al., 2014; Javed et al., 2018; Norris et al., 2020), also do not manage routine execution. To the best of our knowledge, CoMesh is the first fully-decentralized IoT routine management system.
9. Summary
We have presented the CoMesh system that manages routines in a smart space (building, home, or campus) in a fully-decentralized way, without relying on any central components. Analysis of CoMesh’s techniques show that it achieves continuity (inheritance), safety, liveness, and progress. Our implementation and simulation of CoMesh under various topologies show that it scales well with increasing number of devices into the hundreds, and with larger values of (number of failures). CoMesh also balances load effectively, both over smart devices and over time. CoMesh opens the door for realizing self-managing versions of today’s smart space systems. Future directions include building central-distributed hybrid management systems, accounting for heterogeneity of smart devices, and energy considerations.
References
- (1)
- rpi (2019) 2019. Slow upload on new Raspberry Pi 4 2GB. https://forums.raspberrypi.com/viewtopic.php?t=244832
- Aarti and Vineet (2022) G. Aarti and K. Vineet. 2022. IoT in Aerospace & Defense Market by Component (Hardware, Software, Services), by Deployment Mode (On-Premise, Cloud), by Connectivity Technology (Cellular, Wi-Fi, Satellite Communication, Radio Frequency), by Application (Fleet Management, Inventory Management, Equipment Maintenance, Security, Others): Global Opportunity Analysis and Industry Forecast, 2020-2030. https://www.alliedmarketresearch.com/internet-of-things-in-aerospace-and-defense-market.
- Ahsan et al. (2021) S. B. Ahsan, R. Yang, S. A. Noghabi, and I. Gupta. 2021. Home, SafeHome: smart home reliability with visibility and atomicity. In EuroSys.
- Akyildiz and Wang (2005) I. F. Akyildiz and X. Wang. 2005. A survey on wireless mesh networks. IEEE Commun. Mag. 43, 9 (2005).
- Amazon Alexa (2014) Amazon Alexa 2014. Amazon Alexa. https://developer.amazon.com/alexa.
- Apecechea (2019) G. I. Apecechea. 2019. Cryptographic Sortition in Blockchains: the importance of VRFs. https://medium.com/witnet/cryptographic-sortition-in-blockchains-the-importance-of-vrfs-ad5c20a4e018
- Ardekani et al. (2017) M. Ardekani, R. Singh, N. Agrawal, D. Terry, and R. Suminto. 2017. Rivulet: A Fault-Tolerant Platform for Smart-Home Applications. In Middleware.
- AWS (2022) AWS. 2022. AWS IoT Core additional metering details. https://aws.amazon.com/iot-core/pricing/additional-details/
- Chandra and Collis (2021) R. Chandra and S. Collis. 2021. Digital Agriculture for Small-Scale Producers: Challenges and Opportunities. CACM 64, 12 (2021).
- Datar et al. (2004) M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In SoCG.
- DeCandia et al. (2007) G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. 2007. Dynamo: Amazon’s Highly Available Key-Value Store. In SOSP.
- Dijkstra (1959) E. W. et al. Dijkstra. 1959. A note on two problems in connexion with graphs. Numerische mathematik 1, 1 (1959).
- Dixon et al. (2012) C. Dixon, R. Mahajan, S. Agarwal, A. J. Brush, B. Lee, S. Saroiu, and P. Bahl. 2012. An operating system for the home. In NSDI.
- Fatima et al. (2022) Z. Fatima, M. H. Tanveer, S. Zardari, L. F. Naz, H. Khadim, N. Ahmed, and M. Tahir. 2022. Production plant and warehouse automation with IoT and industry 5.0. Applied Sciences 12, 4 (2022).
- Feng et al. (2020) Y. Feng, M. Li, C. Zeng, and H. Liu. 2020. Robustness of internet of battlefield things (iobt): A directed network perspective. Entropy 22, 10 (2020).
- Fox et al. (2019) M. A. Fox, J. L. Breese, and G. Vaidyanathan. 2019. Live Music Performances and the Internet of Things.
- Future (2022) Market Research Future. 2022. Smart Stadium Market Estimated to Hit 24.3 Billion with a CAGR of 23% during 2021 - 2030. https://www.globenewswire.com/news-release/2022/06/01/2453900/0/en/Smart-Stadium-Market-Estimated-to-Hit-24-3-Billion-with-a-CAGR-of-23-during-2021-2030-Report-by-Market-Research-Future-MRFR.html.
- Gan and Lee (2018) H. Gan and W.S. Lee. 2018. Development of a Navigation System for a Smart Farm. IFAC-PapersOnLine 51, 17 (2018).
- Garcia-Molina (1982) H. Garcia-Molina. 1982. Elections in a Distributed Computing System. IEEE TOC C-31, 1 (1982).
- Google Home (2016) Google Home 2016. Google Home. https://store.google.com/us/product/google_home.
- Grothaus (2019) M. Grothaus. 2019. That major Google outage meant some Nest users couldn’t unlock doors or use the AC. https://www.fastcompany.com/90358396/that-major-google-outage-meant-some-nest-users-couldnt-unlock-doors-or-use-the-ac
- Guo et al. (2022) X. Guo, Lo. Shangguan, Y. He, N. Jing, J. Zhang, H. Jiang, and Y. Liu. 2022. Saiyan: Design and Implementation of a Low-power Demodulator for LoRa Backscatter Systems. In NSDI.
- Gupta et al. (2000) I. Gupta, R. Van Renesse, and K. P Birman. 2000. A probabilistically correct leader election protocol for large groups. In DISC.
- Hunt et al. (2010) P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. 2010. ZooKeeper: Wait-free Coordination for Internet-scale Systems.. In ATC.
- ISO Central Secretary (2016) ISO Central Secretary. 2016. Systems and software engineering – Lifecycle profiles for Very Small Entities (VSEs) – Part 1: Overview. Standard ISO/IEC TR 29110-1:2016. International Organization for Standardization, Geneva, CH. https://www.iso.org/standard/62711.html
- J. (2021) Bojan J. 2021. Internet of Things statistics for 2021 – Taking Things Apart. https://dataprot.net/statistics/iot-statistics/.
- Javed et al. (2018) A. Javed, K. Heljanko, A. Buda, and K. Främling. 2018. CEFIoT: A fault-tolerant IoT architecture for edge and cloud. In WF-IoT.
- Johnson and Maltz (1996) D. B. Johnson and D. A. Maltz. 1996. Dynamic source routing in ad hoc wireless networks. In Mobile computing.
- K. et al. (2021) Mohamad K., Anthony W., and Vamsi T. 2021. Simplifying Backscatter Deployment: Full-Duplex LoRa Backscatter. In NSDI.
- Karger et al. (1997) David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. 1997. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In STOC. 654–663.
- Krishna and Narasimha Murty (1999) K. Krishna and M. Narasimha Murty. 1999. Genetic K-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 29, 3 (1999), 433–439. https://doi.org/10.1109/3477.764879
- Lakshman and Malik (2010) A. Lakshman and P. Malik. 2010. Cassandra: a decentralized structured storage system. IGOPS 44, 2 (2010).
- Lampropoulos et al. (2019) G. Lampropoulos, Siakas K., and T. Anastasiadis. 2019. Internet of things in the context of industry 4.0: An overview. IJEK (2019).
- Li et al. (2022) Chenning Li, Xiuzhen Guo, Longfei Shangguan, Zhichao Cao, and Kyle Jamieson. 2022. CurvingLoRa to Boost LoRa Network Throughput via Concurrent Transmission. In NSDI.
- Lin et al. (2016) K. Lin, W. Wang, M. Bi, Y.and Qiu, and M. M. Hassan. 2016. Human localization based on inertial sensors and fingerprints in the Industrial Internet of Things. Elsevier COMNET 101 (2016).
- Liu et al. (2022) D. Liu, T. F. Abdelzaher, T. Wang, Y. Hu, J. Li, S. Liu, M. Caesar, D. Kalasapura, J. Bhattacharyya, N. Srour, J. Kim, G. Wang, G. Kimberly, and S. Yao. 2022. IoBT-OS: Optimizing the Sensing-to-Decision Loop for the Internet of Battlefield Things. In ICCCN.
- McNally (2022) C. McNally. 2022. Google Nest Wi-Fi Review 2022. https://www.reviews.org/internet-service/google-nest-wifi-review/
- Memoori (2022) Memoori. 2022. The Global Market for the Internet of Things in Smart Commercial Buildings. https://memoori.com/portfolio/the-internet-of-things-in-smart-commercial-buildings-2022-to-2027/.
- Micali et al. (1999) S. Micali, M. Rabin, and S. Vadhan. 1999. Verifiable random functions. In FOCS.
- Moore et al. (2020) S. J. Moore, C. D. Nugent, S. Zhang, and I. Cleland. 2020. IoT reliability: a review leading to 5 key research directions. CCF TPCI 2 (2020).
- Munir and Stankovic (2014) S. Munir and J. A. Stankovic. 2014. Depsys: Dependency aware integration of cyber-physical systems for smart homes. In ICCPS.
- Norris et al. (2020) M. Norris, B. Celik, P. Venkatesh, S. Zhao, P. McDaniel, A. Sivasubramaniam, and G. Tan. 2020. IoTRepair: Systematically Addressing Device Faults in Commodity IoT. In IoTDI.
- OLSR.org (2021) OLSR.org. 2021. Optimized Link State Routing Protocol. https://tinyurl.com/olsrd-wiki.
- Perkins et al. (2003) C. Perkins, E. Belding-Royer, and S. Das. 2003. RFC3561: Ad hoc on-demand distance vector (AODV) routing.
- Poirot et al. (2019) V. Poirot, B. Al Nahas, and O. Landsiedel. 2019. Paxos Made Wireless: Consensus in the Air.. In EWSN.
- Research (2022) Emergen Research. 2022. Internet of Things In Agriculture Market, By System (Automation and Control Systems, Sensing and Monitoring Devices, Livestock Monitoring Hardware, Fish Farming Hardware), By Application, and By Region Forecast to 2030.
- Retkowitz and Kulle (2009) D. Retkowitz and S. Kulle. 2009. Dependency management in smart homes. In DAIS.
- Ryu et al. (2015) M. Ryu, J. Yun, T. Miao, I.-Y. Ahn, S.-C. Choi, and J. Kim. 2015. Design and implementation of a connected farm for smart farming system. In Sensors. IEEE.
- Samsung Smart Things (2012) Samsung Smart Things 2012. Samsung SmartThings. https://www.smartthings.com/.
- Shahid et al. (2021) H. Shahid, M. A. Shah, A. Almogren, H. A. Khattak, I. U. Din, and C. Kumar, N.and Maple. 2021. Machine Learning-Based Mist Computing Enabled Internet of Battlefield Things. ACM TOIT 21, 4 (2021).
- Shen et al. (2016) C. Shen, R. P. Singh, A. Phanishayee, A. Kansal, and R. Mahajan. 2016. Beam: Ending monolithic applications for connected devices. In ATC. 143–157.
- Sivakumar et al. (2021) Arun Narenthiran Sivakumar, Sahil Modi, Mateus Valverde Gasparino, Che Ellis, Andres Eduardo Baquero Velasquez, Girish Chowdhary, and Saurabh Gupta. 2021. Learned Visual Navigation for Under-Canopy Agricultural Robots. In Robotics: Science and Systems.
- Song et al. (2008) J. Song, S. Han, A. Mok, D. Chen, M. Lucas, M. Nixon, and W. Pratt. 2008. WirelessHART: Applying wireless technology in real-time industrial process control. In RTAS.
- Stoica et al. (2001) I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. 2001. Chord: A scalable peer-to-peer lookup service for internet applications. SIGCOMM 31, 4 (2001).
- Su et al. (2014) P. H Su, C.-S. Shih, J. Y.-J. Hsu, K.-J. Lin, and Y.-C. Wang. 2014. Decentralized fault tolerance mechanism for intelligent IoT/M2M middleware. In WF-IoT.
- Turchet et al. (2018) L. Turchet, C. Fischione, G. Essl, D. Keller, and M. Barthet. 2018. Internet of musical things: Vision and challenges. IEEE access 6 (2018).
- Vasisht et al. (2017) D. Vasisht, Z. Kapetanovic, J. Won, X. Jin, R. Chandra, S. Sinha, A. Kapoor, M. Sudarshan, and S. Stratman. 2017. FarmBeats: An IoT Platform for Data-Driven Agriculture. In NSDI.
- Wang et al. (2022) L. Wang, A. Hamad, and V. Sakthivel. 2022. IoT assisted machine learning model for warehouse management. Journal of Interconnection Networks 22, Supp02 (2022).
- Xu et al. (2021) Mi. Xu, C. Liu, Y. Zou, F. Zhao, J. Yu, and X. Cheng. 2021. wChain: A Fast Fault-Tolerant Blockchain Protocol for Multihop Wireless Networks. TWC (2021).
- Yang et al. (2022) R. Yang, J. Wang, J. Hu, S. Zhu, Y. Li, and I. Gupta. 2022. Medley: A Membership Service for IoT Networks. IEEE TNSM 19, 3 (2022).
- Yang et al. (2005) Y. Yang, J. Wang, and R. H. Kravets. 2005. Designing routing metrics for mesh networks.
- Zolotarev (1986) V. M. Zolotarev. 1986. One-dimensional stable distributions. Translations of Mathematical Monographs, Vol. 65. 277 pages.
Appendix A Appendix: Formal Analysis
We formally analyze CoMesh’s properties. For the reader who wishes to skip this section, we summarize our findings:
-
Inheritance: When a k-group changes, the state held by the old leader and new leader are identical.
-
Safety: No two routines that touch an overlapping set of devices, are allowed to execute simultaneously.
-
Liveness: No routines deadlock.
-
Progress: Every routine makes progress.
-
We also calculate the (probabilistic) availability of the system when more than devices fail.
A.1. Inheritance
Definition 1.
Inheritance means that given a -group, the old and new versions of the -group—before and after epoch change, or after leader failure—maintain identical state.
For a device -group, the state that must be transferred to the new leader includes: the device’s current availability and the lock queue (if applicable). For a routine -group, the new leader should receive its state which includes: the routine stage, the IDs of devices that have already been locked (if applicable at the time), and the IDs of devices that have already been released (if applicable at the time).
Lemma 1.
Inheritance is guaranteed after the state transfer stage when at most one node fails. That is, the state held by an old -group leader is the same as the state held by the new -group leader after the state transfer stage, under failure of at most one node.
Proof.
If there are new requests coming during the state transfer stage, request processing will be delayed until the state transfer stage finishes. The state transfer stage happens in two cases: leader failure and/or epoch change.
Case 1: When there is no epoch change, upon -group leader ’s failure, the new leader sends a message to all -group members requesting their local -group state. All the members reply with their local state back to . includes a state entry in the recreated state if and only if there is at least 1 received state that contains . More specifically for the device -groups, which may include a lock queue, every routine request in the lock queue comes with a sequence number denoting when it reached ’s -group leader. includes a routine entry in the recreated lock queue if and only if there is at least one received queue that contain . Since each routine entry in the lock queue was replicated by at least a quorum of -group members before the old leader ’s failure, is guaranteed to receive the exact same set of routine requests. will sort the queue based on the gathered requests’ sequence numbers. Thus, the recreated lock queue is exactly the same as ’s latest lock queue; which also guarantees that ’s recreated state is identical to ’s latest state.
Case 2: Upon epoch change, the system enters state transfer stage: the new -group’s leader requests the old -group’s leader ’s state, to which replies accordingly. There are 4 possibilities during the state transfer stage: a) the old leader fails, b) the new leader fails, c) a non-leader member of the old -group fails, d) a non-leader member of the new -group fails.
Case 2a: If old leader fails after state transfer starts but before state transfer completes, the new leader sends a request for local states to all of the old -group’s members and recreates ’s lock queue as in Case 1.
Case 2b: If new leader fails before it receives the state from old leader , another new leader is elected and sends a new state request to . will reply again with the -group state.
Case 2c: If a non-leader member of the old -group fails, the state transfer will not be affected and the new -group leader will receive the correct state.
Case 2d: If a non-leader member of the new -group fails, another member will be recruited to take its place. The new -group leader will distribute the state to the new member. ∎
Lemma 2.
Inheritance can be guaranteed after the state transfer stage when at most nodes fail.
Proof.
If there are no epoch changes (Lemma 1 Case 1) or failures happen only among non-leader members during the epoch change (Lemma 1 Cases 2c, 2d), each routine entry is maintained at at least alive member. Thus, the state can be recreated correctly as in Lemma 1 Case 1.
Upon epoch change, if the old leader fails before the new leader gets the state, (Lemma 1 Case 2a) can reconstruct the same state as since each state entry is maintained at at least alive old -group member. If keeps failing before it gets or re-constructs the state, the latest new leader can re-construct it similarly.
A.2. Safety
Theorem 1.
[Safety] No two routines that conflict in devices (i.e., their touched device sets are not disjoint) can execute simultaneously, under at most one node failure.
Proof.
Assume routines and share devices. We analyze SLA and OLA separately.
Sequential Lock-Acquiring Strategy (SLA): First, we will analyze Sequential Lock-Acquiring Strategy (Section 4.2.1). Let be that shared device that has the minimum , with its -group leader denoted as . If does not fail, by definition at most one routine can reach quorum and therefore acquire the lock on the device.
No new requests are handled during leader transition. Such requests are put into new leader’s () wait queue. The request/quorum to add such requests to the lock queue will be initiated only after the leader transition completes. Therefore the rest of the proof handles only outstanding requests that have been approved.
If the leader fails, several cases arise:
Case 1: If fails before it has replicated the request from to a quorum of -group nodes, there is no further operation. ’s -group leader resends ’s request after a timeout T of not receiving an acknowledgement.
Case 2: If fails after it has replicated ’s request to a quorum of -group members, the new leader adds to the lock queue during the reconstruction phase when it queries (a quorum of) k-group members for their pieces of the state.
Case 3: If fails after ’s request reaches the top of the lock queue, two sub-cases occur: a) If the request has been approved by a quorum of nodes (i.e., been forwarded to nodes, and a quorum has approved to acquire the lock), the new leader marks as the approved routine and notifies ’s -group leader. b) If the request has not yet reached quorum (to approve the routine for lock acquisition): then the new leader adds to the lock queue, and continues its normal processing.
Case 4: If fails after it has sent a Locked message to ’s k-group, the new leader : 1) re-constitutes the lock queue from the surviving nodes, and 2) (re)notifies ’s -group leader of its approval.
Optimistic Lock-Acquiring Strategy (OLA): For each device that is touched by both and , only one of the two routines’ lock requests will be processed first by ’s -group leader and reach quorum for the device pre-lock, as long as does not fail. By definition only this routine will go on to lock the device if all other desired devices are also available and each desired device’s ’s does not fail.
If a leader does fail, two cases arise:
Case 1: If fails before it gets a quorum of acknowledgement for pre-lock replication, the new leader will have no record of it. ’s -group leader will resend the request after timeout.
Case 2: If fails after it has replicated ’s pre-lock request, but before it hears back from ’s leader for further action (of either locking for execution or releasing due to failed pre-lock), the new leader will know has pre-locked from its group member and resend pre-lock information back to .
Case 3: If fails after it hears from for lock acquisition or releasing, but before it gets quorum acknowledgement for replication, ’s -group leader will resend the request after timeout and the locking process keeps up.
Case 4: If fails after it has locked and notifies ’s -group the new leader 1) re-constitutes the state from the surviving nodes, and 2) (re)notifies ’s -group leader of its approval. Therefore, the OLA strategy is safe as well. Thus CoMesh is safe under at most one node failure. ∎
A.3. Liveness
Theorem 2.
[No Deadlocks] No two routines are stuck in a deadlock.
Proof.
Each SLA and OLA violate separate necessary conditions for deadlocks. For SLA, routines only wait for higher- devices than they have already locked. This means no routine waits for lower- devices than it has locked. Thus, there is no “wait for” cycle among routines. As for OLA, routines do not wait for any already-locked devices. If a device is already (pre-)locked, the new routine will abort and try to acquire its touched devices’ locks at a later time. Without “hold and wait” there is no deadlock.
Finally, we prove that CoMesh always makes progress. We prove the induction step via the following theorem (this theorem can be applied iteratively).
Theorem 3.
[Progress/Liveness] Consider a set of executing routines that are each still waiting to acquire 1 or more locks, and a set of devices they are waiting for. If no further routines arrive into , and SLA is used for locking, then: at least one routine from the set will make progress (i.e., get access to one more devices that it desires).
Proof.
Consider the highest- device in set , and the routine that currently holds the lock on device . Then cannot be waiting for any further locks (otherwise would not be the highest- in ), and thus will eventually complete. At that point, the routine at the head of ’s queue will be granted access to and thus will make progress. ∎
For OLA, an adversary may always be able to prevent two competing routines from making progress. However, in practice, we observed that routines make fast progress in OLA. Nevertheless, because of this, we use SLA as the default in the CoMesh implementation.
A.4. System Availability
While CoMesh can tolerate failures, what happens if there are more than failures? With simultaneous failures, we calculate the availability = the probability that CoMesh still works correctly, i.e., all -groups contains at least non-faulty nodes. Assume we have smart devices, -groups, and each group has members.
-
(1)
When ,
-
(2)
When , the availability is:
The equation for one -group is explained as follows. Given failures, the numerator is the number of ways to select a -group containing non-faulty nodes (left term), and faulty nodes (right term), while the denominator is the number of all possible selections for a -group. varies from down to (or a higher number if there are insufficient alive nodes).
From Fig. 3(a) we observe the availability of CoMesh drops when the number of simultaneous failures becomes . However, CoMesh still provides 50% of availability when simultaneous failures grow up to 9. Fig. 3(b) shows that to reach a specific percentile of availability , CoMesh’s tolerated drops very slowly as the number of groups increases—this indicates scalability with the number of groups.