Delay-Optimal Distributed Edge Computing in Wireless Edge Networks
Abstract
By integrating edge computing with parallel computing, distributed edge computing (DEC) makes use of distributed devices in edge networks to perform computing in parallel, which can substantially reduce service delays. In this paper, we explore DEC that exploits distributed edge devices connected by a wireless network to perform a computation task offloaded from an end device. In particular, we study the fundamental problem of minimizing the delay of executing a distributed algorithm of the computation task. We first establish some structural properties of the optimal communication scheduling policy. Then, given these properties, we characterize the optimal computation allocation policy, which can be found by an efficient algorithm. Next, based on the optimal computation allocation, we characterize the optimal scheduling order of communications for some special cases, and develop an efficient algorithm with a finite approximation ratio to find it for the general case. Last, based on the optimal computation allocation and communication scheduling, we further show that the optimal selection of devices can be found efficiently for some special cases. Our results provide some useful insights for the optimal computation-communication co-design. We evaluate the performance of the theoretical findings using simulations.
I Introduction
Edge computing has recently emerged as a promising paradigm that performs a substantial amount of computing, storage, networking, and management functions on devices at or close to end users (referred to as “edge devices”). This trend is largely enabled by the proliferation of smart devices with powerful computing capabilities, which are often not fully utilized. Compared to cloud computing which performs computing in remote data centers, edge computing can offload a large amount of data traffic from the core network to the edge network, which can greatly reduce the communication delay incurred in the network. One main driving force for the popularity of edge computing is many emerging applications of AI that require very low service delays, including augmented reality (AR) [1], virtual reality (VR) [2], and autonomous vehicle [3]. These applications are empowered by recent advances in machine learning (ML), which typically rely on computationally intensive processing of large amounts of data.
On the other hand, distributed computing is a traditional computing paradigm that uses distributed devices to perform computing cooperatively in parallel. One main advantage of this approach is that it can greatly reduce computation delay by parallelizing the algorithm of computation and distributing the computation workload to different devices, rather than performing all the computation on a single device. One well-known example of distributed computing is cloud computing in data centers, which utilizes large clusters of interconnected computer servers for parallel computing.
To exploit the potential of edge computing, it is promising to leverage distributed edge computing (DEC), which makes use of distributed edge devices to perform computing cooperatively in parallel. This approach is enabled by widely available edge devices with under-utilized computing capacities that can be connected by wireless networks in a distributed manner. To accelerate many emerging applications that require very low delays, it is beneficial to offload and distribute a large computation workload from a single end device to possibly multiple edge devices nearby. One main advantage of this approach is to perform the computation in parallel which reduces the computation delay. If a computation is offloaded from one device and distributed to devices with the same computing power, and if communication delays are not counted (which certainly should and will be considered), then the computation delay will reduce by a fold of , which is very appealing. Another advantage of that approach is to leverage an edge device(s) with higher computing power to reduce the computation delay. This will happen more likely when dedicated and powerful edge computer servers are deployed as infrastructure in the future.
In this paper, we study DEC that exploits wirelessly connected edge devices to perform distributed computing. Our goal is to minimize the delay of executing a distributed algorithm. To this end, we will study the optimal allocation of computation workloads of the distributed algorithm (referred to as “computation allocation”) to devices. We will also investigate the optimal scheduling of communications between the devices in the wireless network. We will further study the optimal selection of devices for executing the distributed algorithm. We will explore fundamental issues in the cross-layer design, analysis, and optimization of DEC in wireless networks.
The computation allocation and communication scheduling for DEC are significantly different from prior studies. These two problems have non-trivial coupling, as the optimal solution to each problem depends on the solution to the other problem. Therefore, compared to existing works on distributed computing, the computation allocation here needs to take into account the features of wireless networks, including interference among wireless links and diverse data rates of wireless links. Moreover, the objective of communication scheduling here is to minimize the delay of executing a distributed algorithm, which is quite different from existing studies on wireless network scheduling.
The main contributions of this paper can be summarized as follows.
-
•
We propose a framework of distributed computing with a parallel algorithm structure using edge devices connected by a wireless network. Based on this framework, we formulate the problem of allocating computation workloads to devices and scheduling communications between devices for minimizing the delay of executing the distributed algorithm.
-
•
We first establish some structural properties of the optimal communication scheduling, which show that it is optimal to be non-preemptive, be non-idle, and schedule forward communications before backward communications. Then, given these properties, we characterize the optimal computation allocation by developing an efficient algorithm to find it. Next, based on the optimal computation allocation, we characterize the optimal scheduling order of communications for the cases with uniform communication delays or computation rates. We also develop an efficient algorithm with a finite approximation ratio that finds the optimal scheduling order for the general case with diverse communication and computation rates. Based on the optimal computation allocation and communication scheduling, we further show that the optimal selection of devices can be found by an efficient linear search for the cases with uniform communication delays or computation rates. Our results provide some useful insights for the optimal computation-communication co-design.
-
•
We evaluate the performance of the optimal polices using simulations. The simulation results demonstrate that the optimal polices outperform non-optimal policies, and are more advantageous when communication delays and/or computation rates are more diverse.
The rest of this paper is organized as follows. Section II reviews related work. In Section III, we describe a framework of distributed edge computing in wireless networks. Section IV focuses on the optimal computation allocation and communication scheduling. Simulation results are discussed in Section V. Section VI concludes this paper and discusses future work.
II Related Work
Edge computing. Edge computing has attracted growing research interests in the past few years [4]. Many works have used edge devices for video applications that require low service delays, such as for video rendering [5], and virtual reality [6, 7]. One important application studied in these works is real-time video inference [8, 9, 10]. Computation offloading from mobile devices to edge servers has also been studied [11, 12, 13, 14]. Another major research direction of edge computing is edge caching [15]. Learning users’ interested contents for caching has also been studied [16]. Cooperative networks of caches have also been studied [17, 18]. However, existing works on edge computing have not considered offloading and distributing computation to more than one device, with the goal of reducing the computation delay.
Distributed computing. There have been many studies on the design of distributed algorithms and computation allocation for reducing computation delays [19, 20, 21, 22, 23, 24]. Some of these works have studied the effects of the network on computing [25, 26, 27]. Some other works have considered the throughput of networked computers for processing computations [28]. Recent studies have considered the cross-layer design of distributed computing and networking for improving computation delay [29] or throughput [30, 31]. On the other hand, many works have studied communication scheduling in data center networks [32]. A large body of these works have focused on the scheduling of co-flows under distributed computing frameworks, in particular MapReduce [33, 34, 35]. However, most existing works on distributed computing have not considered offloading and distributing computation to more than one device, with the goal of reducing the computation delay. Moreover, many studies have considered wired networks of devices, which do not take into account the features of wireless networks, including interference among wireless links.
Wireless network scheduling. Wireless network scheduling has been studied extensively for more than a decade. Most of the works have focused on the throughput of wireless networks [36], including recent works on deadline-constrained throughput [37] and with distributed scheduling [38]. Many other works have considered the total utility of data flows in the network [39] which depends on the throughput. Much fewer works have studied the delay performance of wireless network scheduling [40]. On the other hand, some works have studied the cross-layer design of scheduling, routing, and/or congestion control for various objectives, including for throughput [41, 42], delay [43], or utility [44]. However, existing works on wireless network scheduling have not considered using distributed devices connected wirelessly to perform computation cooperatively, with the goal of reducing the computation delay.
III System Model and Problem Formulation
In this section, we first describe a framework of distributed computing with a parallel algorithm structure using wirelessly connected edge devices. Based on this framework, we then formulate the problem of minimizing the delay of executing the distributed algorithm.
Distributed algorithm. Our goal is to execute an algorithm which consists of some computations. The algorithm can be executed by a single device in a centralized manner by performing the computations of the algorithm sequentially in some order. Alternatively, the algorithm can be executed in a distributed manner such that the computations are performed by distributed devices in parallel. By exploiting the computing power of distributed devices, this parallelization can greatly reduce the delay of executing the algorithm.
In particular, we consider a parallel algorithm that can be decomposed into multiple computations in parallel, as illustrated in Fig. 1 (a). This parallel structure can capture many applications where the algorithm of computation can be parallelized. For example, for graphic rendering [45], a graphic can be partitioned into multiple segments such that each segment can be processed independently. Moreover, even for algorithms that cannot be fully parallelized, they often can be partly parallelized. For instance, for image classification using a DNN model, although different layers of the model have to be processed in order, many layers can be parallelized separately.
Computation. A computation is to take some data as input, execute some instructions (e.g., arithmetic operation, comparison, branching such as ”if…then…”) based on the input data, and produces some data as output. The workload of a computation (e.g., the number of arithmetic operations) generally varies for different computations. Some computations cannot be executed in parallel and have to be executed in order. This is the case when the output of a computation is used as the input of another computation, such that the latter computation cannot start until the former computation is completed. The precedence relations between the computations of an algorithm can be represented by a directed acyclic graph as illustrated in Fig. 1 (a).
We assume that the total computation workload of the algorithm is divisible, such that it can be divided into any workloads , of the parallel computation , with . For ease of exposition, without loss of generality, we assume that the workloads of computation 0 and are 0. The results of this paper can be used to find approximate optimal solutions when the total workload is not arbitrarily divisible (which we will study in future work).


Communication. For two computations that have to be executed in order, the output of one computation has to be communicated to another computation as the input. The message passing between two computations is referred to as a communication. The workload of a communication (typically captured the amount of data to be transferred) generally varies for different communications.
The communications in the algorithm consists of forward communications (from computation 0 to each ) and backward communications (from each computation to ). We assume that the workloads of forward and backward communications are independent of the computation workloads , 111We will study the situation where the workload of a communication depends on that of the corresponding computation (e.g., proportional) in future work. We conjecture that the results for this setting will be similar to those in this paper.. This is the case when the amounts of input and output data of a computation are independent of the computation workloads. For example, consider an algorithm that computes by calculating for all and finding the maximum (e.g., to solve a combinatorial optimization problem). The total computation workload of this algorithm is (approximately) which is the number of evaluations of . Then each computation can be designed as to compute by calculating for all in a subset and finding the maximum, where and , . Thus the workload of computation is . Therefore, for computation , the input data is the specification of function and subset (which can be a set of consecutive binary sequences, specified by the smallest and largest binary sequences), and the output data is the maximum value of over , which are both independent of the computation workload .
Computing device. We consider a set of edge devices (referred to as “nodes”) that are available for executing the distributed algorithm. The computation rate of a device is the computation workload that the device can complete per unit time, which quantifies the computation capability (depending on e.g., CPU, memory) of the device. It generally varies for different devices. The computations of the algorithm are allocated to edge devices as in Fig. 1 (b).
Edge network. The edge devices are typically connected by wireless links. Due to interference among wireless links, only wireless links without mutual interference can transmit data concurrently. The communication rate of a wireless link is the communication workload that the link can complete per unit time, which quantifies the communication capability (depending on e.g., transmit power, channel state) of the link. It generally varies for different wireless links.
We consider a single-hop wireless network such that each node can transmit data to each other node directly. The network is subject to complete interference constraints such that only one node can transmit at a time (i.e., no more than one node can transmit concurrently). This is a reasonable setting when nodes are close to each other, which is usually the case in an edge network (e.g., WiFi). We assume that the computation rates of devices and communication rates of wireless links are known222We will study the situation where computation and communication rates are unknown and stochastic in future work, based on the results of this paper. (which can be estimated before executing the algorithm). As a result, the delays of executing computations and communications are known. We also assume that the communications between nodes are coordinated by a central controller (e.g., a WiFi AP), such that there is no contention or interference in the network. This can be achieved, e.g., using the point coordination function protocol of WiFi.
Algorithm delay. The delay for executing the distributed algorithm is the total time it takes to complete all the computations and communications of the algorithm, subject to the precedence constraints among computations and communications.
Based on the framework described above, our goal is to solve the following problem.
Definition 1 (The problem of minimizing algorithm delay)
We aim to optimize allocating the computation workloads , of the distributed algorithm to computing nodes , , and scheduling the communications between the nodes in the wireless network, in order to minimize the delay of executing the distributed algorithm.
IV Optimal Computation Allocation and Optimal Communication Scheduling for Minimizing Algorithm Delay
In this section, based on the framework and problem formulation in Section III, we study the optimal allocation of computation workloads and the optimal scheduling of communications that minimize the delay of executing the distributed algorithm. We assume that all the available nodes are used for executing the algorithm. In Section IV-D, we will relax this assumption and study the optimal selection of nodes that minimizes the algorithm delay, based on the optimal computation allocation and communication scheduling. Note that after the optimal computation allocation and communication scheduling are determined, the optimal scheduling of computations on the device nodes can be easily determined: each computation starts once its corresponding forward communication finishes.
We note that there is non-trivial interdependence between computation allocation and communication scheduling: the optimal design for one problem depends on the design for the other problem. In the following, we will first show some general structural properties that are satisfied by the optimal communication scheduling. Then, given any communication scheduling policy with these properties, we will characterize the optimal computation allocation. Next, based on the optimal computation allocation policies, we will characterize the optimal communication scheduling.
IV-A Structural Properties of Optimal Communication Scheduling
In general, a communication scheduling policy can be preemptive such that the network can interrupt the execution of a communication at any time and start to execute another communication [46]. However, we can show that it suffices to focus on non-preemptive policies.
Lemma 1
Non-preemptive communication scheduling policies are optimal.
Due to space limitation, only the proofs of some results in this paper are provided (in the appendix). Lemma 1 shows that it is not beneficial to preempt an ongoing communication to execute another communication. Intuitively, this is because preemptive scheduling is typically better than non-preemptive scheduling when tasks become available at different times and the objective is to minimize the total delay of tasks [46]. In contrast, for the problem here, the communications are always available (subject to that a backward communication is after the corresponding forward communication), and the objective is to minimize the algorithm delay which is equal to the maximum delay of communication.
Then we show that it is optimal to schedule all the forward communications before all the backward communications.
Lemma 2
It is optimal to schedule all the forward communications before all the backward communications.
Lemma 2 provides the insight that it is always beneficial to schedule any forward communication before any backward communication compared to the other way, as it allows for more time to execute the computations associated with these communications.

Next we show that it is optimal for the wireless network to keep busy between forward communications and between backward communications.
Lemma 3
It is optimal for the communication scheduling policies to be non-idle between forward communications and between backward communications, respectively.
The non-idle optimal policy in Lemma 3 means that the wireless network has no idle period between any two forward communications and between any two backward communications. However, there can be some idle period between the last forward communication and the first backward communication (i.e., between time and time in Fig. 2). Lemma 3 provides the insight that the wireless network should keep performing communications without any idle period, so as to complete communications as soon as possible, unless it is necessary to wait for some period during which the nodes can perform computations.
IV-B Optimal Computation Allocation
In this subsection, we study the optimal allocation of computation workloads to nodes given any communication scheduling policy that satisfies the structural properties discussed in Section IV-A.
The optimal computation allocation can be found by an efficient algorithm that consists of up to three phases as described in Algorithm 1. In particular, in Phase 1, we first allocate computation workloads as much as possible to nodes such that all these workloads can be completed before the last forward communication ends (i.e., before time in Fig. 2). If the total workload of the algorithm can be fully allocated in this way, we have found the optimal computation allocation. Otherwise, in Phase 2, we allocate computation workloads as much as possible to nodes such that all these workloads can be completed after the first backward communication starts (i.e., after time in Fig. 2). If the remaining workload of the algorithm can be fully allocated in this way, we have found the optimal computation allocation. Otherwise, in Phase 3, we allocate the further remaining workload of the algorithm to nodes such that it can be completed after the last forward communication ends and before the first backward communication starts (i.e., between time and time in Fig. 2). It can be seen that the computational complexity of Algorithm 1 is , as each phase of the algorithm involves at most iterations. We establish the optimality of Algorithm 1 as follows.
Proposition 1
Proposition 1 provides some interesting insights regarding the optimal computation allocation characterized by Algorithm 1. Intuitively, the optimal policy should reduce the idle computing periods of nodes as much as possible, so as to minimize the algorithm delay. To this end, it allocates workloads to the idle period of each node after its forward communication ends and before the last forward communication (among all nodes) ends, and to the idle period after the first backward communication (among all nodes) starts and before its backward communication starts, until there is no such idle period. The workloads allocated to these idle periods do not increase the algorithm delay, as it remains equal to the total delay of all forward and backward communications. If there is some workload of the algorithm that remains unallocated after the above allocation, it is allocated to all nodes in proportional to their computation rates, such that it incurs an equal computation delay to all the nodes. This delay increases the algorithm delay beyond the delay incurred by communications. As a result of Proposition 1 and Algorithm 1, we can see that when the total workload of the algorithm is sufficient (above some threshold), each node keeps performing its computation between its forward and backward communications. Otherwise, some node is forced to be idle between its forward and backward communications.
IV-C Optimal Scheduling Order of Communications
In this subsection, based on the structural properties of the optimal communication scheduling in Section IV-A and the optimal computation allocation in IV-B, we study the optimal scheduling order of communication. Due to the symmetry between forward communications and backward communications, we focus on the scheduling order of backward communications, as the results for forward communications follow similarly. In particular, we consider the optimal scheduling order that minimizes the algorithm delay, given the total computation workload of the algorithm. Based on the optimal computation allocation found by Algorithm 1, we can transform this problem to an equivalent “dual” form: how to schedule the backward communications, such that the total computation workload that can be completed after the first backward communication starts (i.e., after time in Fig. 2) is maximized?

We first consider the case where nodes have uniform computation rates but can have diverse communication delays. The optimal scheduling order is given as follows.
Proposition 2
For the case of uniform computation rates, it is optimal to schedule communications in the descending order of communication delays.
Proposition 2 means that the optimal policy schedules the longest communication (i.e., with the largest delay) first, and the second longest next, etc, as illustrated in Fig. 3(a). This result provides the following insight: it is better to schedule a longer communication earlier than a shorter one, since it allows for more time for other node(s) to perform computing. As a result, the computing capacities of nodes are most efficiently utilized and thus the algorithm delay is minimized.
Then we consider the case where nodes have uniform communication delays but can have diverse computation rates. The optimal scheduling order is given below.
Proposition 3
For the case of uniform communication delays, it is optimal to schedule communications in the ascending order of the corresponding nodes’ computation rates.
Proposition 3 means that the optimal policy schedules the communication of the slowest node first, and that of the second slowest node next, etc, as illustrated in Fig. 3(b). The insight from this result is as follows: it is better to utilize a faster node for a longer period than a slower node, so that the computing capacities of nodes are most efficiently utilized and thus the algorithm delay is minimized.
Next we consider the general case where nodes can have diverse computation rates and also diverse communication delays. It is plausible that the optimal policies in the previous two cases can perform well for the case here. However, we can find some counterexamples which show that those policies can result in a solution that is arbitrarily worse than the optimal solution (given in the appendix).
To determine the optimal scheduling order, we can use an exhaustive search by calculating the total computation workload (that can be completed after the first backward communication starts) for all possible scheduling orders and then finding the optimal one. However, the computational complexity of the exhaustive search is which is too high. Therefore, we aim to find a computationally efficient approximation algorithm that can provide performance guarantee in terms of the approximation ratio , where is the total computation workload for the optimal scheduling order. It can be shown that the largest delay first and fastest node last policies (which are the optimal policies for the previous two cases, respectively) cannot provide a finite approximation ratio due to ignoring computation rates or communication delays, respectively. Thus motivated, the design of the approximation algorithm here should take into account both factors. In particular, we design a greedy algorithm that calculates the total computation workload for all possible scheduling orders of the first communications among all the communications, where , and then finds the optimal order of the first communications. The scheduling order of the remaining communications can be set arbitrarily. The algorithm is described in detail in Algorithm 2. We can show that this algorithm has a finite approximation ratio as follows.
Proposition 4
Algorithm 2 finds a communication scheduling order that has an approximation ratio of compared to the optimal policy, i.e., .
Proposition 4 shows that there exists a tradeoff between the computational complexity of Algorithm 2 and its approximation ratio, which can be controlled by the parameter . The complexity of the algorithm is which is in the order of . Therefore, a larger means higher complexity which is worse, but also a higher approximation ratio which is better.

IV-D Optimal Selection of Nodes
In the previous subsections, it is assumed that each node is used to execute the distributed algorithm, such that the forward and backward communications of that node is scheduled regardless of the computation workload allocated to it (even when no workload is allocated). However, it is important and interesting to note that it may not be optimal to use as many nodes as possible to execute the algorithm. This is because using an additional node incurs communication delays which can increase the algorithm delay. This increase can outweigh the decrease of the computation delay due to utilizing the additional node for computing, as illustrated in Fig. 4. Thus motivated, in the following we investigate how to select nodes for executing the algorithm to minimize the algorithm delay.



We first consider the cases where nodes have uniform computation rates or uniform communications delays. For these two cases, we can show that the optimal set of nodes to select is those with the smallest communication delays or the highest computation rates.
Lemma 4
For the cases of uniform computation rates (or uniform communications delays), if a node is in the optimal set of nodes, then each node with a smaller communication delay (or higher computation rate, respectively) is also in the optimal set.
Lemma 4 is intuitive as a node with a smaller communication delay or higher computation rate should be preferred over one with a larger communication delay or lower computation rate, respectively. Based on this result, we can use an efficient linear exhaustive search to find the optimal set of nodes. In particular, it calculates the minimum delay for all possible numbers of the “best” nodes (i.e., the nodes with the smallest communication delays or highest computation rates where ), and then finds the optimal number of the “best” nodes. The computational complexity of the linear exhaustive search is .
Then we consider the general case where nodes can have diverse computation rates and also diverse communication delays. To determine the optimal set of nodes, we may use an exhaustive search that calculates the (approximate) minimum delay for all possible sets of nodes (as the optimal scheduling order of communications is difficult to find), and then finds the optimal set among them. However, the computational complexity of the exhaustive search is which is too high. Therefore, we can use a greedy algorithm instead as follows: we start with the empty set, and in each iteration we add to this set the node not selected that can reduce the algorithm delay the most, until no such node exists. The computational complexity of this greedy algorithm is . We will analyze the performance of this algorithm in terms of its approximation ratio in our future work.
V Simulation Results
In this section, we evaluate the performance of the optimal computation allocation, the optimal communication order, and the optimal selection of nodes using simulation results.
V-A Optimal Computation Allocation
To illustrate the efficiency of the optimal computation allocation, we compare the algorithm delay under the optimal computation allocation (OCA) found by Algorithm 1, and under equal computation allocation (ECA) that allocates an equal computation workload to each node. We set the default parameters as follows: , , , , , .
Fig. 7 illustrates the algorithm delay under OCA and under ECA, when the total computation workload of the algorithm varies. We can see that the delay under OCA is always no greater than that under ECA, which demonstrates the better performance of OCA. We can also see that the delay is non-decreasing with , which is because a larger workload takes more time to complete. We note that the performance gap is 0 when is small. This is because in this case, all the workloads can be completed before the last forward communication ends or after the first backward communication starts, such that the delay is equal to the total time of forward and backward communications, which is the same for both allocations. We further observe that the performance gain of OCA compared to ECA for diverse communication delays and computation rates (DMDC) is more than that for uniform communication delays and diverse computation rates (UMDC), or for diverse communication delays and uniform computation rates (DMUC). This is because when communication delays or computation rates are diverse rather than uniform, OCA is more different from ECA, so that OCA is more beneficial.
V-B Optimal Communication Order
To illustrate the efficiency of the optimal communication scheduling order, we compare the algorithm delay under the optimal order given by Propositions 2 or 3, or the approximate optimal order found by Algorithm 2, and under an arbitrary order that schedules communications in the ascending order of users’ indices.
Fig. 7 illustrates the algorithm delay under the (approximate) optimal communication order (OCO) and under the arbitrary communication order (ACO), as the the total computation workload of the algorithm varies. As expected, we can see that OCO always outperforms ACO, and the delay is non-decreasing with . Similar to Fig. 7, we note that the performance gap is 0 when is small, which is because in this case the delay is equal to the total delay of communications, which is the same for both scheduling orders. We can also observe that the performance gain of OCO compared to ACO for DMDC is more than that for UMDC or DMUC. Similar to Fig. 7, the reason is that when communication delays or computation rates are diverse rather than uniform, OCO is more different from ECA and thus is more beneficial.
V-C Optimal Selection of Nodes
To illustrate the impact of the selection of nodes, we compare the algorithm delay as the number of nodes selected for executing the algorithm varies. Fig. 7 illustrates the algorithm delay under the optimal computation allocation (OCA), under the (approximate) optimal communication order (OCO), and under both (OCA-OCO), as the number of selected nodes varies. We can see that the delay first decreases and then increases with . This is because the delay reduction due to the computation workload completed by an additional node first outweighs the delay increase due to the communications of that node, and then the former effect is dominated by the second effect. We can also observe that the delay under OCA-OCO is better than under OCA or OCO only, which demonstrates that both OCA and OCO are beneficial.
VI Conclusion and Future Work
In this paper, we have explored DEC by studying the minimization of the delay of executing a distributed algorithm using distributed edge devices connected by a wireless network. In particular, we have characterized the optimal communication scheduling, the optimal computation allocation, and the optimal selection of nodes for minimizing the algorithm delay. The optimal policies have been developed by addressing the non-trivial coupling between these three issues, while taking into account the features of wireless networks. The results have provided useful insights into the optimal policies.
For future work, one immediate direction is to investigate the case where communication and computation delays are unknown and stochastic. Another important setting is when the computation workload of the algorithm is not arbitrarily divisible. We will also study the setting when the workloads of communications depend the corresponding workloads of computations. We will further explore distributed algorithms with other structures, including a serial structure, and a combination of serial and parallel structures.
APPENDIX
Counterexamples of communication scheduling orders
Next we present counterexamples where the communication scheduling orders given by the largest delay first (LDF) policy and the fastest computing node last (FCL) policy are arbitrarily worse than the optimal orders, for the case of diverse computation rates and diverse communication delays.
First consider a setting where the delays of the communications are in the order , such that the LDF policy results in the order . Then the maximum computation workload that can be completed under this scheduling order is given by
(1) |
However, the optimal scheduling order can be in the form of , such that the maximum workload is given by
(2) |
if is sufficiently large. Furthermore, we can see that if , we have , which means that the solution given by the LDF policy can be arbitrarily worse than the optimal solution . This is because LDF ignores the computation rates in determining the scheduling order, so that the maximum workload is independent of the computation rate , which can be very large. Consider another example where the nodes’ computation rates are in the order , such that the FCL policy results in the order . Then the maximum computation workload that can be completed is also given by (1). However, the optimal scheduling order can also be , such that the maximum workload is also given by (2), if is sufficiently large. Then we can see that if , we have , which means that the solution given by FCL can be arbitrarily worse than the optimal solution . This is because FCL ignores the communication delays in determining the scheduling order, so that the maximum workload is independent of the communication delay , which can be very large.
Proof of Lemma 1

The main idea of the proof is an exchange argument. WLOG, suppose forward communication M1 is interrupted by forward communication M2 into two parts, such that the 2nd part of M1 starts after M2 (or part of M2) is completed, as illustrated in Fig. 8 (a). If M1 is interrupted for multiple times, the proof follows by applying the exchange argument multiple times. Now we exchange the scheduling order of the 1st part of M1 and M2 (or the interrupting part of M2), as illustrated in Fig. 8 (b). We can see that the order exchange does not affect the schedules of computations C1 and C2, as well as the schedule of any communication or computation on the nodes other than nodes 1 and 2. As a result, the delay of the algorithm remain the same. This completes the proof.
Proof of Lemma 2

The main idea of the proof is an exchange argument. According to Lemma 1, it suffices to focus on non-preemptive scheduling. WLOG, suppose forward communication FM1 is scheduled after backward communication BM2, as illustrated in Fig. 9 (a). Now we exchange the scheduling order of FM1 and BM2, as illustrated in Fig. 9 (b). We can see that the order exchange does not affect the schedules of computations C1 and C2, as well as the schedule of any communication or computation on the nodes other than nodes 1 and 2. As a result, the delay of the algorithm remain the same. This completes the proof.
Proof of Lemma 3

The main idea of the proof is a shifting argument. According to Lemma 1 and Lemma 2, it suffices to focus on non-preemptive scheduling policies that schedule all forward communications before all backward communications. WLOG, suppose there is an idle period of the wireless network between forward communication M1 and forward communication M2, as illustrated in Fig. 10 (a). If there are multiple idle periods, the proof follows by applying the exchange argument multiple times. Now we shift M2 to be right after M1, as illustrated in Fig. 10 (b). We can see that the shifting does not affect the schedules of computations C1 and C2, as well as the schedule of any communication or computation on the nodes other than nodes 1 and 2. As a result, the delay of the algorithm remain the same. If M1 and M2 are backward communications, we can shift M1 to be right before M2, and the same argument applies. This completes the proof.
Proof of Proposition 1

The main idea of the proof is a shifting argument. We first note that a lower bound of the algorithm delay is the total delay of all forward and backward communications. Therefore, if Algorithm 1 terminates in Phase 1 or Phase 2, then it finds a feasible schedule of all the computation workloads of the algorithm, such that the algorithm delay is equal to . Suppose the Algorithm 1 terminates in Phase 3, and the remaining total unallocated computation workloads is not allocated to the nodes in proportional to their computation rates. Then there must be some node that is idle for some period after all forward communications and before all backward communications, as illustrated in Fig. 11 (a). In this case, we can always shift some workload from some other node to this node without increasing the algorithm delay, until there is no such idle period, as illustrated in Fig. 11 (b). This completes the proof.
Proof of Proposition 4
Let be the scheduling order found by Algorithm 2 such that the communication of node is scheduled at the th in the order. Let be the optimal scheduling order. For convenience, define
Also define and . Let be an ascending order of elements in . For any and , we construct a new order of the elements in such that , (the elements , can be arbitrary, so there can be multiple such ). Intuitively, we move the elements of with order indices given by to be the first elements in , while keeping the relative order of these elements the same. For example, if and with , then . Define
Then we have
(3) |
where the inequality is due to that
is a subset of
To see this, for the previous example, if , we have
Then we have
for any , where the 1st inequality is due to Algorithm 2, and the 2nd inequality is due to (Proof of Proposition 4). Using the above we have
where the 2nd equality is due to that we sum up for all possible orders of whose number is . Thus we have . This completes the proof.
References
- [1] Google Glass. [Online]. Available: https://x.company/glass/
- [2] Oculus. [Online]. Available: https://www.oculus.com
- [3] Autonomous Flight. [Online]. Available: https://autonomousflight.com
- [4] M. Chiang and T. Zhang, “Fog and IoT: An overview of research opportunities,” IEEE Internet of Things Journal, vol. 3, no. 6, pp. 854–864, 2016.
- [5] C. Wu, Y. Zhang, L. Zhang, B. Yang, X. Chen, W. Zhu, and L. Qiu, “Butterfly: Mobile collaborative rendering over gpu workload migration,” in IEEE International Conference on Computer Communications (INFOCOM), 2017.
- [6] W. Zhang, J. Chen, Y. Zhang, and D. Raychaudhuri, “Towards efficient edge cloud augmentation for virtual reality mmogs,” in Proceedings of the Second ACM/IEEE Symposium on Edge Computing. ACM, 2017, p. 8.
- [7] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “When edge meets learning: Adaptive control for resource-constrained distributed machine learning,” in IEEE International Conference on Computer Communications (INFOCOM), 2018.
- [8] Q. Liu, S. Huang, J. Opadere, and T. Han, “An edge network orchestrator for mobile augmented reality,” in IEEE International Conference on Computer Communications (INFOCOM), 2018.
- [9] X. Ran, H. Chen, X. Zhu, Z. Liu, and J. Chen, “Deepdecision: A mobile deep learning framework for edge video analytics,” in IEEE International Conference on Computer Communications (INFOCOM), 2018.
- [10] L. Cheng and J. Wang, “Vitrack: Efficient tracking on the edge for commodity video surveillance systems,” in IEEE International Conference on Computer Communications (INFOCOM), 2018.
- [11] X. Chen, L. Jiao, W. Li, and X. Fu, “Efficient multi-user computation offloading for mobile-edge cloud computing,” IEEE/ACM Transactions on Networking, vol. 24, no. 5, pp. 2795–2808, 2015.
- [12] L. Tong, Y. Li, and W. Gao, “A hierarchical edge cloud architecture for mobile computing,” in IEEE International Conference on Computer Communications (INFOCOM), April 2016.
- [13] H. Tan, Z. Han, X.-Y. Li, and F. C. Lau, “Online job dispatching and scheduling in edge-clouds,” in IEEE International Conference on Computer Communications (INFOCOM), 2017.
- [14] B. Gao, Z. Zhou, F. Liu, and F. Xu, “Winning at the starting line: Joint network selection and service placement for mobile edge computing,” in IEEE International Conference on Computer Communications (INFOCOM), 2019.
- [15] P. Yang, N. Zhang, Y. Bi, L. Yu, and X. S. Shen, “Catalyzing cloud-fog interoperation in 5G wireless networks: An SDN approach,” IEEE Network, vol. 31, no. 5, pp. 14–20, 2017.
- [16] X. Wang, M. Chen, T. Taleb, A. Ksentini, and V. C. M. Leung, “Cache in the air: exploiting content caching and delivery techniques for 5G systems,” IEEE Communications Magazine, vol. 52, no. 2, pp. 131–139, February 2014.
- [17] W. C. Ao and K. Psounis, “Distributed caching and small cell cooperation for fast content delivery,” in Proceedings of the 16th ACM International Symposium on Mobile Ad Hoc Networking and Computing, ser. MobiHoc ’15. New York, NY, USA: ACM, 2015, pp. 127–136. [Online]. Available: http://doi.acm.org/10.1145/2746285.2746300
- [18] A. Liu and V. K. N. Lau, “Exploiting base station caching in MIMO cellular networks: Opportunistic cooperation for video streaming,” IEEE Transactions on Signal Processing, vol. 63, no. 1, pp. 57–69, Jan 2015.
- [19] K. W. Ross and D. D. Yao, “Optimal load balancing and scheduling in a distributed computer system,” Journal of the ACM (JACM), vol. 38, no. 3, pp. 676–689, 1991.
- [20] H. Chang, M. Kodialam, R. R. Kompella, T. Lakshman, M. Lee, and S. Mukherjee, “Scheduling in mapreduce-like systems for fast completion time,” in IEEE International Conference on Computer Communications (INFOCOM), 2011.
- [21] Y. Zheng, N. B. Shroff, and P. Sinha, “A new analytical technique for designing provably efficient mapreduce schedulers,” in IEEE International Conference on Computer Communications (INFOCOM), 2013.
- [22] Y. Zhu, Y. Jiang, W. Wu, L. Ding, A. Teredesai, D. Li, and W. Lee, “Minimizing makespan and total completion time in mapreduce-like systems,” in IEEE International Conference on Computer Communications (INFOCOM), 2014.
- [23] Y. Zheng, N. B. Shroff, R. Srikant, and P. Sinha, “Exploiting large system dynamics for designing simple data center schedulers,” in IEEE International Conference on Computer Communications (INFOCOM), 2015.
- [24] S. Im, M. Naghshnejad, and M. Singhal, “Scheduling jobs with non-uniform demands on multiple servers without interruption,” in IEEE International Conference on Computer Communications (INFOCOM), 2016.
- [25] Y.-C. Cheng and T. G. Robertazzi, “Distributed computation for a tree network with communication delays,” IEEE Transactions on Aerospace and Electronic Systems, vol. 26, no. 3, pp. 511–516, 1990.
- [26] J. Tan, X. Meng, and L. Zhang, “Coupling task progress for mapreduce resource-aware scheduling,” in IEEE International Conference on Computer Communications (INFOCOM), 2013.
- [27] J. Jiang, S. Ma, B. Li, and B. Li, “Symbiosis: network-aware task scheduling in data-parallel frameworks,” in IEEE International Conference on Computer Communications (INFOCOM), 2016.
- [28] Q. Xie, A. Yekkehkhany, and Y. Lu, “Scheduling with multi-level data locality: Throughput and heavy-traffic optimality,” in IEEE International Conference on Computer Communications (INFOCOM), 2016.
- [29] F. Chen, M. Kodialam, and T. Lakshman, “Joint scheduling of processing and shuffle phases in mapreduce systems,” in IEEE International Conference on Computer Communications (INFOCOM), 2012.
- [30] J. Liu, C. H. Xia, N. B. Shroff, and X. Zhang, “On distributed computation rate optimization for deploying cloud computing programming frameworks,” ACM SIGMETRICS Performance Evaluation Review, vol. 40, no. 4, pp. 63–72, 2013.
- [31] J. Ghaderi, S. Shakkottai, and R. Srikant, “Scheduling storms and streams in the cloud,” in ACM SIGMETRICS Performance Evaluation Review, vol. 43, no. 1. ACM, 2015, pp. 439–440.
- [32] M. Shafiee and J. Ghaderi, “A simple congestion-aware algorithm for load balancing in datacenter networks,” IEEE/ACM Transactions on Networking, vol. 25, no. 6, pp. 3670–3682, 2017.
- [33] M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica, “Managing data transfers in computer clusters with orchestra,” ACM SIGCOMM Computer Communication Review, vol. 41, no. 4, pp. 98–109, 2011.
- [34] Y. Zhao, K. Chen, W. Bai, M. Yu, C. Tian, Y. Geng, Y. Zhang, D. Li, and S. Wang, “Rapier: Integrating routing and scheduling for coflow-aware data center networks,” in IEEE International Conference on Computer Communications (INFOCOM), 2015.
- [35] Y. Li, S. H.-C. Jiang, H. Tan, C. Zhang, G. Chen, J. Zhou, and F. Lau, “Efficient online coflow routing and scheduling,” in IEEE International Conference on Computer Communications (INFOCOM), 2016.
- [36] A. Eryilmaz, R. Srikant, and J. R. Perkins, “Stable scheduling policies for fading wireless channels,” IEEE/ACM Transactions on Networking, vol. 13, no. 2, pp. 411–424, 2005.
- [37] I.-H. Hou, V. Borkar, and P. Kumar, A theory of QoS for wireless. IEEE, 2009.
- [38] L. Jiang and J. Walrand, “A distributed csma algorithm for throughput and utility maximization in wireless networks,” IEEE/ACM Transactions on Networking (ToN), vol. 18, no. 3, pp. 960–972, 2010.
- [39] X. Liu, E. K. Chong, and N. B. Shroff, “A framework for opportunistic scheduling in wireless networks,” Computer networks, vol. 41, no. 4, pp. 451–474, 2003.
- [40] G. R. Gupta and N. B. Shroff, “Delay analysis and optimality of scheduling policies for multihop wireless networks,” IEEE/ACM Transactions on Networking (TON), vol. 19, no. 1, pp. 129–141, 2011.
- [41] L. Tassiulas and A. Ephremides, “Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks,” IEEE Transactions on Automatic Control, vol. 37, no. 12, pp. 1936–1948, 1992.
- [42] X. Lin and S. B. Rasool, “Distributed and provably efficient algorithms for joint channel-assignment, scheduling, and routing in multichannel ad hoc wireless networks,” IEEE/ACM Transactions on Networking (TON), vol. 17, no. 6, pp. 1874–1887, 2009.
- [43] D. Xue and E. Ekici, “Delay-guaranteed cross-layer scheduling in multihop wireless networks,” IEEE/ACM Transactions on Networking, vol. 21, no. 6, pp. 1696–1707, 2013.
- [44] Y. Yi and M. Chiang, “Stochastic network utility maximisation—a tribute to kelly’s paper published in this journal a decade ago,” European Transactions on Telecommunications, vol. 19, no. 4, pp. 421–442, 2008.
- [45] A. Munshi, D. Ginsburg, and D. Shreiner, OpenGL ES 2.0 programming guide. Pearson Education, 2008.
- [46] M. Pinedo, Scheduling. Springer, 2012, vol. 29.