Peer Offloading with Delayed Feedback in Fog Networks

Miao Yang, Hongbin Zhu, Hua Qian, Senior Member, IEEE, Yevgeni Koucheryavy, Senior Member, IEEE, Konstantin Samouylov, and Haifeng Wang This paper was presented in part at the 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020). M. Yang and H. Qian are with Shanghai Advanced Research Institute, Chinese Academy of Sciences (CAS), also with School of Information Science and Technology, ShanghaiTech University, and also with University of Chinese Academy of Sciences. H. Zhu and H. Wang are with School of Information Science and Technology, ShanghaiTech University, also with Shanghai Institute of Microsystem and Information Technology, CAS. Yevgeni Koucheryavy is with Tampere University, Finland, and also with National Research University Higher School of Economics, Moscow, Russia. Konstantin Samouylov is with Peoples Friendship University of Russia (RUDN University), Moscow, Russian Federation. Corresponding author: Hongbin Zhu ([email protected]). This work was supported in part by the National Natural Science Foundation of China (Grant No. 61971286) and the Science and Technology Commission Foundation of Shanghai (Grant No. 19DZ1204300).

Abstract

Comparing to cloud computing, fog computing performs computation and services at the edge of networks, thus relieving the computation burden of the data center and reducing the task latency of end devices. Computation latency is a crucial performance metric in fog computing, especially for real-time applications. In this paper, we study a peer computation offloading problem for a fog network with unknown dynamics. In this scenario, each fog node (FN) can offload its computation tasks to neighboring FNs in a time slot manner. The offloading latency, however, could not be fed back to the task dispatcher instantaneously due to the uncertainty of the processing time in peer FNs. Besides, peer competition occurs when different FNs offload tasks to one FN at the same time. To tackle the above difficulties, we model the computation offloading problem as a sequential FN selection problem with delayed information feedback. Using adversarial multi-arm bandit framework, we construct an online learning policy to deal with delayed information feedback. Different contention resolution approaches are considered to resolve peer competition. Performance analysis shows that the regret of the proposed algorithm, or the performance loss with suboptimal FN selections, achieves a sub-linear order, suggesting an optimal FN selection policy. Besides, we prove that the proposed strategy can result in a Nash equilibrium (NE) with all FNs playing the same policy. Simulation results validate the effectiveness of the proposed policy.

Index Terms:

Fog computing, adversarial multi-arm bandit, reinforcement learning, task offloading, delayed feedback.

I Introduction

The pervasive mobile devices, together with the rapidly increasing mobile applications, such as, augmented reality, virtual reality, online gaming and super-resolution video streaming, are revolutionizing our way of life [1]. These massive devices and heterogeneous services require unprecedentedly high transmission speed and low latency [2]. The limited battery capacities and computation resources of the mobile devices, on the other hand, place stringent restrictions on local computations.

Fog computing has recently emerged to assist in the task processing of mobile devices [3, 4, 5]. The key idea of fog computing is to utilize computation resources of fog nodes (FNs), which are usually located at the edge of networks. These FNs can be the servers or micro base stations with extra computation resources. The computation tasks of mobile devices can be offloaded to FNs, which releases the computation workloads of mobile devices and reduces their power consumption. On the other hand, the offloaded computation tasks and the computation capability of each FN vary, which may result in unbearable latency and degrade the users’ experience. For example, some weak FNs with poor computation capacity or extensive workload are not in favor of computation-intensive applications.

Intelligent computation peer offloading strategy is desired to fully utilize the neighboring FNs with high quality of experience. By means of peer computation offloading, the weak FNs can offload their emerging computation tasks to the FNs in the vicinity. With efficient peer offloading strategy, fog computing can assist the terminals to satisfy stringent latency requirements and support a variety of internet of things (IoTs) devices whose computation resources are limited.

I-A Prior Work

In recent literatures, quite a few computation offloading problems over the fog networks have been proposed. The authors in [6] investigated the computation offloading problem by adaptively offloading the computation tasks from cloud to fog servers with the predicted mobility. The task offloading problem was formulated as a multi-period generalized assignment problem in [7], which aimed at minimizing the total delay by offloading tasks to suitable fog nodes in each period. The work in [8] studied a workload assignment problem between fog and cloud that considered the trade-off between power consumption and transmission delay. The above studies are based on centralized scheduling with high overhead of information exchange at edge devices. Such offloading algorithms do not fit for real-time and massive data services. In addition, these studies do not consider the offloading among FNs, which can be more challenging with unbalanced workload among FNs.

To tackle the peer offloading problem, the authors in [9] and [10] formulated the offloading problem as a convex optimization problem to minimize the energy consumption with latency constraints. In [11], a latency minimization problem was studied to allocate the computation resources of the edge servers. Meanwhile, the authors in [12] proposed a task allocation approach that minimized the overall task completion time using a multidimensional auction. In the above works, it is generally assumed that information of the fog network is completely known, thus the computation offloading is modeled as a deterministic scheduling problem. However, such assumptions do not hold in practice since the global information could not be obtained in real-time and the network status may change rapidly.

For the task offloading problem with unknown dynamics, the authors in [13] proposed an online peer offloading algorithm for time-varying channel conditions and stochastic arrival tasks. The work in [14] proposed an online secretary algorithm to minimize the offloading latency with unknown arrival rates of neighboring FNs. The latency, the energy consumption, and the switching cost were jointly considered as a stochastic programming problem in [15]. Taking into account the requirements of ultra-reliable low-latency, the authors in [16] proposed a mechanism to minimize the users’ power consumption while considering the allocated resources. Besides, in [17], the workloads of neighboring FNs were formulated as a Markov model and a latency minimizing algorithm was proposed to track these dynamics. These works assume that the workload of FNs satisfies some stationary distributions. Furthermore, there exist competition and collision effects among other FNs offloading choices. Some unexpected tasks brought by dynamic network structure and conditions could deteriorate the offloading performance.

In general, all works mentioned above assume instantaneous feedback of offloading decisions. The task assigner is assumed to receive the expected latency when the tasks are dispatched. This assumption is not always valid in the real world since the FN that provides computation ability may not be able to give accurate latency estimates when the tasks are received. The exact execution clock cycle of each task can not be obtained until it is finished. The execution time of a task is coupled with the hardware and the software state, such as, the cache condition, resource contention, scheduling policy, interrupt management, etc. In fact, even the measurement of the worst-case execution time (WCET) of a task is quite a headache problem in the field of real-time operating system (RTOS) [18].

I-B Contribution and Organization

In this paper, for the FN who wants to offload tasks, we assume that: (1) the global FNs state information is not known; (2) the resources of service FNs may be unpredictably competed by other FNs; (3) the offload feedback is obtained only when the task was processed and finished by the computation FN. These assumptions naturally raise three challenging problems. Firstly, the performance achieved from each FN is unaware beforehand and has to be learned due to dynamic FN conditions. Secondly, the achieved performance also depends on the set of competing tasks, which is unpredictable since the decisions of other FNs are made locally without coordination. Thirdly, the feedback of the offloaded task is delayed, which may deteriorate the achieved performance of existing strategies. Most algorithms for computation offloading do not fully consider the uncertainties mentioned above and are not in favor of these assumptions.

In [19], we studied the computation peer offloading problem in a fog-enabled network. Our goal is to minimize the peer offloading latency in uncertain environment without instantaneous latency feedback. An exponential based algorithm was proposed to handle the tricky problem and simulation results revealed the efficiency of the proposed strategy.

In this paper, we extend our previous work with theoretical proofs for the single-agent setting in [19]. Furthermore, we study the computation peer offloading problem for the multi-agent setting. The main contributions of this paper are summarized as follows.

•

We develop an online learning framework for fog networks, where each FN can offload their computation tasks to peer FNs in the vicinity. We formulate the latency minimization problem as a sequential decision making problem. Both single-agent and multi-agent settings are considered in this work.
•

For the single-agent setting, an online learning algorithm is advocated based on the adversarial multi-armed bandit (MAB) framework, which accommodates arbitrary variation of FN conditions. The proposed method can be harnessed to learn the statistical information of the network through delayed information feedback. Besides, the proposed method allows task dispatching FN to dynamically identify the most suitable service FNs without the prior knowledge of arrival tasks and channel conditions. The theoretical proof is provided, suggesting that the proposed algorithm achieves a sub-linear order with delayed feedback.
•

For the multi-agent setting, we consider the different conflict resolutions when multiple tasks are dispatched to the same FN. Assume that all FNs adopt the same strategy, multi-agent setting can be regarded as a game between an FN and the rest of FNs. The proposed algorithm for the multi-agent setting can achieve the same optimality performance as the single-agent setting. Besides, the proposed algorithm tailored for the offloading game can converge to an $\epsilon$ -NE in a specific two users setting.

The rest of the paper is organized as follows. The system model and optimization problem are introduced in section II. An online learning algorithm for computation offloading with delayed feedback is proposed in section III. The theoretical analysis of multi-agent games with the proposed algorithm is studied in IV. Numerical results are given in section V and section VI concludes our work.

II System Model and problem formulation

II-A Network Model

Consider a fog network consisting of an end-user layer and a fog layer as shown in Fig. 1. In our framework, the end-user layer includes smartphones, laptops and wireless sensors that do not have enough computation capability and are power limited. With the help of fog computing, these end-users can offload their task data to the fog layers for remote distributed computing purposes. We consider a fog network with $N$ FNs, defined by $\mathcal{N}=\{1,2,...,N\}$ . Each FN can process the task from end-user devices or other peer FNs, which is typical in practical fog networking [20].

For concisely, we classify FNs into the following two categories: (1) Task FN: An FN is a task FN if it offloads workloads to other FNs. Moreover, it does not receive any workload from other FNs; (2) Service FN: An FN is called a service FN if it receives workloads from other FN and does not offload workload to others. Let $V$ and $K$ denote the amounts of task FNs and service FNs in fog networks. Notice that in our categorization, a smart FN won’t offload workloads to other FNs while receiving workloads from other FNs. In other words, the FN can only play a role in the task FN and the service FN.

The operational timeline is discretized into time slots (e.g., $10$ ms) for making peer offloading decisions. Once FN $i$ receives $b_{t}^{i}$ task at time slot $t$ , the FN can process the task locally or offload the task to its neighboring FNs. The service rate, which denotes the expected number of consecutive task completions per time slot, can help character the computation capability of FN. Intuitively, the FNs with low service rates or their central processing units (CPUs) are running at full load prefer to offload computation tasks to its neighboring FNs, while the FNs with high service rates or abundant idle computation resources prefer to process locally.

Refer to caption — Figure 1: Peer offloading scenario.

Next, we discuss the latency of peer computing and local computing, which is our foremost concern. The latency of peer computing contains three parts: (1) latency of task dispatching; (2) latency of task processing; (3) latency of results receiving. The latency of results receiving can be negligible due to the relatively small size of computation results [21].

II-B Communication Model

Most works about computation offloading assumed that some prior information is available, such as, wireless channel state information (CSI) [22, 23]. However, it is not easy to have access to such information, especially when the number of FNs is large. In this paper, CSI is assumed to be time-varying and unknown before task transmission. Task FN $i$ can offload tasks to service FN $j$ over a wireless channel. The transmitted rate can be given by

\displaystyle r^{ij}=B\log_{2}(1+\frac{g^{ij}hP^{i}}{\sigma^{2}}),

(1)

where $g^{ij}$ is the channel gain between FN $i$ and $j$ , and $h$ is the average fading gain of the FN $i$ . Notation $P^{i}$ is the transmission power of FN $i$ and $\sigma^{2}$ denotes the noise power spectral density. The bandwidth per node is the same in our model and is given by $B$ . We denote $\overline{r}^{ij}$ as the average transmission rate between FN $i$ and FN $j$ during one task transmission. Thus the task dispatching latency can be described as

\displaystyle D_{t}^{i}=\frac{b_{t}^{i}}{\overline{r}^{ij}}.

(2)

The channel deterioration may lead to the failure of the computation tasks’ transmissions. The task FN can confirm the transmission status by acknowledgement (ACK) or negative-acknowledgement (NACK) message [24]. Once receiving a NACK message, the task FN will re-transmit the task with the Automatic Repeat-reQuest (ARQ) or the Hybrid Automatic Repeat-reQuest (HARQ) scheme [25].

II-C Computation Model

The majority of existing offloading studies consider an ideal setting that the task FN can obtain the computation latency when tasks arrive at the service FN. However, current OSs do not support instantaneous feedback. The initial machine state of the service FN varies before the task executes and some interrupts may occur during the computation processing. In this paper, the feedback latency can be obtained when the service FN finishes the execution of the offloaded task.

Next, we define the computation model in the service FNs. According to Kleinrock approximation [26], the arrival process of the task can be regarded as a Poisson process. To model the perturbation of the computation process in service FNs, the service rate of each FN is assumed to be negative exponentially distributed (i.e., generated by a Poisson process). Besides, the statistics of arrival rates vary among FNs since the density of end-users is usually not uniform. The arrival rate $\lambda$ of an FN is unaware to other FNs.

When a task arrives at the destination, it will be cached in the queue if the service FN is busy. Since the computing service rate $\mu_{t}^{j}$ of FN $j$ at time slot $t$ may be perturbed by some system interrupts, we assume that $\mu_{t}^{j}$ is unknown to all FNs. Let $\bar{\mu}_{t}^{j}$ denote the average service rate during computation. The computation queue can be modeled as an $M/M/1$ ¹¹1 $M$ denotes Markov property. queue. An $M/M/1$ queueing system indicates that the tasks arrive according to a Poisson process and the service rates are assumed to be independent and exponentially distributed. This is consistent with what we described earlier. Let $Q_{t}^{j}$ denote the current task queue length of FN $j$ when the task arrives at time slot $t$ .

Without loss of generality, we assume that the task computation latency in service FNs is proportional to the received task size, such as, $kb_{t}^{i},k\in R^{+}$ . Thus, the computation latency at FN $j$ is

\displaystyle C_{t}^{j}=\frac{Q_{t}^{j}+kb_{t}^{i}}{\bar{\mu}_{t}^{j}}.

(3)

Equation (3) denotes an estimation of the computation latency.

II-D Problem Formulation

Combining the communication latency and computation latency, the total peer offloading latency is

\displaystyle O_{t}^{j}=C_{t}^{j}+D_{t}^{i}=\frac{Q_{t}^{j}+b_{t}^{i}}{\bar{\mu}_{t}^{j}}+\frac{b_{t}^{i}}{\overline{r}^{ij}}.

(4)

In practice, the offloading latency can not be infinite, we bound it within $[0,T_{\text{max}}]$ , where $T_{\text{max}}$ is the maximal tolerable latency. Thus, the loss $l_{t}^{j}$ of each choice is described as

\displaystyle l_{t}^{j}=\begin{cases}\frac{O_{t}^{j}}{T_{\max}},&\text{if}~{}O_{t}^{j}<T_{\text{max}},\\ 1,&\text{otherwise},\end{cases}

(5)

where $l_{t}^{j}\in[0,1]$ . We want to minimize the average latency of each FN in the long term and it can be formulated as

\begin{split}&\text{minimize}~{}~{}~{}\lim_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}\sum_{i=1}^{K}L_{t}^{i}~{}\mathbbm{1}\{I_{t}=i\}\\ &\text{subject to}~{}~{}~{}~{}~{}I_{t}\in\mathcal{I},t=1,2,...,T,\end{split}

(6)

where $I_{t}$ denotes the index of the offload FN at time slot $t$ , $\mathcal{I}$ denotes the set of all the offloading choices and $\mathbbm{1}$ represents indicating vector. In our model, the channel conditions and task queues of the service FNs are unaware to the task FN. The task FN could not receive the latency results until the feedback has been received. Finding the optimal solution of (6) is challenging with the non-realtime information.

Online learning is a well-established paradigm for making a sequence of decisions when given the knowledge of previous prediction feedback. Besides, online learning can reduce computation complexity and has been investigated in many literatures. Integrating online learning scheme, the challenging optimization problem of (6) can be converted into the low-complexity sequential decision problem at each time slot.

Vanilla online learning algorithm operates in a fixed order, such as making choices, receiving feedbacks and updating the learning model. These typical online learning algorithms may not work under the delayed feedback setting, where the order of the received feedback may not be in line with the order of the dispatches. The lag of the feedback and disorder of the observations bring significant challenges to vanilla online learning algorithms.

III Single-agent setting

In this section, we first introduce the adversarial multi-arm bandit (MAB) framework, then an offloading algorithm based on the exponentially weighted strategy is proposed.

III-A Adversary multi-arm bandit

The adversarial MAB is an efficient algorithm when engaged in the dynamic environment. The offloading problem in section II can be modeled as an adversarial MAB consisting of a set of players (task FNs), which can choose an action (service FN) from the action sets (available service FNs). When the task FN makes a choice at time slot $s$ , the offloading loss can be observed by the task FN after $d_{s}$ time slots. The action of offloading is adopted by one FN at slot $s$ and the loss is obtained at slot $t$ , so we have $s+d_{s}=t$ .

To measure the utility of the task offloading scheme $\Phi$ , the concept of expected regret is introduced. Expected regret is harnessed to distinguish the cumulative loss of $\Phi$ and the best strategy in hindsight. The desired $\Phi$ is to minimize its expected cumulative regret $R_{\Phi}(T)$ , which is defined as

\displaystyle R_{\Phi}(T)=\mathbb{E}\Big{[}\sum_{t=1}^{T}l_{s|t}^{a}\Big{]}-\min_{A}\sum_{t=1}^{T}l_{t}^{A},

(7)

where $\mathbb{E}(\cdot)$ represents the expectation operation, $a\in\{1,...,K\}$ denotes the action and $A$ denotes the best policy which gives the least loss. Notation $l_{s|t}^{a}$ represents the loss introduced by the time lag and $l_{t}^{A}$ denotes the loss of optimal policy $A$ .

In adversarial multi-arm bandit setting, task FN chooses service FN $j$ using the mixed strategies according to a probability distribution $\bm{p}_{t}=[p_{t}^{1},p_{t}^{2},...,p_{t}^{K}]$ [27]. Due to the variation of computation lag, the result computation latency may be observed in a disorderly manner. We then introduce an algorithm to handle the delayed feedback and make the decision for task offloading.

III-B Proposed delayed exponential based algorithm

Let $\mathcal{L}_{t}$ denote the arrival feedback set at time slot $t$ and $|\mathcal{L}_{t}|$ denote the size of feedback set. Recall that the feedback at slot $t$ includes losses incurred at slot $s$ . Once $\mathcal{L}_{t}$ is revealed, the FN can get the estimated loss according to $\bm{p}_{t}$ at the current slot. For each $l_{s|t}^{j}\in\mathcal{L}_{t}$ , a biased estimator $\tilde{l}_{s|t}^{j}$ for reducing the estimation variance of the loss vector is

\displaystyle\tilde{l}_{s|t}^{j}=\frac{l_{s|t}^{j}}{p_{s_{n}}^{j}+\beta l_{s|t}^{j}}\mathbbm{1}_{\{I_{s}=j\}},

(8)

where $\beta$ is the biased factor of the loss for implicit exploration[28].

During a time slot $t$ , the task FN may collect more than one feedback or without any feedback. Once receiving the feedback, it will be buffered and then updates $\bm{p}_{t+1}$ at the beginning of slot $t+1$ . If the offloading feedback is returned in $[0,d_{\max}]$ , we regard it as an efficient offloading action. On the other hand, if $d_{s}\geq d_{\max}$ , the task FN regards the action as a failure because of the intolerable latency. Here, the task FN updates $\bm{p}_{t}$ with loss $l_{s|t}=1$ . The computation task will be re-offloaded after $d_{\max}$ according to the selection probability. For $l_{s|t}^{i}\in\mathcal{L}_{t}$ , the task FN $i$ updates the weights and probabilities of all service FNs with

$\displaystyle\tilde{l}_{t}^{j}$	$\displaystyle=\sum_{n=1}^{\|\mathcal{L}_{t}\|}\tilde{l}_{s\|t}^{j},$	(9)
$\displaystyle w_{t}^{j}$	$\displaystyle=w_{t-1}^{j}\exp(-\eta_{t}\tilde{l_{t}^{j}}),$	(10)
$\displaystyle p_{t}^{j}$	$\displaystyle=\frac{w_{t}^{j}}{\sum_{k\in K}w_{t}^{k}},$	(11)

where $\eta$ denotes the step size. If $\mathcal{L}_{t}=\emptyset$ , the task FN directly reuses the previous distribution, i.e., $\bm{p}_{t+1}=\bm{p}_{t}$ , and the offloading strategy does not change at all. When the feedback is received without lag, the updating scheme of the proposed algorithm is similar to the vanilla EXP3-IX algorithm [28].

Algorithm 1 Delayed Exponential Based (DEB) Algorithm.

1:Initialize:

2: Set

\bm{p}_{0}=(\frac{1}{K},...,\frac{1}{K})

for all FN

j\in\mathcal{K}

3:for

t=1,2,...,T

4: Offload task to service FN according to

\bm{p}_{t-1}

5: Feedback collected in set

\mathcal{L}_{t}

6: for

n=1,2,...,|\mathcal{L}_{t}|

7: Estimate

l_{s|t}^{j}

via (8) if

l_{s|t}^{j}\in\mathcal{L}_{t}

8: Collect the loss of each FN by (9).

9: end for

10: if

d_{s}>=d_{\max}

then

11:

l_{s}=1

12: end if

13: Update

w_{t}^{j}

and

p_{t}^{j}

by (10) and (11).

14:end for

Algorithm 1 summarizes the proposed DEB algorithm for processing the task offloading problem with delayed feedback. Initially, the probability of each FN is set as $1/K$ . The task FN makes the offloading choice according to the probability distribution $\bm{p}_{t-1}$ . If fresh feedback arrives, the task FN calculates the estimated loss and updates the weight of each service FN. If not, the task FN makes the decision according to the previous distribution.

III-C Performance analysis

In this part, we analyze the regret of the proposed delayed exponential based algorithm.

Theorem 1.

Let $T$ denote the time horizon of the learning procedure and $D=\sum_{t=1}^{T}d_{t}$ denotes the total delay. Define $|\mathcal{M}|$ to be the regret caused by all the samples that are not received before round $T$ . The regret $|\mathcal{M}|$ is bounded due to the intolerable feedback delay $d_{\max}$ . With probability at least $1-\delta$ , the regret of the proposed algorithm satisfies

	$\displaystyle\bar{\mathcal{R}}$	$\displaystyle\leq\frac{\ln K}{\eta}+\frac{\ln(K/\delta)}{2\beta}+(\frac{\eta e}{2}+\beta)(KT+\frac{K\ln(K/\delta)}{2\beta})$
		$\displaystyle+D\eta+\|\mathcal{M}\|.$

In particular, setting $\eta=2\beta=\sqrt{\frac{\ln K+\ln(K/\delta)}{D+(e+1)KT/2}}$ , we have

	$\displaystyle\bar{\mathcal{R}}\leq$	$\displaystyle 2\sqrt{\Big{(}D+\frac{(e+1)}{2}KT\Big{)}\Big{(}2\ln K-\ln({\delta})\Big{)}}$
		$\displaystyle+\frac{(e+1)}{2}K\ln(\frac{K}{\delta})+\|\mathcal{M}\|.$		(12)

Proof of Theorem 1.

Given in Appendix A. ∎

According to Theorem 1, the regret of the proposed algorithm can be upper bounded by sub-linear order with determining proper update parameters. This result indicates that the regret growth decreases as time goes by, yielding the optimal service FN selections of the proposed algorithm.

IV Multi-agent setting

In most cases, the fog network consists of multiple task FNs. In this section, we extend our work to the multi-agent setting. We consider a game where all players play according to the proposed algorithm with delayed feedback. Without the consideration of the delays, an algorithm with sublinear regret, played against itself, will converge to a Nash equilibrium (NE) in the ergodic average sense [29]. In this section, we further investigate our proposed algorithm in the multi-agent setting. Adopting the proposed algorithm, the action of each task FN with the delay is proved to achieve an NE.

IV-A Problem Formulation

In the multi-agent setting, there are $V$ active task FNs and $K$ independent service FNs. In our paper, a time slot is divided into sub-slots. The $V$ task FNs are allocated to access $K$ service FNs in each sub-slots. In most cases, this assumption holds since not all task FNs are active to offload tasks at the same time. At each time slot, each active task FN chooses one service FN to offload its computation tasks based on its local observations. In addition, task FNs do not share information with each other due to the consideration of information security. In our framework, the different task FNs access service FNs in a time-division manner. Collisions occur when multiple task FNs choose the same service FN. Two collision models are considered in our paper. One is the ‘winner-takes-all’ model when a collision occurs. In this model, the computation task of one task FN is processed and the rest would not be processed. The other model is that the service FN accepts all the arrival tasks and arranges the arrivals in a random manner in the queue without jumping. In the multiple task FNs setting, the total reward under a policy $\Phi$ by time $t$ is the same as (7), which is given by

\displaystyle R_{\Phi}(T)=\mathbb{E}\Big{[}\sum_{v=1}^{V}\sum_{t=1}^{T}l_{s|t}^{a}(v)\Big{]}-\min_{A}\sum_{v=1}^{V}\sum_{t=1}^{T}l_{t}^{A}(v),

(13)

where $l_{s|t}^{a}(v)$ denotes the loss of task FN $v$ . The regret derivation procedure of the proposed algorithm in multiple agents setting is the same as the single case and here we care more about the NE property of the proposed strategy.

An NE is a set of strategies, one for each of multiple players of a game, that each player’s choice is his best response to the choices of the other players [30]. Besides, the NE has the self-stability property such that the users at the equilibrium can achieve a mutually satisfactory solution and no user has the incentive to deviate. In the multi-agent decentralized offloading scenario, an NE means no task FN has an incentive of changing its mixed strategy if all other FNs do not change their strategies. This property is crucial to the decentralized computation offloading problem, since the task FNs may act in their own interests. For each task FN, achieving an NE means the current strategy is their optimal response, thus yielding a minimal latency solution.

In the following discussion, we will concentrate on investigating the NE property of the proposed algorithm. We investigate the NE property of the proposed algorithm in a two users zero-sum case.

IV-B Equilibrium Formulation

Let $U$ be the cost matrix, such that when the row player plays $i$ and the column player plays $j$ , the former pays a cost of $U(i,j)$ and the latter gains a reward of $U(i,j)$ (a.k.a., a cost of $-U(i,j)$ ), where $0\leq U(i,j)\leq 1$ for any $i,j$ . Notation $\bm{p}_{t},\bm{q}_{t}$ represent the mixed strategies of row player and column player, then we use the convention that

\displaystyle U(\bm{p}_{t},j)\triangleq\sum_{i=1}^{K}p_{t}^{(i)}U(i,j),

(14)

and

\displaystyle U(\bm{p}_{t},\bm{q}_{t})\triangleq\sum_{i=1}^{K}\sum_{j=1}^{K}p_{t}^{(i)}q_{t}^{(j)}U(i,j).

(15)

NE is a critical concept in game theory to predict the outcome of a game. An NE is a strategy profile $(\bm{p}_{t}^{*},\bm{q}_{t}^{*})$ such that there is no motivation for any player to change its strategy since the changed strategy will degrade the benefit. To obtain the desired strategy, we need to define the set of all approximate NE:

Definition 1.

The set of all $\epsilon$ -NE points is

	$\displaystyle\mathcal{N_{\epsilon}}=\Bigl{\{}(\bm{p}^{},\bm{q}^{})\|U(\bm{p}^{},\bm{q}^{})\leq\min_{\bm{p}}U(\bm{p},\bm{q})+\epsilon,$
	$\displaystyle U(\bm{p}^{},\bm{q}^{})\geq\max_{\bm{q}}U(\bm{p},\bm{q})-\epsilon\Bigr{\}},$		(16)

and the set of NE points is $\mathcal{N}_{0}$ .

The entity that converges to the set of NE in our case is the ergodic average of $(\bm{p}_{t},\bm{q}_{t})$ .

Definition 2.

The ergodic average of a sequence of distributions $\bm{p_{t}}$ is defined as:

\displaystyle\bm{\bar{p}}_{t}\triangleq\frac{\sum_{\tau=1}^{t}\bm{p}_{\tau}}{t}.

(17)

We say that $(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})$ converges in $L1$ to the set of NE if

\displaystyle\lim_{T\rightarrow\infty}\mathop{\arg\min}\limits_{(\bm{p}_{T}^{*},\bm{q}_{T}^{*})\in\mathcal{N}_{0}}E\{||(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})-(\bm{p}_{T}^{*},\bm{q}_{T}^{*})||_{1}\}=0,

(18)

which also implies that $\forall\epsilon>0$ ,

\displaystyle\lim_{T\rightarrow\infty}\mathop{\arg\min}\limits_{(\bm{p}_{T}^{*},\bm{q}_{T}^{*})\in\mathcal{N}_{0}}P\{||(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})-(\bm{p}_{T}^{*},\bm{q}_{T}^{*})||_{1}\geq\epsilon\}=0,

(19)

where $||\cdot||_{1}$ represents $L1$ norm.

The main theory analysis, given in Theorem 2, generalizes this statement for the case with delayed feedback, and the proposed algorithm can converge to an NE with the feedback lag. Note that the convergence of the ergodic mean to the set of NE is in the $L1$ norm sense, which is much stronger than the convergence of the expected ergodic mean.

Theorem 2.

Let two players play a zero-sum game with a cost matrix $U$ , where $0\leq U(i,j)\leq 1$ for each $i,j$ by the proposed algorithm. Denote the delay sequences of the row player and the column player as $\{d_{t}^{r}\}$ and $\{d_{t}^{c}\}$ . Let the mixed strategies of the row and column players at round $t$ be $p_{t}$ and $q_{t}$ , respectively. Then as $T\rightarrow\infty$ , the game played by the proposed algorithm versus itself converges to NE.

proof of theorem 2.

Given in Appendix B. ∎

Regarding the action of the adversary as the multiple merged actions of other players, the two players setting can be generalized to the multiple players setting [31].

V Numerical Results

In this section, we evaluate the experiment performance of the proposed DEB algorithm. During all the simulations, the task size of each FN is equal and is normalized as $1$ . The duration of each time slot is also normalized as $1$ .

V-A Single-agent Setting

In the first part, we would like to exam the performance of the DEB algorithm in the single-agent setting. Consider a fog network consisting of $7$ FNs. The number of task FN is $1$ , and the rest are service FNs. The task FN is deployed in the center of a circle with $2$ km radius, where the service FNs are randomly deployed in this range. In particular, the path-loss channel is generated according to the 3GPP propagation environment [24]. The total time slot $T$ is set as $30000$ . Besides, to perform the dynamic computation perturbation, the service rates $\mu$ of FNs follows the Poisson distribution with mean generated from the Gaussian distribution with $\mathcal{N}(6,1)$ . To normalized the offloading latency, we set $T_{\max}$ as $5$ . The threshold $t_{\max}$ that prevents the delay of feedback from being too large is set as $3$ .

In fog computing, to the best of our knowledge, there does not exist any literature considering delayed information feedback. To evaluate the performance of the proposed algorithm, we compare the performance of the proposed DEB algorithm to the following schemes that can be implemented in the delayed feedback setting. (i) Queued Partial Monitoring with Delayed Feedback (QPM-D) algorithm [32]: the feedback of each offloading decision is assigned to the queue corresponding to the selected service FN and the feedback information is employed when the service FN is chosen; (ii) Stochastic Delayed Bandits (SDB) algorithm [33]: an algorithm that combines the QPM-D for the main updated scheme and a heuristic scheme for controlling the sampling distribution to improve performance in the presence of delay; (iii) Delayed Upper Confidence Bound (DUCB) algorithm [34]: an algorithm that exploits the delayed feedback to adjust the reward update in classical UCB framework; (iv) Bandit Learning-based Offloading of Tasks (BLOT) algorithm [15]: an offloading algorithm that selects the service FNs by means of the currently available information; (v) Non-delayed offloading: each task FN receives the feedback instantly using the Exponential-weight algorithm for Exploration and Exploitation (EXP3) algorithm presented in [27]. Here QPM-D and SDB are meta schemes for handling delayed feedback. Besides, we adopt Thompson sampling algorithm for the implementation of the QPM-D and SDB algorithm. Note that the BLOT algorithm is not designed for the latency minimization problem, a little bit modification is taken to make BLOT work in this system setting.

In the first experiment, we consider a stationary setting in which the arrival rates of task FNs are constant over time. The arrival rates of FN $i$ is set to be $\lambda_{i}=4.5+0.5\times i$ , suggesting that the FN $1$ achieves the best delay performance in statistics. In Fig. 2, we plot the regret performance, latency performance and the task FN selections of the DEB, QPM-D, SDB, DUCB, BLOT and non-delayed EXP3 algorithms. In addition, the latency of optimal choice (task FN always selects service FN 1) has been plotted to show the performance gap of the above algorithms. Here $1000$ Monte Carlo simulations are performed.

Fig. 2 (a) depicts the regret performance of offloading decisions with the above algorithms. The $y$ -label of the figure represents the ratio of the aggregate regret, revealing the regret order of the DEB, QPM-D, SDB, DUCB, BLOT and non-delayed EXP3 algorithms. The convergence performance of the proposed algorithm represents the learning process of the environment, such as the channel condition and the computation capacities of neighboring FNs. From Fig. 2 (a), we observe that the DEB algorithm achieves better regret performance than the QPM-D, DUCB, BLOT and SDB algorithms. The discrepancy of the performance between DEB, QPM-D, SDB algorithms is caused by the utilization of delayed feedbacks. The proposed DEB algorithm harnesses the delayed feedbacks instantly. The QPM-D and SDB algorithms store the delayed feedbacks of a service FN to its queue firstly and update when the service FN is selected. Due to the delayed feedback, such updating schemes will degrade the regret performance. As for the performance gap between the DUCB, the BLOT and the DEB algorithms, the DUCB and BLOT algorithms focus on the recent feedback but loses sight of the entire information, which results in a larger performance loss. Without the knowledge of other FNs’ conditions and peer competitions, the FN adopted the DEB algorithm can always adapt to the environment and make a proper choice.

Fig. 2 (b) indicates the average offloading latency of the single-agent setting. The offloading latency of the QPM-D and SDB algorithms decrease fast at the beginning and achieves sub-optimal latency performance later. Moreover, the converged latency of algorithms DUCB and BLOT perform not as well as the DEB algorithm. The reason comes from that the DUCB and BLOT algorithms adopt the discount UCB style to handle the delayed feedback. Thus the exploration procedure pay more attention to the recent feedback. Thus the two algorithms are not sufficient to identify the best service FN. The proposed algorithm can fully explore the service FNs and then pick the best FN, thus revealing a better averaged latency performance. This figure can also be leveraged to complementary explain the results in Fig. 2 (a).

In Fig. 2 (c), the statistical selections of DEB, QPM-D, SDB, DUCB, BLOT and EXP3 algorithms are depicted with the same simulation setup. According to the simulation setup, FN $1$ has the smallest expected loss and FN $6$ has the largest expected loss. Fig. 2 (c) reveals that the task FN selects the best FN for the most times with the DEB algorithm. For the FN with the largest loss, the task FN rarely offloads tasks to it and its selections occur during the exploration period for most of the time. Combining the three figures of Fig. 2, we can conclude that the DEB algorithm can better adapt to the changeable environment compared to other algorithms, thus yielding the smallest regret.

In the second experiment, the performance of the proposed algorithm in dynamic changing network environment is verified. In the first $T/2$ time slots, the arrival rates of FN $i$ is set as $\lambda_{i}=4.5+0.5\times i$ , which is the same as the first experiment. In the last $T/2$ time slots, the arrival rate of FN $1$ slows to $7$ , then the FN $2$ starts to outperform FN $1$ and eventually becomes the leader. The above setting is more practical where the arrival rates of FNs may change due to the dynamics of the network structure. This experiment can exam the robustness of the proposed algorithm.

In Fig. 3, we explore the average latency performance of the DEB, QPM, DUCB, BLOT and SDB algorithms in the dynamic setting. Here $1000$ Monte Carlo simulations are performed. From Fig. 3, we observe that the performance tendencies of all the algorithms have changed at $T/2$ as the network statistic varies. Then the deteriorated performance recovers eventually in the rest time. Contrasting with QPM-D, SDB, DUCB and BLOT algorithms, the DEB algorithm recovers faster when task FN suffers network statistic variation, suggesting a more robust property. It is because the DEB algorithm selects the service FNs according to a probability distribution, and the implicit exploration factor $\beta$ can help adapt to the dynamic environment.

To evaluate the scalability of the DEB algorithm, we exam the latency performance of the DEB algorithm with different network sizes in Fig. 4. Here the number of service FNs are set as $K=\{20,100,500,1000\}$ . The FN $1$ is set as the best service FN with arrival rate $5$ , while the rest service FNs are uniformly generated from $[6,10]$ . From Fig. 4, we observe that the latency performance deteriorates with the increment of task FNs. Additional service FNs add the cost of exploration, which results in worse average latency. However, as time goes by, the average latency decreases. The task FN with the DEB algorithm can pick the best service FN eventually.

V-B Multi-agent Setting

In this part, we investigate the performance of the proposed algorithm in the multi-agent setting. We set $V=4$ and $K=10$ . The arrival rate of each FN $i$ is set as $\lambda_{i}=4+0.4\times i$ . In the multi-agent scenario, the DUCB and BLOT algorithms cause severe collisions in its exploration stage, thus the two algorithms are not applicable to multi-agent setting. Thus, we only compare the performance of the DEB, QPM and SDB algorithms here.

Fig. 5 explores the multi-agent performance of the DEB, SDB and QPM algorithms. Here we adopt a weak-collision setting, where the service FN processes the multiple arrivals in a random order manner. In this setting, the task FNs suffer slight collision loss when there exist multiple arrival tasks in service FN. Fig. 5 plots the average latency of different users with the above algorithms. Compared to the SDB and QPM algorithm, the DEB algorithm performs worse in the beginning, then gets better as time goes by. The DEB algorithm chooses service FNs according to the probability distribution, indicating a sufficient exploration for all service FNs. Such selection rule is more suitable for dynamic Markov environment. In addition, Fig. 6 shows the service FNs selections of the DEB algorithm for different agents. From Fig. 6, we observe that the service FNs with low arrival rates are nearly equally selected by the four agents. In this setting, the loss caused by collisions is slight. Therefore, the agents prefer to share the best FNs when the offloading strategy becomes stable.

In the next experiment, we investigate the performance of the DEB algorithm in a ‘winner-takes-all’ setting, where only one task FN can obtain the real loss and the rest suffer the total loss $1$ . To reveal the NE property of the proposed algorithm, here we only consider $2$ agents. The number of service FNs is set as $6$ . The arrival rate of service FN $i$ is set as $[4,4,5,6,7,8]$ . It can be observed that the NE points in this setting is: (i) task FN $1$ chooses service FN $1$ while task FN $2$ chooses service FN $2$ ; (ii) task FN $1$ chooses service FN $2$ while task FN $2$ chooses FN $1$ . Fig. 7 reveals the task FNs’ selections during period [0, $\frac{T}{3}$ ], ( $\frac{T}{3}$ , $\frac{2T}{3}$ ], ( $\frac{2T}{3}$ , $T$ ] and $[0,T]$ , respectively. In the earlier stage of offloading, the two task FNs prefer to share service FN $1$ and FN $2$ . However, it will result in many collisions when their choices are the same. After suffering the performance loss from collisions, the two agents become smarter and gradually avoid selecting the same FN. Eventually, both the two task FNs achieve the NE point, where task FN $1$ choosing service FN $2$ and task FN $2$ choosing service FN $1$ . This simulation result indicates that even with delayed feedback, the proposed algorithm can converge to an NE in the ergodic average sense.

VI Conclusions

Considering the unknown peer FNs’ stations and dynamic environmental changes, we studied the task peer offloading problem with minimal latency in fog networks in this paper. We modeled the task offloading problem as an adversarial multi-arm bandit problem with delayed feedback. To solve this problem, we proposed an efficient DEB algorithm that extends the general exponential based algorithm to the case with delayed feedback. The DEB algorithm is proven to achieve a sub-linear regret in both single-agent setting and multi-agent setting. Moreover, the DEB algorithm can lead to an NE with existing delay feedback in the multi-agent setting. Theoretical and numerical results validated the efficiency of the DEB algorithm.

Appendix A

Proof of Theorem 1.

Since the initial weight of each action is set as $1$ , for all sequences of actions drawn by the proposed algorithm, we have

	$\displaystyle\frac{W_{T}}{W_{0}}$	$\displaystyle\geq\frac{\max\limits_{j}\exp(-\eta\sum_{s=1}^{T}\tilde{l}_{s}^{j})}{K}$
		$\displaystyle\geq\frac{\exp(-\eta\min\limits_{j}\sum_{s=1}^{T}\tilde{l}_{s}^{j})}{K}.$

Then taking logarithms,

\displaystyle\ln\frac{W_{T}}{W_{0}}\geq-\eta\min_{j}\sum_{s=1}^{T}\tilde{l}_{s}^{j}-\ln K.

(20)

On the other hand,

$\displaystyle\frac{W_{t}}{W_{t-1}}$	$\displaystyle=\sum_{i}\frac{w_{t-1}^{i}}{W_{t-1}}\prod_{s:s+d_{s}=t}\exp(-\eta\tilde{l}_{s}^{i})$
	$\displaystyle=\sum_{i}p_{t}^{a}\prod_{s:s+d_{s}=t}\exp(-\eta\tilde{l}_{s}^{i})$
	$\displaystyle\leq\sum_{i}p_{t}^{a}\sum_{s:s+d_{s}=t}\exp(-\eta\tilde{l}_{s}^{i})$	(21)
	$\displaystyle\leq\sum_{i}p_{t}^{a}\sum_{s:s+d_{s}=t}\Big{(}1-\eta\tilde{l}_{s}^{i}+\frac{\eta^{2}}{2}(\tilde{l}_{s}^{i})^{2}\Big{)}$	(22)
	$\displaystyle=\sum_{s:s+d_{s}=t}\Big{(}1-\eta\sum_{i}p_{t}^{i}\tilde{l}_{s}^{i}+\frac{\eta^{2}}{2}\sum_{i}p_{t}^{i}(\tilde{l}_{s}^{i})^{2}\Big{)}$
	$\displaystyle=1+\Big{\|}\{s:s+d_{s}=t\}\Big{\|}-1-\eta\sum_{s:s+d_{s}=t}\sum_{i}p_{t}^{i}\tilde{l}_{s}^{i}$
	$\displaystyle~{}~{}~{}+\frac{\eta^{2}}{2}\sum_{s:s+d_{s}=t}\sum_{i}p_{t}^{i}(\tilde{l}_{s}^{i})^{2}$
	$\displaystyle\leq\exp\Big{(}\Big{\|}\{s:s+d_{s}=t\}\Big{\|}-1-\eta\sum_{s:s+d_{s}=t}\sum_{i}p_{t}^{i}\tilde{l}_{s}^{i}$
	$\displaystyle~{}~{}~{}+\frac{\eta^{2}}{2}\sum_{s:s+d_{s}=t}\sum_{i}p_{t}^{i}(\tilde{l}_{s}^{i})^{2}\Big{)}.$	(23)

From $\exp(-\eta\tilde{l}_{s}^{i})\in(0,1]$ , $\forall i$ , we can obtain (21), the inequality of (22) is given by $e^{x}\leq 1+x+x^{2}/2$ for $x\leq 0$ and inequality (23) is derived from $e^{x}\geq 1+x$ , $\forall x$ .

By a telescoping sum, we have

	$\displaystyle\frac{W_{T}}{W_{0}}$	$\displaystyle\leq\exp\Big{(}-\eta\sum_{t}\sum_{s:s+d_{s}=t}\sum_{i}p_{t}^{i}\tilde{l}_{s}^{i}$
		$\displaystyle~{}~{}~{}+\frac{\eta^{2}}{2}\sum_{t}\sum_{s:s+d_{s}=t}\sum_{i}p_{t}^{i}(\tilde{l}_{s}^{i})^{2}\Big{)}.$		(24)

Note that the sums of the form $\sum_{t}\sum_{s:s+d_{s}=t}$ only include each value of $s$ once and there may exist some feedbacks that can not be obtained before round $T$ . Since the feedbacks are delayed by $d_{\max}$ at most, the regret caused by missing samples is also bounded. Let $|\mathcal{M}|$ denote the regret caused by missing samples. Then we have,

	$\displaystyle\ln\frac{W_{T}}{W_{0}}$	$\displaystyle\leq-\eta\sum_{t}\sum_{i}p_{t}^{i}\tilde{l}_{s}^{i}$
		$\displaystyle~{}~{}~{}+\frac{\eta^{2}}{2}\sum_{t}\sum_{i}p_{t}^{i}(\tilde{l}_{s}^{i})^{2}+\|\mathcal{M}\|.$		(25)

Combining (20) and (25), we can get

	$\displaystyle-\eta\min_{j}\sum_{s=1}^{T}\tilde{l}_{s}^{j}-\ln K$	$\displaystyle\leq-\eta\sum_{t}\sum_{i}p_{t}^{i}\tilde{l}_{s}^{i}$
		$\displaystyle~{}~{}~{}+\frac{\eta^{2}}{2}\sum_{t}\sum_{i}p_{t}^{i}(\tilde{l}_{s}^{i})^{2}+\|\mathcal{M}\|.$

Rearranging the above equation, we have

	$\displaystyle\sum_{t}\sum_{i}p_{t+d_{t}}^{i}\tilde{l}_{t}^{i}-\min_{j}\sum_{t}\tilde{l}_{t}^{j}$	$\displaystyle\leq\frac{1}{2}\eta\sum_{t}\sum_{i}p_{t+d_{t}}^{i}(\tilde{l}_{t}^{i})^{2}$
		$\displaystyle+\frac{\ln K}{\eta}+\|\mathcal{M}\|.$		(26)

Firstly, we need to figure out the relationship between $p_{t+d_{t}}^{i}$ and $p_{t}^{i}$ .

Lemma 1.

Let $N_{t}=|\{s:s+d_{s}\in[t,t+d_{t})\}|$ denote the amount of feedbacks that arrives between playing action and observing its feedback. For a subset of the integers $C$ , corresponding to time steps, let

\displaystyle q^{i}(C)=\frac{\exp(-\eta\sum_{s\in C}\tilde{l}_{s}^{i})}{\sum_{j}\exp(-\eta\sum_{s\in C}\tilde{l}_{s}^{j})}.

(27)

Then the resulting probabilities fulfill for every time slot $t$ and action $i$ ,

\displaystyle p_{t+d_{t}}^{i}-p_{t}^{i}\leq-\eta\sum_{m=1}^{N_{t}}q^{i}(C_{t-1}\cup\{z^{\prime}_{n}:n<m\})\tilde{l}_{z^{\prime}_{i}}^{i},

(28)

where $z^{\prime}_{n}$ is an enumeration of $\{s:s+d_{s}\in[t,t+d_{t})\}$ .

proof of Lemma 1.

Refer to Lemma 10 of [35]. ∎

By using Lemma 1 to bound the left hand side (LHS) of (26), we have

$\displaystyle\sum_{t}\sum_{i}p_{t}^{i}\tilde{l}_{t}^{i}$	$\displaystyle-\min_{j}\sum_{t}\tilde{l}_{t}^{j}\leq\frac{1}{2}\eta\sum_{t}\sum_{i}p_{t+d_{t}}^{i}(\tilde{l}_{t}^{i})^{2}$
	$\displaystyle+\eta\sum_{t}\sum_{i}\tilde{l}_{t}^{i}\sum_{m=1}^{N_{t}}q^{i}(C_{t-1}\cup\{z^{\prime}_{n}:n<m\})\tilde{l}_{z^{\prime}_{n}}^{i}$
	$\displaystyle+\frac{\ln K}{\eta}+\|\mathcal{M}\|.$	(29)

Next, we will bound the second term on the right hand side (RHS) of (29). Since $t$ is not part of the enumeration $z_{j}^{\prime}$ , so the two expectations are taken independently: $\mathbb{E}[\tilde{l}_{z_{j}^{\prime}}^{i}]\leq 1$ and $\mathbb{E}[\tilde{l}_{t}^{i}]\leq 1$ . Then, we have

\displaystyle E\Big{[}\sum_{t}\sum_{i}\tilde{l}_{t}^{i}\sum_{m=1}^{N_{t}}q^{i}(C_{t-1}\cup\{z_{n}^{\prime}:n<m\})\tilde{l}_{z_{n}^{\prime}}^{i}\Big{]}\leq\sum_{t}N_{t}.

Since counting in how many intervals every loss is observed is the same as counting how many losses are observed in every interval. Thus, summing over $t$ or $s$ is equivalent, i.e.,

	$\displaystyle\sum_{t}N_{t}$	$\displaystyle\leq\sum_{t}\|\{s:s+d_{s}\in[t,t+d_{t})\}\|$
		$\displaystyle=\sum_{s}\|\{t:s+d_{s}\in[t,t+d_{t})\}\|.$

Both $s$ and $t$ is bounded to $T$ . Then we have

	$\displaystyle\sum_{s}\|\{t:s+d_{s}\in[t,t+d_{t})\}\|$
	$\displaystyle\leq\sum_{s}\|\{t>s:s+d_{s}\in[t,t+d_{t})\}\|\|$
	$\displaystyle~{}~{}~{}+\|\{t<s:s+d_{s}\in[t,t+d_{t})\}\|.$		(30)

and first term of (30) can be bounded as

	$\displaystyle\|\{t>s:s+d_{s}\in[t,t+d_{t})\}\|$
	$\displaystyle\leq\|\{t>s:t\leq s+d_{s}\}\backslash{\{t>s:t+d_{t}<s+d_{s}\}}\|$
	$\displaystyle\leq d_{s}-\|\{t>s:t+d_{t}<s+d_{s}\}\|,$		(31)

The second term is can be bounded in a similar way,

	$\displaystyle\|\{t<s:s+d_{s}\in[t,t+d_{t})\}\|$
	$\displaystyle\leq\|\{t<s:s+d_{s}<t+d_{t}\}\|.$		(32)

Then we note that by the prior equivalency of summing over $t$ or $s$ , the last term in (31) cancel with (32) when summed, which bounds the second term of (29) by $\eta D$ .

The next lemma can help us bound the first term on RHS of (29).

Lemma 2.

The probabilities defined in (27) is

\displaystyle q^{i}(\Phi_{i})\leq\Big{(}1+\frac{1}{2N-1}\Big{)}q^{i}(\Phi_{i-1}),\forall i,

(33)

proof of Lemma 2.

Refer to Lemma 11 of [35]. ∎

Using Lemma 2, we have

	$\displaystyle\sum_{t}\sum_{i}p_{t+d_{t}}^{i}(\tilde{l}_{t}^{i})^{2}$
	$\displaystyle=\sum_{t}\sum_{i}p_{t}^{i}\frac{p_{t+d_{t}}^{i}}{p_{t}^{i}}(\tilde{l}_{t}^{i})^{2}$
	$\displaystyle=\sum_{t}\sum_{i}p_{t}^{i}(\tilde{l}_{t}^{i})^{2}\prod_{i=1}^{N_{t}}\frac{q^{i}(C_{t-1}\cup\{z_{j}^{\prime}:j\leq i\})}{q^{i}(C_{t-1}\cup\{z_{j}^{\prime}:j<i\})}$
	$\displaystyle\leq\sum_{t}\sum_{i}p_{t}^{i}(\tilde{l}_{t}^{i})^{2}(1+\frac{1}{2N-1})^{N_{t}}$
	$\displaystyle\leq\sum_{t}\sum_{i}p_{t}^{i}(\tilde{l}_{t}^{i})^{2}(1+\frac{1}{2N-1})^{2N-1}$
	$\displaystyle\leq\sum_{t}\sum_{i}p_{t}^{i}(\tilde{l}_{a}^{i})^{2}e.$		(34)

Finally, the inequality of (29) can be derived as

	$\displaystyle\sum_{t}\sum_{i}p_{t}^{i}\tilde{l}_{t}^{i}-\min_{j}\sum_{t}\tilde{l}_{t}^{j}$	$\displaystyle\leq\frac{\ln K}{\eta}+\eta D$
		$\displaystyle+\frac{\eta}{2}\sum_{t}\sum_{i}p_{t}^{i}(\tilde{l}_{t}^{i})^{2}e+\|\mathcal{M}\|.$		(35)

Note that the loss definition in (8) is a biased estimation loss. Next we use the following lemma to bound the unbiased estimation loss of the proposed algorithm.

Lemma 3.

With probability at least $1-\delta$ ,

\displaystyle\sum_{t=1}^{T}(\tilde{l}_{t}^{i}-l_{t}^{i})\leq\frac{\ln(K/\delta)}{2\beta},

(36)

simultaneously holds for all $i\in[K]$ .

proof of Lemma 3.

Refer to Lemma 1 of [28]. ∎

Applying Lemma 3, with probability at least $1-\theta$ , we have

	$\displaystyle\sum_{t}\sum_{i}p_{t}^{i}\tilde{l}_{t}^{i}-\min_{j}\sum_{t}\tilde{l}_{t}^{j}$
	$\displaystyle\leq\frac{\ln K}{\eta}+\frac{\eta}{2}\sum_{t}\sum_{i}p_{t}^{i}(\tilde{l}_{a}^{i})^{2}e+\eta D+\|\mathcal{M}\|.$		(37)

From the definition of estimated loss in (8), the first term of LHS in (37) can be converted to

$\displaystyle\sum_{i}p_{t}^{i}\tilde{l}_{t}^{i}$	$\displaystyle=\sum_{i=1}p_{t}^{i}\frac{l_{t}^{i}}{p_{t}^{i}+\beta l_{t}^{i}}\mathbbm{1}_{\{I_{t}=i\}}$
	$\displaystyle=\sum_{i}\mathbbm{1}_{\{I_{t}=i\}}l_{t}^{i}-\beta\sum_{i}\frac{(l_{t}^{i})^{2}}{p_{t}^{i}+\beta l_{t}^{i}}\mathbbm{1}_{\{I_{t}=i\}}$
	$\displaystyle=\sum_{i}\mathbbm{1}_{\{I_{t}=i\}}l_{t}^{i}-\beta\sum_{i}l_{t}^{i}\tilde{l}_{t}^{i}.$	(38)

Similarly,

\displaystyle\sum_{i}p_{t}^{i}(\tilde{l}_{t}^{i})^{2}=\sum_{i}p_{t}^{i}\frac{l_{t}^{i}}{p_{t}^{i}+\beta l_{t}^{i}}\tilde{l}_{t}^{i}\leq\sum_{i}l_{t}^{i}\tilde{l}_{t}^{i}.

(39)

Combining (38), (39) with (37), we have

	$\displaystyle\sum_{t}(l_{t,I_{t}}-\beta\sum_{i}l_{t}^{i}\tilde{l}_{t}^{i})-\min\limits_{j}\sum_{t}(l_{t}^{j}+\frac{\ln(K/\delta)}{2\beta})$
	$\displaystyle\leq\frac{\ln K}{\eta}+\frac{\eta e}{2}\sum_{t}\sum_{i}l_{t}^{i}\tilde{l}_{a}^{i}+\eta D+\|\mathcal{M}\|.$
	$\displaystyle\leq\frac{\ln K}{\eta}+\frac{\ln(K/\delta)}{2\beta}+\eta D+(\frac{\eta e}{2}+\beta)\sum_{t}\sum_{i}l_{t}^{i}\tilde{l}_{a}^{i}+\|\mathcal{M}\|$
	$\displaystyle\leq\frac{\ln K}{\eta}+\frac{\ln(K/\delta)}{2\beta}+\eta D+(\frac{\eta e}{2}+\beta)\sum_{t}\sum_{i}\tilde{l}_{t}^{i}+\|\mathcal{M}\|$
	$\displaystyle\leq\frac{\ln K}{\eta}+\frac{\ln(K/\delta)}{2\beta}+\eta D+(\frac{\eta e}{2}+\beta)(KT+\frac{K\ln(K/\delta)}{2\beta})$
	$\displaystyle+\|\mathcal{M}\|.$		(40)

By properly selecting $\eta=2\beta=\sqrt{\frac{\ln K+\ln(K/\delta)}{D+(e+1)KT/2}}$ , we have

	$\displaystyle\bar{\mathcal{R}}\leq$	$\displaystyle 2\sqrt{\Big{(}D+\frac{(e+1)}{2}KT\Big{)}\Big{(}2\ln K-\ln\delta\Big{)}}$
		$\displaystyle+\frac{(e+1)}{2}K\ln(\frac{K}{\delta})+\|\mathcal{M}\|,$		(41)

which finishes the proof of Theorem 1. ∎

Appendix B

proof of theorem 2.

Let $\epsilon>0$ , and define the ergodic average of the value of the game by

\displaystyle\bar{U}_{T}=\frac{\sum_{t=1}^{T}U(\bm{p}_{t},\bm{q}_{t})}{T}.

(42)

By utilizing the proposed algorithm with cost sequence $l_{t}^{i}=U(i,\bm{q}_{t})$ , we know from (40) that the row player guarantees that for any column strategy, in particular $\bm{q}_{t}$ , and any row strategy $\bm{p}$ , we have

	$\displaystyle E\Biggl{\{}\sum_{t=1}^{T}\Big{(}U(\bm{p}_{t},\bm{q}_{t})-U(\bm{p},\bm{q}_{t})\Big{)}\Biggr{\}}\leq\frac{\ln K}{\eta}+\frac{\ln(K/\delta)}{2\beta}$
	$\displaystyle+\eta D+(\frac{\eta e}{2}+\beta)(KT+\frac{K\ln(K/\delta)}{2\beta})+\|\mathcal{M}\|.$		(43)

Let $G$ denote the constant term of (43), such that

\displaystyle G=\frac{\ln K}{\eta}+\frac{\ln(K/\delta)}{2\beta}+(\frac{\eta e}{2}+\beta)\frac{K\ln(K/\delta)}{2\beta}+|\mathcal{M}|

There exists $T_{1}>0$ ,

$\displaystyle E\Biggl{\{}\bar{U}_{T}-U(\bm{p},\bm{\bar{q}}_{T})\Biggr{\}}$	$\displaystyle=E\Biggl{\{}\frac{\sum_{t=1}^{T}\Big{(}U(\bm{p}_{t},\bm{q}_{t})-U(\bm{p},\bm{q}_{t})\Big{)}}{T}\Biggr{\}}$
	$\displaystyle\leq\frac{G+(\frac{\eta e}{2}+\beta)KT+\eta D}{T}$
	$\displaystyle\leq\frac{\epsilon}{2},~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\quad\quad\forall~{}T>T_{1},$	(44)

where the second inequality holds since $D$ is bounded by $d_{\max}$ and $G$ is negligible as $T\rightarrow\infty$ .

Then for the column player with cost sequence $l_{t}^{j}=1-U(\bm{p}_{t},j)$ , for every strategy $\bm{p}_{t}$ and strategy $\bm{q}$ , we have

\displaystyle E\Biggl{\{}\sum_{t=1}^{T}\Big{(}U(\bm{p}_{t},\bm{q})-U(\bm{p}_{t},\bm{q}_{t})\Big{)}\Biggr{\}}\leq G+(\frac{\eta e}{2}+\beta)KT+\eta D.

(45)

Existing $T_{2}>0$ ,

$\displaystyle E\Biggl{\{}U(\bm{\bar{p}}_{T},\bm{q})-\bar{U}_{T}\Biggr{\}}$	$\displaystyle=E\Biggl{\{}\frac{\sum_{t=1}^{T}\Big{(}U(\bm{p}_{t},\bm{q})-U(\bm{p}_{t},\bm{q}_{t})\Big{)}}{T}\Biggr{\}}$
	$\displaystyle\leq\frac{G+(\frac{\eta e}{2}+\beta)KT+\eta D}{T}$
	$\displaystyle\leq\frac{\epsilon}{2},\qquad\qquad\quad\quad\forall~{}T>T_{2}.$	(46)

Now, define $\bm{p}_{T}^{b}$ as the best response to $\bm{\bar{q}}_{T}$ , and we have

\displaystyle\bm{p}_{T}^{b}=\arg\min_{\bm{p}^{\prime}}~{}U(\bm{p}^{\prime},\bm{\bar{q}}_{T})|,

together with $\bm{q}_{T}^{b}$ , the best response to $\bm{\bar{p}}_{T}$ , and we have

\displaystyle\bm{q}_{T}^{b}=\arg\max_{\bm{q}^{\prime}}~{}U(\bm{\bar{p}}_{T},\bm{q}^{\prime})|.

Choosing $\bm{p}=\bm{p}_{T}^{b},\bm{q}=\bm{\bar{q}}_{T}$ in (44), (46) and adding them together, we have

$\displaystyle E\Biggl{\{}\Bigg{\|}U(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})-\min_{\bm{p}^{\prime}}~{}U(\bm{p}^{\prime},\bm{\bar{q}}_{T})\Bigg{\|}\Biggr{\}}$	$\displaystyle=E\Bigl{\{}\bar{U}_{T}-U(\bm{p}_{T}^{b},\bm{\bar{q}}_{T})\Bigr{\}}$
	$\displaystyle+E\Biggl{\{}U(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})-\bar{U}_{T}\Biggr{\}}$
	$\displaystyle\leq\epsilon,~{}\forall~{}T>\max\{T_{1},T_{2}\},$	(47)

where the first equality holds since $U(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})\geq U(\bm{p}_{T}^{b},\bar{\bm{q}}_{T})$ . Choosing $\bm{p}=\bm{\bar{p}}_{T},\bm{q}=\bm{q}_{T}^{b}$ in (44), (46) and adding them together, we have

$\displaystyle E\Biggl{\{}\Bigg{\|}U(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})-\max_{\bm{q}^{\prime}}~{}U(\bm{\bar{p}}_{T},\bm{q^{\prime}})\Bigg{\|}\Biggr{\}}$	$\displaystyle=E\Bigl{\{}\bar{U}_{T}-U(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})\Bigr{\}}$
	$\displaystyle+E\Biggl{\{}U(\bm{\bar{p}}_{T},\bm{q}_{T}^{b})-\bar{U}_{T}\Biggr{\}}$
	$\displaystyle\leq\epsilon,\forall~{}T>\max\{T_{1},T_{2}\},$	(48)

where the first equality holds as $U(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})\leq U(\bm{\bar{p}}_{T},\bm{q}_{T}^{b})$ . Equations (47) and (48) show that $(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})$ can converge to $\mathcal{N}_{\epsilon}$ in the $L1$ sense. ∎

References

[1] L. Atzori, A. Iera, and G. Morabito, “The internet of things: A survey,” Computer networks, vol. 54, no. 15, pp. 2787–2805, 2010.
[2] M. S. Elbamby, C. Perfecto, C. Liu, J. Park, S. Samarakoon, X. Chen, and M. Bennis, “Wireless edge computing with latency and reliability guarantees,” Proceedings of the IEEE, vol. 107, no. 8, pp. 1717–1737, 2019.
[3] F. Bonomi, R. Milito, J. Zhu, and S. Addepalli, “Fog computing and its role in the internet of things,” in Workshop on Mobile Cloud Computing (MCC), Aug. 2012, pp. 13–16.
[4] S. Yi, C. Li, and Q. Li, “A survey of fog computing: concepts, applications and issues,” in Workshop on Mobile Big Data (Mobidata), Aug. 2015, pp. 13–16.
[5] F. Bonomi, R. Milito, P. Natarajan, and J. Zhu, “Fog computing: A platform for internet of things and analytics,” in Big data and internet of things: A roadmap for smart environments. Springer, 2014, pp. 169–186.
[6] K. Zhang, Y. Mao, S. Leng, Y. He, and Y. Zhang, “Mobile-edge computing for vehicular networks: A promising network paradigm with predictive off-loading,” IEEE Vehicular Technology Magazine, vol. 12, no. 2, pp. 36–44, 2017.
[7] J. Wang, K. Liu, B. Li, T. Liu, R. Li, and Z. Han, “Delay-sensitive multi-period computation offloading with reliability guarantees in fog networks,” IEEE Transactions on Mobile Computing, pp. 1–1, 2019.
[8] R. Deng, R. Lu, C. Lai, T. H. Luan, and H. Liang, “Optimal workload allocation in fog-cloud computing toward balanced delay and power consumption,” IEEE Internet of Things Journal, vol. 3, no. 6, pp. 1171–1181, 2016.
[9] C. You, K. Huang, H. Chae, and B. Kim, “Energy-efficient resource allocation for mobile-edge computation offloading,” IEEE Transactions on Wireless Communications, vol. 16, no. 3, pp. 1397–1411, 2017.
[10] T. Q. Dinh, J. Tang, Q. D. La, and T. Q. S. Quek, “Offloading in mobile edge computing: Task allocation and computational frequency scaling,” IEEE Transactions on Communications, vol. 65, no. 8, pp. 3571–3584, 2017.
[11] I. Ketykó, L. Kecskés, C. Nemes, and L. Farkas, “Multi-user computation offloading as multiple knapsack problem for 5g mobile edge computing,” in European Conference on Networks and Communications (EuCNC), June 2016, pp. 225–229.
[12] M. Khaledi, M. Khaledi, and S. K. Kasera, “Profitable task allocation in mobile cloud computing,” in ACM Symposium on QoS and Security for Wireless and Mobile Networks (Q2SWinet), Nov. 2016, pp. 9–17.
[13] H. Zhu, K. Kang, X. Luo, and H. Qian, “Online learning for computation peer offloading with semi-bandit feedback,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 4524–4528.
[14] G. Lee, W. Saad, and M. Bennis, “An online secretary framework for fog network formation with minimal latency,” in IEEE International Conference on Communications (ICC), May 2017, pp. 1–6.
[15] Z. Zhu, T. Liu, Y. Yang, and X. Luo, “BLOT: Bandit learning-based offloading of tasks in fog-enabled networks,” IEEE Transactions on Parallel and Distributed Systems, vol. 30, no. 12, pp. 2636–2649, 2019.
[16] C. Liu, M. Bennis, M. Debbah, and H. V. Poor, “Dynamic task offloading and resource allocation for ultra-reliable low-latency edge computing,” IEEE Transactions on Communications, vol. 67, no. 6, pp. 4132–4150, 2019.
[17] M. Yang, H. Zhu, H. Wang, Y. Koucheryavy, K. Samouylov, and H. Qian, “An online learning approach to computation offloading in dynamic fog networks,” IEEE Internet of Things Journal, pp. 1–1, 2020.
[18] J. Gustafsson, A. Betts, A. Ermedahl, and B. Lisper, “The mälardalen wcet benchmarks: Past, present and future,” in International Workshop on Worst-Case Execution Time Analysis (WCET). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, July 2010.
[19] M. Yang, H. Zhu, H. Wang, Y. Koucheryavy, K. Samouylov, and H. Qian, “Peer to peer offloading with delayed feedback: An adversary bandit approach,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 5035–5039.
[20] M. Chiang and T. Zhang, “Fog and IoT: An overview of research opportunities,” IEEE Internet of Things Journal, vol. 3, no. 6, pp. 854–864, Dec. 2016.
[21] L. Chen, S. Zhou, and J. Xu, “Computation peer offloading for energy-constrained mobile edge computing in small-cell networks,” IEEE/ACM Transactions on Networking, vol. 26, no. 4, pp. 1619–1632, Aug. 2018.
[22] D. Huang, P. Wang, and D. Niyato, “A dynamic offloading algorithm for mobile computing,” IEEE Transactions on Wireless Communications, vol. 11, no. 6, pp. 1991–1995, 2012.
[23] O. Muñoz, A. Pascual-Iserte, and J. Vidal, “Optimization of radio and computational resources for energy efficiency in latency-constrained application offloading,” IEEE Transactions on Vehicular Technology, vol. 64, no. 10, pp. 4738–4755, 2015.
[24] “Evolved universal terrestrial radio access (E-UTRA); physical layer procedures (release 16),” 3GPP TS 36.213, 2020.
[25] D. J. Costello, Error control coding: fundamentals and applications. prentice Hall, 1983.
[26] D. P. Bertsekas, R. G. Gallager, and P. Humblet, Data networks. Prentice-Hall International New Jersey, 1992, vol. 2.
[27] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire, “The non-stochastic multi-armed bandit problem,” SIAM Journal on Computing, vol. 32, pp. 48–77, 2002.
[28] G. Neu, “Explore no more: Improved high-probability regret bounds for non-stochastic bandits,” in Advances in Neural Information Processing Systems (NIPS), Dec. 2015, pp. 3168–3176.
[29] Y. Cai and C. Daskalakis, “On minmax theorems for multiplayer games,” in Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Jan. 2011, pp. 217–234.
[30] J. Nash, “Non-cooperative games,” Annals of mathematics, pp. 286–295, 1951.
[31] T. Roughgarden, “Algorithmic game theory,” Communications of the ACM, vol. 53, no. 7, pp. 78–86, 2010.
[32] P. Joulani, A. Gyorgy, and C. Szepesvári, “Online learning under delayed feedback,” in International Conference on Machine Learning (ICML), June 2013, pp. 1453–1461.
[33] T. Mandel, Y.-E. Liu, E. Brunskill, and Z. Popović, “The queue method: Handling delay, heuristics, prior data, and evaluation in bandits,” in AAAI Conference on Artificial Intelligence, 2015.
[34] C. Vernade, O. Cappe, and V. Perchet, “Stochastic bandit models for delayed conversions,” in Conference on Uncertainty in Artificial Intelligence (UAI), Aug. 2017.
[35] T. S. Thune, N. Cesa-Bianchi, and Y. Seldin, “Nonstochastic multiarmed bandits with unrestricted delays,” in Advances in Neural Information Processing Systems (NIPS), Dec. 2019, pp. 6541–6550.

	$\displaystyle\sum_{s}\|\{t:s+d_{s}\in[t,t+d_{t})\}\|$
	$\displaystyle\leq\sum_{s}\|\{t>s:s+d_{s}\in[t,t+d_{t})\}\|\|$
	$\displaystyle~{}~{}~{}+\|\{t<s:s+d_{s}\in[t,t+d_{t})\}\|.$		(30)

	$\displaystyle\|\{t>s:s+d_{s}\in[t,t+d_{t})\}\|$
	$\displaystyle\leq\|\{t>s:t\leq s+d_{s}\}\backslash{\{t>s:t+d_{t}<s+d_{s}\}}\|$
	$\displaystyle\leq d_{s}-\|\{t>s:t+d_{t}<s+d_{s}\}\|,$		(31)

	$\displaystyle\sum_{t}(l_{t,I_{t}}-\beta\sum_{i}l_{t}^{i}\tilde{l}_{t}^{i})-\min\limits_{j}\sum_{t}(l_{t}^{j}+\frac{\ln(K/\delta)}{2\beta})$
	$\displaystyle\leq\frac{\ln K}{\eta}+\frac{\eta e}{2}\sum_{t}\sum_{i}l_{t}^{i}\tilde{l}_{a}^{i}+\eta D+\|\mathcal{M}\|.$
	$\displaystyle\leq\frac{\ln K}{\eta}+\frac{\ln(K/\delta)}{2\beta}+\eta D+(\frac{\eta e}{2}+\beta)\sum_{t}\sum_{i}l_{t}^{i}\tilde{l}_{a}^{i}+\|\mathcal{M}\|$
	$\displaystyle\leq\frac{\ln K}{\eta}+\frac{\ln(K/\delta)}{2\beta}+\eta D+(\frac{\eta e}{2}+\beta)\sum_{t}\sum_{i}\tilde{l}_{t}^{i}+\|\mathcal{M}\|$
	$\displaystyle\leq\frac{\ln K}{\eta}+\frac{\ln(K/\delta)}{2\beta}+\eta D+(\frac{\eta e}{2}+\beta)(KT+\frac{K\ln(K/\delta)}{2\beta})$
	$\displaystyle+\|\mathcal{M}\|.$		(40)