This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Peer Offloading with Delayed Feedback in Fog Networks

Miao Yang, Hongbin Zhu, Hua Qian, Senior Member, IEEE, Yevgeni Koucheryavy, Senior Member, IEEE, Konstantin Samouylov, and Haifeng Wang This paper was presented in part at the 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020). M. Yang and H. Qian are with Shanghai Advanced Research Institute, Chinese Academy of Sciences (CAS), also with School of Information Science and Technology, ShanghaiTech University, and also with University of Chinese Academy of Sciences. H. Zhu and H. Wang are with School of Information Science and Technology, ShanghaiTech University, also with Shanghai Institute of Microsystem and Information Technology, CAS. Yevgeni Koucheryavy is with Tampere University, Finland, and also with National Research University Higher School of Economics, Moscow, Russia. Konstantin Samouylov is with Peoples Friendship University of Russia (RUDN University), Moscow, Russian Federation. Corresponding author: Hongbin Zhu ([email protected]). This work was supported in part by the National Natural Science Foundation of China (Grant No. 61971286) and the Science and Technology Commission Foundation of Shanghai (Grant No. 19DZ1204300).
Abstract

Comparing to cloud computing, fog computing performs computation and services at the edge of networks, thus relieving the computation burden of the data center and reducing the task latency of end devices. Computation latency is a crucial performance metric in fog computing, especially for real-time applications. In this paper, we study a peer computation offloading problem for a fog network with unknown dynamics. In this scenario, each fog node (FN) can offload its computation tasks to neighboring FNs in a time slot manner. The offloading latency, however, could not be fed back to the task dispatcher instantaneously due to the uncertainty of the processing time in peer FNs. Besides, peer competition occurs when different FNs offload tasks to one FN at the same time. To tackle the above difficulties, we model the computation offloading problem as a sequential FN selection problem with delayed information feedback. Using adversarial multi-arm bandit framework, we construct an online learning policy to deal with delayed information feedback. Different contention resolution approaches are considered to resolve peer competition. Performance analysis shows that the regret of the proposed algorithm, or the performance loss with suboptimal FN selections, achieves a sub-linear order, suggesting an optimal FN selection policy. Besides, we prove that the proposed strategy can result in a Nash equilibrium (NE) with all FNs playing the same policy. Simulation results validate the effectiveness of the proposed policy.

Index Terms:
Fog computing, adversarial multi-arm bandit, reinforcement learning, task offloading, delayed feedback.

I Introduction

The pervasive mobile devices, together with the rapidly increasing mobile applications, such as, augmented reality, virtual reality, online gaming and super-resolution video streaming, are revolutionizing our way of life [1]. These massive devices and heterogeneous services require unprecedentedly high transmission speed and low latency [2]. The limited battery capacities and computation resources of the mobile devices, on the other hand, place stringent restrictions on local computations.

Fog computing has recently emerged to assist in the task processing of mobile devices [3, 4, 5]. The key idea of fog computing is to utilize computation resources of fog nodes (FNs), which are usually located at the edge of networks. These FNs can be the servers or micro base stations with extra computation resources. The computation tasks of mobile devices can be offloaded to FNs, which releases the computation workloads of mobile devices and reduces their power consumption. On the other hand, the offloaded computation tasks and the computation capability of each FN vary, which may result in unbearable latency and degrade the users’ experience. For example, some weak FNs with poor computation capacity or extensive workload are not in favor of computation-intensive applications.

Intelligent computation peer offloading strategy is desired to fully utilize the neighboring FNs with high quality of experience. By means of peer computation offloading, the weak FNs can offload their emerging computation tasks to the FNs in the vicinity. With efficient peer offloading strategy, fog computing can assist the terminals to satisfy stringent latency requirements and support a variety of internet of things (IoTs) devices whose computation resources are limited.

I-A Prior Work

In recent literatures, quite a few computation offloading problems over the fog networks have been proposed. The authors in [6] investigated the computation offloading problem by adaptively offloading the computation tasks from cloud to fog servers with the predicted mobility. The task offloading problem was formulated as a multi-period generalized assignment problem in [7], which aimed at minimizing the total delay by offloading tasks to suitable fog nodes in each period. The work in [8] studied a workload assignment problem between fog and cloud that considered the trade-off between power consumption and transmission delay. The above studies are based on centralized scheduling with high overhead of information exchange at edge devices. Such offloading algorithms do not fit for real-time and massive data services. In addition, these studies do not consider the offloading among FNs, which can be more challenging with unbalanced workload among FNs.

To tackle the peer offloading problem, the authors in [9] and [10] formulated the offloading problem as a convex optimization problem to minimize the energy consumption with latency constraints. In [11], a latency minimization problem was studied to allocate the computation resources of the edge servers. Meanwhile, the authors in [12] proposed a task allocation approach that minimized the overall task completion time using a multidimensional auction. In the above works, it is generally assumed that information of the fog network is completely known, thus the computation offloading is modeled as a deterministic scheduling problem. However, such assumptions do not hold in practice since the global information could not be obtained in real-time and the network status may change rapidly.

For the task offloading problem with unknown dynamics, the authors in [13] proposed an online peer offloading algorithm for time-varying channel conditions and stochastic arrival tasks. The work in [14] proposed an online secretary algorithm to minimize the offloading latency with unknown arrival rates of neighboring FNs. The latency, the energy consumption, and the switching cost were jointly considered as a stochastic programming problem in [15]. Taking into account the requirements of ultra-reliable low-latency, the authors in [16] proposed a mechanism to minimize the users’ power consumption while considering the allocated resources. Besides, in [17], the workloads of neighboring FNs were formulated as a Markov model and a latency minimizing algorithm was proposed to track these dynamics. These works assume that the workload of FNs satisfies some stationary distributions. Furthermore, there exist competition and collision effects among other FNs offloading choices. Some unexpected tasks brought by dynamic network structure and conditions could deteriorate the offloading performance.

In general, all works mentioned above assume instantaneous feedback of offloading decisions. The task assigner is assumed to receive the expected latency when the tasks are dispatched. This assumption is not always valid in the real world since the FN that provides computation ability may not be able to give accurate latency estimates when the tasks are received. The exact execution clock cycle of each task can not be obtained until it is finished. The execution time of a task is coupled with the hardware and the software state, such as, the cache condition, resource contention, scheduling policy, interrupt management, etc. In fact, even the measurement of the worst-case execution time (WCET) of a task is quite a headache problem in the field of real-time operating system (RTOS) [18].

I-B Contribution and Organization

In this paper, for the FN who wants to offload tasks, we assume that: (1) the global FNs state information is not known; (2) the resources of service FNs may be unpredictably competed by other FNs; (3) the offload feedback is obtained only when the task was processed and finished by the computation FN. These assumptions naturally raise three challenging problems. Firstly, the performance achieved from each FN is unaware beforehand and has to be learned due to dynamic FN conditions. Secondly, the achieved performance also depends on the set of competing tasks, which is unpredictable since the decisions of other FNs are made locally without coordination. Thirdly, the feedback of the offloaded task is delayed, which may deteriorate the achieved performance of existing strategies. Most algorithms for computation offloading do not fully consider the uncertainties mentioned above and are not in favor of these assumptions.

In [19], we studied the computation peer offloading problem in a fog-enabled network. Our goal is to minimize the peer offloading latency in uncertain environment without instantaneous latency feedback. An exponential based algorithm was proposed to handle the tricky problem and simulation results revealed the efficiency of the proposed strategy.

In this paper, we extend our previous work with theoretical proofs for the single-agent setting in [19]. Furthermore, we study the computation peer offloading problem for the multi-agent setting. The main contributions of this paper are summarized as follows.

  • We develop an online learning framework for fog networks, where each FN can offload their computation tasks to peer FNs in the vicinity. We formulate the latency minimization problem as a sequential decision making problem. Both single-agent and multi-agent settings are considered in this work.

  • For the single-agent setting, an online learning algorithm is advocated based on the adversarial multi-armed bandit (MAB) framework, which accommodates arbitrary variation of FN conditions. The proposed method can be harnessed to learn the statistical information of the network through delayed information feedback. Besides, the proposed method allows task dispatching FN to dynamically identify the most suitable service FNs without the prior knowledge of arrival tasks and channel conditions. The theoretical proof is provided, suggesting that the proposed algorithm achieves a sub-linear order with delayed feedback.

  • For the multi-agent setting, we consider the different conflict resolutions when multiple tasks are dispatched to the same FN. Assume that all FNs adopt the same strategy, multi-agent setting can be regarded as a game between an FN and the rest of FNs. The proposed algorithm for the multi-agent setting can achieve the same optimality performance as the single-agent setting. Besides, the proposed algorithm tailored for the offloading game can converge to an ϵ\epsilon-NE in a specific two users setting.

The rest of the paper is organized as follows. The system model and optimization problem are introduced in section II. An online learning algorithm for computation offloading with delayed feedback is proposed in section III. The theoretical analysis of multi-agent games with the proposed algorithm is studied in IV. Numerical results are given in section V and section VI concludes our work.

II System Model and problem formulation

II-A Network Model

Consider a fog network consisting of an end-user layer and a fog layer as shown in Fig. 1. In our framework, the end-user layer includes smartphones, laptops and wireless sensors that do not have enough computation capability and are power limited. With the help of fog computing, these end-users can offload their task data to the fog layers for remote distributed computing purposes. We consider a fog network with NN FNs, defined by 𝒩={1,2,,N}\mathcal{N}=\{1,2,...,N\}. Each FN can process the task from end-user devices or other peer FNs, which is typical in practical fog networking [20].

For concisely, we classify FNs into the following two categories: (1) Task FN: An FN is a task FN if it offloads workloads to other FNs. Moreover, it does not receive any workload from other FNs; (2) Service FN: An FN is called a service FN if it receives workloads from other FN and does not offload workload to others. Let VV and KK denote the amounts of task FNs and service FNs in fog networks. Notice that in our categorization, a smart FN won’t offload workloads to other FNs while receiving workloads from other FNs. In other words, the FN can only play a role in the task FN and the service FN.

The operational timeline is discretized into time slots (e.g., 1010 ms) for making peer offloading decisions. Once FN ii receives btib_{t}^{i} task at time slot tt, the FN can process the task locally or offload the task to its neighboring FNs. The service rate, which denotes the expected number of consecutive task completions per time slot, can help character the computation capability of FN. Intuitively, the FNs with low service rates or their central processing units (CPUs) are running at full load prefer to offload computation tasks to its neighboring FNs, while the FNs with high service rates or abundant idle computation resources prefer to process locally.

Refer to caption
Figure 1: Peer offloading scenario.

Next, we discuss the latency of peer computing and local computing, which is our foremost concern. The latency of peer computing contains three parts: (1) latency of task dispatching; (2) latency of task processing; (3) latency of results receiving. The latency of results receiving can be negligible due to the relatively small size of computation results [21].

II-B Communication Model

Most works about computation offloading assumed that some prior information is available, such as, wireless channel state information (CSI) [22, 23]. However, it is not easy to have access to such information, especially when the number of FNs is large. In this paper, CSI is assumed to be time-varying and unknown before task transmission. Task FN ii can offload tasks to service FN jj over a wireless channel. The transmitted rate can be given by

rij=Blog2(1+gijhPiσ2),\displaystyle r^{ij}=B\log_{2}(1+\frac{g^{ij}hP^{i}}{\sigma^{2}}), (1)

where gijg^{ij} is the channel gain between FN ii and jj, and hh is the average fading gain of the FN ii. Notation PiP^{i} is the transmission power of FN ii and σ2\sigma^{2} denotes the noise power spectral density. The bandwidth per node is the same in our model and is given by BB. We denote r¯ij\overline{r}^{ij} as the average transmission rate between FN ii and FN jj during one task transmission. Thus the task dispatching latency can be described as

Dti=btir¯ij.\displaystyle D_{t}^{i}=\frac{b_{t}^{i}}{\overline{r}^{ij}}. (2)

The channel deterioration may lead to the failure of the computation tasks’ transmissions. The task FN can confirm the transmission status by acknowledgement (ACK) or negative-acknowledgement (NACK) message [24]. Once receiving a NACK message, the task FN will re-transmit the task with the Automatic Repeat-reQuest (ARQ) or the Hybrid Automatic Repeat-reQuest (HARQ) scheme [25].

II-C Computation Model

The majority of existing offloading studies consider an ideal setting that the task FN can obtain the computation latency when tasks arrive at the service FN. However, current OSs do not support instantaneous feedback. The initial machine state of the service FN varies before the task executes and some interrupts may occur during the computation processing. In this paper, the feedback latency can be obtained when the service FN finishes the execution of the offloaded task.

Next, we define the computation model in the service FNs. According to Kleinrock approximation [26], the arrival process of the task can be regarded as a Poisson process. To model the perturbation of the computation process in service FNs, the service rate of each FN is assumed to be negative exponentially distributed (i.e., generated by a Poisson process). Besides, the statistics of arrival rates vary among FNs since the density of end-users is usually not uniform. The arrival rate λ\lambda of an FN is unaware to other FNs.

When a task arrives at the destination, it will be cached in the queue if the service FN is busy. Since the computing service rate μtj\mu_{t}^{j} of FN jj at time slot tt may be perturbed by some system interrupts, we assume that μtj\mu_{t}^{j} is unknown to all FNs. Let μ¯tj\bar{\mu}_{t}^{j} denote the average service rate during computation. The computation queue can be modeled as an M/M/1M/M/1111MM denotes Markov property. queue. An M/M/1M/M/1 queueing system indicates that the tasks arrive according to a Poisson process and the service rates are assumed to be independent and exponentially distributed. This is consistent with what we described earlier. Let QtjQ_{t}^{j} denote the current task queue length of FN jj when the task arrives at time slot tt.

Without loss of generality, we assume that the task computation latency in service FNs is proportional to the received task size, such as, kbti,kR+kb_{t}^{i},k\in R^{+}. Thus, the computation latency at FN jj is

Ctj=Qtj+kbtiμ¯tj.\displaystyle C_{t}^{j}=\frac{Q_{t}^{j}+kb_{t}^{i}}{\bar{\mu}_{t}^{j}}. (3)

Equation (3) denotes an estimation of the computation latency.

II-D Problem Formulation

Combining the communication latency and computation latency, the total peer offloading latency is

Otj=Ctj+Dti=Qtj+btiμ¯tj+btir¯ij.\displaystyle O_{t}^{j}=C_{t}^{j}+D_{t}^{i}=\frac{Q_{t}^{j}+b_{t}^{i}}{\bar{\mu}_{t}^{j}}+\frac{b_{t}^{i}}{\overline{r}^{ij}}. (4)

In practice, the offloading latency can not be infinite, we bound it within [0,Tmax][0,T_{\text{max}}], where TmaxT_{\text{max}} is the maximal tolerable latency. Thus, the loss ltjl_{t}^{j} of each choice is described as

ltj={OtjTmax,ifOtj<Tmax,1,otherwise,\displaystyle l_{t}^{j}=\begin{cases}\frac{O_{t}^{j}}{T_{\max}},&\text{if}~{}O_{t}^{j}<T_{\text{max}},\\ 1,&\text{otherwise},\end{cases} (5)

where ltj[0,1]l_{t}^{j}\in[0,1]. We want to minimize the average latency of each FN in the long term and it can be formulated as

minimizelimT1Tt=1Ti=1KLti𝟙{It=i}subject toIt,t=1,2,,T,\begin{split}&\text{minimize}~{}~{}~{}\lim_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}\sum_{i=1}^{K}L_{t}^{i}~{}\mathbbm{1}\{I_{t}=i\}\\ &\text{subject to}~{}~{}~{}~{}~{}I_{t}\in\mathcal{I},t=1,2,...,T,\end{split} (6)

where ItI_{t} denotes the index of the offload FN at time slot tt, \mathcal{I} denotes the set of all the offloading choices and 𝟙\mathbbm{1} represents indicating vector. In our model, the channel conditions and task queues of the service FNs are unaware to the task FN. The task FN could not receive the latency results until the feedback has been received. Finding the optimal solution of (6) is challenging with the non-realtime information.

Online learning is a well-established paradigm for making a sequence of decisions when given the knowledge of previous prediction feedback. Besides, online learning can reduce computation complexity and has been investigated in many literatures. Integrating online learning scheme, the challenging optimization problem of (6) can be converted into the low-complexity sequential decision problem at each time slot.

Vanilla online learning algorithm operates in a fixed order, such as making choices, receiving feedbacks and updating the learning model. These typical online learning algorithms may not work under the delayed feedback setting, where the order of the received feedback may not be in line with the order of the dispatches. The lag of the feedback and disorder of the observations bring significant challenges to vanilla online learning algorithms.

III Single-agent setting

In this section, we first introduce the adversarial multi-arm bandit (MAB) framework, then an offloading algorithm based on the exponentially weighted strategy is proposed.

III-A Adversary multi-arm bandit

The adversarial MAB is an efficient algorithm when engaged in the dynamic environment. The offloading problem in section II can be modeled as an adversarial MAB consisting of a set of players (task FNs), which can choose an action (service FN) from the action sets (available service FNs). When the task FN makes a choice at time slot ss, the offloading loss can be observed by the task FN after dsd_{s} time slots. The action of offloading is adopted by one FN at slot ss and the loss is obtained at slot tt, so we have s+ds=ts+d_{s}=t.

To measure the utility of the task offloading scheme Φ\Phi, the concept of expected regret is introduced. Expected regret is harnessed to distinguish the cumulative loss of Φ\Phi and the best strategy in hindsight. The desired Φ\Phi is to minimize its expected cumulative regret RΦ(T)R_{\Phi}(T), which is defined as

RΦ(T)=𝔼[t=1Tls|ta]minAt=1TltA,\displaystyle R_{\Phi}(T)=\mathbb{E}\Big{[}\sum_{t=1}^{T}l_{s|t}^{a}\Big{]}-\min_{A}\sum_{t=1}^{T}l_{t}^{A}, (7)

where 𝔼()\mathbb{E}(\cdot) represents the expectation operation, a{1,,K}a\in\{1,...,K\} denotes the action and AA denotes the best policy which gives the least loss. Notation ls|tal_{s|t}^{a} represents the loss introduced by the time lag and ltAl_{t}^{A} denotes the loss of optimal policy AA.

In adversarial multi-arm bandit setting, task FN chooses service FN jj using the mixed strategies according to a probability distribution 𝒑t=[pt1,pt2,,ptK]\bm{p}_{t}=[p_{t}^{1},p_{t}^{2},...,p_{t}^{K}] [27]. Due to the variation of computation lag, the result computation latency may be observed in a disorderly manner. We then introduce an algorithm to handle the delayed feedback and make the decision for task offloading.

III-B Proposed delayed exponential based algorithm

Let t\mathcal{L}_{t} denote the arrival feedback set at time slot tt and |t||\mathcal{L}_{t}| denote the size of feedback set. Recall that the feedback at slot tt includes losses incurred at slot ss. Once t\mathcal{L}_{t} is revealed, the FN can get the estimated loss according to 𝒑t\bm{p}_{t} at the current slot. For each ls|tjtl_{s|t}^{j}\in\mathcal{L}_{t}, a biased estimator l~s|tj\tilde{l}_{s|t}^{j} for reducing the estimation variance of the loss vector is

l~s|tj=ls|tjpsnj+βls|tj𝟙{Is=j},\displaystyle\tilde{l}_{s|t}^{j}=\frac{l_{s|t}^{j}}{p_{s_{n}}^{j}+\beta l_{s|t}^{j}}\mathbbm{1}_{\{I_{s}=j\}}, (8)

where β\beta is the biased factor of the loss for implicit exploration[28].

During a time slot tt, the task FN may collect more than one feedback or without any feedback. Once receiving the feedback, it will be buffered and then updates 𝒑t+1\bm{p}_{t+1} at the beginning of slot t+1t+1. If the offloading feedback is returned in [0,dmax][0,d_{\max}], we regard it as an efficient offloading action. On the other hand, if dsdmaxd_{s}\geq d_{\max}, the task FN regards the action as a failure because of the intolerable latency. Here, the task FN updates 𝒑t\bm{p}_{t} with loss ls|t=1l_{s|t}=1. The computation task will be re-offloaded after dmaxd_{\max} according to the selection probability. For ls|titl_{s|t}^{i}\in\mathcal{L}_{t}, the task FN ii updates the weights and probabilities of all service FNs with

l~tj\displaystyle\tilde{l}_{t}^{j} =n=1|t|l~s|tj,\displaystyle=\sum_{n=1}^{|\mathcal{L}_{t}|}\tilde{l}_{s|t}^{j}, (9)
wtj\displaystyle w_{t}^{j} =wt1jexp(ηtltj~),\displaystyle=w_{t-1}^{j}\exp(-\eta_{t}\tilde{l_{t}^{j}}), (10)
ptj\displaystyle p_{t}^{j} =wtjkKwtk,\displaystyle=\frac{w_{t}^{j}}{\sum_{k\in K}w_{t}^{k}}, (11)

where η\eta denotes the step size. If t=\mathcal{L}_{t}=\emptyset, the task FN directly reuses the previous distribution, i.e., 𝒑t+1=𝒑t\bm{p}_{t+1}=\bm{p}_{t}, and the offloading strategy does not change at all. When the feedback is received without lag, the updating scheme of the proposed algorithm is similar to the vanilla EXP3-IX algorithm [28].

Algorithm 1 Delayed Exponential Based (DEB) Algorithm.
1:Initialize:
2:      Set 𝒑0=(1K,,1K)\bm{p}_{0}=(\frac{1}{K},...,\frac{1}{K}) for all FN j𝒦j\in\mathcal{K}.
3:for t=1,2,,Tt=1,2,...,T do
4:     Offload task to service FN according to 𝒑t1\bm{p}_{t-1}.
5:     Feedback collected in set t\mathcal{L}_{t}.
6:     for n=1,2,,|t|n=1,2,...,|\mathcal{L}_{t}| do
7:         Estimate ls|tjl_{s|t}^{j} via (8) if ls|tjtl_{s|t}^{j}\in\mathcal{L}_{t}.
8:         Collect the loss of each FN by (9).
9:     end for
10:     if ds>=dmaxd_{s}>=d_{\max} then
11:         ls=1l_{s}=1.
12:     end if
13:     Update wtjw_{t}^{j} and ptjp_{t}^{j} by (10) and (11).
14:end for

Algorithm 1 summarizes the proposed DEB algorithm for processing the task offloading problem with delayed feedback. Initially, the probability of each FN is set as 1/K1/K. The task FN makes the offloading choice according to the probability distribution 𝒑t1\bm{p}_{t-1}. If fresh feedback arrives, the task FN calculates the estimated loss and updates the weight of each service FN. If not, the task FN makes the decision according to the previous distribution.

III-C Performance analysis

In this part, we analyze the regret of the proposed delayed exponential based algorithm.

Theorem 1.

Let TT denote the time horizon of the learning procedure and D=t=1TdtD=\sum_{t=1}^{T}d_{t} denotes the total delay. Define |||\mathcal{M}| to be the regret caused by all the samples that are not received before round TT. The regret |||\mathcal{M}| is bounded due to the intolerable feedback delay dmaxd_{\max}. With probability at least 1δ1-\delta, the regret of the proposed algorithm satisfies

¯\displaystyle\bar{\mathcal{R}} lnKη+ln(K/δ)2β+(ηe2+β)(KT+Kln(K/δ)2β)\displaystyle\leq\frac{\ln K}{\eta}+\frac{\ln(K/\delta)}{2\beta}+(\frac{\eta e}{2}+\beta)(KT+\frac{K\ln(K/\delta)}{2\beta})
+Dη+||.\displaystyle+D\eta+|\mathcal{M}|.

In particular, setting η=2β=lnK+ln(K/δ)D+(e+1)KT/2\eta=2\beta=\sqrt{\frac{\ln K+\ln(K/\delta)}{D+(e+1)KT/2}}, we have

¯\displaystyle\bar{\mathcal{R}}\leq 2(D+(e+1)2KT)(2lnKln(δ))\displaystyle 2\sqrt{\Big{(}D+\frac{(e+1)}{2}KT\Big{)}\Big{(}2\ln K-\ln({\delta})\Big{)}}
+(e+1)2Kln(Kδ)+||.\displaystyle+\frac{(e+1)}{2}K\ln(\frac{K}{\delta})+|\mathcal{M}|. (12)
Proof of Theorem 1.

Given in Appendix A. ∎

According to Theorem 1, the regret of the proposed algorithm can be upper bounded by sub-linear order with determining proper update parameters. This result indicates that the regret growth decreases as time goes by, yielding the optimal service FN selections of the proposed algorithm.

IV Multi-agent setting

In most cases, the fog network consists of multiple task FNs. In this section, we extend our work to the multi-agent setting. We consider a game where all players play according to the proposed algorithm with delayed feedback. Without the consideration of the delays, an algorithm with sublinear regret, played against itself, will converge to a Nash equilibrium (NE) in the ergodic average sense [29]. In this section, we further investigate our proposed algorithm in the multi-agent setting. Adopting the proposed algorithm, the action of each task FN with the delay is proved to achieve an NE.

IV-A Problem Formulation

In the multi-agent setting, there are VV active task FNs and KK independent service FNs. In our paper, a time slot is divided into sub-slots. The VV task FNs are allocated to access KK service FNs in each sub-slots. In most cases, this assumption holds since not all task FNs are active to offload tasks at the same time. At each time slot, each active task FN chooses one service FN to offload its computation tasks based on its local observations. In addition, task FNs do not share information with each other due to the consideration of information security. In our framework, the different task FNs access service FNs in a time-division manner. Collisions occur when multiple task FNs choose the same service FN. Two collision models are considered in our paper. One is the ‘winner-takes-all’ model when a collision occurs. In this model, the computation task of one task FN is processed and the rest would not be processed. The other model is that the service FN accepts all the arrival tasks and arranges the arrivals in a random manner in the queue without jumping. In the multiple task FNs setting, the total reward under a policy Φ\Phi by time tt is the same as (7), which is given by

RΦ(T)=𝔼[v=1Vt=1Tls|ta(v)]minAv=1Vt=1TltA(v),\displaystyle R_{\Phi}(T)=\mathbb{E}\Big{[}\sum_{v=1}^{V}\sum_{t=1}^{T}l_{s|t}^{a}(v)\Big{]}-\min_{A}\sum_{v=1}^{V}\sum_{t=1}^{T}l_{t}^{A}(v), (13)

where ls|ta(v)l_{s|t}^{a}(v) denotes the loss of task FN vv. The regret derivation procedure of the proposed algorithm in multiple agents setting is the same as the single case and here we care more about the NE property of the proposed strategy.

An NE is a set of strategies, one for each of multiple players of a game, that each player’s choice is his best response to the choices of the other players [30]. Besides, the NE has the self-stability property such that the users at the equilibrium can achieve a mutually satisfactory solution and no user has the incentive to deviate. In the multi-agent decentralized offloading scenario, an NE means no task FN has an incentive of changing its mixed strategy if all other FNs do not change their strategies. This property is crucial to the decentralized computation offloading problem, since the task FNs may act in their own interests. For each task FN, achieving an NE means the current strategy is their optimal response, thus yielding a minimal latency solution.

In the following discussion, we will concentrate on investigating the NE property of the proposed algorithm. We investigate the NE property of the proposed algorithm in a two users zero-sum case.

IV-B Equilibrium Formulation

Let UU be the cost matrix, such that when the row player plays ii and the column player plays jj, the former pays a cost of U(i,j)U(i,j) and the latter gains a reward of U(i,j)U(i,j) (a.k.a., a cost of U(i,j)-U(i,j)), where 0U(i,j)10\leq U(i,j)\leq 1 for any i,ji,j. Notation 𝒑t,𝒒t\bm{p}_{t},\bm{q}_{t} represent the mixed strategies of row player and column player, then we use the convention that

U(𝒑t,j)i=1Kpt(i)U(i,j),\displaystyle U(\bm{p}_{t},j)\triangleq\sum_{i=1}^{K}p_{t}^{(i)}U(i,j), (14)

and

U(𝒑t,𝒒t)i=1Kj=1Kpt(i)qt(j)U(i,j).\displaystyle U(\bm{p}_{t},\bm{q}_{t})\triangleq\sum_{i=1}^{K}\sum_{j=1}^{K}p_{t}^{(i)}q_{t}^{(j)}U(i,j). (15)

NE is a critical concept in game theory to predict the outcome of a game. An NE is a strategy profile (𝒑t,𝒒t)(\bm{p}_{t}^{*},\bm{q}_{t}^{*}) such that there is no motivation for any player to change its strategy since the changed strategy will degrade the benefit. To obtain the desired strategy, we need to define the set of all approximate NE:

Definition 1.

The set of all ϵ\epsilon-NE points is

𝒩ϵ={(𝒑,𝒒)|U(𝒑,𝒒)min𝒑U(𝒑,𝒒)+ϵ,\displaystyle\mathcal{N_{\epsilon}}=\Bigl{\{}(\bm{p}^{*},\bm{q}^{*})|U(\bm{p}^{*},\bm{q}^{*})\leq\min_{\bm{p}}U(\bm{p},\bm{q})+\epsilon,
U(𝒑,𝒒)max𝒒U(𝒑,𝒒)ϵ},\displaystyle U(\bm{p}^{*},\bm{q}^{*})\geq\max_{\bm{q}}U(\bm{p},\bm{q})-\epsilon\Bigr{\}}, (16)

and the set of NE points is 𝒩0\mathcal{N}_{0}.

The entity that converges to the set of NE in our case is the ergodic average of (𝒑t,𝒒t)(\bm{p}_{t},\bm{q}_{t}).

Definition 2.

The ergodic average of a sequence of distributions 𝐩𝐭\bm{p_{t}} is defined as:

𝒑¯tτ=1t𝒑τt.\displaystyle\bm{\bar{p}}_{t}\triangleq\frac{\sum_{\tau=1}^{t}\bm{p}_{\tau}}{t}. (17)

We say that (𝐩¯T,𝐪¯T)(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T}) converges in L1L1 to the set of NE if

limTargmin(𝒑T,𝒒T)𝒩0E{(𝒑¯T,𝒒¯T)(𝒑T,𝒒T)1}=0,\displaystyle\lim_{T\rightarrow\infty}\mathop{\arg\min}\limits_{(\bm{p}_{T}^{*},\bm{q}_{T}^{*})\in\mathcal{N}_{0}}E\{||(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})-(\bm{p}_{T}^{*},\bm{q}_{T}^{*})||_{1}\}=0, (18)

which also implies that ϵ>0\forall\epsilon>0,

limTargmin(𝒑T,𝒒T)𝒩0P{(𝒑¯T,𝒒¯T)(𝒑T,𝒒T)1ϵ}=0,\displaystyle\lim_{T\rightarrow\infty}\mathop{\arg\min}\limits_{(\bm{p}_{T}^{*},\bm{q}_{T}^{*})\in\mathcal{N}_{0}}P\{||(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})-(\bm{p}_{T}^{*},\bm{q}_{T}^{*})||_{1}\geq\epsilon\}=0, (19)

where ||||1||\cdot||_{1} represents L1L1 norm.

The main theory analysis, given in Theorem 2, generalizes this statement for the case with delayed feedback, and the proposed algorithm can converge to an NE with the feedback lag. Note that the convergence of the ergodic mean to the set of NE is in the L1L1 norm sense, which is much stronger than the convergence of the expected ergodic mean.

Theorem 2.

Let two players play a zero-sum game with a cost matrix UU, where 0U(i,j)10\leq U(i,j)\leq 1 for each i,ji,j by the proposed algorithm. Denote the delay sequences of the row player and the column player as {dtr}\{d_{t}^{r}\} and {dtc}\{d_{t}^{c}\}. Let the mixed strategies of the row and column players at round tt be ptp_{t} and qtq_{t}, respectively. Then as TT\rightarrow\infty, the game played by the proposed algorithm versus itself converges to NE.

proof of theorem 2.

Given in Appendix B. ∎

Regarding the action of the adversary as the multiple merged actions of other players, the two players setting can be generalized to the multiple players setting [31].

V Numerical Results

In this section, we evaluate the experiment performance of the proposed DEB algorithm. During all the simulations, the task size of each FN is equal and is normalized as 11. The duration of each time slot is also normalized as 11.

V-A Single-agent Setting

Refer to caption
(a) Regret performance
Refer to caption
(b) Latency performance
Refer to caption
(c) Statistical selection
Figure 2: Performance comparison of the QPM-D, SGD, DUCB, BLOT, DEB and no-delayed EXP3 algorithms for the single-agent setting in stationary environment.

In the first part, we would like to exam the performance of the DEB algorithm in the single-agent setting. Consider a fog network consisting of 77 FNs. The number of task FN is 11, and the rest are service FNs. The task FN is deployed in the center of a circle with 22 km radius, where the service FNs are randomly deployed in this range. In particular, the path-loss channel is generated according to the 3GPP propagation environment [24]. The total time slot TT is set as 3000030000. Besides, to perform the dynamic computation perturbation, the service rates μ\mu of FNs follows the Poisson distribution with mean generated from the Gaussian distribution with 𝒩(6,1)\mathcal{N}(6,1). To normalized the offloading latency, we set TmaxT_{\max} as 55. The threshold tmaxt_{\max} that prevents the delay of feedback from being too large is set as 33.

In fog computing, to the best of our knowledge, there does not exist any literature considering delayed information feedback. To evaluate the performance of the proposed algorithm, we compare the performance of the proposed DEB algorithm to the following schemes that can be implemented in the delayed feedback setting. (i) Queued Partial Monitoring with Delayed Feedback (QPM-D) algorithm [32]: the feedback of each offloading decision is assigned to the queue corresponding to the selected service FN and the feedback information is employed when the service FN is chosen; (ii) Stochastic Delayed Bandits (SDB) algorithm [33]: an algorithm that combines the QPM-D for the main updated scheme and a heuristic scheme for controlling the sampling distribution to improve performance in the presence of delay; (iii) Delayed Upper Confidence Bound (DUCB) algorithm [34]: an algorithm that exploits the delayed feedback to adjust the reward update in classical UCB framework; (iv) Bandit Learning-based Offloading of Tasks (BLOT) algorithm [15]: an offloading algorithm that selects the service FNs by means of the currently available information; (v) Non-delayed offloading: each task FN receives the feedback instantly using the Exponential-weight algorithm for Exploration and Exploitation (EXP3) algorithm presented in [27]. Here QPM-D and SDB are meta schemes for handling delayed feedback. Besides, we adopt Thompson sampling algorithm for the implementation of the QPM-D and SDB algorithm. Note that the BLOT algorithm is not designed for the latency minimization problem, a little bit modification is taken to make BLOT work in this system setting.

In the first experiment, we consider a stationary setting in which the arrival rates of task FNs are constant over time. The arrival rates of FN ii is set to be λi=4.5+0.5×i\lambda_{i}=4.5+0.5\times i , suggesting that the FN 11 achieves the best delay performance in statistics. In Fig. 2, we plot the regret performance, latency performance and the task FN selections of the DEB, QPM-D, SDB, DUCB, BLOT and non-delayed EXP3 algorithms. In addition, the latency of optimal choice (task FN always selects service FN 1) has been plotted to show the performance gap of the above algorithms. Here 10001000 Monte Carlo simulations are performed.

Fig. 2 (a) depicts the regret performance of offloading decisions with the above algorithms. The yy-label of the figure represents the ratio of the aggregate regret, revealing the regret order of the DEB, QPM-D, SDB, DUCB, BLOT and non-delayed EXP3 algorithms. The convergence performance of the proposed algorithm represents the learning process of the environment, such as the channel condition and the computation capacities of neighboring FNs. From Fig. 2 (a), we observe that the DEB algorithm achieves better regret performance than the QPM-D, DUCB, BLOT and SDB algorithms. The discrepancy of the performance between DEB, QPM-D, SDB algorithms is caused by the utilization of delayed feedbacks. The proposed DEB algorithm harnesses the delayed feedbacks instantly. The QPM-D and SDB algorithms store the delayed feedbacks of a service FN to its queue firstly and update when the service FN is selected. Due to the delayed feedback, such updating schemes will degrade the regret performance. As for the performance gap between the DUCB, the BLOT and the DEB algorithms, the DUCB and BLOT algorithms focus on the recent feedback but loses sight of the entire information, which results in a larger performance loss. Without the knowledge of other FNs’ conditions and peer competitions, the FN adopted the DEB algorithm can always adapt to the environment and make a proper choice.

Fig. 2 (b) indicates the average offloading latency of the single-agent setting. The offloading latency of the QPM-D and SDB algorithms decrease fast at the beginning and achieves sub-optimal latency performance later. Moreover, the converged latency of algorithms DUCB and BLOT perform not as well as the DEB algorithm. The reason comes from that the DUCB and BLOT algorithms adopt the discount UCB style to handle the delayed feedback. Thus the exploration procedure pay more attention to the recent feedback. Thus the two algorithms are not sufficient to identify the best service FN. The proposed algorithm can fully explore the service FNs and then pick the best FN, thus revealing a better averaged latency performance. This figure can also be leveraged to complementary explain the results in Fig. 2 (a).

Refer to caption
Figure 3: Performance comparison of the QPM-D, SDB, DUCB, BLOT and DEB algorithms in dynamic environment.

In Fig. 2 (c), the statistical selections of DEB, QPM-D, SDB, DUCB, BLOT and EXP3 algorithms are depicted with the same simulation setup. According to the simulation setup, FN 11 has the smallest expected loss and FN 66 has the largest expected loss. Fig. 2 (c) reveals that the task FN selects the best FN for the most times with the DEB algorithm. For the FN with the largest loss, the task FN rarely offloads tasks to it and its selections occur during the exploration period for most of the time. Combining the three figures of Fig. 2, we can conclude that the DEB algorithm can better adapt to the changeable environment compared to other algorithms, thus yielding the smallest regret.

In the second experiment, the performance of the proposed algorithm in dynamic changing network environment is verified. In the first T/2T/2 time slots, the arrival rates of FN ii is set as λi=4.5+0.5×i\lambda_{i}=4.5+0.5\times i, which is the same as the first experiment. In the last T/2T/2 time slots, the arrival rate of FN 11 slows to 77, then the FN 22 starts to outperform FN 11 and eventually becomes the leader. The above setting is more practical where the arrival rates of FNs may change due to the dynamics of the network structure. This experiment can exam the robustness of the proposed algorithm.

In Fig. 3, we explore the average latency performance of the DEB, QPM, DUCB, BLOT and SDB algorithms in the dynamic setting. Here 10001000 Monte Carlo simulations are performed. From Fig. 3, we observe that the performance tendencies of all the algorithms have changed at T/2T/2 as the network statistic varies. Then the deteriorated performance recovers eventually in the rest time. Contrasting with QPM-D, SDB, DUCB and BLOT algorithms, the DEB algorithm recovers faster when task FN suffers network statistic variation, suggesting a more robust property. It is because the DEB algorithm selects the service FNs according to a probability distribution, and the implicit exploration factor β\beta can help adapt to the dynamic environment.

Refer to caption
Figure 4: Latency performance of the DEB algorithm in the multi-agent setting with different number of service FNs.

To evaluate the scalability of the DEB algorithm, we exam the latency performance of the DEB algorithm with different network sizes in Fig. 4. Here the number of service FNs are set as K={20,100,500,1000}K=\{20,100,500,1000\}. The FN 11 is set as the best service FN with arrival rate 55, while the rest service FNs are uniformly generated from [6,10][6,10]. From Fig. 4, we observe that the latency performance deteriorates with the increment of task FNs. Additional service FNs add the cost of exploration, which results in worse average latency. However, as time goes by, the average latency decreases. The task FN with the DEB algorithm can pick the best service FN eventually.

V-B Multi-agent Setting

In this part, we investigate the performance of the proposed algorithm in the multi-agent setting. We set V=4V=4 and K=10K=10. The arrival rate of each FN ii is set as λi=4+0.4×i\lambda_{i}=4+0.4\times i. In the multi-agent scenario, the DUCB and BLOT algorithms cause severe collisions in its exploration stage, thus the two algorithms are not applicable to multi-agent setting. Thus, we only compare the performance of the DEB, QPM and SDB algorithms here.

Refer to caption
Figure 5: Performance comparison of the QPM-D, SDB and DEB algorithms in the multi-agent setting.
Refer to caption
Figure 6: Service FNs selection of the DEB algorithms in the multi-agent setting.

Fig. 5 explores the multi-agent performance of the DEB, SDB and QPM algorithms. Here we adopt a weak-collision setting, where the service FN processes the multiple arrivals in a random order manner. In this setting, the task FNs suffer slight collision loss when there exist multiple arrival tasks in service FN. Fig. 5 plots the average latency of different users with the above algorithms. Compared to the SDB and QPM algorithm, the DEB algorithm performs worse in the beginning, then gets better as time goes by. The DEB algorithm chooses service FNs according to the probability distribution, indicating a sufficient exploration for all service FNs. Such selection rule is more suitable for dynamic Markov environment. In addition, Fig. 6 shows the service FNs selections of the DEB algorithm for different agents. From Fig. 6, we observe that the service FNs with low arrival rates are nearly equally selected by the four agents. In this setting, the loss caused by collisions is slight. Therefore, the agents prefer to share the best FNs when the offloading strategy becomes stable.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Service FNs selections in different time slot under ‘winner-tasks-all’ setting.

In the next experiment, we investigate the performance of the DEB algorithm in a ‘winner-takes-all’ setting, where only one task FN can obtain the real loss and the rest suffer the total loss 11. To reveal the NE property of the proposed algorithm, here we only consider 22 agents. The number of service FNs is set as 66. The arrival rate of service FN ii is set as [4,4,5,6,7,8][4,4,5,6,7,8]. It can be observed that the NE points in this setting is: (i) task FN 11 chooses service FN 11 while task FN 22 chooses service FN 22; (ii) task FN 11 chooses service FN 22 while task FN 22 chooses FN 11. Fig. 7 reveals the task FNs’ selections during period [0,T3\frac{T}{3}], (T3\frac{T}{3},2T3\frac{2T}{3}], (2T3\frac{2T}{3},TT] and [0,T][0,T], respectively. In the earlier stage of offloading, the two task FNs prefer to share service FN 11 and FN 22. However, it will result in many collisions when their choices are the same. After suffering the performance loss from collisions, the two agents become smarter and gradually avoid selecting the same FN. Eventually, both the two task FNs achieve the NE point, where task FN 11 choosing service FN 22 and task FN 22 choosing service FN 11. This simulation result indicates that even with delayed feedback, the proposed algorithm can converge to an NE in the ergodic average sense.

VI Conclusions

Considering the unknown peer FNs’ stations and dynamic environmental changes, we studied the task peer offloading problem with minimal latency in fog networks in this paper. We modeled the task offloading problem as an adversarial multi-arm bandit problem with delayed feedback. To solve this problem, we proposed an efficient DEB algorithm that extends the general exponential based algorithm to the case with delayed feedback. The DEB algorithm is proven to achieve a sub-linear regret in both single-agent setting and multi-agent setting. Moreover, the DEB algorithm can lead to an NE with existing delay feedback in the multi-agent setting. Theoretical and numerical results validated the efficiency of the DEB algorithm.

Appendix A

Proof of Theorem 1.

Since the initial weight of each action is set as 11, for all sequences of actions drawn by the proposed algorithm, we have

WTW0\displaystyle\frac{W_{T}}{W_{0}} maxjexp(ηs=1Tl~sj)K\displaystyle\geq\frac{\max\limits_{j}\exp(-\eta\sum_{s=1}^{T}\tilde{l}_{s}^{j})}{K}
exp(ηminjs=1Tl~sj)K.\displaystyle\geq\frac{\exp(-\eta\min\limits_{j}\sum_{s=1}^{T}\tilde{l}_{s}^{j})}{K}.

Then taking logarithms,

lnWTW0ηminjs=1Tl~sjlnK.\displaystyle\ln\frac{W_{T}}{W_{0}}\geq-\eta\min_{j}\sum_{s=1}^{T}\tilde{l}_{s}^{j}-\ln K. (20)

On the other hand,

WtWt1\displaystyle\frac{W_{t}}{W_{t-1}} =iwt1iWt1s:s+ds=texp(ηl~si)\displaystyle=\sum_{i}\frac{w_{t-1}^{i}}{W_{t-1}}\prod_{s:s+d_{s}=t}\exp(-\eta\tilde{l}_{s}^{i})
=iptas:s+ds=texp(ηl~si)\displaystyle=\sum_{i}p_{t}^{a}\prod_{s:s+d_{s}=t}\exp(-\eta\tilde{l}_{s}^{i})
iptas:s+ds=texp(ηl~si)\displaystyle\leq\sum_{i}p_{t}^{a}\sum_{s:s+d_{s}=t}\exp(-\eta\tilde{l}_{s}^{i}) (21)
iptas:s+ds=t(1ηl~si+η22(l~si)2)\displaystyle\leq\sum_{i}p_{t}^{a}\sum_{s:s+d_{s}=t}\Big{(}1-\eta\tilde{l}_{s}^{i}+\frac{\eta^{2}}{2}(\tilde{l}_{s}^{i})^{2}\Big{)} (22)
=s:s+ds=t(1ηiptil~si+η22ipti(l~si)2)\displaystyle=\sum_{s:s+d_{s}=t}\Big{(}1-\eta\sum_{i}p_{t}^{i}\tilde{l}_{s}^{i}+\frac{\eta^{2}}{2}\sum_{i}p_{t}^{i}(\tilde{l}_{s}^{i})^{2}\Big{)}
=1+|{s:s+ds=t}|1ηs:s+ds=tiptil~si\displaystyle=1+\Big{|}\{s:s+d_{s}=t\}\Big{|}-1-\eta\sum_{s:s+d_{s}=t}\sum_{i}p_{t}^{i}\tilde{l}_{s}^{i}
+η22s:s+ds=tipti(l~si)2\displaystyle~{}~{}~{}+\frac{\eta^{2}}{2}\sum_{s:s+d_{s}=t}\sum_{i}p_{t}^{i}(\tilde{l}_{s}^{i})^{2}
exp(|{s:s+ds=t}|1ηs:s+ds=tiptil~si\displaystyle\leq\exp\Big{(}\Big{|}\{s:s+d_{s}=t\}\Big{|}-1-\eta\sum_{s:s+d_{s}=t}\sum_{i}p_{t}^{i}\tilde{l}_{s}^{i}
+η22s:s+ds=tipti(l~si)2).\displaystyle~{}~{}~{}+\frac{\eta^{2}}{2}\sum_{s:s+d_{s}=t}\sum_{i}p_{t}^{i}(\tilde{l}_{s}^{i})^{2}\Big{)}. (23)

From exp(ηl~si)(0,1]\exp(-\eta\tilde{l}_{s}^{i})\in(0,1], i\forall i, we can obtain (21), the inequality of (22) is given by ex1+x+x2/2e^{x}\leq 1+x+x^{2}/2 for x0x\leq 0 and inequality (23) is derived from ex1+xe^{x}\geq 1+x, x\forall x.

By a telescoping sum, we have

WTW0\displaystyle\frac{W_{T}}{W_{0}} exp(ηts:s+ds=tiptil~si\displaystyle\leq\exp\Big{(}-\eta\sum_{t}\sum_{s:s+d_{s}=t}\sum_{i}p_{t}^{i}\tilde{l}_{s}^{i}
+η22ts:s+ds=tipti(l~si)2).\displaystyle~{}~{}~{}+\frac{\eta^{2}}{2}\sum_{t}\sum_{s:s+d_{s}=t}\sum_{i}p_{t}^{i}(\tilde{l}_{s}^{i})^{2}\Big{)}. (24)

Note that the sums of the form ts:s+ds=t\sum_{t}\sum_{s:s+d_{s}=t} only include each value of ss once and there may exist some feedbacks that can not be obtained before round TT. Since the feedbacks are delayed by dmaxd_{\max} at most, the regret caused by missing samples is also bounded. Let |||\mathcal{M}| denote the regret caused by missing samples. Then we have,

lnWTW0\displaystyle\ln\frac{W_{T}}{W_{0}} ηtiptil~si\displaystyle\leq-\eta\sum_{t}\sum_{i}p_{t}^{i}\tilde{l}_{s}^{i}
+η22tipti(l~si)2+||.\displaystyle~{}~{}~{}+\frac{\eta^{2}}{2}\sum_{t}\sum_{i}p_{t}^{i}(\tilde{l}_{s}^{i})^{2}+|\mathcal{M}|. (25)

Combining (20) and (25), we can get

ηminjs=1Tl~sjlnK\displaystyle-\eta\min_{j}\sum_{s=1}^{T}\tilde{l}_{s}^{j}-\ln K ηtiptil~si\displaystyle\leq-\eta\sum_{t}\sum_{i}p_{t}^{i}\tilde{l}_{s}^{i}
+η22tipti(l~si)2+||.\displaystyle~{}~{}~{}+\frac{\eta^{2}}{2}\sum_{t}\sum_{i}p_{t}^{i}(\tilde{l}_{s}^{i})^{2}+|\mathcal{M}|.

Rearranging the above equation, we have

tipt+dtil~timinjtl~tj\displaystyle\sum_{t}\sum_{i}p_{t+d_{t}}^{i}\tilde{l}_{t}^{i}-\min_{j}\sum_{t}\tilde{l}_{t}^{j} 12ηtipt+dti(l~ti)2\displaystyle\leq\frac{1}{2}\eta\sum_{t}\sum_{i}p_{t+d_{t}}^{i}(\tilde{l}_{t}^{i})^{2}
+lnKη+||.\displaystyle+\frac{\ln K}{\eta}+|\mathcal{M}|. (26)

Firstly, we need to figure out the relationship between pt+dtip_{t+d_{t}}^{i} and ptip_{t}^{i}.

Lemma 1.

Let Nt=|{s:s+ds[t,t+dt)}|N_{t}=|\{s:s+d_{s}\in[t,t+d_{t})\}| denote the amount of feedbacks that arrives between playing action and observing its feedback. For a subset of the integers CC, corresponding to time steps, let

qi(C)=exp(ηsCl~si)jexp(ηsCl~sj).\displaystyle q^{i}(C)=\frac{\exp(-\eta\sum_{s\in C}\tilde{l}_{s}^{i})}{\sum_{j}\exp(-\eta\sum_{s\in C}\tilde{l}_{s}^{j})}. (27)

Then the resulting probabilities fulfill for every time slot tt and action ii,

pt+dtiptiηm=1Ntqi(Ct1{zn:n<m})l~zii,\displaystyle p_{t+d_{t}}^{i}-p_{t}^{i}\leq-\eta\sum_{m=1}^{N_{t}}q^{i}(C_{t-1}\cup\{z^{\prime}_{n}:n<m\})\tilde{l}_{z^{\prime}_{i}}^{i}, (28)

where znz^{\prime}_{n} is an enumeration of {s:s+ds[t,t+dt)}\{s:s+d_{s}\in[t,t+d_{t})\}.

proof of Lemma 1.

Refer to Lemma 10 of [35]. ∎

By using Lemma 1 to bound the left hand side (LHS) of (26), we have

tiptil~ti\displaystyle\sum_{t}\sum_{i}p_{t}^{i}\tilde{l}_{t}^{i} minjtl~tj12ηtipt+dti(l~ti)2\displaystyle-\min_{j}\sum_{t}\tilde{l}_{t}^{j}\leq\frac{1}{2}\eta\sum_{t}\sum_{i}p_{t+d_{t}}^{i}(\tilde{l}_{t}^{i})^{2}
+ηtil~tim=1Ntqi(Ct1{zn:n<m})l~zni\displaystyle+\eta\sum_{t}\sum_{i}\tilde{l}_{t}^{i}\sum_{m=1}^{N_{t}}q^{i}(C_{t-1}\cup\{z^{\prime}_{n}:n<m\})\tilde{l}_{z^{\prime}_{n}}^{i}
+lnKη+||.\displaystyle+\frac{\ln K}{\eta}+|\mathcal{M}|. (29)

Next, we will bound the second term on the right hand side (RHS) of (29). Since tt is not part of the enumeration zjz_{j}^{\prime}, so the two expectations are taken independently: 𝔼[l~zji]1\mathbb{E}[\tilde{l}_{z_{j}^{\prime}}^{i}]\leq 1 and 𝔼[l~ti]1\mathbb{E}[\tilde{l}_{t}^{i}]\leq 1. Then, we have

E[til~tim=1Ntqi(Ct1{zn:n<m})l~zni]tNt.\displaystyle E\Big{[}\sum_{t}\sum_{i}\tilde{l}_{t}^{i}\sum_{m=1}^{N_{t}}q^{i}(C_{t-1}\cup\{z_{n}^{\prime}:n<m\})\tilde{l}_{z_{n}^{\prime}}^{i}\Big{]}\leq\sum_{t}N_{t}.

Since counting in how many intervals every loss is observed is the same as counting how many losses are observed in every interval. Thus, summing over tt or ss is equivalent, i.e.,

tNt\displaystyle\sum_{t}N_{t} t|{s:s+ds[t,t+dt)}|\displaystyle\leq\sum_{t}|\{s:s+d_{s}\in[t,t+d_{t})\}|
=s|{t:s+ds[t,t+dt)}|.\displaystyle=\sum_{s}|\{t:s+d_{s}\in[t,t+d_{t})\}|.

Both ss and tt is bounded to TT. Then we have

s|{t:s+ds[t,t+dt)}|\displaystyle\sum_{s}|\{t:s+d_{s}\in[t,t+d_{t})\}|
s|{t>s:s+ds[t,t+dt)}||\displaystyle\leq\sum_{s}|\{t>s:s+d_{s}\in[t,t+d_{t})\}||
+|{t<s:s+ds[t,t+dt)}|.\displaystyle~{}~{}~{}+|\{t<s:s+d_{s}\in[t,t+d_{t})\}|. (30)

and first term of (30) can be bounded as

|{t>s:s+ds[t,t+dt)}|\displaystyle|\{t>s:s+d_{s}\in[t,t+d_{t})\}|
|{t>s:ts+ds}\{t>s:t+dt<s+ds}|\displaystyle\leq|\{t>s:t\leq s+d_{s}\}\backslash{\{t>s:t+d_{t}<s+d_{s}\}}|
ds|{t>s:t+dt<s+ds}|,\displaystyle\leq d_{s}-|\{t>s:t+d_{t}<s+d_{s}\}|, (31)

The second term is can be bounded in a similar way,

|{t<s:s+ds[t,t+dt)}|\displaystyle|\{t<s:s+d_{s}\in[t,t+d_{t})\}|
|{t<s:s+ds<t+dt}|.\displaystyle\leq|\{t<s:s+d_{s}<t+d_{t}\}|. (32)

Then we note that by the prior equivalency of summing over tt or ss, the last term in (31) cancel with (32) when summed, which bounds the second term of (29) by ηD\eta D.

The next lemma can help us bound the first term on RHS of (29).

Lemma 2.

The probabilities defined in (27) is

qi(Φi)(1+12N1)qi(Φi1),i,\displaystyle q^{i}(\Phi_{i})\leq\Big{(}1+\frac{1}{2N-1}\Big{)}q^{i}(\Phi_{i-1}),\forall i, (33)
proof of Lemma 2.

Refer to Lemma 11 of [35]. ∎

Using Lemma 2, we have

tipt+dti(l~ti)2\displaystyle\sum_{t}\sum_{i}p_{t+d_{t}}^{i}(\tilde{l}_{t}^{i})^{2}
=tiptipt+dtipti(l~ti)2\displaystyle=\sum_{t}\sum_{i}p_{t}^{i}\frac{p_{t+d_{t}}^{i}}{p_{t}^{i}}(\tilde{l}_{t}^{i})^{2}
=tipti(l~ti)2i=1Ntqi(Ct1{zj:ji})qi(Ct1{zj:j<i})\displaystyle=\sum_{t}\sum_{i}p_{t}^{i}(\tilde{l}_{t}^{i})^{2}\prod_{i=1}^{N_{t}}\frac{q^{i}(C_{t-1}\cup\{z_{j}^{\prime}:j\leq i\})}{q^{i}(C_{t-1}\cup\{z_{j}^{\prime}:j<i\})}
tipti(l~ti)2(1+12N1)Nt\displaystyle\leq\sum_{t}\sum_{i}p_{t}^{i}(\tilde{l}_{t}^{i})^{2}(1+\frac{1}{2N-1})^{N_{t}}
tipti(l~ti)2(1+12N1)2N1\displaystyle\leq\sum_{t}\sum_{i}p_{t}^{i}(\tilde{l}_{t}^{i})^{2}(1+\frac{1}{2N-1})^{2N-1}
tipti(l~ai)2e.\displaystyle\leq\sum_{t}\sum_{i}p_{t}^{i}(\tilde{l}_{a}^{i})^{2}e. (34)

Finally, the inequality of (29) can be derived as

tiptil~timinjtl~tj\displaystyle\sum_{t}\sum_{i}p_{t}^{i}\tilde{l}_{t}^{i}-\min_{j}\sum_{t}\tilde{l}_{t}^{j} lnKη+ηD\displaystyle\leq\frac{\ln K}{\eta}+\eta D
+η2tipti(l~ti)2e+||.\displaystyle+\frac{\eta}{2}\sum_{t}\sum_{i}p_{t}^{i}(\tilde{l}_{t}^{i})^{2}e+|\mathcal{M}|. (35)

Note that the loss definition in (8) is a biased estimation loss. Next we use the following lemma to bound the unbiased estimation loss of the proposed algorithm.

Lemma 3.

With probability at least 1δ1-\delta,

t=1T(l~tilti)ln(K/δ)2β,\displaystyle\sum_{t=1}^{T}(\tilde{l}_{t}^{i}-l_{t}^{i})\leq\frac{\ln(K/\delta)}{2\beta}, (36)

simultaneously holds for all i[K]i\in[K].

proof of Lemma 3.

Refer to Lemma 1 of [28]. ∎

Applying Lemma 3, with probability at least 1θ1-\theta, we have

tiptil~timinjtl~tj\displaystyle\sum_{t}\sum_{i}p_{t}^{i}\tilde{l}_{t}^{i}-\min_{j}\sum_{t}\tilde{l}_{t}^{j}
lnKη+η2tipti(l~ai)2e+ηD+||.\displaystyle\leq\frac{\ln K}{\eta}+\frac{\eta}{2}\sum_{t}\sum_{i}p_{t}^{i}(\tilde{l}_{a}^{i})^{2}e+\eta D+|\mathcal{M}|. (37)

From the definition of estimated loss in (8), the first term of LHS in (37) can be converted to

iptil~ti\displaystyle\sum_{i}p_{t}^{i}\tilde{l}_{t}^{i} =i=1ptiltipti+βlti𝟙{It=i}\displaystyle=\sum_{i=1}p_{t}^{i}\frac{l_{t}^{i}}{p_{t}^{i}+\beta l_{t}^{i}}\mathbbm{1}_{\{I_{t}=i\}}
=i𝟙{It=i}ltiβi(lti)2pti+βlti𝟙{It=i}\displaystyle=\sum_{i}\mathbbm{1}_{\{I_{t}=i\}}l_{t}^{i}-\beta\sum_{i}\frac{(l_{t}^{i})^{2}}{p_{t}^{i}+\beta l_{t}^{i}}\mathbbm{1}_{\{I_{t}=i\}}
=i𝟙{It=i}ltiβiltil~ti.\displaystyle=\sum_{i}\mathbbm{1}_{\{I_{t}=i\}}l_{t}^{i}-\beta\sum_{i}l_{t}^{i}\tilde{l}_{t}^{i}. (38)

Similarly,

ipti(l~ti)2=iptiltipti+βltil~tiiltil~ti.\displaystyle\sum_{i}p_{t}^{i}(\tilde{l}_{t}^{i})^{2}=\sum_{i}p_{t}^{i}\frac{l_{t}^{i}}{p_{t}^{i}+\beta l_{t}^{i}}\tilde{l}_{t}^{i}\leq\sum_{i}l_{t}^{i}\tilde{l}_{t}^{i}. (39)

Combining (38), (39) with (37), we have

t(lt,Itβiltil~ti)minjt(ltj+ln(K/δ)2β)\displaystyle\sum_{t}(l_{t,I_{t}}-\beta\sum_{i}l_{t}^{i}\tilde{l}_{t}^{i})-\min\limits_{j}\sum_{t}(l_{t}^{j}+\frac{\ln(K/\delta)}{2\beta})
lnKη+ηe2tiltil~ai+ηD+||.\displaystyle\leq\frac{\ln K}{\eta}+\frac{\eta e}{2}\sum_{t}\sum_{i}l_{t}^{i}\tilde{l}_{a}^{i}+\eta D+|\mathcal{M}|.
lnKη+ln(K/δ)2β+ηD+(ηe2+β)tiltil~ai+||\displaystyle\leq\frac{\ln K}{\eta}+\frac{\ln(K/\delta)}{2\beta}+\eta D+(\frac{\eta e}{2}+\beta)\sum_{t}\sum_{i}l_{t}^{i}\tilde{l}_{a}^{i}+|\mathcal{M}|
lnKη+ln(K/δ)2β+ηD+(ηe2+β)til~ti+||\displaystyle\leq\frac{\ln K}{\eta}+\frac{\ln(K/\delta)}{2\beta}+\eta D+(\frac{\eta e}{2}+\beta)\sum_{t}\sum_{i}\tilde{l}_{t}^{i}+|\mathcal{M}|
lnKη+ln(K/δ)2β+ηD+(ηe2+β)(KT+Kln(K/δ)2β)\displaystyle\leq\frac{\ln K}{\eta}+\frac{\ln(K/\delta)}{2\beta}+\eta D+(\frac{\eta e}{2}+\beta)(KT+\frac{K\ln(K/\delta)}{2\beta})
+||.\displaystyle+|\mathcal{M}|. (40)

By properly selecting η=2β=lnK+ln(K/δ)D+(e+1)KT/2\eta=2\beta=\sqrt{\frac{\ln K+\ln(K/\delta)}{D+(e+1)KT/2}}, we have

¯\displaystyle\bar{\mathcal{R}}\leq 2(D+(e+1)2KT)(2lnKlnδ)\displaystyle 2\sqrt{\Big{(}D+\frac{(e+1)}{2}KT\Big{)}\Big{(}2\ln K-\ln\delta\Big{)}}
+(e+1)2Kln(Kδ)+||,\displaystyle+\frac{(e+1)}{2}K\ln(\frac{K}{\delta})+|\mathcal{M}|, (41)

which finishes the proof of Theorem 1. ∎

Appendix B

proof of theorem 2.

Let ϵ>0\epsilon>0, and define the ergodic average of the value of the game by

U¯T=t=1TU(𝒑t,𝒒t)T.\displaystyle\bar{U}_{T}=\frac{\sum_{t=1}^{T}U(\bm{p}_{t},\bm{q}_{t})}{T}. (42)

By utilizing the proposed algorithm with cost sequence lti=U(i,𝒒t)l_{t}^{i}=U(i,\bm{q}_{t}), we know from (40) that the row player guarantees that for any column strategy, in particular 𝒒t\bm{q}_{t}, and any row strategy 𝒑\bm{p}, we have

E{t=1T(U(𝒑t,𝒒t)U(𝒑,𝒒t))}lnKη+ln(K/δ)2β\displaystyle E\Biggl{\{}\sum_{t=1}^{T}\Big{(}U(\bm{p}_{t},\bm{q}_{t})-U(\bm{p},\bm{q}_{t})\Big{)}\Biggr{\}}\leq\frac{\ln K}{\eta}+\frac{\ln(K/\delta)}{2\beta}
+ηD+(ηe2+β)(KT+Kln(K/δ)2β)+||.\displaystyle+\eta D+(\frac{\eta e}{2}+\beta)(KT+\frac{K\ln(K/\delta)}{2\beta})+|\mathcal{M}|. (43)

Let GG denote the constant term of (43), such that

G=lnKη+ln(K/δ)2β+(ηe2+β)Kln(K/δ)2β+||\displaystyle G=\frac{\ln K}{\eta}+\frac{\ln(K/\delta)}{2\beta}+(\frac{\eta e}{2}+\beta)\frac{K\ln(K/\delta)}{2\beta}+|\mathcal{M}|

There exists T1>0T_{1}>0,

E{U¯TU(𝒑,𝒒¯T)}\displaystyle E\Biggl{\{}\bar{U}_{T}-U(\bm{p},\bm{\bar{q}}_{T})\Biggr{\}} =E{t=1T(U(𝒑t,𝒒t)U(𝒑,𝒒t))T}\displaystyle=E\Biggl{\{}\frac{\sum_{t=1}^{T}\Big{(}U(\bm{p}_{t},\bm{q}_{t})-U(\bm{p},\bm{q}_{t})\Big{)}}{T}\Biggr{\}}
G+(ηe2+β)KT+ηDT\displaystyle\leq\frac{G+(\frac{\eta e}{2}+\beta)KT+\eta D}{T}
ϵ2,T>T1,\displaystyle\leq\frac{\epsilon}{2},~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\quad\quad\forall~{}T>T_{1}, (44)

where the second inequality holds since DD is bounded by dmaxd_{\max} and GG is negligible as TT\rightarrow\infty.

Then for the column player with cost sequence ltj=1U(𝒑t,j)l_{t}^{j}=1-U(\bm{p}_{t},j), for every strategy 𝒑t\bm{p}_{t} and strategy 𝒒\bm{q}, we have

E{t=1T(U(𝒑t,𝒒)U(𝒑t,𝒒t))}G+(ηe2+β)KT+ηD.\displaystyle E\Biggl{\{}\sum_{t=1}^{T}\Big{(}U(\bm{p}_{t},\bm{q})-U(\bm{p}_{t},\bm{q}_{t})\Big{)}\Biggr{\}}\leq G+(\frac{\eta e}{2}+\beta)KT+\eta D. (45)

Existing T2>0T_{2}>0,

E{U(𝒑¯T,𝒒)U¯T}\displaystyle E\Biggl{\{}U(\bm{\bar{p}}_{T},\bm{q})-\bar{U}_{T}\Biggr{\}} =E{t=1T(U(𝒑t,𝒒)U(𝒑t,𝒒t))T}\displaystyle=E\Biggl{\{}\frac{\sum_{t=1}^{T}\Big{(}U(\bm{p}_{t},\bm{q})-U(\bm{p}_{t},\bm{q}_{t})\Big{)}}{T}\Biggr{\}}
G+(ηe2+β)KT+ηDT\displaystyle\leq\frac{G+(\frac{\eta e}{2}+\beta)KT+\eta D}{T}
ϵ2,T>T2.\displaystyle\leq\frac{\epsilon}{2},\qquad\qquad\quad\quad\forall~{}T>T_{2}. (46)

Now, define 𝒑Tb\bm{p}_{T}^{b} as the best response to 𝒒¯T\bm{\bar{q}}_{T}, and we have

𝒑Tb=argmin𝒑U(𝒑,𝒒¯T)|,\displaystyle\bm{p}_{T}^{b}=\arg\min_{\bm{p}^{\prime}}~{}U(\bm{p}^{\prime},\bm{\bar{q}}_{T})|,

together with 𝒒Tb\bm{q}_{T}^{b}, the best response to 𝒑¯T\bm{\bar{p}}_{T}, and we have

𝒒Tb=argmax𝒒U(𝒑¯T,𝒒)|.\displaystyle\bm{q}_{T}^{b}=\arg\max_{\bm{q}^{\prime}}~{}U(\bm{\bar{p}}_{T},\bm{q}^{\prime})|.

Choosing 𝒑=𝒑Tb,𝒒=𝒒¯T\bm{p}=\bm{p}_{T}^{b},\bm{q}=\bm{\bar{q}}_{T} in (44), (46) and adding them together, we have

E{|U(𝒑¯T,𝒒¯T)min𝒑U(𝒑,𝒒¯T)|}\displaystyle E\Biggl{\{}\Bigg{|}U(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})-\min_{\bm{p}^{\prime}}~{}U(\bm{p}^{\prime},\bm{\bar{q}}_{T})\Bigg{|}\Biggr{\}} =E{U¯TU(𝒑Tb,𝒒¯T)}\displaystyle=E\Bigl{\{}\bar{U}_{T}-U(\bm{p}_{T}^{b},\bm{\bar{q}}_{T})\Bigr{\}}
+E{U(𝒑¯T,𝒒¯T)U¯T}\displaystyle+E\Biggl{\{}U(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})-\bar{U}_{T}\Biggr{\}}
ϵ,T>max{T1,T2},\displaystyle\leq\epsilon,~{}\forall~{}T>\max\{T_{1},T_{2}\}, (47)

where the first equality holds since U(𝒑¯T,𝒒¯T)U(𝒑Tb,𝒒¯T)U(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})\geq U(\bm{p}_{T}^{b},\bar{\bm{q}}_{T}). Choosing 𝒑=𝒑¯T,𝒒=𝒒Tb\bm{p}=\bm{\bar{p}}_{T},\bm{q}=\bm{q}_{T}^{b} in (44), (46) and adding them together, we have

E{|U(𝒑¯T,𝒒¯T)max𝒒U(𝒑¯T,𝒒)|}\displaystyle E\Biggl{\{}\Bigg{|}U(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})-\max_{\bm{q}^{\prime}}~{}U(\bm{\bar{p}}_{T},\bm{q^{\prime}})\Bigg{|}\Biggr{\}} =E{U¯TU(𝒑¯T,𝒒¯T)}\displaystyle=E\Bigl{\{}\bar{U}_{T}-U(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})\Bigr{\}}
+E{U(𝒑¯T,𝒒Tb)U¯T}\displaystyle+E\Biggl{\{}U(\bm{\bar{p}}_{T},\bm{q}_{T}^{b})-\bar{U}_{T}\Biggr{\}}
ϵ,T>max{T1,T2},\displaystyle\leq\epsilon,\forall~{}T>\max\{T_{1},T_{2}\}, (48)

where the first equality holds as U(𝒑¯T,𝒒¯T)U(𝒑¯T,𝒒Tb)U(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T})\leq U(\bm{\bar{p}}_{T},\bm{q}_{T}^{b}). Equations (47) and (48) show that (𝒑¯T,𝒒¯T)(\bm{\bar{p}}_{T},\bm{\bar{q}}_{T}) can converge to 𝒩ϵ\mathcal{N}_{\epsilon} in the L1L1 sense. ∎

References

  • [1] L. Atzori, A. Iera, and G. Morabito, “The internet of things: A survey,” Computer networks, vol. 54, no. 15, pp. 2787–2805, 2010.
  • [2] M. S. Elbamby, C. Perfecto, C. Liu, J. Park, S. Samarakoon, X. Chen, and M. Bennis, “Wireless edge computing with latency and reliability guarantees,” Proceedings of the IEEE, vol. 107, no. 8, pp. 1717–1737, 2019.
  • [3] F. Bonomi, R. Milito, J. Zhu, and S. Addepalli, “Fog computing and its role in the internet of things,” in Workshop on Mobile Cloud Computing (MCC), Aug. 2012, pp. 13–16.
  • [4] S. Yi, C. Li, and Q. Li, “A survey of fog computing: concepts, applications and issues,” in Workshop on Mobile Big Data (Mobidata), Aug. 2015, pp. 13–16.
  • [5] F. Bonomi, R. Milito, P. Natarajan, and J. Zhu, “Fog computing: A platform for internet of things and analytics,” in Big data and internet of things: A roadmap for smart environments.   Springer, 2014, pp. 169–186.
  • [6] K. Zhang, Y. Mao, S. Leng, Y. He, and Y. Zhang, “Mobile-edge computing for vehicular networks: A promising network paradigm with predictive off-loading,” IEEE Vehicular Technology Magazine, vol. 12, no. 2, pp. 36–44, 2017.
  • [7] J. Wang, K. Liu, B. Li, T. Liu, R. Li, and Z. Han, “Delay-sensitive multi-period computation offloading with reliability guarantees in fog networks,” IEEE Transactions on Mobile Computing, pp. 1–1, 2019.
  • [8] R. Deng, R. Lu, C. Lai, T. H. Luan, and H. Liang, “Optimal workload allocation in fog-cloud computing toward balanced delay and power consumption,” IEEE Internet of Things Journal, vol. 3, no. 6, pp. 1171–1181, 2016.
  • [9] C. You, K. Huang, H. Chae, and B. Kim, “Energy-efficient resource allocation for mobile-edge computation offloading,” IEEE Transactions on Wireless Communications, vol. 16, no. 3, pp. 1397–1411, 2017.
  • [10] T. Q. Dinh, J. Tang, Q. D. La, and T. Q. S. Quek, “Offloading in mobile edge computing: Task allocation and computational frequency scaling,” IEEE Transactions on Communications, vol. 65, no. 8, pp. 3571–3584, 2017.
  • [11] I. Ketykó, L. Kecskés, C. Nemes, and L. Farkas, “Multi-user computation offloading as multiple knapsack problem for 5g mobile edge computing,” in European Conference on Networks and Communications (EuCNC), June 2016, pp. 225–229.
  • [12] M. Khaledi, M. Khaledi, and S. K. Kasera, “Profitable task allocation in mobile cloud computing,” in ACM Symposium on QoS and Security for Wireless and Mobile Networks (Q2SWinet), Nov. 2016, pp. 9–17.
  • [13] H. Zhu, K. Kang, X. Luo, and H. Qian, “Online learning for computation peer offloading with semi-bandit feedback,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 4524–4528.
  • [14] G. Lee, W. Saad, and M. Bennis, “An online secretary framework for fog network formation with minimal latency,” in IEEE International Conference on Communications (ICC), May 2017, pp. 1–6.
  • [15] Z. Zhu, T. Liu, Y. Yang, and X. Luo, “BLOT: Bandit learning-based offloading of tasks in fog-enabled networks,” IEEE Transactions on Parallel and Distributed Systems, vol. 30, no. 12, pp. 2636–2649, 2019.
  • [16] C. Liu, M. Bennis, M. Debbah, and H. V. Poor, “Dynamic task offloading and resource allocation for ultra-reliable low-latency edge computing,” IEEE Transactions on Communications, vol. 67, no. 6, pp. 4132–4150, 2019.
  • [17] M. Yang, H. Zhu, H. Wang, Y. Koucheryavy, K. Samouylov, and H. Qian, “An online learning approach to computation offloading in dynamic fog networks,” IEEE Internet of Things Journal, pp. 1–1, 2020.
  • [18] J. Gustafsson, A. Betts, A. Ermedahl, and B. Lisper, “The mälardalen wcet benchmarks: Past, present and future,” in International Workshop on Worst-Case Execution Time Analysis (WCET).   Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, July 2010.
  • [19] M. Yang, H. Zhu, H. Wang, Y. Koucheryavy, K. Samouylov, and H. Qian, “Peer to peer offloading with delayed feedback: An adversary bandit approach,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 5035–5039.
  • [20] M. Chiang and T. Zhang, “Fog and IoT: An overview of research opportunities,” IEEE Internet of Things Journal, vol. 3, no. 6, pp. 854–864, Dec. 2016.
  • [21] L. Chen, S. Zhou, and J. Xu, “Computation peer offloading for energy-constrained mobile edge computing in small-cell networks,” IEEE/ACM Transactions on Networking, vol. 26, no. 4, pp. 1619–1632, Aug. 2018.
  • [22] D. Huang, P. Wang, and D. Niyato, “A dynamic offloading algorithm for mobile computing,” IEEE Transactions on Wireless Communications, vol. 11, no. 6, pp. 1991–1995, 2012.
  • [23] O. Muñoz, A. Pascual-Iserte, and J. Vidal, “Optimization of radio and computational resources for energy efficiency in latency-constrained application offloading,” IEEE Transactions on Vehicular Technology, vol. 64, no. 10, pp. 4738–4755, 2015.
  • [24] “Evolved universal terrestrial radio access (E-UTRA); physical layer procedures (release 16),” 3GPP TS 36.213, 2020.
  • [25] D. J. Costello, Error control coding: fundamentals and applications.   prentice Hall, 1983.
  • [26] D. P. Bertsekas, R. G. Gallager, and P. Humblet, Data networks.   Prentice-Hall International New Jersey, 1992, vol. 2.
  • [27] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire, “The non-stochastic multi-armed bandit problem,” SIAM Journal on Computing, vol. 32, pp. 48–77, 2002.
  • [28] G. Neu, “Explore no more: Improved high-probability regret bounds for non-stochastic bandits,” in Advances in Neural Information Processing Systems (NIPS), Dec. 2015, pp. 3168–3176.
  • [29] Y. Cai and C. Daskalakis, “On minmax theorems for multiplayer games,” in Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Jan. 2011, pp. 217–234.
  • [30] J. Nash, “Non-cooperative games,” Annals of mathematics, pp. 286–295, 1951.
  • [31] T. Roughgarden, “Algorithmic game theory,” Communications of the ACM, vol. 53, no. 7, pp. 78–86, 2010.
  • [32] P. Joulani, A. Gyorgy, and C. Szepesvári, “Online learning under delayed feedback,” in International Conference on Machine Learning (ICML), June 2013, pp. 1453–1461.
  • [33] T. Mandel, Y.-E. Liu, E. Brunskill, and Z. Popović, “The queue method: Handling delay, heuristics, prior data, and evaluation in bandits,” in AAAI Conference on Artificial Intelligence, 2015.
  • [34] C. Vernade, O. Cappe, and V. Perchet, “Stochastic bandit models for delayed conversions,” in Conference on Uncertainty in Artificial Intelligence (UAI), Aug. 2017.
  • [35] T. S. Thune, N. Cesa-Bianchi, and Y. Seldin, “Nonstochastic multiarmed bandits with unrestricted delays,” in Advances in Neural Information Processing Systems (NIPS), Dec. 2019, pp. 6541–6550.