This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Collaborative Multi-Agent Multi-Armed Bandit Learning for Small-Cell Caching

Xianzhe Xu, Meixia Tao, , and Cong Shen,
X. Xu and M. Tao are with the Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai 200240, China (Emails: [email protected]; [email protected]). C. Shen is with the Charles L. Brown Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA 22904, USA. (Email: [email protected]). Part of this work was presented in IEEE Globecom 2018 [1]. This work is supported in part by the NSF of China under grant 61941106 and by the National Key R&\&D Project of China under grant 2019YFB1802702.
Abstract

This paper investigates learning-based caching in small-cell networks (SCNs) when user preference is unknown. The goal is to optimize the cache placement in each small base station (SBS) for minimizing the system long-term transmission delay. We model this sequential multi-agent decision making problem in a multi-agent multi-armed bandit (MAMAB) perspective. Rather than estimating user preference first and then optimizing the cache strategy, we propose several MAMAB-based algorithms to directly learn the cache strategy online in both stationary and non-stationary environment. In the stationary environment, we first propose two high-complexity agent-based collaborative MAMAB algorithms with performance guarantee. Then we propose a low-complexity distributed MAMAB which ignores the SBS coordination. To achieve a better balance between SBS coordination gain and computational complexity, we develop an edge-based collaborative MAMAB with the coordination graph edge-based reward assignment method. In the non-stationary environment, we modify the MAMAB-based algorithms proposed in the stationary environment by proposing a practical initialization method and designing new perturbed terms to adapt to the dynamic environment. Simulation results are provided to validate the effectiveness of our proposed algorithms. The effects of different parameters on caching performance are also discussed.

Index Terms:
Wireless caching, multi-armed bandit, multi-agent, coordination graph, small-cell networks, reinforcement learning.

I Introduction

In recent decades, the proliferation of mobile devices results in a steep growth of mobile data traffic, which imposes heavy pressure on backhaul links with limited capacity in cellular networks. Due to the fact that only a small number of files accounts for the majority of the data traffic, caching popular files at small base stations (SBSs) has been widely adopted to reduce the traffic congestion and alleviate the backhaul pressure [2, 3]. Many previous works have studied the cache placement problem with various cache strategies and performance metrics. In [4, 5, 6], the authors focus on the deterministic caching that SBSs cache files entirely without partitioning. The works [7, 8, 9] consider the probabilistic caching that SBSs cache files according to certain well-designed probabilities. The authors in [10, 11, 12] consider the coded caching that files are partitioned into multiple segments, and these segments after coding are cached at SBSs. However, these works on wireless caching rely on a strong assumption that file popularity or user preference is perfectly known in advance, which is somehow unrealistic in general. Therefore, caching design without the knowledge of file popularity or user preference is a necessary but challenging issue. The aim of this work is to address the above issue by investigating the caching design when user preference is unknown.

Several works have studied the wireless caching problem without the knowledge of file popularity or user preference. Some of them tackle this problem by first estimating file popularity (or the number of requests) and then optimizing the cache strategy based on the estimated file popularity (or estimated number of requests) [13, 14, 15, 16, 17, 18, 19, 20]. The authors in [15, 16] estimate the file popularity by using multi-armed bandit (MAB) and then the SBS caches the most popular files in the single SBS scenario. In [17], the authors utilize MAB to estimate the local popularity and then optimize the coded caching in the multiple SBSs scenario. The work [18] also utilizes MAB to estimate local popularity of each SBS and then optimize the cache strategies to minimize the total cost of multiple SBSs with centralized and distributed algorithms. In [20], the authors estimate the local popularity and global popularity first, and then utilize Q-learning to choose cache actions with local popularity, global popularity and cache strategy as state.

Note that all these works [15, 16, 17, 13, 18, 14, 19, 20] only focus on the file popularity (local or global) or the requests collected from multiple users in a cell since they either assume that different users have the same preference or adopt cache hit as the performance metric. However, different users may have different preferences in practice. Moreover, adjacent SBSs need to design their cache strategies collaboratively as they can communicate with a common set of users. If we consider file popularity from a set of users rather than user preference, SBSs cannot know whether these requests are from the users that are only covered by one SBS or the users that are covered by multiple SBSs. Therefore, SBSs cannot design the cache strategy collaboratively. Thus, some recent works[21, 22, 23, 24] focus on the caching strategy design based on user preference rather than file popularity when user preference is unknown. Recently, the work [21] utilizes contextual MAB to learn the context-specific content popularity online and takes the dependence of user preference on their context into account. With the estimated context-specific content popularity, SBSs cache files to maximize the accumulated cache hits. In [22], the authors propose an online content popularity prediction algorithm by leveraging the content features and user preference. Then they propose two learning-based edge caching architectures to design the caching policy. Though the authors in [21, 22] consider the multiple SBSs scenario, the correlations among SBSs are not considered. In [23], the authors propose a new Bayesian learning method to predict personal preference and estimate the individual content request probability. Based on the estimated individual content request probability, a reinforcement learning algorithm is proposed to optimize the deterministic caching by maximizing the system throughput. The work [24] first utilizes the expectation maximization method to estimate the user preference and activity level based on the probabilistic latent semantic analysis model. A low-complexity algorithm is then developed to optimize the D2D cache strategies by maximizing the offloading probability based on the estimated user preference and activity level. However, a training phase is required in [23, 24], which cannot well adapt to the time-varying environment, comparing to online methods. Due to the fact that context information of users is private and the historical request number of a single user is often very limited, online estimation of user preference is very difficult in practice. Therefore, some works try to make caching decisions directly, rather than estimating user preference first and then optimizing cache, to minimize the long-term cost or maximize the long-term reward by using reinforcement learning [25, 26] and deep reinforcement leaning [27]. However, these works [25, 26, 27] only focus on the single-cell performance and no coordination among multiple SBSs exists. To our best knowledge, an online collaborative caching problem without the prior knowledge of user preference at the user level has not been well studied in the literature.

In this paper, we aim to optimize the collaborative caching among multiple SBSs by minimizing the accumulated transmission delay over a finite time horizon in cache-enabled wireless networks when user preference is unknown. The cache strategies need to be optimized collaboratively since each user can be covered by multiple SBSs and experiences different delays when it is served by different SBSs. Thus, SBSs need to learn and decide the cache strategies collaboratively at each time slot over a finite time horizon. Since earlier decision of caching would shape our knowledge of caching rewards that our future action depends upon, this problem is fundamentally a sequential multi-agent decision making problem. Rather than estimating user preference first and then optimizing the cache strategy, we learn the cache strategies directly at SBSs online by utilizing multi-agent MAB (MAMAB). This is because estimating the preference of each individual user is difficult due to the limited number of requests from a single user. Besides, contextual information of users, such as age and gender may not be utilized to estimate user preference due to privacy concerns. On the other hand, the caching reward in the MAMAB framework considered in our work is defined as the transmission delay reduction compared to the case without caching. It represents the aggregate effect of all user requests and hence can be learned more accurately and with faster speed. Our prior conference paper [1] considered the distributed and edge-based collaborative MAMAB algorithms only in the stationary environment, without the theoretical proof of performance guarantee. The main contributions and results of this journal version are summarized as follows.

  • We formulate the collaborative caching optimization problem to minimize the accumulated transmission delay over a finite time horizon in cache-enabled wireless networks without the knowledge of user preference. By defining the reward of caching, we first model this sequential decision making problem in a MAMAB perspective, which is a classical reinforcement learning framework. We then solve this MAMAB problem in both stationary and non-stationary environment.

  • In the stationary environment, we first propose two high-complexity agent-based collaborative MAMAB algorithms with the performance guarantee that the regret is bounded by 𝒪(log(Ttotal))\mathcal{O}(\log{(T_{\text{total}})}), where TtotalT_{\text{total}} is the total number of time slots. To reduce the computational complexity, we also propose a distributed MAMAB algorithm, which ignores the SBSs coordination. By taking both SBS coordination and computational complexity into account, we propose an edge-based collaborative MAMAB algorithm based on the coordination graph edge-based reward assignment method. We also modify the above MAMAB algorithms by designing new perturbed terms to achieve a better performance.

  • In the non-stationary environment, we modify the MAMAB algorithms proposed in the stationary environment by proposing a practical initialization method and designing new perturbed terms in order to adapt to the dynamic environment.

  • Simulation results are provided to demonstrate the effectiveness of our proposed algorithms in both stationary (synthesized data) and non-stationary environment (Movielens data set) by comparing with some benchmark algorithms. The effects of communication distance, cache size and user mobility on caching performance are also discussed.

The rest of the paper is organized as follows. The system model is introduced in Section II. Then we formulate the problem in Section III. In Section IV and Section V, we solve this problem in stationary and non-stationary environment, respectively. We then present the simulation results in Section VI. Finally, we conclude the paper in Section VII.

Notations: This paper uses bold-face lower-case 𝐡\mathbf{h} for vectors and bold-face uppercase 𝐇\mathbf{H} for matrices. 𝟎m×1\mathbf{0}_{m\times 1} denotes the m×1m\times 1 zero vector. 𝕀{X}\mathbb{I}\{X\} denotes the indicator operator that 𝕀{X}=1\mathbb{I}\{X\}=1 if the event XX is true and 𝕀{X}=0\mathbb{I}\{X\}=0 otherwise.

II System Model

II-A Network Model

We consider a SCN where cache-enabled SBSs and users are located in a finite region as illustrated in Fig. 1. The sets of users and SBSs are denoted as 𝒰={1,2,,U}\mathcal{U}=\{1,2,\ldots,U\} and ={1,2,,M}\mathcal{M}=\{1,2,\ldots,M\}, respectively. We assume that SBSs have a limited communication distance lcl_{c}, which means that SBS mm can communicate with user uu if the distance between them lm,ul_{m,u} does not exceed lcl_{c}. Thus, we define the neighbor SBSs of user uu as the set of SBSs that can communicate with user uu, which is denoted as u={m|lm,ulc}\mathcal{M}_{u}=\{m\in\mathcal{M}|l_{m,u}\leq l_{c}\}. We sort the distance lm,ul_{m,u} for mum\in\mathcal{M}_{u} in an increasing order such that juj_{u} denotes the index of the jj-th nearest SBS to user uu. Similarly, the set of the neighbor users of SBS mm is denoted as 𝒰m={u𝒰|lm,ulc}\mathcal{U}_{m}=\{u\in\mathcal{U}|l_{m,u}\leq l_{c}\}.

Refer to caption
Figure 1: System model and its corresponding coordination graph.

II-B Cache Model

There is a file library \mathcal{F} with size |||\mathcal{F}| and all files are assumed to be the same normalized size of 11111For the unequal size case, we can divide each files into chunks of equal size, so the analysis and algorithms in this paper still can be applied.. Each SBS is equipped with a local cache with size S<||S<|\mathcal{F}|. Define a cache matrix 𝐀t{0,1}M×||\mathbf{A}^{t}\in\{0,1\}^{M\times|\mathcal{F}|} to denote the cache strategy of SBSs at time slot tt. The cache matrix should satisfy the cache size constraint f=1||am,ftS\sum_{f=1}^{|\mathcal{F}|}a_{m,f}^{t}\leq S for mm\in\mathcal{M} at each time slot.

II-C Service Model

We adopt the total transmission delay as the performance metric. When SBS mm serves its neighbor user uu by transmitting a file with unit size, the user will experience a transmission delay dm,ud_{m,u}. We model the transmission delay in the noise-limited network and hence interference can be neglected. By considering the large-scale fading only, the signal-to-noise ratio (SNR) between user uu and SBS mm is given by SNRm,u=Plm,uασN2\text{SNR}_{m,u}=\frac{Pl_{m,u}^{-\alpha}}{\sigma_{N}^{2}}, where PP is the transmission power of SBSs, α>2\alpha>2 is the path loss exponent and σN2\sigma_{N}^{2} is the Gaussian white noise power. The transmission delay between user uu and its neighbor SBS mum\in\mathcal{M}_{u} by transmitting a file with unit size is dm,u=1Wlog2(1+SNRm,u)d_{m,u}=\frac{1}{W\log_{2}(1+\text{SNR}_{m,u})}, where WW is the bandwidth. Note that users will experience a larger transmission delay if they are served by farther SBSs. For the non-neighbor SBSs mum\notin\mathcal{M}_{u}, the transmission delay dm,u=d_{m,u}=\infty since the communication link cannot be established between them.

Each user requests for files according to its own preference. The requests of user uu at time slot tt is denoted as a set 𝒬ut\mathcal{Q}_{u}^{t} with |𝒬ut||\mathcal{Q}_{u}^{t}|\in\mathbb{N}. This means that a user uu can request any number of files at a time slot since a time slot contains several hours or even one day in practice. Therefore, SBSs can satisfy multiple requests at a time slot. For a user uu requesting file ff at a given time instance, it will be served by the nearest neighbor SBS that caches file ff. If none of its neighbor SBSs caches file ff, the core network will satisfy this request, which induces a much larger transmission delay d0d_{0} comparing to the delay when requests are served by SBSs locally, i.e., d0dm,ud_{0}\gg d_{m,u} for mum\in\mathcal{M}_{u}. Note that when lcl_{c} is large enough, all users can be connected to all SBSs in \mathcal{M}. This fully connected network is actually a special case of our problem. All parameters are summarized in Table I.

TABLE I: Notation Table
Notation Definition
\mathcal{M} SBSs set
𝒰\mathcal{U} Users set
\mathcal{F} Files set
u\mathcal{M}_{u} Neighbor SBSs set of user uu
𝒰m\mathcal{U}_{m} Neighbor users set of SBS mm
𝐀\mathbf{A} Cache matrix
𝒬u\mathcal{Q}_{u} Requests set of user uu
SS Cache size
juj_{u} Index of jj-th nearest SBS of user uu
lcl_{c} Communication distance
dm,ud_{m,u} Transmission delay between SBS mm and user uu
d0d_{0} Transmission delay of the core network

III Problem Formulation

Our goal is to design cache strategies 𝐀t\mathbf{A}^{t} online to minimize the accumulated transmission delay over a time horizon TtotalT_{\text{total}} when user preference is unknown. In this section, we model this cache optimization problem as a sequential multi-agent decision making problem and reformulate this problem in a MAMAB perspective.

For a user uu that requests file ff at time slot tt, it can be satisfied by either one neighbor SBS or the core network. If at least one neighbor SBS that caches the requested file of the user, the user will be served by the nearest SBS that caches the requested file. In this case, the transmission delay depends on the nearest neighbor SBS that caches the requested file, regardless of other farther SBSs. If none of the neighbor SBSs caches the requested file, i.e., am,ft=0a_{m,f}^{t}=0 for all mum\in\mathcal{M}_{u}, the user will be served by the core network and experience a much larger delay d0d_{0}. Since we focus on the transmission delay from the user perspective, the total transmission delay of the network at time slot tt is the sum of all individual delays for all users, which is given by:

Dt\displaystyle D_{t} =u=1Uf=1||𝕀{f𝒬ut}[j=1|u|aju,ftdju,un=1j1(1anu,ft)\displaystyle=\sum_{u=1}^{U}\sum_{f=1}^{|\mathcal{F}|}\mathbb{I}\{f\in\mathcal{Q}_{u}^{t}\}\bigg{[}\sum_{j=1}^{|\mathcal{M}_{u}|}a_{j_{u},f}^{t}d_{j_{u},u}\prod_{n=1}^{j-1}(1-a_{n_{u},f}^{t})
+mu(1am,ft)d0].\displaystyle~{}~{}~{}~{}~{}~{}~{}+\underset{m\in\mathcal{M}_{u}}{\prod}(1-a_{m,f}^{t})d_{0}\bigg{]}. (1)

We aim to minimize the accumulated transmission delay of the network over the time horizon TtotalT_{\text{total}} by learning and optimizing the cache strategy at each time slot tt when the requests 𝒬ut\mathcal{Q}_{u}^{t} of all users are not known in advance. The optimization problem is formulated as:

P1:min{𝐀𝐭}\displaystyle\text{{P1:}}\quad\underset{\{\mathbf{A^{t}}\}}{\text{min}}\quad t=1TtotalDt\displaystyle\sum_{t=1}^{T_{\text{total}}}D_{t} (2a)
s.t f=1||am,ftS,mandt=1,2,,Ttotal,\displaystyle\sum_{f=1}^{|\mathcal{F}|}a_{m,f}^{t}\leq S,~{}\forall m\in\mathcal{M}~{}\text{and}~{}t=1,2,\ldots,T_{\text{total}}, (2b)
𝐀t{0,1}M×||,t=1,2,,Ttotal.\displaystyle\mathbf{A}^{t}\in\{0,1\}^{M\times|\mathcal{F}|},~{}~{}t=1,2,\ldots,T_{\text{total}}. (2c)

Note that the optimization problem P1 is a classical caching optimization problem and has been solved if user requests 𝒬u\mathcal{Q}_{u} and transmission delays between SBSs and users dm,ud_{m,u} are perfectly known in advance, which is very difficult in practice. Most traditional learning-based caching works estimate the file popularity first and then optimize the cache strategy since they adopt cache hit rate as the performance metric and assume that different users have the same preference. While for optimization problem P1, estimating requests of a single user online is much more difficult than estimating file popularity since the request number of a single user is much smaller than the request number from multiple users in a cell. Besides, users contextual information, such as age and gender, are private and we cannot utilize them to group users and learn preference from a group of users as work [21] does.222To our best knowledge, there is no work that estimates user preference pu,fp_{u,f} online. References [23, 24] obtain the accurate values of estimated user preference in the training phase, which cannot adapt to time-varying environment well. Therefore, the conventional predict-then-optimize approach cannot be utilized in this problem. Besides, the optimization problem P1 is still NP-complete [5] even when the user requests and transmission delays between SBSs and users are perfectly known. Note that the proposed greedy algorithm in [5] to solve P1 has a high computational complexity since the greedy algorithm needs to exhaustively search all feasible file-SBS pairs and choose the best one at each step. This step is repeated until the caches of all SBSs are full. Therefore, instead of estimating the user requests first and then optimizing the cache strategy, we would like to learn the cache strategy directly with a low-complexity algorithm to make sequential caching decisions. Note that reinforcement learning is an effective tool to solve sequential decision making problems. Rather than applying other reinforcement learning algorithms, such as Q-learning, we model this multi-agent sequential decision making problem in a MAMAB perspective since MAB represents a simple class of online learning models with fewer assumptions of the environment. Furthermore, we shall show that the performance of the proposed MAMAB algorithms can be theoretically guaranteed later in Lemma 2.

Before reformulating P1 as a MAMAB problem, we define the reward of caching as the transmission delay reduction compared to the case without caching. Note that SBSs can observe the transmission delay if they serve their neighbor users with caching files. Besides, all SBSs know the transmission delay d0d_{0} when users are served by the core network. Therefore, SBSs can observe the transmission delay and calculate the reward of caching files if they serve their neighbor users. The reward of SBS mm by caching file ff at time slot tt is given by:

rm,ft\displaystyle r_{m,f}^{t} =am,ftu𝒰m𝕀{f𝒬ut}(d0dm,u)\displaystyle=a_{m,f}^{t}\sum_{u\in\mathcal{U}_{m}}\mathbb{I}\{f\in\mathcal{Q}_{u}^{t}\}\left(d_{0}-d_{m,u}\right)
j=1|u|𝕀{m=ju}n=1j1(1anu,ft).\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\sum_{j=1}^{|\mathcal{M}_{u}|}\mathbb{I}\{m=j_{u}\}\prod_{n=1}^{j-1}(1-a_{n_{u},f}^{t}). (3)

By defining the reward, we can reformulate P1 without loss of optimality as:

P2:max{𝐀t}\displaystyle\text{{P2:}}\quad\underset{\{\mathbf{A}^{t}\}}{\text{max}}\quad rtotal=t=1Ttotalm=1Mf=1||rm,ft\displaystyle r_{\text{total}}=\sum_{t=1}^{T_{\text{total}}}\sum_{m=1}^{M}\sum_{f=1}^{|\mathcal{F}|}r_{m,f}^{t} (4)
s.t (2b),(2c).\displaystyle(\ref{eqn:cons11}),(\ref{eqn:cons22}).

Optimization problem P2 is a MAMAB problem, where each SBS is seen as an agent and each file is seen as an arm. SBSs decide the cache strategy based on its cumulative knowledge to maximize the accumulated reward with unknown distributions of the reward for caching different files. We can solve the optimization problem P2 by estimating the reward rm,ftr_{m,f}^{t} and then designing cache strategy at each time slot. Actually, P2 can be divided into TtotalT_{\text{total}} independent sub-optimization problems, one for each time slot, called the single-period optimization (SPO) problem. For each SPO problem, the cache strategy is optimized with the estimated reward rm,ftr_{m,f}^{t} based on the historical reward observations. The objective of P2 is to maximize the accumulated reward over a time horizon based on its cumulative knowledge. Therefore, there exists a tradeoff between exploration (i.e., caching all files a sufficient number of times to estimate the reward more accurately) and exploitation (i.e., caching files to maximize the estimated reward).

For this MAMAB problem, its main difficulty compared to the single agent MAB problem is that there exists collisions among different agents (SBSs), i.e., for a user requesting a file, only the nearest neighbor SBS that caches the requested file can serve the user and obtain the reward while farther neighbor SBSs obtain no reward even if they cache the requested file. Therefore, how to make the caching decisions coordinately is challenging. In this paper, we utilize a coordination graph G=(V,E)G=(V,E) [28, 29] to represent the dependencies among SBSs, as shown in Fig. 1. In the coordination graph, each SBS mVm\in V represents a vertex. An edge (m,n)E(m,n)\in E indicates that the corresponding SBS mm and nn cover the common users, i.e., 𝒰m𝒰n\mathcal{U}_{m}\cap\mathcal{U}_{n}\neq\emptyset, which means that the cache strategies of SBS mm and nn need to be designed coordinately. Specially, each SBS mm has a self edge (m,m)(m,m). We define the neighbor of the vertex mm in the coordination graph as Γ(m){n𝒰m𝒰n,mn}\Gamma(m)\triangleq\{n\in\mathcal{M}\mid\mathcal{U}_{m}\cap\mathcal{U}_{n}\neq\emptyset,m\neq n\}. As we can see in (3), the reward rm,ftr_{m,f}^{t} of SBS mm by caching file ff depends on the joint cache action of SBSs mm and Γ(m)\Gamma(m), which is denoted as bm,ft=[am,ft,am1,ft,,amΓ(m),ft]\textbf{b}_{m,f}^{t}=[a_{m,f}^{t},a_{m_{1},f}^{t},\ldots,a_{m_{\Gamma(m)},f}^{t}]. Therefore, rm,ftr_{m,f}^{t} can be rewritten as rm,ft(bm,f)r_{m,f}^{t}(\textbf{b}_{m,f}). Due to the fact that the user requests are unknown before designing the cache strategy, we can not analyse the property of the reward rm,ft(bm,f)r_{m,f}^{t}(\textbf{b}_{m,f}) to solve the optimization problem. In this case, we can utilize MAMAB to estimate the reward rm,ft(bm,f)r_{m,f}^{t}(\textbf{b}_{m,f}) and then decide the cache strategy directly.

IV Cache Strategy in Stationary Environment

In this section, we aim to solve P2 in the stationary environment. We first introduce the stationary environment and give the specific form of P2. Then we solve this problem with MAMAB-based methods in both distributed and collaborative manners. Finally, we modify the MAMAB algorithms by designing new perturbed terms to achieve a better performance.

IV-A Stationary Environment

The stationary environment is described as follows. The file library \mathcal{F} is a static set with fixed size FF and user preference is time-invariant. Denote pu,fp_{u,f} as the probability that user uu requests file ff, which satisfies 0pu,f10\leq p_{u,f}\leq 1 and f=1Fpu,f=1\sum_{f=1}^{F}p_{u,f}=1. Each user is assumed to request one file at each time slot independently according to its own user preference. Moreover, the locations of users are fixed. Thus, transmission delays dm,ud_{m,u} are time-invariant and can be observed when SBSs serve users or computed if the communication distance is known.

In this stationary environment, the expected reward of SBS mm by caching file ff is given by:

rm,f(bm,f)\displaystyle r_{m,f}(\textbf{b}_{m,f}) =am,fu𝒰mpu,f(d0dm,u)\displaystyle=a_{m,f}\sum_{u\in\mathcal{U}_{m}}p_{u,f}\left(d_{0}-d_{m,u}\right)
j=1|u|𝕀{m=ju}n=1j1(1anu,f).\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\sum_{j=1}^{|\mathcal{M}_{u}|}\mathbb{I}\{m=j_{u}\}\prod_{n=1}^{j-1}(1-a_{n_{u},f}). (5)

Since user preference is time-invariant, the solutions of different SPO problems of P2 are identical. Hence, we only need to focus on one SPO problem, which is given by:

P3:max𝐀\displaystyle\text{{P3:}}\quad\underset{\mathbf{A}}{\text{max}}\quad Rtotal(A)=m=1Mf=1Frm,f(bm,f)\displaystyle R_{\text{total}}(\textbf{A})=\sum_{m=1}^{M}\sum_{f=1}^{F}r_{m,f}(\textbf{b}_{m,f}) (6a)
s.t f=1Fam,fS,m=1,2,,M\displaystyle\sum_{f=1}^{F}a_{m,f}\leq S,~{}~{}~{}m=1,2,\ldots,M (6b)
am,f{0,1},m=1,2,,M.\displaystyle a_{m,f}\in\{0,1\},~{}~{}~{}m=1,2,\ldots,M. (6c)

Then we aim to solve the SPO problem P3 by using MAMAB-based algorithms in both distributed and collaborative manners with different reward decomposition methods.

IV-B Agent-based Collaborative MAMAB

In the following, we propose two agent-based collaborative MAMAB algorithms with different reward decomposition methods from the SBS perspective and user perspective, respectively.

IV-B1 SBS perspective

From (6a), it is observed that the total reward function Rtotal(A)R_{\text{total}}(\textbf{A}) is decomposed into a linear combination of local agent-dependent reward functions rm,f(bm,f)r_{m,f}(\textbf{b}_{m,f}). The reward that SBS mm can obtain depends on the joint bm,f{0,1}1×(Γ(m)+1)\textbf{b}_{m,f}\in\{0,1\}^{1\times(\Gamma(m)+1)}. In the initial phase t=0t=0, SBSs cache files randomly to make each joint action bm,f\textbf{b}_{m,f} appear at least once. Note that t=0t=0 can contain multiple time slots actually. The number of times that bm,f\textbf{b}_{m,f} appears until tt is denoted as Tm,ft(bm,f)T_{m,f}^{t}(\textbf{b}_{m,f}). Besides, if bm,f\textbf{b}_{m,f} appears at time slot tt, we update the average reward R¯m,ft(bm,f)\overline{R}_{m,f}^{t}(\textbf{b}_{m,f}) of SBS mm with joint action bm,f\textbf{b}_{m,f} as:

R¯m,ft(bm,f)=[Tm,ft(bm,f)1]R¯m,ft1(bm,f)+rm,ft(bm,f)Tm,ft(bm,f),\displaystyle\overline{R}_{m,f}^{t}(\textbf{b}_{m,f})=\frac{\left[T_{m,f}^{t}(\textbf{b}_{m,f})-1\right]\overline{R}_{m,f}^{t-1}(\textbf{b}_{m,f})+r_{m,f}^{t}(\textbf{b}_{m,f})}{T_{m,f}^{t}(\textbf{b}_{m,f})}, (7)

where rm,ft(bm,f)r_{m,f}^{t}(\text{b}_{m,f}) is the observed reward of SBS mm with joint action bm,f\textbf{b}_{m,f} at time slot tt. Each SBS mm needs to share its cache strategy with its neighbor Γ(m)\Gamma(m) in the coordination graph in order to update Tm,ft(bm,f)T_{m,f}^{t}(\textbf{b}_{m,f}) and R¯m,ft(bm,f)\overline{R}_{m,f}^{t}(\textbf{b}_{m,f}). Note that the average reward R¯m,ft(bm,f)\overline{R}_{m,f}^{t}(\textbf{b}_{m,f}) is inaccurate when Tm,ft(bm,f)T_{m,f}^{t}(\textbf{b}_{m,f}) is small and hence utilizing the the average reward to decide the cache strategy is not appropriate. Therefore, we define the estimated reward by adding a perturbed term to the average reward similar to [30], which can achieve a good tradeoff between exploration and exploitation. The estimated reward R^m,ft(bm,f)\widehat{R}_{m,f}^{t}(\textbf{b}_{m,f}) is updated as:

R^m,ft(bm,f)=R¯m,ft1(bm,f)+Ba-coll,m(bm,f)3log(t)2Tm,ft1(bm,f),\displaystyle\widehat{R}_{m,f}^{t}(\textbf{b}_{m,f})=\overline{R}_{m,f}^{t-1}(\textbf{b}_{m,f})+B_{\text{a-coll},m}(\textbf{b}_{m,f})\sqrt{\frac{3\log(t)}{2T^{t-1}_{m,f}(\textbf{b}_{m,f})}}, (8)

where Ba-coll,m(bm,f)B_{\text{a-coll},m}(\textbf{b}_{m,f}) is the upper bound on the reward of SBS mm with joint action bm,f\textbf{b}_{m,f} as:

Ba-coll,m(bm,f)\displaystyle B_{\text{a-coll},m}(\textbf{b}_{m,f}) =am,fu𝒰m(d0dm,u)\displaystyle=a_{m,f}\sum_{u\in\mathcal{U}_{m}}(d_{0}-d_{m,u})
j=1|u|𝕀{m=ju}n=1j1(1anu,f).\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}\sum_{j=1}^{|\mathcal{M}_{u}|}\mathbb{I}\{m=j_{u}\}\prod_{n=1}^{j-1}(1-a_{n_{u},f}). (9)

Note that this perturbed term corresponds to the upper confidence bound (UCB) in MAB, which is utilized in the combinatorial MAB problem [30] with a single SBS. This UCB term can also be utilized in this MAMAB problem since the reward of each SBS for a fixed joint action has a stationary distribution in the stationary environment and the theoretical proof is shown later in Lemma 22.

With the estimated reward, we can optimize the cache strategy by maximizing the total estimated reward m=1Mf=1FR^m,ft(bm,f)\sum_{m=1}^{M}\sum_{f=1}^{F}\widehat{R}_{m,f}^{t}(\textbf{b}_{m,f}) at each time slot tt with the cache size constraint. Note that this problem is NP-complete and the commonly used greedy algorithm has a high computational complexity[5]. Therefore, we propose a low-complexity coordinate ascent algorithm [31] to obtain the solutions. Coordinate ascent algorithm is an anytime algorithm that is appropriate for real-time multi-agent systems where decision making must be done under time constraints. Given a random initial cache strategy, each SBS optimizes its own cache and share its optimized cache strategy with its neighbor SBSs in the coordination graph sequentially while the other M1M-1 SBSs stay the same. This step is repeated until no improvement can be made under the time constraints. The details of the coordinate ascent algorithm are given in Algorithm 1.

Algorithm 1 Coordinate Ascent Algorithm
1:  Initialize the cache of SBSs randomly
2:  repeat
3:     for m=1,2,,Mm=1,2,\ldots,M do
4:        Calculate the estimated reward of caching file ff at SBS mm for all files when the caching actions of the other M1M-1 SBSs stay the same
5:        SBS mm optimizes its own caching action by maximizing the total estimated reward
6:     end for
7:  until No SBS changes its cache strategy or decision time runs out

For the performance of Algorithm 1, we have the following lemma.

Lemma 1.

Algorithm 1 can achieve at least 12\frac{1}{2} optimality of P3, i.e., Rtotal(AL)12Rtotal(A)R_{\text{total}}(\textbf{A}^{L})\geq\frac{1}{2}R_{\text{total}}(\textbf{A}^{*}), where AL\textbf{A}^{L} is the cache strategy obtained by Algorithm 1.

Proof.

See Appendix A. ∎

Note that SBS mm with joint action bm,f\textbf{b}_{m,f} only can obtain the reward for file ff if am,f=1a_{m,f}=1, the number of effective action space is m=1M2Γ(m)F\sum_{m=1}^{M}2^{\Gamma(m)}F, which grows exponentially with Γ(m)\Gamma(m). Therefore, this agent-based collaborative MAMAB in a SBS perspective has a high computational complexity and a slow learning speed.

IV-B2 User perspective

Note that from the SBS perspective, the action space of bm,f\textbf{b}_{m,f} is fixed. However, the reward distributions of SBS mm for some different bm,f\textbf{b}_{m,f} are independent and identical and can be combined to reduce the action space. For example, if there is only one user uu covered by two SBSs, the reward distributions of SBS 1u1_{u} with joint action (a1u,f=1,a2u,f=1)(a_{1_{u},f}=1,a_{2_{u},f}=1) and (a1u,f=1,a2u,f=0)(a_{1_{u},f}=1,a_{2_{u},f}=0) are identical since uu will always be served by SBS 1u1_{u} if a1u,f=1a_{1_{u},f}=1. Therefore, we focus on the reward of SBSs from the user perspective by decomposing total reward function into a linear combination of local agent-dependent reward functions rm,f(cm,𝒱m,f)r_{m,f}(\textbf{c}_{m,\mathcal{V}_{m},f}) as:

Rtotal(A)=m=1Mf=1F𝒱mrm,f(cm,𝒱m,f)\displaystyle R_{\text{total}}(\textbf{A})=\sum_{m=1}^{M}\sum_{f=1}^{F}\sum_{\mathcal{V}_{m}}r_{m,f}(\textbf{c}_{m,\mathcal{V}_{m},f}) (10)

where cm,𝒱m,f=[am,f=1,an1,f=0,,an|𝒱m|,f=0]\textbf{c}_{m,\mathcal{V}_{m},f}=[a_{m,f}=1,a_{n_{1},f}=0,\ldots,a_{n_{|\mathcal{V}_{m}|,f}=0}] is the joint action that SBS mm caches file ff while SBSs in 𝒱m\mathcal{V}_{m} do not and 𝒱m{𝒱muu𝒰m}\mathcal{V}_{m}\in\{\mathcal{V}^{u}_{m}\mid u\in\mathcal{U}_{m}\}, where 𝒱mu{nudn,u<dm,u}\mathcal{V}^{u}_{m}\triangleq\{n\in\mathcal{M}_{u}\mid d_{n,u}<d_{m,u}\} is the set of SBSs that are closer to user uu than SBS mm. The reason for only considering joint action cm,𝒱m,f\textbf{c}_{m,\mathcal{V}_{m},f} is that SBS mm can serve user uu if and only if it is the nearest SBS that caches the requested file of user uu. In the initial phase t=0t=0, SBSs cache files randomly to make each joint action cm,𝒱m,f\textbf{c}_{m,\mathcal{V}_{m},f} appear at least once similar to the agent-based collaborative MAMAB from the SBS perspective. The number of times that cm,𝒱m,f\textbf{c}_{m,\mathcal{V}_{m},f} appears until tt is denoted as Tm,ft(cm,𝒱m,f)T^{t}_{m,f}(\textbf{c}_{m,\mathcal{V}_{m},f}). Besides, if joint action cm,𝒱m,f\textbf{c}_{m,\mathcal{V}_{m},f} appears at time slot tt, we update the average reward R¯m,𝒱m,ft(cm,𝒱m,f)\overline{R}_{m,\mathcal{V}_{m},f}^{t}(\textbf{c}_{m,\mathcal{V}_{m},f}) as:

R¯m,𝒱m,ft(cm,𝒱m,f)\displaystyle\!\!\overline{R}_{m,\mathcal{V}_{m},f}^{t}(\textbf{c}_{m,\mathcal{V}_{m},f}) =[Tm,𝒱m,ft(cm,𝒱m,f)1]R¯m,𝒱m,ft1(cm,𝒱m,f)Tm,𝒱m,ft(cm,𝒱m,f)\displaystyle\!=\!\frac{\left[T^{t}_{m,\mathcal{V}_{m},f}(\textbf{c}_{m,\mathcal{V}_{m},f})\!-\!1\right]\!\overline{R}_{m,\mathcal{V}_{m},f}^{t-1}(\textbf{c}_{m,\mathcal{V}_{m},f})}{T^{t}_{m,\mathcal{V}_{m},f}(\textbf{c}_{m,\mathcal{V}_{m},f})}
+u𝒰m𝕀{𝒱mu=𝒱m}rm,fu,tTm,𝒱m,ft(cm,𝒱m,f),\displaystyle~{}~{}~{}~{}~{}+\frac{\sum_{u\in\mathcal{U}_{m}}\mathbb{I}\{\mathcal{V}^{u}_{m}=\mathcal{V}_{m}\}r^{u,t}_{m,f}}{T^{t}_{m,\mathcal{V}_{m},f}(\textbf{c}_{m,\mathcal{V}_{m},f})}, (11)

where rm,fu,tr_{m,f}^{u,t} is the reward of SBS mm obtained by serving user uu with file ff. Note that each SBS mm needs to share its cache strategy with its neighbor Γ(m)\Gamma(m) in the coordination graph in order to update Tm,ft(cm,𝒱m,f)T^{t}_{m,f}(\textbf{c}_{m,\mathcal{V}_{m},f}) and average reward R¯m,𝒱m,ft(cm,𝒱m,f)\overline{R}_{m,\mathcal{V}_{m},f}^{t}(\textbf{c}_{m,\mathcal{V}_{m},f}). By adding a perturbed term similar to (8), the estimated reward with joint action cm,𝒱m,f\textbf{c}_{m,\mathcal{V}_{m},f} at time slot t is given by:

R^m,𝒱m,ft(cm,𝒱m,f)=R¯m,𝒱m,ft1(cm,𝒱m,f)\displaystyle\widehat{R}_{m,\mathcal{V}_{m},f}^{t}(\textbf{c}_{m,\mathcal{V}_{m},f})=\overline{R}_{m,\mathcal{V}_{m},f}^{t-1}(\textbf{c}_{m,\mathcal{V}_{m},f})
+Ba-coll,m,𝒱m(cm,𝒱m,f)3log(t)2Tm,𝒱m,ft1(cm,𝒱m,f),\displaystyle~{}~{}~{}~{}+B_{\text{a-coll},m,\mathcal{V}_{m}}(\textbf{c}_{m,\mathcal{V}_{m},f})\sqrt{\frac{3\log(t)}{2T^{t-1}_{m,\mathcal{V}_{m},f}(\textbf{c}_{m,\mathcal{V}_{m},f})}}, (12)

where Ba-coll,m,𝒱m(cm,𝒱m,f)B_{\text{a-coll},m,\mathcal{V}_{m}}(\textbf{c}_{m,\mathcal{V}_{m},f}) is the upper bound on the reward of SBS mm with cm,𝒱m,f\textbf{c}_{m,\mathcal{V}_{m},f} as:

Ba-coll,m,𝒱m(cm,𝒱m,f)\displaystyle B_{\text{a-coll},m,\mathcal{V}_{m}}(\textbf{c}_{m,\mathcal{V}_{m},f}) =am,fu𝒰m(d0dm,u)\displaystyle=a_{m,f}\sum_{u\in\mathcal{U}_{m}}(d_{0}-d_{m,u})
𝕀{𝒱mu=𝒱m}n𝒱m(1an,f).\displaystyle\mathbb{I}\{\mathcal{V}^{u}_{m}=\mathcal{V}_{m}\}\prod_{n\in\mathcal{V}_{m}}(1-a_{n,f}). (13)

SBSs optimize the cache strategies to maximize the total estimated reward f=1Fm=1M𝒱mR^m,𝒱m,ft(cm,𝒱m,f)\sum_{f=1}^{F}\sum_{m=1}^{M}\sum_{\mathcal{V}_{m}}\widehat{R}_{m,\mathcal{V}_{m},f}^{t}(\textbf{c}_{m,\mathcal{V}_{m},f}) at each time slot tt by utilizing Algorithm 1. Note that the effective action space of cm,𝒱m,f\textbf{c}_{m,\mathcal{V}_{m},f} depends on the number of users and is m=1M2Γ(m)F\sum_{m=1}^{M}2^{\Gamma(m)}F in the worst case. Therefore, we conclude that commonly used agent-based collaborative MAMAB from the SBS perspective can be seen as the worst case of this new agent-based collaborative MAMAB algorithm from the user perspective. By analyzing the property of the caching problem, our proposed algorithm from the user perspective decreases the action space greatly compared to that from the SBS perspective when the user number is small by combining some joint actions together. However, as the user number keeps growing, the action space can be still very large as it grows exponentially with Γ(m)\Gamma(m), which causes a high computational complexity and a slower learning speed.

Lemma 2.

The (α,β)(\alpha,\beta)-approximation regret of these two agent-based collaborative MAMAB algorithms using an (α,β)(\alpha,\beta)-approximation is on the order of 𝒪(log(Ttotal))\mathcal{O}(\log{(T_{\text{total}})}).

Proof.

See Appendix B. ∎

This regret bound guarantees that the algorithm has an asymptotically optimal performance since limTtotallog(Ttotal)Ttotal=0\lim_{T_{\text{total}}\rightarrow\infty}\frac{\log{(T_{\text{total}}})}{T_{\text{total}}}=0. Therefore, we conclude that these two agent-based collaborative MAMAB algorithms converge to the optimal offline cache strategy in the stationary environment.

IV-C Distributed MAMAB

To avoid the curse of dimensionality, we try to solve P3 in a distributed manner. In this case, each SBS is regarded as an independent learner and it learns its own cache strategy independently without information exchange with other SBSs. Therefore, the total reward function is decomposed into a linear combination of agent-independent reward functions rm,f(am,f)r_{m,f}(a_{m,f}) as:

Rtotal(A)=m=1Mf=1Frm,f(am,f),\displaystyle R_{\text{total}}(\textbf{A})=\sum_{m=1}^{M}\sum_{f=1}^{F}r_{m,f}(a_{m,f}), (14)

which means each SBS totally ignores the reward and action of other SBSs. Note that (14) is not equivalent to (6a). It is just a function decomposition method, which is called the independent learner approach [32, 29] and is commonly used in reinforcement learning. In independent learner approach, the agents ignore the actions and rewards of the other agents, and learn their strategies independently. In the initial phase t=0t=0, each SBS independently caches all files once. The number of times that am,fa_{m,f} appears until tt is denoted as Tm,ft(am,f)T^{t}_{m,f}(a_{m,f}). Besides, if am,fa_{m,f} appears at time slot tt, the average reward R¯m,ft(am,f)\overline{R}_{m,f}^{t}(a_{m,f}) of SBS mm with caching action am,fa_{m,f} is updated as:

R¯m,ft(am,f)=[Tm,ft(am,f)1]R¯m,ft1(am,f)+rm,ft(am,f)Tm,ft(am,f),\displaystyle\overline{R}_{m,f}^{t}(a_{m,f})=\frac{\left[T^{t}_{m,f}(a_{m,f})-1\right]\overline{R}^{t-1}_{m,f}(a_{m,f})+r^{t}_{m,f}(a_{m,f})}{T^{t}_{m,f}(a_{m,f})}, (15)

where rm,ft(am,f)r^{t}_{m,f}(a_{m,f}) is the observed reward of SBS mm with action am,fa_{m,f} at time slot tt. By adding a perturbed term similar to (8), the estimated reward with am,fa_{m,f} at time slot tt is given by:

R^m,ft(am,f)=R¯m,ft1(am,f)+Bd,m(am,f)3log(t)2Tm,ft1(am,f).\displaystyle\widehat{R}_{m,f}^{t}(a_{m,f})=\overline{R}_{m,f}^{t-1}(a_{m,f})+B_{d,m}(a_{m,f})\sqrt{\frac{3\log(t)}{2T^{t-1}_{m,f}(a_{m,f})}}. (16)

where Bd,m(am,f)B_{d,m}(a_{m,f}) is the upper bound on the reward of SBS mm with action am,fa_{m,f} and is given by:

Bd,m(am,f)=am,fu𝒰m(d0dm,u)\displaystyle B_{d,m}(a_{m,f})=a_{m,f}\sum_{u\in\mathcal{U}_{m}}(d_{0}-d_{m,u}) (17)

Each SBS optimizes its cache strategy by maximizing its own estimated reward f=1FR^m,ft(am,f)\sum_{f=1}^{F}\widehat{R}^{t}_{m,f}(a_{m,f}) at each time slot tt independently. In this case, the effective action space is MFMF and has a linear growth speed of the number of SBSs, which has a low computational complexity and a fast learning speed. However, the performance is poor since there is no coordination among SBSs.

IV-D Edge-based Collaborative MAMAB

Based on the analysis of the above three MAMAB approaches, it is clear that a good solution should consider both SBS coordination and computational complexity simultaneously. Therefore, we propose a coordination graph edge-based reward assignment method by assigning the reward of SBSs to the edges in the coordination graph (V,E)(V,E). In this case, we decompose the total reward function into a linear combination of edge-independent reward functions as:

Rtotal(A)=f=1F(m,n)Erm,n,f(am,f,an,f).\displaystyle R_{\text{total}}(\textbf{A})=\sum_{f=1}^{F}\sum_{(m,n)\in E}r_{m,n,f}(a_{m,f},a_{n,f}). (18)

where rm,n,f(am,f,an,f)r_{m,n,f}(a_{m,f},a_{n,f}) is the reward of the edge (m,n)(m,n) with joint action (am,f,an,f)(a_{m,f},a_{n,f}). Note that (18) and (6a) are not equivalent. It is just a reward assignment method by dividing the reward of agents to the edges in the coordination graph, which is utilized in reinforcement learning to reduce the complexity [29]. The main difficulty is how to design the specific coordination graph edge-based reward assignment method based on the problem property.

The coordination graph edge-based reward assignment method runs as follows. For a user uu requesting file ff at time slot tt, we assume that it is served by its neighbor SBS mm and SBS mm can obtain a reward rm,fu,tr_{m,f}^{u,t}. Note that SBS mm knows the locations of other M1M-1 SBSs and its serving user uu. If SBS mm is the nearest SBS of uu, then the reward rm,fu,tr_{m,f}^{u,t} obtained by serving uu will be entirely assigned to its self edge (m,m)(m,m). This is because uu will always be served by mm in this case, regardless of whether other SBSs cache file ff or not. When SBS mm is not the nearest SBS of uu, it serves uu if and only if the nearer neighbor SBSs of uu do not cache the requested file ff. In this case, the reward rm,fu,tr_{m,f}^{u,t} is divided equally among the edges between the serving SBS mm and all the nearer neighbor SBSs n𝒱mun\in\mathcal{V}^{u}_{m} with joint action (am,f=1,an,f=0)(a_{m,f}=1,a_{n,f}=0) since the reward rm,fu,tr_{m,f}^{u,t} depends on cache actions of SBS mm and n𝒱mun\in\mathcal{V}^{u}_{m}. These split rewards on edges (m,n)(m,n) are stored at the serving SBS mm since only SBS mm caches the requested file ff and obtains the reward while other nearer SBSs n𝒱mun\in\mathcal{V}^{u}_{m} not.

SBSs cache files randomly to make each joint action (am,f,an,f)(a_{m,f},a_{n,f}) for (m,n)E(m,n)\in E appear at least once in the initial phase t=0t=0.333Note that the initial phase in the edge-based collaborative MAMAB algorithm is much shorter than the agent-based ones due to its smaller action space. At each time slot tt, SBSs update the number of times that joint action (am,f,an,f)(a_{m,f},a_{n,f}) appears until tt, which is denoted as Tm,n,ft(am,f,an,f)T^{t}_{m,n,f}(a_{m,f},a_{n,f}). If (am,f,an,f)(a_{m,f},a_{n,f}) appears at time slot tt, the average reward R¯m,n,ft(am,f,an,f)\overline{R}_{m,n,f}^{t}(a_{m,f},a_{n,f}) of the edge (m,n)E(m,n)\in E with action pair (am,f,an,f)(a_{m,f},a_{n,f}) is updated as:

R¯m,n,ft(am,f,an,f)=\displaystyle\overline{R}_{m,n,f}^{t}(a_{m,f},a_{n,f})=
[Tm,n,ft(am,f,an,f)1]R¯m,n,ft1(am,f,an,f)+rm,n,ft(am,f,an,f)Tm,n,ft(am,f,an,f),\displaystyle\frac{\left[T^{t}_{m,n,f}(a_{m,f},a_{n,f})-1\right]\overline{R}_{m,n,f}^{t-1}(a_{m,f},a_{n,f})+r_{m,n,f}^{t}(a_{m,f},a_{n,f})}{T^{t}_{m,n,f}(a_{m,f},a_{n,f})}, (19)

where rm,n,ft(am,f,an,f)r_{m,n,f}^{t}(a_{m,f},a_{n,f}) is the reward assigned to the edge (m,n)(m,n) with (am,f,an,f)(a_{m,f},a_{n,f}) at time slot tt. Note that each SBS mm needs to share its cache strategy with its neighbor Γ(m)\Gamma(m) in the coordination graph in order to update Tm,n,ft(am,f,an,f)T^{t}_{m,n,f}(a_{m,f},a_{n,f}) and average reward R¯m,n,ft(am,f,an,f)\overline{R}_{m,n,f}^{t}(a_{m,f},a_{n,f}). Moreover, each SBS mm only needs to store the reward of edge (m,n)(m,n) with joint action (am,f=1,an,f=0)(a_{m,f}=1,a_{n,f}=0) for nΓ(m)n\in\Gamma(m) and its self edge (m,m)(m,m) with action (am,f=1,am,f=1)(a_{m,f}=1,a_{m,f}=1) due to the coordination graph edge-based reward assignment method. By adding a perturbed term similarly, the estimated reward R^m,n,ft(am,f,an,f)\widehat{R}_{m,n,f}^{t}(a_{m,f},a_{n,f}) of the edge (m,n)(m,n) with joint action (am,f,an,f)(a_{m,f},a_{n,f}) at time slot tt is given by:

R^m,n,ft(am,f,an,f)=R¯m,n,ft1(am,f,an,f)\displaystyle\widehat{R}_{m,n,f}^{t}(a_{m,f},a_{n,f})=\overline{R}_{m,n,f}^{t-1}(a_{m,f},a_{n,f})
+Bcoll,m,n(am,f,an,f)3log(t)2Tm,n,ft1(am,f,an,f),\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}+B_{\text{coll},m,n}(a_{m,f},a_{n,f})\sqrt{\frac{3\log(t)}{2T^{t-1}_{m,n,f}(a_{m,f},a_{n,f})}}, (20)

where Bcoll,m,n(am,f,an,f)B_{\text{coll},m,n}(a_{m,f},a_{n,f}) is the upper bound of the reward on the edge (m,n)E(m,n)\in E with joint action (am,f,an,f)(a_{m,f},a_{n,f}) and it is given by:

Bcoll,m,n(am,f,an,f)=\displaystyle B_{\text{coll},m,n}(a_{m,f},a_{n,f})=
{am,fu𝒰m𝕀{m=1u}(d0dm,u),ifm=nam,f(1an,f)u𝒰m𝕀{n𝒱mu}j=2|u|𝕀{m=ju}d0dm,uj1+an,f(1am,f)u𝒰n𝕀{m𝒱nu}j=2|u|𝕀{n=ju}d0dn,uj1,ifmn.\displaystyle\begin{cases}a_{m,f}\sum_{u\in\mathcal{U}_{m}}\mathbb{I}\{m=1_{u}\}(d_{0}-d_{m,u}),~{}~{}~{}~{}~{}~{}~{}~{}\text{if}~{}m=n\\ a_{m,f}(1-a_{n,f})\sum_{u\in\mathcal{U}_{m}}\mathbb{I}\{n\in\mathcal{V}_{m}^{u}\}\sum_{j=2}^{|\mathcal{M}_{u}|}\\ \mathbb{I}\{m=j_{u}\}\frac{d_{0}-d_{m,u}}{j-1}+a_{n,f}(1-a_{m,f})\sum_{u\in\mathcal{U}_{n}}\\ \mathbb{I}\{m\in\mathcal{V}_{n}^{u}\}\sum_{j=2}^{|\mathcal{M}_{u}|}\mathbb{I}\{n=j_{u}\}\frac{d_{0}-d_{n,u}}{j-1},~{}~{}~{}~{}~{}~{}~{}\text{if}~{}m\neq n\end{cases}. (21)
Algorithm 2 Edge-based Collaborative MAMAB Caching
1:  Cache files at SBSs to make all joint actions (am,f,an,f)(a_{m,f},a_{n,f}) appear once for (m,n)E(m,n)\in E, i.e., Tm,n,f0(am,f,an,f)=1T^{0}_{m,n,f}(a_{m,f},a_{n,f})=1 and then update the average reward R¯m,n,f0(am,f,an,f)\overline{R}^{0}_{m,n,f}(a_{m,f},a_{n,f})
2:  for t=1,2,,Ttotalt=1,2,\ldots,T_{\text{total}} do
3:     Each SBS decides its cache strategy {a~m,f}\{\widetilde{a}_{m,f}\} with the distributed MAMAB algorithm
4:     Update the estimated reward according to (20) for all joint actions (am,f,an,f)(a_{m,f},a_{n,f})
5:     Decide the cache strategy by maximizing the total estimated reward f=1F(m,n)E\sum_{f=1}^{F}\sum_{(m,n)\in E} R^m,n,ft(am,f,an,f)\widehat{R}_{m,n,f}^{t}(a_{m,f},a_{n,f}) according to the Algorithm 1 with the initial cache strategy {a~m,ft}\{\widetilde{a}_{m,f}^{t}\}
6:     Initialize the reward rm,n,ft(am,f,an,f)=0r^{t}_{m,n,f}(a_{m,f},a_{n,f})=0 for all edges (m,n)E(m,n)\in E and joint actions
7:     Count Tm,n,ft(am,f,an,f)T^{t}_{m,n,f}(a_{m,f},a_{n,f}) for all joint actions
8:     for u=1,2,,Uu=1,2,\ldots,U do
9:        Observe the reward rm,fu,tr_{m,f}^{u,t} of the SBS mm that serves user uu with file ff
10:        if SBS mm is the nearest SBS of user uu then
11:           Update rm,m,ft(am,f,am,f)=rm,m,ft(am,f,am,f)+rm,fu,tr_{m,m,f}^{t}(a_{m,f},a_{m,f})=r_{m,m,f}^{t}(a_{m,f},a_{m,f})+r_{m,f}^{u,t}
12:        else
13:           Update rm,n,ft(am,f,an,f)=rm,n,ft(am,f,an,f)+rm,fu,tj1r_{m,n,f}^{t}(a_{m,f},a_{n,f})=r_{m,n,f}^{t}(a_{m,f},a_{n,f})+\frac{r_{m,f}^{u,t}}{j-1} for the edges (m,n)(m,n) between j1j-1 nearer SBSs and jj-th nearest SBS mm of user uu
14:        end if
15:     end for
16:     Update the average reward according to (19) if joint action (am,f,an,f)(a_{m,f},a_{n,f}) appears and R¯m,n,ft(am,f,an,f)=R¯m,n,ft1(am,f,an,f)\overline{R}_{m,n,f}^{t}(a_{m,f},a_{n,f})=\overline{R}_{m,n,f}^{t-1}(a_{m,f},a_{n,f}) otherwise.
17:  end for

SBSs optimize the cache strategies to maximize the total estimated reward f=1F(m,n)E\sum_{f=1}^{F}\sum_{(m,n)\in E} R^m,n,ft(am,f,an,f)\widehat{R}_{m,n,f}^{t}(a_{m,f},a_{n,f}) at each time slot tt by utilizing Algorithm 1. Specifically, we choose cache strategy obtained from the distributed MAMAB algorithm as the initial cache strategy in Algorithm 1 to have a better performance and a faster convergence speed. Note that the edge-based collaborative MAMAB algorithm not only considers the SBS coordination, but also reduces the computational complexity greatly with effective action space m=1MΓ(m)+M\sum_{m=1}^{M}\Gamma(m)+M, which grows linearly with the number of edges. The details of this algorithm are given in Algorithm 2.

IV-E Tradeoff between Exploration and Exploitation

As we have discussed, we add a perturbed term to the average reward to achieve the tradeoff between exploration and exploitation. However, there also exists some simple methods that can achieve a good tradeoff in MAMAB problems. One such simple and widely used algorithm is the ϵ\epsilon-greedy algorithm, which caches files to maximize the total average reward with probability 1ϵ1-\epsilon and caches files randomly with probability ϵ\epsilon. ϵ\epsilon-greedy algorithm can also be divided into different manners similarly with different reward decomposition methods.

We also propose a new perturbed term to update the estimated reward by exploiting the particular structure of the problem similar to [15] and the new perturbed term is 3log(B2t)2Tt1\sqrt{\frac{3\log(B^{2}t)}{2T^{t-1}}} rather than B3log(t)2Tt1B\sqrt{\frac{3\log(t)}{2T^{t-1}}} if the reward upper bound B>0B>0. Otherwise, the perturbed term is 0. The MAMAB algorithms with new perturbed terms are called MAMAB-v2 algorithms.

V Cache Strategy in Non-Stationary Environment

In this section, we aim to solve P2 in the non-stationary environment. We first give a brief introduction of the non-stationary environment. Then we modify our proposed MAMAB algorithms to make them adapt to the non-stationary environment. Finally, we discuss the tradeoff between exploration and exploitation.

V-A Non-stationary Environment

The non-stationary environment is described as follows. The file library \mathcal{F} is a dynamic set with new files entering and old files leaving over time. Besides, user preference changes irregularly over time and the number of requests for each user at a time slot is random. Note that users can keep silent and request no file at some time slots, the set of active users that request files also changes over time. Moreover, we take user mobility into account that user locations can change at different time slots. Thus, transmission delay dm,ud_{m,u} also varies over time and cannot be known in advance. This non-stationary environment is more realistic since the real world scenario is actually non-stationary. However, this non-stationary environment also brings some challenges since it is time-varying and more stochastic. Therefore, we need to modify the algorithms proposed in the stationary environment to adapt to this non-stationary environment.444Nota that we only modify the distributed and edge-based collaborative MAMAB in this work since the action space of the agent-based collaborative MAMAB is too large, which is almost impractical to be utilized in the non-stationary environment.

V-B Modified MAMAB

Note that the proposed MAMAB algorithms in stationary environment have some prior information of the environment, such as static file set. However, SBSs have no knowledge of these information in the non-stationary environment. Therefore, we need to define an active file set activet\mathcal{F}_{\text{active}}^{t} at each time slot tt due to the fact that the file library \mathcal{F} is dynamic over time. activet\mathcal{F}_{\text{active}}^{t} is the set of files that are requested from time slot 11 to tt and needs to be updated at the end of each time slot tt based on the receiving requests. Besides, we assign a large enough initial value to all joint actions of new added files in activet\mathcal{F}_{\text{active}}^{t}, rather than caching files randomly, to make each new joint action appear at least once. This new initialization method can reduce the explorations in the initial phase greatly and improve the performance compared to the initial phase in the stationary environment. Moreover, we need to design new UCB terms accordingly since SBSs have no information of the reward upper bound BB.

V-B1 Modified Distributed MAMAB

From the distributed MAMAB algorithm in the stationary environment, it is observed that the average reward R¯m,ft(am,f)=0\overline{R}_{m,f}^{t}(a_{m,f})=0 if am,f=0a_{m,f}=0. Therefore, each SBS mm only initializes the estimated reward R^m,ft(am,f)=H\widehat{R}_{m,f}^{t}(a_{m,f})=H when am,f=1a_{m,f}=1 for all active files factive0f\in\mathcal{F}_{\text{active}}^{0}, where HH is a large enough number in order to make each active file can be cached at SBSs at least once. At each time slot tt, each SBS mm updates the estimated reward for all files that have been cached at least once as:

R^m,ft(am,f)=R¯m,ft1(am,f)+3log((Bd,mt)2t)2Tm,ft1(am,f),\displaystyle\widehat{R}_{m,f}^{t}(a_{m,f})=\overline{R}_{m,f}^{t-1}(a_{m,f})+\sqrt{\frac{3\log\left((B^{t}_{d,m})^{2}t\right)}{2T^{t-1}_{m,f}(a_{m,f})}}, (22)

where Bd,mt=maxfactivetR¯m,ft1(am,f)B^{t}_{d,m}=\underset{f\in\mathcal{F}_{\text{active}}^{t}}{\text{max}}\overline{R}_{m,f}^{t-1}(a_{m,f}). Comparing with the distributed MAMAB in the stationary environment, we utilize the maximal average reward to represent the upper bound of the reward due to the upper bound of the reward cannot be obtained and changes over time in the non-stationary environment. Then each SBS decides its cache strategy by maximizing its own estimated reward. After caching action has been done, each SBS mm updates the number of times that am,fa_{m,f} appears Tm,ft(am,f)T^{t}_{m,f}(a_{m,f}) and the average reward according to (15) based on the observed reward similar to the distributed MAMAB in the stationary environment. Finally, SBSs update the active file set activet\mathcal{F}_{\text{active}}^{t} at the end of time slot tt based on their receiving requests and initialize the estimated reward R^m,ft(am,f)=H\widehat{R}_{m,f}^{t}(a_{m,f})=H when am,f=1a_{m,f}=1 for new added files.

V-B2 Modified Edge-based Collaborative MAMAB

From Algorithm 2, it is observed that the average reward R¯m,n,ft(am,f,an,f)=0\overline{R}_{m,n,f}^{t}(a_{m,f},a_{n,f})=0 if action pair (am,f,an,f)=(1,1)(a_{m,f},a_{n,f})=(1,1) when mnm\neq n or (am,f,an,f)=(0,0)(a_{m,f},a_{n,f})=(0,0). Therefore, we only need to focus on the edge (m,n)E(m,n)\in E when only one of them caches the file and the other does not, i.e., (am,f,an,f)=(1,0)(a_{m,f},a_{n,f})=(1,0), (am,f,an,f)=(0,1)(a_{m,f},a_{n,f})=(0,1) and (am,f,am,f)=(1,1)(a_{m,f},a_{m,f})=(1,1), which are called effective joint actions. We initialize the estimated reward of all effective joint actions R^m,n,ft(am,f,an,f)\widehat{R}_{m,n,f}^{t}(a_{m,f},a_{n,f}) with a large enough number HH. At time slot tt, SBSs update the estimated reward of all effective joint actions (am,f,an,f)(a_{m,f},a_{n,f}) that appear at least once as:

R^m,n,ft(am,f,an,f)\displaystyle\widehat{R}_{m,n,f}^{t}(a_{m,f},a_{n,f}) =R¯m,n,ft1(am,f,an,f)\displaystyle=\overline{R}_{m,n,f}^{t-1}(a_{m,f},a_{n,f})
+3log((Bcoll,m,nt)2t)2Tm,n,ft1(am,f,an,f),\displaystyle~{}~{}~{}~{}~{}+\sqrt{\frac{3\log((B^{t}_{\text{coll},m,n})^{2}t)}{2T^{t-1}_{m,n,f}(a_{m,f},a_{n,f})}}, (23)

where Bcoll,m,nt=maxfactivetR¯m,n,ft1(am,f,an,f)B^{t}_{\text{coll},m,n}=\underset{f\in\mathcal{F}_{\text{active}}^{t}}{\text{max}}\overline{R}_{m,n,f}^{t-1}(a_{m,f},a_{n,f}) is seen as the upper bound of the reward on the edge (m,n)E(m,n)\in E with joint action (am,f,an,f)(a_{m,f},a_{n,f}) for all active files ff at time slot tt. Then SBSs decide their cache strategies collaboratively to maximize the total estimated reward according to Algorithm 1. SBSs update Tm,n,ft(am,f,an,f)T^{t}_{m,n,f}(a_{m,f},a_{n,f}) and average reward R¯m,n,ft(am,f,an,f)\overline{R}_{m,n,f}^{t}(a_{m,f},a_{n,f}) according to (19) if joint action (am,f,an,f)(a_{m,f},a_{n,f}) appears based on the observed reward that follow the coordination graph edge-based reward assignment. Finally, SBSs update the active file set activet\mathcal{F}_{\text{active}}^{t} at the end of time slot tt based on their receiving requests and initialize the estimated reward R^m,n,ft(am,f,an,f)=H\widehat{R}_{m,n,f}^{t}(a_{m,f},a_{n,f})=H for all effective joint actions of new added files.

V-C Tradeoff between Exploration and Exploitation

As we have discussed, by adding the perturbed term to the average reward as the estimated reward is a good way to achieve the tradeoff between exploration and exploitation in the stationary environment. However, our proposed perturbed terms do not have theoretical performance guarantee in the non-stationary environment. Therefor, we can also utilize the ϵ\epsilon-greedy algorithm to achieve the tradeoff between exploration and exploitation, which is simple but effective in many scenarios. In the ϵ\epsilon-greedy algorithm, SBSs cache files to maximize the total average reward with probability 1ϵ1-\epsilon and caches files randomly with probability ϵ\epsilon. Note that ϵ\epsilon-greedy algorithm can also be divided into both distributed and collaborative manners similarly with different average reward update manners.

VI Simulation Results

In this section, we demonstrate the performance of our proposed algorithms in both stationary environment and non-stationary environment. Simulations are performed in a square area of 100×100100\times 100 m2\text{m}^{2}. Both SBSs and users are uniformly distributed in this plane. Unless otherwise stated, the system parameters are set as follows: bandwidth W=10MHzW=10~{}\text{MHz}, transmission power of SBSs P=1P=1 W, Gaussian white noise power σN2=1\sigma_{N}^{2}=1 W, path loss exponent α=4\alpha=4 and the core network transmission delay d0=3maxm,u𝒰1Wlog2(1+SNRm,u)d_{0}=3\underset{m\in\mathcal{M},u\in\mathcal{U}}{\text{max}}~{}\frac{1}{W\log_{2}(1+\text{SNR}_{m,u})}.

VI-A Stationary Environment

In the stationary environment, we assume that users request for files according to independent Zipf distributions. Specifically, the probability of user uu requesting file ff is denoted as pu,f=1/fuδuj=1F1/jδup_{u,f}=\frac{1/f_{u}^{\delta_{u}}}{\sum_{j=1}^{F}1/j^{\delta_{u}}}, where δu{0.5,0.7,0.9,1.1,1.3}\delta_{u}\in\{0.5,0.7,0.9,1.1,1.3\} is the Zipf parameter of user uu and fuf_{u} means that file ff is the fuf_{u}-th most interested file of uu. Unless otherwise stated, the simulation parameters are set as follows: the number of files F=100F=100, cache size S=10S=10 and time horizon Ttotal=25000T_{\text{total}}=25000. We utilize the average transmission delay as the performance metric, which is defined as the ratio of the total transmission delay to the the number of time slots.

Refer to caption
(a) Regret comparison of MAMAB algorithms.
Refer to caption
(b) Regret comparison of MAMAB-V2 algorithms.
Figure 2: The time evolution of regret when users have the same preference with M=6M=6, U=50U=50, lc=50l_{c}=50 m, F=100F=100, S=10S=10 and δ=0.9\delta=0.9.

To evaluate the performance of the algorithms proposed in this paper, we compare them with the following benchmark algorithms:

\bullet Oracle Coordinate Ascent Algorithm: Cache files according to Algorithm 1 with 300300 random initial strategy and choose the best one when user preference is known.

\bullet Oracle Greedy Algorithm [5]: SBSs Cache files greedily when user preference is known.

\bullet CUCB Algorithm [17]: Each SBS estimates its local file popularity by using CUCB algorithm and then the cache strategy is optimized with the estimated local popularity coordinately.

\bullet Least Frequently Used (LFU) Algorithm [33]: Replace the file with the minimum requested times in the cache of each SBS with the requested file that is not available.

\bullet Least Recently Used (LRU) Algorithm [33]: Replace the least recently used file in the cache of each SBS with the requested file that is not available.

In this subsection, we first show the regret of MAMAB and MAMAB-v2 algorithms in both distributed and collaborative manners. Then, we evaluate the performance of our proposed algorithms by comparing them with benchmark algorithms. Finally, we demonstrate the performances of our proposed algorithms with respect to different communication distance and cache size.

For presentation convenience, the agent-based collaborative MAMAB in SBS perspective and user perspective are denoted as “Agent-based-1 C-MAMAB” and “Agent-based-2 C-MAMAB”, respectively. The edge-based collaborative MAMAB is denoted as “Edge-based C-MAMAB”. All the plots are obtained after averaging over 3030 independent realizations.

Time evolutions for the regret of MAMAB and MAMAB-v2 algorithms in different manners are plotted in Fig. 2(a) and Fig. 2(b), respectively. The regret is defined as the accumulated difference between the oracle coordinate ascent algorithm and the corresponding MAMAB algorithm. It is observed that Agent-based-1 C-MAMAB performs worst in Fig. 2(a) due to its large action space, which causes slow convergence speed. The Edge-based C-MAMAB outperforms three other MAMAB algorithms since it considers both SBS coordination and dimensions of action space, though it has no performance guarantee. In both MAMAB and MAMAB-v2 algorithms, Edge-based ones perform best and Agent-based-2 outperforms Agent-based-1, which means the large action space degrades the performance greatly. Comparing MAMAB and MAMAB-v2 algorithms, we find that MAMAB-v2 outperforms MAMAB notably by designing a new perturbed term, which means that they can achieve a better tradeoff between exploration and exploitation555MAMAB-v2 algorithms also outperform the ϵ\epsilon-greedy and these results are not shown is this paper due to the page limit.. Therefore, we only focus on MAMAB-v2 algorithms in the following parts.

Refer to caption
(a) MAMAB-v2 algorithms with the same user preference.
Refer to caption
(b) MAMAB-v2 algorithms with different user preference.
Figure 3: The time evolution of the average transmission delay with M=6M=6, U=50U=50, lc=50l_{c}=50 m, F=100F=100 and S=10S=10.

Fig. 3(a) and Fig. 3(b) compare the performance of MAMAB-v2 with some benchmark algorithms. It is shown that the low-complexity oracle coordinate ascent algorithm outperforms oracle greedy algorithm and they both can be seen as the performance upper bound since they know the full knowledge of user preference. Besides, MAMAB-v2 algorithms outperform LRU and LFU since they concentrate on the reward that SBSs can obtain rather than users requests. Moreover, MAMAB-v2 outperforms CUCB algorithm since it distinguishes user preference and file popularity. Besides, MAMAB-v2 also has a faster learning speed than predict-then-optimize CUCB method when users have the same preference. For different MAMAB-v2 algorithms, the Edge-based C-MAMAB-v2 performs best and almost can achieve the performance of the oracle algorithms. The Agent-based-1 C-MAMAB outperforms the distributed MAMAB when users have the same preference since it considers the SBS coordination. However, when users have different preference, Agent-based-1 C-MAMAB-v2 almost performs the same as the distributed one since the correlations among SBSs are weak.

Refer to caption
Figure 4: Average delay after 2500025000 time slots of MAMAB-v2 algorithms vs. lcl_{c} when users have different preferences with M=6M=6, U=50U=50 F=100F=100 and S=10S=10.
Refer to caption
Figure 5: Average delay after 2500025000 time slots of MAMAB-v2 algorithms vs. SS when users have different preferences with M=6M=6, U=50U=50, F=100F=100 and lc=50l_{c}=50 m.

Fig. 5 illustrates the average transmission delay of different MAMAB-v2 algorithms after 2500025000 time slots versus different communication distances lcl_{c}. By increasing lcl_{c}, the number of users that can be covered by multiple SBSs increases and the correlations among SBSs becomes stronger. Fig. 5 shows that both distributed and collaborative MAMAB-v2 algorithms almost can achieve the performance of the oracle greedy algorithm. This is because many users only have one neighbor SBS when lcl_{c} is small and almost no correlations exist among SBSs. By increasing the communication distance lcl_{c}, it is observed that the gap between oracle greedy algorithm and distributed MAMAB-v2 increases. This is because distributed MAMAB-v2 totally ignores the correlations among SBSs while the correlations among SBSs become larger when lcl_{c} grows. The gap between oracle greedy and Agent-based-1 also increases when lcl_{c} grows since the performance becomes poor when the action space is too large. While for the Agent-based-2 and Edge-based C-MAMAB-v2, they perform close to the oracle greedy for all lcl_{c} since they take both SBS coordination and action space into account.

Fig. 5 depicts the average transmission delay of different algorithms after 2500025000 time slots versus different cache size SS. Note that the average transmission delay decreases as SS increases since more user requests can be satisfied by SBSs locally. However, the delay decreasing speed becomes smaller as the cache size increases, which is consistent with the Zipf distribution that a small number of popular files produce the majority of the data traffic. The Edge-based and Agent-based-2 C-MAMAB-v2 almost can achieve the performance of the oracle greedy. Comparing the performance of the Agent-based-1 C-MAMAB-v2 and distributed MAMAB-v2, it is observed that Agent-based-1 C-MAMAB-v2 performs better than distributed MAMAB-v2 when SS is large since it can make a better utilization of cache size by considering SBS coordination.

VI-B Non-stationary Environment

Refer to caption
Figure 6: Time evolution of algorithms for M=5M=5 and lc=50l_{c}=50 m.
Refer to caption
Figure 7: Time evolution of algorithms for M=5M=5, lc=40l_{c}=40 m and S=400S=400.
Refer to caption
Figure 8: Effects of users mobility on modified MAMAB algorithms for M=5M=5, lc=50l_{c}=50 m and S=400S=400.
Refer to caption
Figure 9: Average delay of modified MAMAB algorithms after 10391039 days vs. communication distance for M=5M=5 and S=400S=400.

In this work, we model the non-stationary environment by using the 1M1M data set from MovieLens [34]. This data set contains 10002091000209 ratings of ||=3952|\mathcal{F}|=3952 movies and these ratings were made by U=6040U=6040 users from the year 20002000 to 20032003. Due to the fact that a user only rating the movies after watching them, we assume that a rating of a user corresponds to a file request in this work. The time horizon is divided into Ttotal=1039T_{\text{total}}=1039 time slots (days) and SBSs update the cache once everyday. Unless otherwise stated, Unless otherwise stated, we assume that locations of users are fixed at different time slots but the set of active users changes over time 666Note that users always request files at fixed locations in practice, such as at home or office. Therefore, we only consider the dynamic active users set and ignore the mobility of users., and the simulation parameters are set as follows: the number of SBSs M=5M=5 and cache size S=400S=400. We utilize the average transmission delay as the performance metric, which is defined as the ratio of the total transmission delay to the the number of requests. To evaluate the performance of our proposed algorithms, we compare them with the following benchmark algorithms:

\bullet Oracle Coordinated Ascent Algorithm: Optimize the cache by utilizing the Algorithm 1 at each time slot when the user requests and transmission delays between SBSs and users are perfectly known in advance777The reason for choosing the coordinated ascent algorithm rather than the oracle greedy algorithm in the stationary environment is the high computation complexity of the greedy algorithm with large number of uses and files..

\bullet Distributed/Edge-based Collaborative ϵ\epsilon-Greedy Algorithm (D/C-ϵ\epsilon-Greedy): SBSs cache files according to the ϵ\epsilon-greedy algorithm based on the corresponding average reward rather than the estimated reward that contains the perturbed term.

\bullet Ridge Regression Algorithm: Each SBS estimates the number of requests for its neighbor users via ridge regression by utilizing all requests information, rather than only utilizing the information of caching files as [19], and then caches the most popular files at each time slot.

\bullet m-CACao[21]: SBSs cache files according to the context-aware proactive caching algorithm.

In this subsection, we first discuss the tradeoff between exploration and exploitation of modified MAMAB algorithms in the non-stationary environment. Then we validate the effectiveness of our proposed modified MAMAB algorithms. Finally, we discuss the effects of user mobility and communication distance lcl_{c} of modified MAMAB algorithms. For presentation convenience, our proposed modified MAMAB algorithm in distributed and edge-based collaborative manners are denoted as “M-D-MAMAB” and “M-C-MAMAB”, respectively.

Fig. 7 illustrates the performance of our proposed modified MAMAB algorithms and ϵ\epsilon-greedy algorithms with different ϵ\epsilon values. It is observed that edge-based collaborative algorithms outperform the corresponding distributed algorithms. Comparing three distributed algorithms and three collaborative algorithms, we find that when the cache size is large (S=400S=400), our proposed modified MAMAB algorithms with perturbed terms can achieve a better tradeoff compared to ϵ\epsilon-greedy algorithms. This is because large cache size can make SBSs play all actions sufficient times and hence our modified MAMAB algorithms have a fast learning speed to learn the reward of different actions accurately. However, when the cache size is small (S=100S=100), it is illustrated that ϵ\epsilon-greedy algorithm with ϵ=0\epsilon=0 performs better than ϵ\epsilon-greedy algorithm with ϵ=0.05\epsilon=0.05 and modified MAMAB algorithms in both distributed and edge-based collaborative manners. This means that when the cache size is small, doing more exploitations is better since SBSs have a lower speed to learn the reward accurately with such a small cache size.

Fig. 7 illustrates the performance of our proposed modified MAMAB and some reference algorithms. It is observed that the oracle coordinate ascent performs best since it knows all information of user requests and transmission delays between users and SBSs in advance. M-C-MAMAB outperforms M-D-MAMAB since it considers the SBSs cooperation. Besides, our proposed modified MAMAB algorithms outperform the ridge regression and m-CACao since they focus on the reward rather than the request number from a set of users. Therefore, we can conclude that MAMAB is more appropriate to solve this sequential decision making problem than the ridge regression that first utilizes all requests information to predict local popularity and then caches the most popular files. Moreover, M-D-MAMAB can tackle the cooperative caching problem among SBSs better than existing algorithms.

In Fig. 9, the effects of user mobility on the caching performance of modified MAMAB algorithms are evaluated. It is observed that modified MAMAB algorithms have a good performance no matter users are static or mobile. However, the gap between M-D-MAMAB and M-C-MAMAB becomes smaller when we take user mobility into account. This is because that M-D-MAMAB ignores the coordination among SBSs and has a smaller action space, and hence it has a faster learning speed to learn the reward accurately. Therefore, we can conclude that M-D-MAMAB is more robust than M-C-MAMAB when users are moving over time.

In Fig. 9, the performance of modified MAMAB algorithms after 10391039 days versus lcl_{c} is evaluated. By increasing lcl_{c}, the average transmission delay of modified MAMAB algorithms decreases since more users can be covered and served by SBSs locally as lcl_{c} grows. Comparing the performance of the modified MAMAB in distributed and collaborative manners, they perform the same when lcl_{c} is small (lc=10l_{c}=10 m). When lcl_{c} becomes larger (lc=20,30l_{c}=20,30 m), it is observed that the M-D-MAMAB even outperforms the collaborative one. This is because some joint actions of M-C-MAMAB only can serve a small number of users and have a small reward, but they still need to have enough explorations. Thus, SBSs choose these joint actions with small reward many times and lose the chance for more exploitations. But as lcl_{c} keeps growing (lc=40,50l_{c}=40,50 m), the correlations among SBSs become stronger and the number of users that are covered by multiple SBSs becomes larger, M-C-MAMAB outperforms M-D-MAMAB since it considers the coordination among SBSs and achieves a good tradeoff between exploration and exploitation.

VII Conclusion

In this work, we investigated the collaborative caching optimization problem to minimize the accumulated transmission delay without the knowledge of user preference. We model this sequential multi-agent decision making problem in a MAMAB perspective and solve it by learning cache strategy directly online in both stationary and non-stationary environment. In the stationary environment, we proposed multiple efficient MAMAB algorithms in both distributed and collaborative manners to solve the problem. In the agent-based collaborative MAMAB, we provided a strong performance guarantee by proving that its regret is bounded by 𝒪(log(Ttotal))\mathcal{O}(\log{(T_{\text{total}}))}. However, the computational complexity is high. For the distributed MAMAB, it has a low computational complexity but the correlations among SBSs are ignored. To achieve a better balance between SBS coordination and computational complexity, we also proposed a coordination graph edge-based reward assignment method in the edge-based collaborative MAMAB. In the non-stationary environment, we modified the MAMAB-based algorithms proposed in the stationary environment by proposing a practical initialization method and designing new perturbed terms to adapt to the dynamic environment better. Simulation results demonstrated the effectiveness of our proposed algorithms. The effects of the communication distance and cache size were also discussed.

Appendix A Proof of Lemma 1

A\{m}L=[a1L,a2L,,am1L,𝟎F×1,am+1L,,aML]\textbf{A}_{\mathcal{M}\backslash\{m\}}^{L}=[\textbf{a}^{L}_{1},\textbf{a}^{L}_{2},\ldots,\textbf{a}^{L}_{m-1},\mathbf{0}_{F\times 1},\textbf{a}^{L}_{m+1},\ldots,\textbf{a}^{L}_{M}] is the cache strategy that SBS mm caches nothing and the other M1M-1 SBSs cache according to AL\textbf{A}^{L}. A=[a1,a2,,aM]T\textbf{A}^{*}=[\textbf{a}^{*}_{1},\textbf{a}^{*}_{2},\ldots,\textbf{a}^{*}_{M}]^{T} is the optimal cache strategy. We utilize \cup to combine different cache strategies together, e.g., amAL=[a1L,a2L,,am1L,amLam,am+1L,,aML]\textbf{a}_{m}^{*}\cup\textbf{A}^{L}=[\textbf{a}^{L}_{1},\textbf{a}^{L}_{2},\ldots,\textbf{a}^{L}_{m-1},\textbf{a}^{L}_{m}\cup\textbf{a}_{m}^{*},\textbf{a}^{L}_{m+1},\ldots,\textbf{a}^{L}_{M}], which means that SBS mm caches files in both am\textbf{a}_{m}^{*} and amL\textbf{a}_{m}^{L} while the other M1M-1 SBSs cache files according to AL\textbf{A}^{L}, and AAL=[a1La1,,aMLaM]\textbf{A}^{*}\cup\textbf{A}^{L}=[\textbf{a}^{L}_{1}\cup\textbf{a}_{1}^{*},\ldots,\textbf{a}^{L}_{M}\cup\textbf{a}_{M}^{*}], which means that each SBS mm cache files in both am\textbf{a}_{m}^{*} and amL\textbf{a}_{m}^{L}. Then, we can obtain:

Rtotal(A)Rtotal(AL)(a)Rtotal(AAL)Rtotal(AL)\displaystyle R_{\text{total}}(\textbf{A}^{*})-R_{\text{total}}(\textbf{A}^{L})\overset{(a)}{\leq}R_{\text{total}}(\textbf{A}^{*}\cup\textbf{A}^{L})-R_{\text{total}}(\textbf{A}^{L})
(b)m=1M[Rtotal(amAL)Rtotal(AL)]\displaystyle\overset{(b)}{\leq}\sum_{m=1}^{M}\left[R_{\text{total}}(\textbf{a}_{m}^{*}\cup\textbf{A}^{L})-R_{\text{total}}(\textbf{A}^{L})\right]
(c)m=1M[Rtotal(amA\{m}L)Rtotal(A\{m}L)]\displaystyle\overset{(c)}{\leq}\sum_{m=1}^{M}\left[R_{\text{total}}(\textbf{a}_{m}^{*}\cup\textbf{A}_{\mathcal{M}\backslash\{m\}}^{L})-R_{\text{total}}(\textbf{A}_{\mathcal{M}\backslash\{m\}}^{L})\right]
(d)m=1M[Rtotal(amLA\{m}L)Rtotal(A\{m}L)]\displaystyle\overset{(d)}{\leq}\sum_{m=1}^{M}\left[R_{\text{total}}(\textbf{a}_{m}^{L}\cup\textbf{A}_{\mathcal{M}\backslash\{m\}}^{L})-R_{\text{total}}(\textbf{A}_{\mathcal{M}\backslash\{m\}}^{L})\right]
(e)m=1M[Rtotal(amLA\{m,m+1,,M}L)Rtotal(A\{m,m+1,,M}L)]\displaystyle\overset{(e)}{\leq}\sum_{m=1}^{M}\left[R_{\text{total}}(\textbf{a}_{m}^{L}\cup\textbf{A}_{\mathcal{M}\backslash\{m,m+1,\ldots,M\}}^{L})-R_{\text{total}}(\textbf{A}_{\mathcal{M}\backslash\{m,m+1,\ldots,M\}}^{L})\right]
=Rtotal(AL),\displaystyle=R_{\text{total}}(\textbf{A}^{L}), (24)

where step (a) follows from the fact that caching more files increases the reward. Step (b) is obtained from that for any strategies am\textbf{a}^{\prime}_{m} and an\textbf{a}^{\prime}_{n} and A, we have Rtotal(amanA)Rtotal(A)[Rtotal(amA)Rtotal(A)]+[Rtotal(anA)Rtotal(A)]R_{\text{total}}(\textbf{a}^{\prime}_{m}\cup\textbf{a}_{n}^{\prime}\cup\textbf{A})-R_{\text{total}}(\textbf{A})\leq[R_{\text{total}}(\textbf{a}_{m}^{\prime}\cup\textbf{A})-R_{\text{total}}(\textbf{A})]+[R_{\text{total}}(\textbf{a}^{\prime}_{n}\cup\textbf{A})-R_{\text{total}}(\textbf{A})] since SBS mm and nn may cover the common users. For step (c) and (e), they are obtained from Rtotal(amA)Rtotal(A)Rtotal(amA\{m})Rtotal(A\{m})R_{\text{total}}(\textbf{a}^{\prime}_{m}\cup\textbf{A})-R_{\text{total}}(\textbf{A})\leq R_{\text{total}}(\textbf{a}^{\prime}_{m}\cup\textbf{A}_{\mathcal{M}\backslash\{m\}})-R_{\text{total}}(\textbf{A}_{\mathcal{M}\backslash\{m\}}). This is because the cache strategy am\textbf{a}^{\prime}_{m} and am\textbf{a}_{m} may cache the common files. Step (d) follows from that amL\textbf{a}_{m}^{L} is the optimal cache strategy of SBS mm when other M1M-1 SBSs adopt according to A\{m}L\textbf{A}^{L}_{\mathcal{M}\backslash\{m\}} obtained from Algorithm 1. Therefore, we have Rtotal(AL)12Rtotal(A)R_{\text{total}}(\textbf{A}^{L})\geq\frac{1}{2}R_{\text{total}}(\textbf{A}^{*}), and the proof is completed.

Appendix B Proof of Lemma 2

The proofs of two agent-based collaborative MAMAB are similar and we only focus on the agent-based collaborative MAMAB in a SBS perspective. The reward of SBS mm with joint action bm,f\textbf{b}_{m,f} is a random variable Rm,f(bm,f)[0,Ba-coll,m(bm,f)]R_{m,f}(\textbf{b}_{m,f})\in\left[0,B_{\text{a-coll},m}(\textbf{b}_{m,f})\right] with expectation μm,f(bm,f)\mu_{m,f}(\textbf{b}_{m,f}). The optimal reward is denoted as r𝝁(A)=opt𝝁r_{\bm{\mu}}(\textbf{A}^{*})=\text{opt}_{\bm{\mu}} respect to the the expectation vector 𝝁\bm{\mu}. For variable Rm,f(bm,f)R_{m,f}(\textbf{b}_{m,f}), the mean value of it after this joint action bm,f\textbf{b}_{m,f} appears ss times is R¯m,fs(bm,f)=(i=1srm,fs(bm,f))/s\overline{R}_{m,f}^{s}(\textbf{b}_{m,f})=(\sum_{i=1}^{s}r_{m,f}^{s}(\textbf{b}_{m,f}))/s. In the tt-th time slot, let FtF_{t} be the event that the oracle fails to produce an α\alpha-approximate answer with respect to the estimated reward R^m,ft(bm,f)\widehat{R}_{m,f}^{t}(\textbf{b}_{m,f}). We have P[Ft]=𝔼[𝕀{Ft}]1βP[F_{t}]=\mathbb{E}[\mathbb{I}\{F_{t}\}]\leq 1-\beta 888The coordinate ascent oracle is proven to be an (α,β)(\alpha,\beta) approximation with α=12\alpha=\frac{1}{2} and β=1\beta=1 in Appendix A.. The bad cache actions is denoted as AB={A|r𝝁(A)<αopt𝝁}\textbf{A}_{B}=\{\textbf{A}|r_{\bm{\mu}}(\textbf{A})<\alpha\cdot\text{opt}_{\bm{\mu}}\}. Then we define Δmin=αopt𝝁maxAAB{r𝝁(A)}\Delta_{\text{min}}=\alpha\cdot\text{opt}_{\bm{\mu}}-\underset{\textbf{A}\in\textbf{A}_{B}}{\text{max}}\{r_{\bm{\mu}}(\textbf{A})\} and Δmax=αopt𝝁minAAB{r𝝁(A)}\Delta_{\text{max}}=\alpha\cdot\text{opt}_{\bm{\mu}}-\underset{\textbf{A}\in\textbf{A}_{B}}{\text{min}}\{r_{\bm{\mu}}(\textbf{A})\}.

We maintain counter Nm,f(bm,f)N_{m,f}(\textbf{b}_{m,f}) for each joint action bm,f\textbf{b}_{m,f} and we have Nm,f0(bm,f)=1N_{m,f}^{0}(\textbf{b}_{m,f})=1 since each joint action bm,f\textbf{b}_{m,f} appear once in the initial phase. At each time slot tt, if we choose a bad action AB\textbf{A}_{B}, we increase one minimum counter Nm,f(bm,f)N_{m,f}(\textbf{b}_{m,f}) by one at this bad time slot, i.e., Nm,ft(bm,f)=Nm,ft1(bm,f)+1N^{t}_{m,f}(\textbf{b}_{m,f})=N^{t-1}_{m,f}(\textbf{b}_{m,f})+1. Therefore, we have Nm,ft(bm,f)Tm,ft(bm,f)N^{t}_{m,f}(\textbf{b}_{m,f})\leq T^{t}_{m,f}(\textbf{b}_{m,f}).

Define Qt=maxbm,fBa-coll,m2(bm,f)6logt(f1(Δmin))2Q_{t}=\underset{\textbf{b}_{m,f}}{\text{max}}B_{\text{a-coll},m}^{2}(\textbf{b}_{m,f})\frac{6\log t}{(f^{-1}(\Delta_{\text{min}}))^{2}}, where f()f(\cdot) is the bounded smoothness function[30]. For a bad time slot tt with action AtAB\textbf{A}^{t}\in\textbf{A}_{B}, we have

m=1Mf=1Fbm,fNm,fTtotal(bm,f)m=1Mf=1Fbm,fQTtotal\displaystyle\sum_{m=1}^{M}\sum_{f=1}^{F}\sum_{\textbf{b}_{m,f}}N^{T_{\text{total}}}_{m,f}(\textbf{b}_{m,f})-\sum_{m=1}^{M}\sum_{f=1}^{F}\sum_{\textbf{b}_{m,f}}Q_{T_{\text{total}}}
=t=1Ttotal𝕀{AtAB}+m=1M2Γ(m)Fm=1Mf=1Fbm,fQTtotal\displaystyle=\sum_{t=1}^{T_{\text{total}}}\mathbb{I}\{\textbf{A}^{t}\in\textbf{A}_{B}\}+\sum_{m=1}^{M}2^{\Gamma(m)}F-\sum_{m=1}^{M}\sum_{f=1}^{F}\sum_{\textbf{b}_{m,f}}Q_{T_{\text{total}}}
t=1Ttotalm=1Mf=1Fbm,f𝕀{AtAB,Nm,ft(bm,f)>Nm,ft1(bm,f),\displaystyle\leq\sum_{t=1}^{T_{\text{total}}}\sum_{m=1}^{M}\sum_{f=1}^{F}\sum_{\textbf{b}_{m,f}}\mathbb{I}\{\textbf{A}^{t}\in\textbf{A}_{B},N^{t}_{m,f}(\textbf{b}_{m,f})>N_{m,f}^{t-1}(\textbf{b}_{m,f}),
Nm,ft1(bm,f)>QTtotal}+M2M1F\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}N_{m,f}^{t-1}(\textbf{b}_{m,f})>Q_{T_{\text{total}}}\}+M2^{M-1}F
t=1Ttotalm=1Mf=1Fbm,f𝕀{AtAB,Nm,ft(bm,f)>Nm,ft1(bm,f),\displaystyle\leq\sum_{t=1}^{T_{\text{total}}}\sum_{m=1}^{M}\sum_{f=1}^{F}\sum_{\textbf{b}_{m,f}}\mathbb{I}\{\textbf{A}^{t}\in\textbf{A}_{B},N^{t}_{m,f}(\textbf{b}_{m,f})>N_{m,f}^{t-1}(\textbf{b}_{m,f}),
Nm,ft1(bm,f)>Qt}+M2M1F\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}N_{m,f}^{t-1}(\textbf{b}_{m,f})>Q_{t}\}+M2^{M-1}F
=t=1Ttotal𝕀{AtAB,bm,ft,Nm,ft1(bm,ft)>Qt}+M2M1F\displaystyle=\sum_{t=1}^{T_{\text{total}}}\mathbb{I}\{\textbf{A}^{t}\in\textbf{A}_{B},\forall~{}\textbf{b}^{t}_{m,f},N_{m,f}^{t-1}(\textbf{b}^{t}_{m,f})>Q_{t}\}+M2^{M-1}F
t=1Ttotal(𝕀{Ft}+𝕀{¬Ft,AtAB,bm,ft,Nm,ft1(bm,ft)>Qt})\displaystyle\leq\sum_{t=1}^{T_{\text{total}}}\left(\mathbb{I}\{F_{t}\}+\mathbb{I}\{\neg F_{t},\textbf{A}^{t}\in\textbf{A}_{B},\forall~{}\textbf{b}^{t}_{m,f},N_{m,f}^{t-1}(\textbf{b}^{t}_{m,f})>Q_{t}\}\right)
+M2M1F\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+M2^{M-1}F
t=1Ttotal(𝕀{Ft}+𝕀{¬Ft,AtAB,bm,ft,Tm,ft1(bm,ft)>Qt})\displaystyle\leq\sum_{t=1}^{T_{\text{total}}}\left(\mathbb{I}\{F_{t}\}+\mathbb{I}\{\neg F_{t},\textbf{A}^{t}\in\textbf{A}_{B},\forall~{}\textbf{b}^{t}_{m,f},T_{m,f}^{t-1}(\textbf{b}^{t}_{m,f})>Q_{t}\}\right)
+M2M1F.\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+M2^{M-1}F. (25)

Note that we have

P[|R¯m,fTm,ft1(bm,f)(bm,f)μm,f(bm,f)|\displaystyle P\bigg{[}|\overline{R}_{m,f}^{T_{m,f}^{t-1}(\textbf{b}_{m,f})}(\textbf{b}_{m,f})-\mu_{m,f}(\textbf{b}_{m,f})|
Ba-coll,m(bm,f)3logt/2Tm,ft1(bm,f)]\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\geq B_{\text{a-coll},m}(\textbf{b}_{m,f})\sqrt{{3\log t}/{2T_{m,f}^{t-1}(\textbf{b}_{m,f})}}\bigg{]}
=s=1tP[Tm,ft1(bm,f)=s]\displaystyle=\sum_{s=1}^{t}P\left[T_{m,f}^{t-1}(\textbf{b}_{m,f})=s\right]
P[|R¯m,fs(bm,f)μm,f(bm,f)|Ba-coll,m(bm,f)3logt/2s]\displaystyle P\left[|\overline{R}_{m,f}^{s}(\textbf{b}_{m,f})-\mu_{m,f}(\textbf{b}_{m,f})|\geq B_{\text{a-coll},m}(\textbf{b}_{m,f})\sqrt{3\log t/2s}\right]
s=1tP[|R¯m,fs(bm,f)μm,f(bm,f)|\displaystyle\leq\sum_{s=1}^{t}P\big{[}|\overline{R}_{m,f}^{s}(\textbf{b}_{m,f})-\mu_{m,f}(\textbf{b}_{m,f})|
Ba-coll,m(bm,f)3logt/2s]\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\geq B_{\text{a-coll},m}(\textbf{b}_{m,f})\sqrt{{3\log t}/{2s}}\big{]}
2t2,\displaystyle\leq 2t^{-2}, (26)

where the last inequality comes from the Chernoff-Hoeffding Inequality.

Define a random variable Λm,ft(bm,f)=Ba-coll,m(bm,f)3logt/2Tm,ft1(bm,f)\Lambda_{m,f}^{t}(\textbf{b}_{m,f})=B_{\text{a-coll},m}(\textbf{b}_{m,f})\sqrt{{3\log t}/{2T_{m,f}^{t-1}(\textbf{b}_{m,f})}} and event

Et={bm,f,|R¯m,fTm,ft1(bm,f)(bm,f)μm,f(bm,f)|Λm,ft(bm,f)}\!\!\!\!\!\!\!E_{t}=\{\forall~{}\textbf{b}_{m,f},|\overline{R}_{m,f}^{T_{m,f}^{t-1}(\textbf{b}_{m,f})}(\textbf{b}_{m,f})-\mu_{m,f}(\textbf{b}_{m,f})|\leq\Lambda_{m,f}^{t}(\textbf{b}_{m,f})\}. We have P(¬Et)M2MFt2P(\neg E_{t})\leq M2^{M}Ft^{-2}. From (8), we have R^m,ft(bm,f)R¯m,fTm,ft1(bm,f)(bm,f)=Λm,ft(bm,f)\widehat{R}_{m,f}^{t}(\textbf{b}_{m,f})-\overline{R}_{m,f}^{T_{m,f}^{t-1}(\textbf{b}_{m,f})}(\textbf{b}_{m,f})=\Lambda_{m,f}^{t}(\textbf{b}_{m,f}). Thus, |R¯Tm,ft1(bm,f)(bm,f)μm,f(bm,f)|Λt(bm,f)|\overline{R}^{T_{m,f}^{t-1}(\textbf{b}_{m,f})}(\textbf{b}_{m,f})-\mu_{m,f}(\textbf{b}_{m,f})|\leq\Lambda^{t}(\textbf{b}_{m,f}) implies that R^m,ft(bm,f)μm,f(bm,f)\widehat{R}_{m,f}^{t}(\textbf{b}_{m,f})\geq\mu_{m,f}(\textbf{b}_{m,f}). In other words, we have Etr^t𝝁E_{t}\Longrightarrow\widehat{\textbf{r}}^{t}\geq\bm{\mu}, where r^t\widehat{\textbf{r}}^{t} is the vector of R^m,ft(bm,f)\widehat{R}_{m,f}^{t}(\textbf{b}_{m,f}) for all bm,f\textbf{b}_{m,f}.

Let Λ=maxbm,fBa-coll,m(bm,f)3logt2Qt\Lambda=\underset{\textbf{b}_{m,f}}{\text{max}}B_{\text{a-coll},m}(\textbf{b}_{m,f})\sqrt{\frac{3\log t}{2Q_{t}}} and define Λt=max{Λm,ft(bm,ft)|bm,ft}\Lambda^{t}=\text{max}\{\Lambda_{m,f}^{t}(\textbf{b}^{t}_{m,f})|\forall~{}\textbf{b}^{t}_{m,f}\}. Then we have

Et|R^m,ft(bm,ft)μm,f(bm,ft)|2Λt,bm,ft,\displaystyle E_{t}\Longrightarrow|\widehat{R}_{m,f}^{t}(\textbf{b}^{t}_{m,f})-\mu_{m,f}(\textbf{b}^{t}_{m,f})|\leq 2\Lambda^{t},\forall~{}\textbf{b}^{t}_{m,f}, (27)
{AtAB,bm,ft,Tm,ft1(bm,ft)>Qt}Λ>Λt.\displaystyle\{\textbf{A}^{t}\in\textbf{A}_{B},\forall~{}\textbf{b}^{t}_{m,f},T_{m,f}^{t-1}(\textbf{b}^{t}_{m,f})>Q_{t}\}\Longrightarrow\Lambda>\Lambda^{t}. (28)

Therefore, if {Et,¬Ft,AtAB,bm,ft,Tm,ft1(bm,ft)>Qt}\{E_{t},\neg F_{t},\textbf{A}^{t}\in\textbf{A}_{B},\forall~{}\textbf{b}^{t}_{m,f},T_{m,f}^{t-1}(\textbf{b}^{t}_{m,f})>Q_{t}\} holds at time slot tt, we have

r𝝁(At)+f(2Λ)>r𝝁(At)+f(2Λt)rr^t(At)αoptr^t\displaystyle~{}~{}r_{\bm{\mu}}(\textbf{A}^{t})+f(2\Lambda)>r_{\bm{\mu}}(\textbf{A}^{t})+f(2\Lambda^{t})\geq r_{\widehat{\textbf{r}}^{t}}(\textbf{A}^{t})\geq\alpha\cdot\text{opt}_{\widehat{\textbf{r}}^{t}}
αrr^t(A)αr𝝁(A)=αopt𝝁.\displaystyle\geq\alpha\cdot r_{\widehat{\textbf{r}}^{t}}(\textbf{A}^{*})\geq\alpha\cdot r_{\bm{\mu}}(\textbf{A}^{*})=\alpha\cdot\text{opt}_{\bm{\mu}}. (29)

Since we have f(2Λ)=Δminf(2\Lambda)=\Delta_{\text{min}}, r𝝁(At)+f(2Λ)>αopt𝝁r_{\bm{\mu}}(\textbf{A}^{t})+f(2\Lambda)>\alpha\cdot\text{opt}_{\bm{\mu}} contradicts the definition of Δmin\Delta_{\text{min}}. Therefore, we have P[{Et,¬Ft,AtAB,bm,ft,Tm,ft1(bm,ft)>Qt}]=0P[\{E_{t},\neg F_{t},\textbf{A}^{t}\in\textbf{A}_{B},\forall~{}\textbf{b}^{t}_{m,f},T_{m,f}^{t-1}(\textbf{b}^{t}_{m,f})>Q_{t}\}]=0. Then we can conclude that P[{¬Ft,AtAB,bm,ft,Tm,ft1(bm,ft)>Qt}}]P[¬Et]M2MFt2P[\{\neg F_{t},\textbf{A}^{t}\in\textbf{A}_{B},\forall~{}\textbf{b}^{t}_{m,f},T_{m,f}^{t-1}(\textbf{b}^{t}_{m,f})>Q_{t}\}\}]\leq P[\neg E_{t}]\leq M2^{M}Ft^{-2}. Then we have

𝔼[m=1Mf=1Fbm,fNm,fTtotal(bm,f)]m=1Mf=1Fbm,fQTtotal\displaystyle\mathbb{E}[\sum_{m=1}^{M}\sum_{f=1}^{F}\sum_{\textbf{b}_{m,f}}N^{T_{\text{total}}}_{m,f}(\textbf{b}_{m,f})]\leq\sum_{m=1}^{M}\sum_{f=1}^{F}\sum_{\textbf{b}_{m,f}}Q_{T_{\text{total}}}
+(1β)Ttotal+t=1TtotalM2MFt2+M2M1F\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+(1-\beta)T_{\text{total}}+\sum_{t=1}^{T_{\text{total}}}M2^{M}Ft^{-2}+M2^{M-1}F
maxbm,fBa-coll,m2(bm,f)6M2M1FlogTtotal(f1(Δmin))2\displaystyle\leq\underset{\textbf{b}_{m,f}}{\text{max}}B_{\text{a-coll},m}^{2}(\textbf{b}_{m,f})\frac{6M2^{M-1}F\log T_{\text{total}}}{(f^{-1}(\Delta_{\text{min}}))^{2}}
+M2M1F(π23+1)+(1β)Ttotal.\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+M2^{M-1}F(\frac{\pi^{2}}{3}+1)+(1-\beta)T_{\text{total}}. (30)

Therefore, the regret is given by

Zregret=t=1Ttotal𝔼[Rtotal(A)Rtotal(At)]\displaystyle Z_{\text{regret}}=\sum_{t=1}^{T_{\text{total}}}\mathbb{E}[R_{\text{total}}(A^{*})-R_{\text{total}}(A^{t})]
Ttotalαβopt𝝁[Ttotalαopt𝝁\displaystyle\leq T_{\text{total}}\alpha\beta\text{opt}_{\bm{\mu}}-\bigg{[}T_{\text{total}}\alpha\text{opt}_{\bm{\mu}}
𝔼[m=1Mf=1Fbm,fNm,fTtotal(bm,f)M2M1F]Δmax]\displaystyle~{}~{}~{}~{}~{}~{}~{}-\mathbb{E}[\sum_{m=1}^{M}\sum_{f=1}^{F}\sum_{\textbf{b}_{m,f}}N^{T_{\text{total}}}_{m,f}(\textbf{b}_{m,f})-M2^{M-1}F]\cdot\Delta_{\text{max}}\bigg{]}
[maxbm,fBa-coll,m2(bm,f)6M2M1FlogTtotal(f1(Δmin))2+M2MFπ26\displaystyle\leq\bigg{[}\underset{\textbf{b}_{m,f}}{\text{max}}B_{\text{a-coll},m}^{2}(\textbf{b}_{m,f})\frac{6M2^{M-1}F\log T_{\text{total}}}{(f^{-1}(\Delta_{\text{min}}))^{2}}+\frac{M2^{M}F\pi^{2}}{6}
+(1β)Ttotal]Δmax(1β)Ttotalαopt𝝁\displaystyle~{}~{}~{}~{}~{}~{}~{}+(1-\beta)T_{\text{total}}\bigg{]}\Delta_{\text{max}}-(1-\beta)T_{\text{total}}\alpha\text{opt}_{\bm{\mu}}
[maxbm,fBa-coll,m2(bm,f)6M2M1FlogTtotal(f1(Δmin))2+M2MFπ26]Δmax.\displaystyle\leq\!\left[\!\underset{\textbf{b}_{m,f}}{\text{max}}B_{\text{a-coll},m}^{2}(\textbf{b}_{m,f})\frac{6M2^{M-1}F\log T_{\text{total}}}{(f^{-1}(\Delta_{\text{min}}))^{2}}\!+\!\frac{M2^{M}F\pi^{2}}{6}\!\right]\!\!\Delta_{\text{max}}. (31)

Therefore, the proof is completed. For the agent-based collaborative MAMAB in a user perspective, the proof is similar and hence is omitted here.

References

  • [1] X. Xu and M. Tao, “Collaborative multi-agent reinforcement learning of caching optimization in small-cell networks,” in IEEE Proc. Global Commun. Conf., Dec. 2018.
  • [2] X. Wang, M. Chen, T. Taleb, A. Ksentini, and V. C. Leung, “Cache in the air: exploiting content caching and delivery techniques for 5g systems,” IEEE Commun. Mag., vol. 52, no. 2, pp. 131–139, Feb. 2014.
  • [3] E. Bastug, M. Bennis, and M. Debbah, “Living on the edge: The role of proactive caching in 5g wireless networks,” IEEE Commun. Mag., vol. 52, no. 8, pp. 82–89, Aug. 2014.
  • [4] X. Peng, J.-C. Shen, J. Zhang, and K. B. Letaief, “Backhaul-aware caching placement for wireless networks,” in IEEE Proc.Global Commun. Conf., Dec. 2015, pp. 1–6.
  • [5] K. Shanmugam, N. Golrezaei, A. G. Dimakis, A. F. Molisch, and G. Caire, “Femtocaching: Wireless content delivery through distributed caching helpers,” IEEE Trans. Inf. Theory, vol. 59, no. 12, pp. 8402–8413, Dec. 2013.
  • [6] E. Bastug, M. Bennis, M. Kountouris, and M. Debbah, “Cache-enabled small cell networks: Modeling and tradeoffs,” EURASIP J. on Wireless Commun. and Netw., vol. 2015, no. 1, pp. 1–11, Feb. 2015.
  • [7] B. Blaszczyszyn and A. Giovanidis, “Optimal geographic caching in cellular networks,” in IEEE Proc. Int. Conf. Commun., Jun. 2015, pp. 3358–3363.
  • [8] S. H. Chae and W. Choi, “Caching placement in stochastic wireless caching helper networks: Channel selection diversity via caching,” IEEE Trans. Wireless Commun., vol. 15, no. 10, pp. 6626–6637, Oct. 2016.
  • [9] Y. Chen, M. Ding, J. Li, Z. Lin, G. Mao, and L. Hanzo, “Probabilistic small-cell caching: Performance analysis and optimization,” IEEE Trans. Veh. Technol., vol. 66, no. 5, pp. 4341–4354, May. 2017.
  • [10] M. A. Maddah-Ali and U. Niesen, “Fundamental limits of caching,” IEEE Trans. Inf. Theory, vol. 60, no. 5, pp. 2856–2867, May. 2014.
  • [11] X. Xu and M. Tao, “Modeling, analysis, and optimization of coded caching in small-cell networks,” IEEE Trans. Commun., vol. 65, no. 8, pp. 3415–3428, Aug. 2017.
  • [12] A. Tang, S. Roy, and X. Wang, “Coded caching for wireless backhaul networks with unequal link rates,” IEEE Trans. Commun., vol. 66, no. 1, pp. 1–13, Jan. 2018.
  • [13] M. Leconte, G. Paschos, L. Gkatzikis, M. Draief, S. Vassilaras, and S. Chouvardas, “Placing dynamic content in caches with small population,” in IEEE Proc. Int. Conf. Comput. Commun., Apr. 2016, pp. 1–9.
  • [14] S. Li, J. Xu, M. van der Schaar, and W. Li, “Trend-aware video caching through online learning,” IEEE Trans. Multimedia, vol. 18, no. 12, pp. 2503–2516, Dec. 2016.
  • [15] P. Blasco and D. Gündüz, “Learning-based optimization of cache content in a small cell base station,” in IEEE Proc. Int. Conf. Commun., Jun. 2014, pp. 1897–1903.
  • [16] P. Blasco and D. Gunduz, “Multi-armed bandit optimization of cache content in wireless infostation networks,” in IEEE Proc. Int. Symp. Inf. Theory, Jun. 2014, pp. 51–55.
  • [17] A. Sengupta, S. Amuru, R. Tandon, R. M. Buehrer, and T. C. Clancy, “Learning distributed caching strategies in small cell networks,” in IEEE Proc. Int. Symp. Wireless Commun. Syst., Aug. 2014, pp. 917–921.
  • [18] J. Song, M. Sheng, T. Q. Quek, C. Xu, and X. Wang, “Learning-based content caching and sharing for wireless networks,” IEEE Trans. Commun., vol. 65, no. 10, pp. 4309–4324, Oct. 2017.
  • [19] P. Yang, N. Zhang, S. Zhang, L. Yu, J. Zhang, and X. S. Shen, “Content popularity prediction towards location-aware mobile edge caching,” IEEE Trans. Multimedia, pp. 1–1, Sep. 2018.
  • [20] A. Sadeghi, F. Sheikholeslami, and G. B. Giannakis, “Optimal and scalable caching for 5g using reinforcement learning of space-time popularities,” IEEE J. Sel. Topics Signal Process., vol. 12, no. 1, pp. 180–190, Feb. 2018.
  • [21] S. Müller, O. Atan, M. van der Schaar, and A. Klein, “Context-aware proactive content caching with service differentiation in wireless networks,” IEEE Tran. Wireless Commun., vol. 16, no. 2, pp. 1024–1036, Feb. 2017.
  • [22] Y. Jiang, M. Ma, M. Bennis, F. Zheng, and X. You, “User preference learning based edge caching for fog-ran,” arXiv preprint arXiv:1801.06449, 2018.
  • [23] P. Cheng, C. Ma, M. Ding, Y. Hu, Z. Lin, Y. Li, and B. Vucetic, “Localized small cell caching: A machine learning approach based on rating data,” IEEE Trans. Commun., pp. 1–1, 2018.
  • [24] B. Chen and C. Yang, “Caching policy for cache-enabled D2D communications by learning user preference,” IEEE Trans. Commun., pp. 1–1, 2018.
  • [25] S. O. Somuyiwa, D. G nd z, and A. Gyorgy, “Reinforcement learning for proactive caching of contents with different demand probabilities,” in Proc. Int. Symp. Wireless Commun. Syst, Aug 2018, pp. 1–6.
  • [26] K. Guo, C. Yang, and T. Liu, “Caching in base station with recommendation via Q-learning,” in IEEE Proc. Wireless Commun. Netw. Conf., Mar. 2017, pp. 1–6.
  • [27] C. Zhong, M. C. Gursoy, and S. Velipasalar, “A deep reinforcement learning-based framework for content caching,” in Proc. Annual Conf. Inf. Sci. Syst. (CISS), Mar. 2018, pp. 1–6.
  • [28] C. Guestrin, D. Koller, and R. Parr, “Multiagent planning with factored MDPs,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2002, pp. 1523–1530.
  • [29] J. R. Kok and N. Vlassis, “Collaborative multiagent reinforcement learning by payoff propagation,” J. Mach. Learn. Res., vol. 7, pp. 1789–1828, Sep. 2006.
  • [30] W. Chen, Y. Wang, and Y. Yuan, “Combinatorial multi-armed bandit: General framework and applications,” in International Conference on Machine Learning, 2013, pp. 151–159.
  • [31] N. Vlassis, R. Elhorst, and J. R. Kok, “Anytime algorithms for multiagent decision making using coordination graphs,” in Proc. Int. Conf. Syst., Man Cybern. (ICSMC), vol. 1, Oct. 2004, pp. 953–957.
  • [32] C. Claus and C. Boutilier, “The dynamics of reinforcement learning in cooperative multiagent systems,” AAAI/IAAI, vol. 1998, no. 746-752, p. 2, 1998.
  • [33] D. Lee, J. Choi, J.-H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim, “LRFU: a spectrum of policies that subsumes the least recently used and least frequently used policies,” IEEE Trans. Comput., vol. 50, no. 12, pp. 1352–1361, Dec 2001.
  • [34] F. M. Harper and J. A. Konstan, “The movielens datasets: History and context,” ACM Trans. Interact. Intell. Syst., vol. 5, no. 4, pp. 1–19, Dec. 2015.