This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Linear Combinatorial Semi-Bandit with Causally Related Rewards

Behzad Nourani-Koliji1111Equal Contribution    Saeed Ghoorchian2&Setareh Maghsudi1
1University of Tübingen
2Technical University of Berlin
[email protected] [email protected] [email protected]

Linear Combinatorial Semi-Bandit with Causally Related Rewards

Behzad Nourani-Koliji1111Equal Contribution    Saeed Ghoorchian2&Setareh Maghsudi1
1University of Tübingen
2Technical University of Berlin
[email protected] [email protected] [email protected]
Abstract

In a sequential decision-making problem, having a structural dependency amongst the reward distributions associated with the arms makes it challenging to identify a subset of alternatives that guarantees the optimal collective outcome. Thus, besides individual actions’ reward, learning the causal relations is essential to improve the decision-making strategy. To solve the two-fold learning problem described above, we develop the ’combinatorial semi-bandit framework with causally related rewards’, where we model the causal relations by a directed graph in a stationary structural equation model. The nodal observation in the graph signal comprises the corresponding base arm’s instantaneous reward and an additional term resulting from the causal influences of other base arms’ rewards. The objective is to maximize the long-term average payoff, which is a linear function of the base arms’ rewards and depends strongly on the network topology. To achieve this objective, we propose a policy that determines the causal relations by learning the network’s topology and simultaneously exploits this knowledge to optimize the decision-making process. We establish a sublinear regret bound for the proposed algorithm. Numerical experiments using synthetic and real-world datasets demonstrate the superior performance of our proposed method compared to several benchmarks.

1 Introduction

In the seminal form of the Multi-Armed Bandit (MAB) problem, an agent selects an arm from a given set of arms at sequential rounds of decision-making. Upon selecting an arm, the agent receives a reward, which is drawn from the unknown reward distribution of that arm. The agent aims at maximizing the average reward over the gambling horizon Robbins (1952). The MAB problem portrays the exploration-exploitation dilemma, where the agent decides between accumulating immediate reward and obtaining information that might result in larger reward only in the future Maghsudi and Hossain (2016). To measure the performance of a strategy, one uses the notion of regret. It is the difference between the accumulated reward of the applied decision-making policy and that of the optimal policy in hindsight.

In a combinatorial semi-bandit setting Chen et al. (2013), at each round, the agent selects a subset of base arms. This subset is referred to as a super arm. She then observes the individual reward of each base arm that belongs to the selected super arm. Consequently, she accumulates the collective reward associated with the chosen super arm. The combinatorial MAB problem is challenging since the number of super arms is combinatorial in the number of base arms. Thus, conventional MAB algorithms such as Auer et al. (2002) are not appropriate for combinatorial problems as they result in suboptimal regret bounds. The aforementioned problem becomes significantly more difficult when there are causal dependencies amongst the reward distributions.

In some cases, it is possible to model the causal structure that affects the rewards Lattimore et al. (2016). Therefore, exploiting the knowledge of this structure helps to deal with the aforementioned challenges. In our paper, we develop a novel combinatorial semi-bandit framework with causally related rewards, where we rely on Structural Equation Models (SEMs) Kaplan (2008) to model the causal relations. At each time of play, we see the instantaneous rewards of the chosen base arms as controlled stimulus to the causal system. Consequently, in our causal system, the solution to the decision-making problem is the choice over the exogenous input that maximizes the collected reward. We propose a decision-making policy to solve the aforementioned problem and prove that it achieves a sublinear regret bound in time. Our developed framework can be used to model various real-world problems, such as network data analysis of biological networks or financial markets. We apply our framework to analyze the development of Covid-19 in Italy. We show that our proposed policy is able to detect the regions that contribute the most to the spread of Covid-19 in the country.

Compared to previous works, our proposed framework does not require any prior knowledge of the structural dependencies. For example, in Tang et al. (2017), the authors exploit the prior knowledge of statistical structures to learn the best combinatorial strategy. At each decision-making round, the agent receives the reward of the selected super arm and some side rewards from the selected base arms’ neighbors. In Huyuk and Tekin (2019) a Combinatorial Thompson Sampling (CTS) algorithm to solve a combinatorial semi-bandit problem with probabilistically triggered arms is proposed. The proposed algorithm has access to an oracle that determines the best decision at each round of play based on the already collected data. Similarly, the authors in Chen et al. (2016) study a setting where triggering super arms can probabilistically trigger other unchosen arms. They propose an Upper Confidence Bound (UCB)-based algorithm that uses an oracle to improve the decision-making process. In Yu et al. (2020), the authors formulate a combinatorial bandit problem where the agent has access to an influence diagram that represents the probabilistic dependencies in the system. The authors propose a Thompson sampling algorithm and its approximations to solve the formulated problem. Further, there are some works that study the underlying structure of the problem. For example, in Toni and Frossard (2018), the authors attempt to learn the structure of a combinatorial bandit problem. However, they do not assume any causal relations between rewards. Moreover, in Sen et al. (2017), the MAB framework is employed to identify the best soft intervention on a causal system while it is assumed that the causal graph is only partially unknown.

The rest of the paper is organized as follows. In Section 2, we formulate the structured combinatorial semi-bandit problem with causally related rewards. In Section 3, we introduce our proposed algorithm, namely SEM-UCB. Section 4 includes the theoretical analysis of the regret performance of SEM-UCB. Section 5 is dedicated to numerical evaluation. Section 6 concludes the paper.

2 Problem Formulation

Let [N]={1,2,,N}[N]=\{1,2,\dots,N\} denote the set of base arms. 𝐛t=[𝐛t[1],𝐛t[2],,𝐛t[N]][0,1]N\mathbf{b}_{t}=[\mathbf{b}_{t}[1],\mathbf{b}_{t}[2],\dots,\mathbf{b}_{t}[N]]\in[0,1]^{N} represents the vector of instantaneous rewards of the base arms at time tt. The instantaneous rewards of each base arm i[N]i\in[N] are independent and identically distributed (i.i.d.) random variables drawn from an unknown probability distribution with mean 𝜷[i]\boldsymbol{\beta}[i]. We collect the mean rewards of all the base arms in the mean reward vector of 𝜷=[𝜷[1],𝜷[2],,𝜷[N]]\boldsymbol{\beta}=[\boldsymbol{\beta}[1],\boldsymbol{\beta}[2],\dots,\boldsymbol{\beta}[N]].

We consider a causally structured combinatorial semi-bandit problem where an agent sequentially selects a subset of base arms over time. We refer to this subset as the super arm. More precisely, at each time tt, the agent selects a decision vector 𝐱t=[𝐱t[1],𝐱t[2],,𝐱t[N]]{0,1}N\mathbf{x}_{t}=[\mathbf{x}_{t}[1],\mathbf{x}_{t}[2],\dots,\mathbf{x}_{t}[N]]\in\left\{0,1\right\}^{N}. If the agent selects the base arm ii at time tt, we have 𝐱t[i]=1\mathbf{x}_{t}[i]=1, otherwise 𝐱t[i]=0\mathbf{x}_{t}[i]=0. The agent observes the value of 𝐛t[i]\mathbf{b}_{t}[i] at time tt only if 𝐱t[i]=1\mathbf{x}_{t}[i]=1. The agent is allowed to select at most ss base arms at each time of play. Hence, we define the set of all feasible super arms as

𝒳={𝐱𝐱{0,1}N𝐱0s},\mathcal{X}=\left\{\mathbf{x}\mid\mathbf{x}\in\{0,1\}^{N}\wedge\left\|\mathbf{x}\right\|_{0}\leq s\right\}, (1)

where 0\left\|\cdot\right\|_{0} determines the number of non-zero elements in a vector. In our problem, the parameter ss is pre-determined and is given to the agent.

We take advantage of a directed graph structure to model the causal relationships in the system. We consider an unknown stationary sparse Directed Acyclic Graph (DAG) 𝒢=(𝒱,,𝐀)\mathcal{G}=(\mathcal{V},\mathcal{E},\mathbf{A}), where 𝒱\mathcal{V} denotes the set of NN vertices, i.e., |𝒱|=N\left|\mathcal{V}\right|=N, \mathcal{E} is the edge set, and 𝐀\mathbf{A} denotes the weighted adjacency matrix. By pN1p\leq N-1, we denote the length of the largest path in the graph 𝒢\mathcal{G}. We assume that the reward generating processes in the bandit setting follow an error-free Structural Equation Model (SEM) (Giannakis et al. (2018), Bazerque et al. (2013a)). The exogenous input vector and the endogenous output vector of the SEM at each time tt are denoted by 𝐳t=[𝐳t[1],𝐳t[2],,𝐳t[N]]\mathbf{z}_{t}=[\mathbf{z}_{t}[1],\mathbf{z}_{t}[2],\dots,\mathbf{z}_{t}[N]] and 𝐲t=[𝐲t[1],𝐲t[2],,𝐲t[N]]\mathbf{y}_{t}=[\mathbf{y}_{t}[1],\mathbf{y}_{t}[2],\dots,\mathbf{y}_{t}[N]], respectively. At each time tt, the exogenous input 𝐳t\mathbf{z}_{t} represents the semi-bandit feedback in the decision-making problem. Formally,

𝐳t=diag(𝐛t)𝐱t,\displaystyle\mathbf{z}_{t}=\operatorname{diag}(\mathbf{b}_{t})\mathbf{x}_{t}, (2)

where diag()\operatorname{diag}(\cdot) represents the diagonalization of its given input vector. Consequently, we define the elements of the endogenous output vector 𝐲t\mathbf{y}_{t} as

𝐲t[i]=ij𝐀[i,j]𝐲t[j]+𝐅[i,i]𝐳t[i],i=1,,N,\mathbf{y}_{t}[i]=\sum_{i\neq j}\mathbf{A}[i,j]\mathbf{y}_{t}[j]+\mathbf{F}[i,i]\mathbf{z}_{t}[i],~{}~{}~{}~{}\forall i=1,\dots,N, (3)

where 𝐅\mathbf{F} is a diagonal matrix that captures the effects of the exogenous input vector 𝐳t\mathbf{z}_{t}. The SEM in Equation 3 implies that the output measurement 𝐲t[i]\mathbf{y}_{t}[i] depends on the single-hop neighbor measurements in addition to the exogenous input signal 𝐳t[i]\mathbf{z}_{t}[i]. In our formulation, at each time tt, the endogenous output 𝐲t[i]\mathbf{y}_{t}[i] represents the overall reward of the corresponding base arm i[N]i\in[N]. Therefore, at each time tt, the overall reward of each base arm comprises two parts; one part directly results from its instantaneous reward, while the other part reflects the effect of causal influences of other base arms’ overall rewards.

In Equation 3, the overall rewards are causally related. Thus, the adjacency matrix 𝐀\mathbf{A} represents the causal relationships between the overall rewards; accordingly, the element 𝐀[i,j]\mathbf{A}[i,j] of the adjacency matrix 𝐀\mathbf{A} denotes the causal impact of the overall reward of base arm jj on the overall reward of base arm ii, and we have 𝐀[i,i]=0\mathbf{A}[i,i]=0, i=1,2,,N\forall i=1,2,\dots,N. We assume that the agent is not aware of the causal relationships between the overall rewards. Hence, the adjacency matrix 𝐀\mathbf{A} is unknown a priori. In the following, we work with the matrix form of Equation 3, defined at time tt as

𝐲t=𝐀𝐲t+𝐅𝐳t.\mathbf{y}_{t}=\mathbf{A}\mathbf{y}_{t}+\mathbf{F}\mathbf{z}_{t}. (4)

In Figure 1, we illustrate an exemplary network consisting of NN vertices and the underlying causal relations. Based on our problem formulation, the agent is able to observe both the exogenous input signal vector 𝐳t\mathbf{z}_{t} and the endogenous output signal vector 𝐲t\mathbf{y}_{t}. As we see, there does not exist necessarily a causal relation between every pair of nodes.

𝐲t[1]\mathbf{y}_{t}[1]𝐲t[2]\mathbf{y}_{t}[2]𝐲t[3]\mathbf{y}_{t}[3]𝐲t[N]\mathbf{y}_{t}[N]𝐀[2,1]\mathbf{A}{[2,1]}𝐀[2,3]\mathbf{A}{[2,3]}𝐀[3,1]\mathbf{A}{[3,1]}𝐀[3,N]\mathbf{A}{[3,N]} 𝐳t[1]\mathbf{z}_{t}[1]𝐲t[1]\mathbf{y}_{t}[1]𝐅[1,1]\mathbf{F}[1,1]𝐳t[2]\mathbf{z}_{t}[2]𝐲t[2]\mathbf{y}_{t}[2]𝐅[2,2]\mathbf{F}[2,2]𝐳t[3]\mathbf{z}_{t}[3]𝐲t[3]\mathbf{y}_{t}[3]𝐅[3,3]\mathbf{F}[3,3]𝐳t[N]\mathbf{z}_{t}[N]𝐲t[N]\mathbf{y}_{t}[N]𝐅[N,N]\mathbf{F}[N,N]
Figure 1: An exemplary illustration of a graph consisting of NN vertices and their causal relations. The black directed edges represent the causal relationships amongst the vertices.

By inserting (2) in (4) and solving for 𝐲t\mathbf{y}_{t} we obtain

𝐲t=(𝐈𝐀)1𝐅diag(𝐛t)𝐱t.\mathbf{y}_{t}=(\mathbf{I-A})^{-1}\mathbf{F}\operatorname{diag}(\mathbf{b}_{t})\mathbf{x}_{t}. (5)

Finally, we define the payoff received by the agent upon choosing the decision vector 𝐱t\mathbf{x}_{t} as

r(𝐱t)=𝟏𝐲t=𝟏(𝐈𝐀)1𝐅diag(𝐛t)𝐱t,r(\mathbf{x}_{t})={\bf 1}^{\top}\mathbf{y}_{t}={\bf 1}^{\top}(\mathbf{I-A})^{-1}\mathbf{F}\operatorname{diag}(\mathbf{b}_{t})\mathbf{x}_{t}, (6)

where 𝟏{\bf 1} is the NN-dimensional vector of ones. Since the graph 𝒢\mathcal{G} is a DAG, it implies that with a proper indexing of the vertices, the adjacency matrix 𝐀\mathbf{A} is a strictly upper triangular matrix. This guarantees that the matrix (𝐈𝐀)(\mathbf{I-A}) is invertible. In our problem, since the agent directly observes the exogenous input, we assume that the effects of 𝐅\mathbf{F} on the exogenous inputs are already integrated in the instantaneous rewards. Therefore, to simplify the notation and without loss of generality, we assume that 𝐅=𝐈\mathbf{F}=\mathbf{I} in the following.

Given a decision vector 𝐱t𝒳\mathbf{x}_{t}\in\mathcal{X}, the expected payoff at time tt is calculated as

μ(𝐱t)=𝔼[r(𝐗)|𝐗=𝐱t],\mu(\mathbf{x}_{t})=\mathbb{E}\left[r(\mathbf{X})|\mathbf{X}=\mathbf{x}_{t}\right], (7)

where the expectation is taken with respect to the randomness in the reward generating processes.

Ideally, the agent’s goal is to maximize her total mean payoff over a time horizon TT. Alternatively, the agent aims at minimizing the expected regret, defined as the difference between the expected accumulated payoff of an oracle that follows the optimal policy and that of the agent that follows the applied policy. Formally, the expected regret is defined as

T(𝒳)=Tμ(𝐱)t=1Tμ(𝐱t),\mathcal{R}_{T}(\mathcal{X})=T\mu(\mathbf{x}^{\ast})-\sum_{t=1}^{T}\mu(\mathbf{x}_{t}), (8)

where 𝐱=argmax𝐱𝒳μ(𝐱)\mathbf{x}^{*}=\underset{\mathbf{x}\in\mathcal{X}}{\text{argmax}}~{}\mu(\mathbf{x}) is the optimal decision vector, and 𝐱t\mathbf{x}_{t} denotes the selected decision vector at time tt under the applied policy.

Remark 1.

The definition of payoff in (6) implies that we are dealing with a linear combinatorial semi-bandit problem with causally related rewards. In general, due to the randomness in selection of the decision vector 𝐱t\mathbf{x}_{t}, the consecutive overall reward vectors 𝐲t\mathbf{y}_{t} become non-identically distributed. In the following section, we propose our algorithm that is able to deal with such variables. This is an improvement over the previous methods, such as Chen et al. (2016) and Huyuk and Tekin (2019), that are not able to cope with our problem formulation, as they are specially designed to work with i.i.d. random variables.

3 Decision-Making Strategy

In this section, we present our decision-making strategy to solve the problem described in Section 2. Our proposed policy consists of two learning components: (i) an online graph learning and (ii) an Upper Confidence Bound (UCB)-based reward learning. In the following, we describe each component separately and propose our algorithm, namely SEM-UCB.

3.1 Online Graph Learning

The payoff defined in (6) implies that the knowledge of 𝐀\mathbf{A} is necessary to select decision vectors that result in higher accumulated payoffs. Therefore, the agent aims at learning the matrix 𝐀\mathbf{A} to improve her decision-making process. To this end, we propose an online graph learning framework that uses the collected feedback, i.e., the collected exogenous input and endogenous output vectors, to estimate the ground truth matrix 𝐀\mathbf{A}. In the following, we formalize the online graph learning framework.

At each time tt, we collect the feedback up to the current time in 𝐙t=[𝐳1𝐳t]\mathbf{Z}_{t}=[\mathbf{z}_{1}\ldots\mathbf{z}_{t}] and 𝐘t=[𝐲1𝐲t]\mathbf{Y}_{t}=[\mathbf{y}_{1}\ldots\mathbf{y}_{t}]. Therefore,

𝐘t=𝐀𝐘t+𝐙t.\mathbf{Y}_{t}=\mathbf{A}\mathbf{Y}_{t}+\mathbf{Z}_{t}. (9)

We assume that the right indexing of the vertices is known prior to estimating the ground truth adjacency matrix. We use the collected feedback 𝐘t\mathbf{Y}_{t} and 𝐙t\mathbf{Z}_{t} as the input to a parametric graph learning algorithm (Giannakis et al. (2018), Dong et al. (2019)). More precisely, we use the following optimization problem to estimate the adjacency matrix at time tt.

𝐀^t=argmin𝐀\displaystyle\hat{\mathbf{A}}_{t}=\underset{\mathbf{A}}{\text{argmin}} 𝐘t𝐀𝐘t𝐙t22+g(𝐀)\displaystyle\left\|\mathbf{Y}_{t}-\mathbf{A}\mathbf{Y}_{t}-\mathbf{Z}_{t}\right\|_{2}^{2}+\mathrm{g}(\mathbf{A}) (10)
s.t. 𝐀[i,j]0,i,j[N],\displaystyle\mathbf{A}[i,j]\geq 0,\quad\forall i,j\in[N],
𝐀[i,j]=0,ij,\displaystyle\mathbf{A}[i,j]=0,\quad\forall i\geq j,

where 2\left\|\cdot\right\|_{2} represents the L2-normL^{2}\text{-norm} of matrices and g(𝐀)\mathrm{g}(\mathbf{A}) is a regularization function that imposes sparsity over 𝐀\mathbf{A}. In our numerical experiments, we work with different regularization functions to demonstrate the effectiveness of our proposed algorithm in different scenarios. As an example, we impose the sparsity property on the estimated matrix 𝐀^t\hat{\mathbf{A}}_{t} in (10) by defining g(𝐀)=λ𝐀1\mathrm{g}(\mathbf{A})=\lambda\left\|\mathbf{A}\right\|_{1}, where 1\left\|\cdot\right\|_{1} is the L1-normL^{1}\text{-norm} of the matrices and λ\lambda is the regularization parameter. Our choices of regularization function guarantee that the optimization problem (10) is convex.

3.2 SEM-UCB Algorithm

We propose our decision-making policy in Algorithm 1. The key idea behind our algorithm is that it works with observations for each base arm, rather than the payoff observations for each super arm. As the same base arm can be observed while selecting different super arms, we can use the obtained information from selection of a super arm to improve our payoff estimation of other relevant super arms. This, combined with the fact that our algorithm simultaneously learns the causal relations, significantly improves the performance of our proposed algorithm and speed up the learning process.

For each base arm ii, we define the empirical average of instantaneous rewards at time tt as

𝜷^t[i]=τ=1t𝐛τ[i]𝟙{𝐱τ[i]=1}𝐦t[i],\hat{\boldsymbol{\beta}}_{t}[i]=\frac{\sum_{\tau=1}^{t}\mathbf{b}_{\tau}[i]\mathds{1}\left\{\mathbf{x}_{\tau}[i]=1\right\}}{\mathbf{m}_{t}[i]}, (11)

where 𝐦t[i]\mathbf{m}_{t}[i] denotes the number of times that the base arm ii is observed up to time tt. Formally,

𝐦t[i]=τ=1t𝟙{𝐱τ[i]=1}.\mathbf{m}_{t}[i]=\sum_{\tau=1}^{t}\mathds{1}\left\{\mathbf{x}_{\tau}[i]=1\right\}. (12)

The initialization phase of SEM-UCB algorithm follows a specific strategy to create a rich data that helps to learn the ground truth adjacency matrix. At each time tt during the first NN times of play, SEM-UCB picks the column tt of an upper-triangular initialization matrix 𝐌{0,1}N×N\mathbf{M}\in\{0,1\}^{N\times N}, where 𝐌\mathbf{M} is created as follows. All diagonal elements of 𝐌\mathbf{M} are equal to 11. As for the column ii, if isi\leq s, we set all elements above diagonal to 11. If s+1iNs+1\leq i\leq N, we select s1s-1 elements above diagonal uniformly at random and set them to 11. The remaining elements are set to 0.

After the initialization period, our proposed algorithm takes two steps at each time tt to learn the causal relationships and the expected instantaneous rewards of the base arms. First, it uses the collected feedback 𝐘t\mathbf{Y}_{t} and 𝐙t\mathbf{Z}_{t} and solves the optimization problem (10) to obtain the estimated adjacency matrix. It then uses the reward observations to calculate the UCB index 𝐄t[i]\mathbf{E}_{t}[i] for each base arm ii, defined as

𝐄t[i]=𝜷^t[i]+(s+1)lnt𝐦t[i].\mathbf{E}_{t}[i]=\hat{\boldsymbol{\beta}}_{t}[i]+\sqrt{\frac{(s+1)\text{ln}t}{\mathbf{m}_{t}[i]}}. (13)

Afterward, the algorithm selects a decision vector 𝐱t\mathbf{x}_{t} using the current estimate of the adjacency matrix and the developed UCB indices of the base arms. Let 𝐄t=[𝐄t[1],𝐄t[2],,𝐄t[N]]\mathbf{E}_{t}=[\mathbf{E}_{t}[1],\mathbf{E}_{t}[2],\dots,\mathbf{E}_{t}[N]]. At time tt, SEM-UCB selects 𝐱t\mathbf{x}_{t} as

𝐱t=argmax𝐱𝒳\displaystyle\mathbf{x}_{t}=\underset{\mathbf{x}\in\mathcal{X}}{\text{argmax}} 𝟏(𝐈𝐀^t1)1diag(𝐄t1)𝐱\displaystyle\mathbf{1}^{\top}(\mathbf{I}-\hat{\mathbf{A}}_{t-1})^{-1}\operatorname{diag}(\mathbf{E}_{t-1})\mathbf{x} (14)
s.t. 𝐱0s.\displaystyle\left\|\mathbf{x}\right\|_{0}\leq s.
Algorithm 1 SEM-UCB: Structural Equation Model-Upper Confidence Bound

Input: Parameter ss, initialization matrix 𝐌\mathbf{M}.

1:  for t=1,,Nt=1,\dots,N do
2:     Select column tt of the initialization matrix 𝐌\mathbf{M} as the decision vector 𝐱t\mathbf{x}_{t}.
3:     Observe 𝐳t\mathbf{z}_{t} and 𝐲t\mathbf{y}_{t}.
4:  end for
5:  for t=N+1,,Tt=N+1,\dots,T do
6:     Solve (10) to obtain 𝐀^t1\hat{\mathbf{A}}_{t-1}.
7:     Calculate 𝐄t1[i]\mathbf{E}_{t-1}[i] using (13), i[N]\forall i\in[N].
8:     Select decision vector 𝐱t\mathbf{x}_{t} that solves (14).
9:     Observe 𝐳t\mathbf{z}_{t} and 𝐲t\mathbf{y}_{t}.
10:  end for
Remark 2.

The initialization phase of our algorithm guarantees that all the base arms are pulled at least once and the matrix 𝐌\mathbf{M} is full rank. Consequently, the adjacency matrix 𝐀\mathbf{A} is uniquely identifiable from the collected feedback Bazerque et al. (2013a).

Remark 3.

Let 𝐜=𝟏(𝐈𝐀^t1)1diag(𝐄t1)\mathbf{c}^{\top}=\mathbf{1}^{\top}(\mathbf{I}-\hat{\mathbf{A}}_{t-1})^{-1}\operatorname{diag}(\mathbf{E}_{t-1}). Since all the elements of both matrices 𝐄t1\mathbf{E}_{t-1} and 𝐀^t1\hat{\mathbf{A}}_{t-1} are non-negative, we have 𝐜[i]>0\mathbf{c}[i]>0, i[N]\forall i\in[N]. Thus, the optimization problem (14) reduces to finding the ss-biggest elements of 𝐜\mathbf{c}. Therefore, (14) can be solved efficiently based on the choice of sorting algorithm used to order the elements of 𝐜\mathbf{c}.

The computational complexity of the SEM-UCB algorithm varies depending on the solver that is used to learn the graph. For example, if we use OSQP solver Stellato et al. (2020), we achieve a computational complexity of order 𝒪(N4)\mathcal{O}(N^{4}).

4 Theoretical Analysis

In this section, we prove an upper bound on the expected regret of SEM-UCB algorithm. We use the following definitions in our regret analysis. For any decision vector 𝐱𝒳\mathbf{x}\in\mathcal{X}, let Δ(𝐱)=μ(𝐱)μ(𝐱)\Delta(\mathbf{x})=\mu(\mathbf{x}^{\ast})-\mu(\mathbf{x}). We define Δmax=max𝐱:μ(𝐱)<μ(𝐱)Δ(𝐱)\Delta_{\max}=\underset{\mathbf{x}:\mu(\mathbf{x})<\mu(\mathbf{x}^{\ast})}{\max}~{}\Delta(\mathbf{x}) and Δmin=min𝐱:μ(𝐱)<μ(𝐱)Δ(𝐱)\Delta_{\min}=\underset{\mathbf{x}:\mu(\mathbf{x})<\mu(\mathbf{x}^{\ast})}{\min}~{}\Delta(\mathbf{x}). Moreover, let 𝐰t=𝟏(𝐈𝐀^t)1diag(𝐱t+1)\mathbf{w}_{t}^{\top}={\bf 1}^{\top}(\mathbf{I}-\hat{\mathbf{A}}_{t})^{-1}\text{diag}(\mathbf{x}_{t+1}). We define wmax=max𝑡max𝑖𝐰t[i]w_{\max}=\underset{t}{\max}~{}\underset{i}{\max}~{}\mathbf{w}_{t}[i].

The following theorem states an upper bound on the expected regret of SEM-UCB.

Theorem 1.

The expected regret of SEM-UCB algorithm is upper bounded as

T(𝒳)[4wmax2s2(s+1)NlnTΔmin2+N+π23spN]Δmax.\displaystyle\mathcal{R}_{T}(\mathcal{X})\leq{\Big{[}}\frac{4w_{\max}^{2}s^{2}(s+1)N\ln{T}}{\Delta_{\min}^{2}}+N+\frac{\pi^{2}}{3}s^{p}N{\Big{]}}\Delta_{\max}. (15)
Proof.

See Section A of supplementary material. ∎

5 Experimental Analysis

In this section, we present experimental results to provide more insight on the usefulness of learning the causal relations for improving the decision-making process. We evaluate the performance of our algorithm on synthetic and real-world datasets by comparing it to standard benchmark algorithms.

Benchmarks. 

We compare SEM-UCB with state-of-the-art combinatorial semi-bandit algorithms that do not learn the causal structure of the problem. Specifically, we compare our algorithm with the following policies: (i) CUCB Chen et al. (2016) calculates a UCB index for each base arm at each time tt and feeds them to an approximation oracle that outputs a super arm. (ii) DFL-CSR Tang et al. (2017) develops a UCB index for each base arm and selects a super arm at each time tt based on a prior knowledge of a graph structure that shows the correlations among base arms. (iii) CTS Huyuk and Tekin (2019) employs Thompson sampling and uses an oracle to select a super arm at each time tt. (iv) FTRL Zimmert et al. (2019) selects a super arm at each time tt based on the method of Follow-the-Regularized-Leader. To be comparable, we apply these benchmarks on the vector of overall reward 𝐲t\mathbf{y}_{t} at each time tt. If a benchmark requires 𝐲t\mathbf{y}_{t} to be in [0,1][0,1], we feed the normalized version of 𝐲t\mathbf{y}_{t} to the corresponding algorithm. Finally, in our experiments, we choose s=6s=6, meaning that the algorithms can choose 66 base arms at each time of play.

5.1 Synthetic Dataset

Our simulation setting is as follows. We first create a graph consisting of N=20N=20 nodes. The elements of the adjacency matrix 𝐀\mathbf{A} are drawn from a uniform distribution over [0.4,0.7][0.4,0.7]. The edge density of the ground truth adjacency matrix is 0.150.15. At each time tt, the vector of instantaneous rewards 𝐛t\mathbf{b}_{t} is drawn from a multivariate normal distribution with the support in [0,1]20[0,1]^{20} and a spherical covariance matrix. As demonstrated in Section 2, we generate the vector of overall rewards according to the SEM in (3). We use g(𝐀)=λ𝐀1\mathrm{g}(\mathbf{A})=\lambda\left\|\mathbf{A}\right\|_{1} as the regularization function in (10) when estimating the adjacency matrix 𝐀\mathbf{A}. The regularization parameter λ\lambda is tuned by grid search over [0.0001,1000][0.0001,1000]. We evaluate the estimated adjacency matrix at each time tt by using the mean squared error defined as MSE=1N2𝐀𝐀^F2\text{MSE}=\frac{1}{N^{2}}\left\|\mathbf{A}-\hat{\mathbf{A}}\right\|_{\text{F}}^{2}, where F\left\|\cdot\right\|_{\text{F}} denotes the Frobenius norm.

Comparison with the benchmarks. 

We run the algorithms using the aforementioned synthetic data with T=4000T=4000. In Figure 2, we depict the trend of time-averaged expected regret for each policy. As we see, SEM-UCB surpasses all other policies. This is due to the fact that SEM-UCB learns the network’s topology and hence, it has a better knowledge of the causal relationships in the graph structure, unlike other policies that do not estimate the graph structure. As we see, the time-averaged expected regret of SEM-UCB tends to zero. This matches with our theoretical results in Section 4. Note that, the benchmark policies exhibit a suboptimal regret performance as they have to deal with non-identically distributed random variables 𝐲t\mathbf{y}_{t}.

Refer to caption

Figure 2: Time-averaged expected regret of different policies.

5.2 Covid-19 Dataset

We evaluate our proposed algorithm on the Covid-19 outbreak dataset of daily new infected cases during the pandemic in different regions within Italy.222https://github.com/pcm-dpc/COVID-19 The dataset fits in our framework as the daily new cases in each region results from the causal spread of Covid-19 among the regions in a country Mastakouri and Schölkopf (2020) and the region-specific characteristics Guaitoli and Pancrazi (2021). As the regions differ in their regional characteristics, such as socio-economic and geographical characteristics, each region has a specific exposure risk of Covid-19 infection. To be consistent with our terminology in Section 2, at each time (day) tt, we use the overall reward 𝐲t[i]\mathbf{y}_{t}[i] to refer to the overall daily new cases in region ii and use the instantaneous reward 𝐛t[i]\mathbf{b}_{t}[i] to refer to the region-specific daily new cases in region ii. Naturally, the overall daily new cases includes the region-specific daily new cases of Covid-19 infection.

Governments around the world strive to track the spread of Covid-19 and find the regions that are contributing the most to the total number of daily new cases in the country Bridgwater and Bóta (2021). By the end of this experiment, we address this critical problem and highlight that our algorithm is capable of finding the optimal candidate regions for political interventions in order to contain the spread of a contagious disease such as Covid-19.

Data preparation. 

We focus on the recorded daily new cases from 1010 August to 1515 October, 20202020, for N=21N=21 regions within Italy. The Covid-19 dataset only provides us with the overall daily new cases of each region. Hence, in order to apply our algorithm, we need to infer the distribution of region-specific daily new cases for each region. In the following, we describe this process and further pre-processing of the Covid-19 dataset.

According to Bull (2021), for the time period from 1818 May to 33 June, 20202020, all places for work and leisure activities were opened and travelling within regions was permitted while travelling between regions was forbidden. Consequently, during this period, there are no causal effects on the overall daily new cases of each region from other regions. In addition, according to google mobility data Nouvellet et al. (2021), during 44 weeks prior to 1818 May the mobility was increasing within the regions while travel ban between the regions was still imposed. Hence, we use this expanded period to estimate the underlying distributions of the region-specific daily new cases using a kernel density estimation. Finally, considering that the daily recorded data noticeably fluctuates, a 77-day moving average was applied to the signals.

We create the region-specific daily new cases for each region by sampling from the estimated distributions. Below, we present the results of applying our algorithm on the pre-processed Covid-19 dataset. Since the data only contains the reported overall daily new cases for a limited time period, care should be exercised in interpreting the results. However, by providing more relevant data, our proposed framework helps towards more accurate detection of the regions that contribute the most to the development of Covid-19.

Learning the structural dependencies. 

Our algorithm learns the ground truth adjacency matrix 𝐀\mathbf{A} using (10). As for the choice of regularization function in (10), we employ Directed Total Variation (DTV) which is a novel application of the Graph Directed Variation (GDV) function Sardellitti et al. (2017). DTV regularization function is defined as

g(𝐀)=λi,j=1,,N𝐀[i,j]k=1,,t[𝐘[i,k]𝐘[j,k]]+,\mathrm{g}(\mathbf{A})=\lambda\sum_{i,j=1,\dots,N}\mathbf{A}[i,j]\sum_{k=1,\dots,t}\left[\mathbf{Y}[i,k]-\mathbf{Y}[j,k]\right]^{+}, (16)
[y]+=max{y,0}.\left[y\right]^{+}=\text{max}\left\{y,0\right\}. (17)

The regularization function addresses the smoothness of the entire observations 𝐘\mathbf{Y} over the underlying directed graph. To be more realistic, since the causal spread of the disease might create cycles, we additionally include cyclic graphs in the search space of the optimization problem (10).

Refer to caption

Figure 3: Original overall daily new cases and the corresponding predicted values for different days in the validation set.

We perform cross-validation technique to tune the regularization parameter λ\lambda. As mentioned before, we work on a limited time period with T=66T=66 days. Thus, we split the data into train and validation sets in 10:1 ratio. More specifically, we split the data into 66 subsets of 1111 consecutive days. In each subset, one day is chosen uniformly at random to be included in the validation set while the remaining 1010 days are added to the train set. We calculate the prediction error at each time tt by

Error(t)=1NK(t)i𝒦(t)𝐲i𝐲^i1,\textit{Error}(t)=\frac{1}{NK(t)}\sum_{i\in\mathcal{K}(t)}\left\|\mathbf{y}_{i}-\hat{\mathbf{y}}_{i}\right\|_{1}, (18)

where 𝒦(t)\mathcal{K}(t) is the validation set at time tt with cardinality K(t)=|𝒦(t)|K(t)=|\mathcal{K}(t)| and 𝐲i\mathbf{y}_{i} and 𝐲^i\hat{\mathbf{y}}_{i} are the validation data and the corresponding predicted value using the estimated graph for day ii, respectively. Figure 3 compares the ground truth overall daily new cases and the predicted overall daily new cases using the estimated graph on 44 different days of the Covid-19 outbreak in our validation data. Due to space limitation, we use abbreviations for region names. Table 11 in Section B.1 of supplementary material lists the abbreviations together with the original names of the regions. We observe that our proposed framework is capable to estimate the data for each region efficiently, that helps the agent to improve its decision-making process in a real-world scenario.

Learning regions with the highest contribution. 

In Figure 4, we show the decision-making process of the agent over time by following the SEM-UCB policy. Dark rectangles represent the 66 selected regions at each day (time). Based on our framework, we represent the selected regions by our algorithm as those with biggest contributions to the development of Covid-19 during the time interval considered in our experiment. More specifically, we find the regions of Lombardia, Emilia-Romagna, Lazio, Veneto, Piemonte, and Liguria as the ones that contribute the most to the spread of Covid-19 during that period in Italy.

We emphasize that, due to the causal effects among the regions, contribution of each region to the spread of covid-19 differs from its overall daily cases of infection. Thus, the set of regions with the highest contribution does not necessarily equal to the set of regions with the highest total number of daily cases. This is a key aspect of our problem formulation that is addressed by SEM-UCB in Figure 4. We elaborate more on this fact in Section B.3 of supplementary material.

Refer to caption

Figure 4: Selected regions on each day.

6 Conclusion

In this paper, we developed a combinatorial semi-bandit framework with causally related rewards, where we modelled the causal relations by a directed graph in a structural equation model. We developed a decision-making policy, namely SEM-UCB, that learns the structural dependencies to improve the decision-making process. We proved that SEM-UCB achieves a sublinear regret bound in time. Our framework is applicable in a number of contexts such as network data analysis of biological networks or financial markets. We applied our method to analyze the development of Covid-19. The experiments showed that SEM-UCB outperforms several state-of-the-art combinatorial semi-bandit algorithms. Future research directions would be to extend the current framework to deal with piece-wise stationary environments where the causal graph and/or the expected instantaneous rewards of the base arms undergo abrupt changes over time.

Acknowledgements

This work was partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC number 2064/1 – Project number 390727645, and by Grant 01IS20051 from the German Federal Ministry of Education and Research (BMBF). We are grateful to Sergio Barbarossa and Sofien Dhouib for fruitful discussions and comments.

References

  • Auer et al. [2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002.
  • Azuma [1967] Kazuoki Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 19(3):357 – 367, 1967.
  • Bazerque et al. [2013a] Juan Andrés Bazerque, Brian Baingana, and Georgios B Giannakis. Identifiability of sparse structural equation models for directed and cyclic networks. In 2013 IEEE Global Conference on Signal and Information Processing, pages 839–842. IEEE, 2013.
  • Bazerque et al. [2013b] Juan Andrés Bazerque, Brian Baingana, and Georgios B. Giannakis. Identifiability of sparse structural equation models for directed and cyclic networks. In 2013 IEEE Global Conference on Signal and Information Processing, pages 839–842, 2013.
  • Bridgwater and Bóta [2021] Alexander Bridgwater and András Bóta. Identifying regions most likely to contribute to an epidemic outbreak in a human mobility network. In 2021 Swedish Artificial Intelligence Society Workshop (SAIS), pages 1–4. IEEE, 2021.
  • Bull [2021] Martin Bull. The italian government response to covid-19 and the making of a prime minister. Contemporary Italian Politics, pages 1–17, 2021.
  • Chen et al. [2013] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework and applications. In International Conference on Machine Learning, pages 151–159. PMLR, 2013.
  • Chen et al. [2016] Wei Chen, Yajun Wang, Yang Yuan, and Qinshi Wang. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research, 17(1):1746–1778, 2016.
  • Dong et al. [2019] Xiaowen Dong, Dorina Thanou, Michael Rabbat, and Pascal Frossard. Learning graphs from data: A signal representation perspective. IEEE Signal Processing Magazine, 36(3):44–63, 2019.
  • Duncan [2004] Andrew Duncan. Powers of the adjacency matrix and the walk matrix. 2004.
  • Gai et al. [2012] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20(5):1466–1478, 2012.
  • Giannakis et al. [2018] Georgios B Giannakis, Yanning Shen, and Georgios Vasileios Karanikolas. Topology identification and learning over graphs: Accounting for nonlinearities and dynamics. Proceedings of the IEEE, 106(5):787–807, 2018.
  • Guaitoli and Pancrazi [2021] Gabriele Guaitoli and Roberto Pancrazi. Covid-19: Regional policies and local infection risk: Evidence from italy with a modelling study. The Lancet Regional Health-Europe, 8:100169, 2021.
  • Huyuk and Tekin [2019] Alihan Huyuk and Cem Tekin. Analysis of thompson sampling for combinatorial multi-armed bandit with probabilistically triggered arms. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1322–1330. PMLR, 2019.
  • Kaplan [2008] David Kaplan. Structural equation modeling: Foundations and extensions, volume 10. Sage Publications, 2008.
  • Lattimore et al. [2016] Finnian Lattimore, Tor Lattimore, and Mark D Reid. Causal bandits: learning good interventions via causal inference. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 1189–1197, 2016.
  • Maghsudi and Hossain [2016] Setareh Maghsudi and Ekram Hossain. Multi-armed bandits with application to 5g small cells. IEEE Wireless Communications, 23(3):64–73, 2016.
  • Mastakouri and Schölkopf [2020] Atalanti Mastakouri and Bernhard Schölkopf. Causal analysis of covid-19 spread in germany. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 3153–3163. Curran Associates, Inc., 2020.
  • Nouvellet et al. [2021] Pierre Nouvellet, Sangeeta Bhatia, Anne Cori, Kylie EC Ainslie, Marc Baguelin, Samir Bhatt, Adhiratha Boonyasiri, Nicholas F Brazeau, Lorenzo Cattarino, Laura V Cooper, et al. Reduction in mobility and covid-19 transmission. Nature communications, 12(1):1–9, 2021.
  • Robbins [1952] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
  • Sardellitti et al. [2017] Stefania Sardellitti, Sergio Barbarossa, and Paolo Di Lorenzo. On the graph fourier transform for directed graphs. IEEE Journal of Selected Topics in Signal Processing, 11(6):796–811, 2017.
  • Sen et al. [2017] Rajat Sen, Karthikeyan Shanmugam, Alexandros G Dimakis, and Sanjay Shakkottai. Identifying best interventions through online importance sampling. In International Conference on Machine Learning, pages 3057–3066. PMLR, 2017.
  • Stellato et al. [2020] B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd. OSQP: an operator splitting solver for quadratic programs. Mathematical Programming Computation, 12(4):637–672, 2020.
  • Tang et al. [2017] Shaojie Tang, Yaqin Zhou, Kai Han, Zhao Zhang, Jing Yuan, and Weili Wu. Networked stochastic multi-armed bandits with combinatorial strategies. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pages 786–793. IEEE, 2017.
  • Toni and Frossard [2018] Laura Toni and Pascal Frossard. Spectral mab for unknown graph processes. In 2018 26th European Signal Processing Conference (EUSIPCO), pages 116–120. IEEE, 2018.
  • Yu et al. [2020] Tong Yu, Branislav Kveton, Zheng Wen, Ruiyi Zhang, and Ole J Mengshoel. Graphical models meet bandits: A variational thompson sampling approach. In International Conference on Machine Learning, pages 10902–10912. PMLR, 2020.
  • Zimmert et al. [2019] Julian Zimmert, Haipeng Luo, and Chen-Yu Wei. Beating stochastic and adversarial semi-bandits optimally and simultaneously. In International Conference on Machine Learning, pages 7683–7692. PMLR, 2019.

Appendix A PROOF OF THEOREM 1

A.1 NOTATIONS

Before proceeding to the proof, in the following we introduce some important notations together with their definitions.

We define the index set of a decision vector 𝐱𝒳\mathbf{x}\in\mathcal{X} by (𝐱)={i|𝐱[i]0,i[N]}\mathcal{I}(\mathbf{x})=\left\{i~{}|~{}\mathbf{x}[i]\neq 0,\forall i\in[N]\right\}. The confidence bound of base arm ii at time tt is defined as 𝐂t[i]=(s+1)lnt𝐦t[i]\mathbf{C}_{t}[i]=\sqrt{\frac{(s+1)\ln{t}}{\mathbf{m}_{t}[i]}}. At each time tt, we collect the empirical average of instantaneous rewards 𝜷^t[i]\hat{\boldsymbol{\beta}}_{t}[i] and the calculated confidence bounds 𝐂t[i]\mathbf{C}_{t}[i] of all base arms i[N]i\in[N] in vectors 𝜷^t\hat{\boldsymbol{\beta}}_{t} and 𝐂t\mathbf{C}_{t}, respectively. We have 𝐄t=𝜷^t+𝐂t\mathbf{E}_{t}=\hat{\boldsymbol{\beta}}_{t}+\mathbf{C}_{t}. For ease of presentation, in the sequel, we use the following equivalence 𝟏(𝐈𝐀^t1)1diag(𝐄t1)𝐱t=𝟏(𝐈𝐀^t1)1diag(𝐱t)𝐄t1{\bf 1}^{\top}(\mathbf{I}-\hat{\mathbf{A}}_{t-1})^{-1}\text{diag}(\mathbf{E}_{t-1})\mathbf{x}_{t}={\bf 1}^{\top}(\mathbf{I}-\hat{\mathbf{A}}_{t-1})^{-1}\text{diag}(\mathbf{x}_{t})\mathbf{E}_{t-1}. At each time tt, we define the selection index for a decision vector 𝐱𝒳\mathbf{x}\in\mathcal{X} as It(𝐱)=𝟏(𝐈𝐀^t1)1diag(𝐱)𝐄t1I_{t}(\mathbf{x})=\mathbf{1}^{\top}({\bf I}-\hat{\mathbf{A}}_{t-1})^{-1}\textup{diag}(\mathbf{x})\mathbf{E}_{t-1}. To simplify the notation, sometimes we drop the time index tt in 𝐦t[i]\mathbf{m}_{t}[i] and use 𝐦[i]\mathbf{m}[i] to denote the number of times that the base arm ii has been observed up to the current time instance.

For any 𝐱𝒳\mathbf{x}\in\mathcal{X}, we use the counter 𝒯𝐱(t)\mathcal{T}_{\mathbf{x}}(t) to represent the total number of times the decision vector 𝐱\mathbf{x} is selected up to time tt. Finally, for each base arm i[N]i\in[N], we define a counter 𝒯i(t)\mathscr{T}_{i}(t) which is updated as follows. At each time tt after the initialization phase that a suboptimal decision vector 𝐱t\mathbf{x}_{t} is selected, we have at least one base arm i[N]i\in[N] such that i=argmini(𝐱t)𝐦t[i]i=\underset{i\in\mathcal{I}(\mathbf{x}_{t})}{\textup{argmin}}~{}\mathbf{m}_{t}[i]. In this case, if the base arm ii is unique, we increment 𝒯i(t)\mathscr{T}_{i}(t) by 1. If there are more than one such base arm, we break the tie and select one of them arbitrarily to increment its corresponding counter.

A.2 Auxiliary Results

We use the following lemma in the proof of Theorem 1.

Lemma 1.

(Azuma [1967]) Let z1,z2,,zmz_{1},z_{2},\dots,z_{m} be random variables and zi[0,1]z_{i}\in[0,1], i\forall i. Moreover, 𝔼[zt|z1,,zt1]=α\mathbb{E}[z_{t}|z_{1},\dots,z_{t-1}]=\alpha, for all t=1,,mt=1,\dots,m. Then, for all D0D\geq 0,

[|i=1mzimα|D]e2D2m.\mathbb{P}{\Bigg{[}}{\Big{|}}\sum_{i=1}^{m}z_{i}-m\alpha{\Big{|}}\geq D{\Bigg{]}}\leq e^{-\frac{2D^{2}}{m}}. (19)

A.3 PROOF

We start by rewriting the expected regret as

T(𝒳)\displaystyle\mathcal{R}_{T}(\mathcal{X}) =Tμ(𝐱)t=1Tμ(𝐱t)\displaystyle=T\mu(\mathbf{x}^{\ast})-\sum_{t=1}^{T}\mu(\mathbf{x}_{t})
=𝐱:μ(𝐱)<μ(𝐱)Δ(𝐱)𝔼[𝒯𝐱(T)].\displaystyle=\sum_{\mathbf{x}:\mu(\mathbf{x})<\mu(\mathbf{x}^{\ast})}\Delta(\mathbf{x})\mathbb{E}[\mathcal{T}_{\mathbf{x}}(T)]. (20)

Based on the definition of the counters 𝒯i(t)\mathscr{T}_{i}(t) for the base arms i[N]i\in[N], at each time tt that a suboptimal decision vector is selected, only one of such counters is incremented by 11. Thus, we have Gai et al. [2012]

𝔼[𝐱:μ(𝐱)<μ(𝐱)𝒯𝐱(t)]=𝔼[i=1N𝒯i(t)],\displaystyle\mathbb{E}\left[\sum_{\mathbf{x}:\mu(\mathbf{x})<\mu(\mathbf{x}^{\ast})}\mathcal{T}_{\mathbf{x}}(t)\right]=\mathbb{E}\left[\sum_{i=1}^{N}\mathscr{T}_{i}(t)\right], (21)

which implies that

𝐱:μ(𝐱)<μ(𝐱)𝔼[𝒯𝐱(t)]=i=1N𝔼[𝒯i(t)].\displaystyle\sum_{\mathbf{x}:\mu(\mathbf{x})<\mu(\mathbf{x}^{\ast})}\mathbb{E}\left[\mathcal{T}_{\mathbf{x}}(t)\right]=\sum_{i=1}^{N}\mathbb{E}\left[\mathscr{T}_{i}(t)\right]. (22)

Therefore, we observe that

T(𝒳)\displaystyle\mathcal{R}_{T}(\mathcal{X}) =𝐱:μ(𝐱)<μ(𝐱)Δ(𝐱)𝔼[𝒯𝐱(T)]\displaystyle=\sum_{\mathbf{x}:\mu(\mathbf{x})<\mu(\mathbf{x}^{\ast})}\Delta(\mathbf{x})\mathbb{E}[\mathcal{T}_{\mathbf{x}}(T)]
()Δmaxi=1N𝔼[𝒯i(T)],\displaystyle\stackrel{{\scriptstyle(\ast)}}{{\leq}}\Delta_{\max}\sum_{i=1}^{N}\mathbb{E}[\mathscr{T}_{i}(T)], (23)

where ()(\ast) follows from the definition of Δmax\Delta_{\max}.

Let 𝕀i(t)\mathbbm{I}_{i}(t) denote the indicator function which is equal to 11 if 𝒯i(t)\mathscr{T}_{i}(t) is increased by 11 at time tt, and is 0 otherwise. Therefore,

𝒯i(T)=t=N+1T𝟙{𝕀i(t)=1}.\displaystyle\mathscr{T}_{i}(T)=\sum_{t=N+1}^{T}\mathbbm{1}\left\{\mathbbm{I}_{i}(t)=1\right\}. (24)

If 𝕀i(t)=1\mathbbm{I}_{i}(t)=1, it means that a suboptimal decision vector 𝐱t\mathbf{x}_{t} is selected at time tt. In this case, 𝐦t[i]=min{𝐦t[j]|j(𝐱t)}\mathbf{m}_{t}[i]=\min\left\{\mathbf{m}_{t}[j]|j\in\mathcal{I}(\mathbf{x}_{t})\right\}. Let l=4(s+1)lnT(Δminswmax)2l=\left\lceil{\frac{4(s+1)\ln{T}}{(\frac{\Delta_{\min}}{sw_{\max}})^{2}}}\right\rceil. Then,

𝒯i(T)=t=N+1T𝟙{𝕀i(t)=1}\displaystyle\mathscr{T}_{i}(T)=\sum_{t=N+1}^{T}\mathbbm{1}\left\{\mathbbm{I}_{i}(t)=1\right\}
l+t=N+1T𝟙{𝕀i(t)=1&𝒯i(t1)l}\displaystyle\leq l+\sum_{t=N+1}^{T}\mathbbm{1}\left\{\mathbbm{I}_{i}(t)=1~{}\&~{}\mathscr{T}_{i}(t-1)\geq l\right\}
l+t=N+1T𝟙{It(𝐱)It(𝐱t)&𝒯i(t1)l}\displaystyle\leq l+\sum_{t=N+1}^{T}\mathbbm{1}\left\{I_{t}(\mathbf{x}^{\ast})\leq I_{t}(\mathbf{x}_{t})~{}\&~{}\mathscr{T}_{i}(t-1)\geq l\right\}
=l+t=N+1T𝟙{𝟏(𝐈𝐀^t1)1diag(𝐱)𝐄t1\displaystyle=l+\sum_{t=N+1}^{T}\mathbbm{1}\{{\bf 1}^{\top}(\mathbf{I}-\hat{\mathbf{A}}_{t-1})^{-1}\text{diag}(\mathbf{x}^{\ast})\mathbf{E}_{t-1}
𝟏(𝐈𝐀^t1)1diag(𝐱t)𝐄t1&𝒯i(t1)l}\displaystyle\hskip 22.76219pt\leq{\bf 1}^{\top}(\mathbf{I}-\hat{\mathbf{A}}_{t-1})^{-1}\text{diag}(\mathbf{x}_{t})\mathbf{E}_{t-1}~{}\&~{}\mathscr{T}_{i}(t-1)\geq l\}
=l+t=NT𝟙{𝟏(𝐈𝐀^t)1diag(𝐱)𝐄t\displaystyle=l+\sum_{t=N}^{T}\mathbbm{1}\{{\bf 1}^{\top}(\mathbf{I}-\hat{\mathbf{A}}_{t})^{-1}\text{diag}(\mathbf{x}^{\ast})\mathbf{E}_{t}
𝟏(𝐈𝐀^t)1diag(𝐱t+1)𝐄t&𝒯i(t)l}.\displaystyle\hskip 22.76219pt\leq{\bf 1}^{\top}(\mathbf{I}-\hat{\mathbf{A}}_{t})^{-1}\text{diag}(\mathbf{x}_{t+1})\mathbf{E}_{t}~{}\&~{}\mathscr{T}_{i}(t)\geq l\}. (25)

Based on the definition of 𝒯i(t)\mathscr{T}_{i}(t), we have 𝒯i(t)𝐦t[i]\mathscr{T}_{i}(t)\leq\mathbf{m}_{t}[i], i[N]\forall i\in[N]. Therefore, when 𝒯i(t)l\mathscr{T}_{i}(t)\geq l, the following holds Gai et al. [2012].

l𝒯i(t)𝐦t[j],j(𝐱t+1).\displaystyle l\leq\mathscr{T}_{i}(t)\leq\mathbf{m}_{t}[j],~{}~{}~{}~{}\forall j\in\mathcal{I}(\mathbf{x}_{t+1}). (26)

Let 𝐯t+1=𝟏(𝐈𝐀^t)1diag(𝐱)\mathbf{v}_{t+1}^{\top}={\bf 1}^{\top}(\mathbf{I}-\hat{\mathbf{A}}_{t})^{-1}\text{diag}(\mathbf{x}^{\ast}) and 𝐮t+1=𝟏(𝐈𝐀^t)1diag(𝐱t+1)\mathbf{u}_{t+1}^{\top}={\bf 1}^{\top}(\mathbf{I}-\hat{\mathbf{A}}_{t})^{-1}\text{diag}(\mathbf{x}_{t+1}). We order the elements in sets (𝐱)\mathcal{I}(\mathbf{x}^{\ast}) and (𝐱t+1)\mathcal{I}(\mathbf{x}_{t+1}) arbitrarily. In the following, our results are independent of the way we order these sets. Let vkv_{k}, k=1,,|(𝐱)|sk=1,\dots,|\mathcal{I}(\mathbf{x}^{\ast})|\leq s, represent the kkth element in (𝐱)\mathcal{I}(\mathbf{x}^{\ast}) and uku_{k}, k=1,,|(𝐱t+1)|sk=1,\dots,|\mathcal{I}(\mathbf{x}_{t+1})|\leq s, represent the kkth element in (𝐱t+1)\mathcal{I}(\mathbf{x}_{t+1}). Hence, we have

𝒯i(T)l+t=NT𝟙{min0<𝐦[v1],,𝐦[v|(𝐱)|]t\displaystyle\mathscr{T}_{i}(T)\leq l+\sum_{t=N}^{T}\mathbbm{1}{\Bigg{\{}}\min_{0<\mathbf{m}[v_{1}],\dots,\mathbf{m}[v_{|\mathcal{I}(\mathbf{x}^{\ast})|}]\leq t}
j=1|(𝐱)|𝐯t+1[vj](𝜷^t[vj]+𝐂t[vj])\displaystyle\hskip 99.58464pt\sum_{j=1}^{|\mathcal{I}(\mathbf{x}^{\ast})|}\mathbf{v}_{t+1}^{\top}[v_{j}](\hat{\boldsymbol{\beta}}_{t}[v_{j}]+\mathbf{C}_{t}[v_{j}])\leq
maxl𝐦[u1],,𝐦[u|(𝐱t+1)|]tj=1|(𝐱t+1)|𝐮t+1[uj](𝜷^t[uj]+𝐂t[uj])}\displaystyle\max_{l\leq\mathbf{m}[u_{1}],\dots,\mathbf{m}[u_{|\mathcal{I}(\mathbf{x}_{t+1})|}]\leq t}\sum_{j=1}^{|\mathcal{I}(\mathbf{x}_{t+1})|}\mathbf{u}_{t+1}^{\top}[u_{j}](\hat{\boldsymbol{\beta}}_{t}[u_{j}]+\mathbf{C}_{t}[u_{j}]){\Bigg{\}}}
l+t=1mv1=1tmv|(𝐱)|=1tmu1=ltmu|(𝐱t+1)|=lt\displaystyle\leq l+\sum_{t=1}^{\infty}\sum_{m_{v_{1}}=1}^{t}\dots\sum_{m_{v_{|\mathcal{I}(\mathbf{x}^{\ast})|}}=1}^{t}\sum_{m_{u_{1}}=l}^{t}\dots\sum_{m_{u_{|\mathcal{I}(\mathbf{x}_{t+1})|}}=l}^{t}
𝟙{j=1|(𝐱)|𝐯t+1[vj](𝜷^t[vj]+𝐂t[vj])\displaystyle\hskip 28.45274pt\mathbbm{1}{\Bigg{\{}}\sum_{j=1}^{|\mathcal{I}(\mathbf{x}^{\ast})|}\mathbf{v}_{t+1}^{\top}[v_{j}](\hat{\boldsymbol{\beta}}_{t}[v_{j}]+\mathbf{C}_{t}[v_{j}])
j=1|(𝐱t+1)|𝐮t+1[uj](𝜷^t[uj]+𝐂t[uj])}.\displaystyle\hskip 71.13188pt\leq\sum_{j=1}^{|\mathcal{I}(\mathbf{x}_{t+1})|}\mathbf{u}_{t+1}^{\top}[u_{j}](\hat{\boldsymbol{\beta}}_{t}[u_{j}]+\mathbf{C}_{t}[u_{j}]){\Bigg{\}}}. (27)

We define the Event 𝒫\mathcal{P} as

j=1|(𝐱)|𝐯t+1\displaystyle\sum_{j=1}^{|\mathcal{I}(\mathbf{x}^{\ast})|}\mathbf{v}_{t+1}^{\top} [vj](𝜷^t[vj]+𝐂t[vj])\displaystyle[v_{j}](\hat{\boldsymbol{\beta}}_{t}[v_{j}]+\mathbf{C}_{t}[v_{j}])
j=1|(𝐱t+1)|𝐮t+1[uj](𝜷^t[uj]+𝐂t[uj]).\displaystyle\leq\sum_{j=1}^{|\mathcal{I}(\mathbf{x}_{t+1})|}\mathbf{u}_{t+1}^{\top}[u_{j}](\hat{\boldsymbol{\beta}}_{t}[u_{j}]+\mathbf{C}_{t}[u_{j}]). (28)

If the Event 𝒫\mathcal{P} in (A.3) is true, it implies that at least one of the following events must be true.

𝟏(𝐈𝐀^t)1diag\displaystyle{\bf 1}^{\top}(\mathbf{I}-\hat{\mathbf{A}}_{t})^{-1}\text{diag} (𝐱)(𝜷^t+𝐂t)\displaystyle(\mathbf{x}^{\ast})(\hat{\boldsymbol{\beta}}_{t}+\mathbf{C}_{t})
𝟏(𝐈𝐀)1diag(𝐱)𝜷,\displaystyle\leq{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta}, (29)
𝟏(𝐈𝐀^t)1diag\displaystyle{\bf 1}^{\top}(\mathbf{I}-\hat{\mathbf{A}}_{t})^{-1}\text{diag} (𝐱t+1)(𝜷^t𝐂t)\displaystyle(\mathbf{x}_{t+1})(\hat{\boldsymbol{\beta}}_{t}-\mathbf{C}_{t})
𝟏(𝐈𝐀)1diag(𝐱t+1)𝜷,\displaystyle\geq{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}_{t+1})\boldsymbol{\beta}, (30)
𝟏(𝐈𝐀)1diag(𝐱)𝜷\displaystyle{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta} <𝟏(𝐈𝐀)1diag(𝐱t+1)𝜷\displaystyle<{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}_{t+1})\boldsymbol{\beta}
+2𝟏(𝐈𝐀^t)1diag(𝐱t+1)𝐂t.\displaystyle+2{\bf 1}^{\top}(\mathbf{I}-\hat{\mathbf{A}}_{t})^{-1}\text{diag}(\mathbf{x}_{t+1})\mathbf{C}_{t}. (31)

First, we consider (A.3). Based on our problem formulation and proposed solution, we know that matrices 𝐀\mathbf{A} and 𝐀^t\hat{\mathbf{A}}_{t} are nilpotent with index NN. Thus, 𝐀N=𝟎N×N\mathbf{A}^{N}=\mathbf{0}_{N\times N} and 𝐀^tN=𝟎N×N\hat{\mathbf{A}}_{t}^{N}=\mathbf{0}_{N\times N}. Hence, we can write the Taylor’s series of (𝐈𝐀)1(\mathbf{I}-\mathbf{A})^{-1} and (𝐈𝐀^t)1(\mathbf{I}-\hat{\mathbf{A}}_{t})^{-1} as

(𝐈𝐀)1=𝐈+𝐀+𝐀2++𝐀N1,\displaystyle(\mathbf{I}-\mathbf{A})^{-1}=\mathbf{I}+\mathbf{A}+\mathbf{A}^{2}+\dots+\mathbf{A}^{N-1}, (32)

and

(𝐈𝐀^t)1=𝐈+𝐀^t+𝐀^t2++𝐀^tN1,\displaystyle(\mathbf{I}-\hat{\mathbf{A}}_{t})^{-1}=\mathbf{I}+\hat{\mathbf{A}}_{t}+\hat{\mathbf{A}}_{t}^{2}+\dots+\hat{\mathbf{A}}_{t}^{N-1}, (33)

respectively. Substituting (32) and (33) in (A.3) results in

𝟏(𝐈+𝐀^t\displaystyle{\bf 1}^{\top}(\mathbf{I}+\hat{\mathbf{A}}_{t} +𝐀^t2++𝐀^tN1)diag(𝐱)(𝜷^t+𝐂t)\displaystyle+\hat{\mathbf{A}}_{t}^{2}+\dots+\hat{\mathbf{A}}_{t}^{N-1})\text{diag}(\mathbf{x}^{\ast})(\hat{\boldsymbol{\beta}}_{t}+\mathbf{C}_{t})
𝟏(𝐈+𝐀+𝐀2++𝐀N1)diag(𝐱)𝜷.\displaystyle\leq{\bf 1}^{\top}(\mathbf{I}+\mathbf{A}+\mathbf{A}^{2}+\dots+\mathbf{A}^{N-1})\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta}. (34)

For j=1,Nj=1,\dots N, we find the upper bound for

[𝟏𝐀^tj1diag(𝐱)(𝜷^t+𝐂t)𝟏𝐀j1diag(𝐱)𝜷].\displaystyle\mathbb{P}\left[{\bf 1}^{\top}\hat{\mathbf{A}}_{t}^{j-1}\text{diag}(\mathbf{x}^{\ast})(\hat{\boldsymbol{\beta}}_{t}+\mathbf{C}_{t})\leq{\bf 1}^{\top}\mathbf{A}^{j-1}\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta}\right]. (35)

We consider the following Event \mathcal{E}.

𝟏𝐀^tj1diag\displaystyle{\bf 1}^{\top}\hat{\mathbf{A}}_{t}^{j-1}\text{diag} (𝐱)(𝜷^t+𝐂t)+𝟏𝐀^tj1diag(𝐱)𝜷\displaystyle(\mathbf{x}^{\ast})(\hat{\boldsymbol{\beta}}_{t}+\mathbf{C}_{t})+{\bf 1}^{\top}\hat{\mathbf{A}}_{t}^{j-1}\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta}
𝟏𝐀^tj1diag(𝐱)𝜷+𝟏𝐀j1diag(𝐱)𝜷.\displaystyle\leq{\bf 1}^{\top}\hat{\mathbf{A}}_{t}^{j-1}\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta}+{\bf 1}^{\top}\mathbf{A}^{j-1}\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta}. (36)

If \mathcal{E} is true, then at least one of the following must hold.

𝟏𝐀^tj1diag(𝐱)(𝜷^t+𝐂t)𝟏𝐀^tj1diag(𝐱)𝜷,\displaystyle\underbrace{{\bf 1}^{\top}\hat{\mathbf{A}}_{t}^{j-1}\text{diag}(\mathbf{x}^{\ast})(\hat{\boldsymbol{\beta}}_{t}+\mathbf{C}_{t})\leq{\bf 1}^{\top}\hat{\mathbf{A}}_{t}^{j-1}\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta}}_{\mathcal{I}}, (37)
𝟏𝐀^tj1diag(𝐱)𝜷𝟏𝐀j1diag(𝐱)𝜷.\displaystyle\underbrace{{\bf 1}^{\top}\hat{\mathbf{A}}_{t}^{j-1}\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta}\leq{\bf 1}^{\top}\mathbf{A}^{j-1}\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta}}_{\mathcal{II}}. (38)

Therefore, we have

[][]+[].\displaystyle\mathbb{P}\left[\mathcal{E}\right]\leq\mathbb{P}\left[\mathcal{I}\right]+\mathbb{P}\left[\mathcal{II}\right]. (39)

Let 𝐲t=𝟏𝐀^tj1diag(𝐱)\mathbf{y}_{t}^{\top}={\bf 1}^{\top}\hat{\mathbf{A}}_{t}^{j-1}\text{diag}(\mathbf{x}^{\ast}). If Event \mathcal{I} is true, then at least one of the following must hold.

𝐲t[v1](𝜷^t[v1]+𝐂t[v1])𝐲t[v1]\displaystyle\mathbf{y}_{t}^{\top}[v_{1}](\hat{\boldsymbol{\beta}}_{t}[v_{1}]+\mathbf{C}_{t}[v_{1}])\leq\mathbf{y}_{t}^{\top}[v_{1}] 𝜷[v1],\displaystyle\boldsymbol{\beta}[v_{1}], (40)
𝐲t[v2](𝜷^t[v2]+𝐂t[v2])𝐲t[v2]\displaystyle\mathbf{y}_{t}^{\top}[v_{2}](\hat{\boldsymbol{\beta}}_{t}[v_{2}]+\mathbf{C}_{t}[v_{2}])\leq\mathbf{y}_{t}^{\top}[v_{2}] 𝜷[v2],\displaystyle\boldsymbol{\beta}[v_{2}], (41)
\displaystyle\vdots
𝐲t[v|(𝐱)|](𝜷^t[v|(𝐱)|]+𝐂t[v|(𝐱)|])\displaystyle\mathbf{y}_{t}^{\top}[v_{|\mathcal{I}(\mathbf{x}^{\ast})|}](\hat{\boldsymbol{\beta}}_{t}[v_{|\mathcal{I}(\mathbf{x}^{\ast})|}]+\mathbf{C}_{t}[v_{|\mathcal{I}(\mathbf{x}^{\ast})|}])
𝐲t[v|(𝐱)|]𝜷[v|(𝐱)|].\displaystyle\leq\mathbf{y}_{t}^{\top}[v_{|\mathcal{I}(\mathbf{x}^{\ast})|}]\boldsymbol{\beta}[v_{|\mathcal{I}(\mathbf{x}^{\ast})|}]. (42)

For k=1,,|(𝐱)|k=1,\dots,|\mathcal{I}(\mathbf{x}^{\ast})|, we have

[𝐲t[vk]\displaystyle\mathbb{P}{\Big{[}}\mathbf{y}_{t}^{\top}[v_{k}] (𝜷^t[vk]+𝐂t[vk])𝐲t[vk]𝜷[vk]]\displaystyle(\hat{\boldsymbol{\beta}}_{t}[v_{k}]+\mathbf{C}_{t}[v_{k}])\leq\mathbf{y}_{t}^{\top}[v_{k}]\boldsymbol{\beta}[v_{k}]{\Big{]}}
=(a)[𝐦t[vk](𝜷^t[vk]+𝐂t[vk])𝐦t[vk]𝜷[vk]]\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\mathbb{P}{\Big{[}}\mathbf{m}_{t}[v_{k}](\hat{\boldsymbol{\beta}}_{t}[v_{k}]+\mathbf{C}_{t}[v_{k}])\leq\mathbf{m}_{t}[v_{k}]\boldsymbol{\beta}[v_{k}]{\Big{]}}
(b)e(2/𝐦t[vk])𝐦t[vk]2𝐂t[vk]2\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}e^{-(2/\mathbf{m}_{t}[v_{k}])\mathbf{m}_{t}[v_{k}]^{2}\mathbf{C}_{t}[v_{k}]^{2}}
=(c)e2(s+1)lnt\displaystyle\stackrel{{\scriptstyle(c)}}{{=}}e^{-2(s+1)\ln t}
=t2(s+1),\displaystyle=t^{-2(s+1)}, (43)

where (a)(a) holds since 𝐲t[vk]0\mathbf{y}_{t}^{\top}[v_{k}]\geq 0, k\forall k, (b)(b) follows from Lemma 1, and (c)(c) results from the definition of 𝐂t\mathbf{C}_{t}. Hence, for Event \mathcal{I}, we conclude that

[]|(𝐱)|t2(s+1)st2(s+1).\displaystyle\mathbb{P}\left[\mathcal{I}\right]\leq|\mathcal{I}(\mathbf{x}^{\ast})|t^{-2(s+1)}\leq st^{-2(s+1)}. (44)

Now, we consider Event \mathcal{II}. Based on Theorem 1 in Bazerque et al. [2013b], we know that we can identify the adjacency matrix AA uniquely by NN samples gathered during the initialization period of our proposed algorithm. This means that with probability 11, after the time point θ=N<\theta=N<\infty, 𝐀^t=𝐀\hat{\mathbf{A}}_{t}=\mathbf{A} holds for all t>θt>\theta. Therefore, for t>Nt>N, Event \mathcal{II} holds with probability 11.

Combining the aforementioned results with (39), we find the upper bound for (35) as

[𝟏𝐀^tj1diag(𝐱)(𝜷^t+𝐂t)𝟏𝐀j1\displaystyle\mathbb{P}{\Big{[}}{\bf 1}^{\top}\hat{\mathbf{A}}_{t}^{j-1}\text{diag}(\mathbf{x}^{\ast})(\hat{\boldsymbol{\beta}}_{t}+\mathbf{C}_{t})\leq{\bf 1}^{\top}\mathbf{A}^{j-1} diag(𝐱)𝜷]\displaystyle\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta}{\Big{]}}
st2(s+1),\displaystyle\leq st^{-2(s+1)}, (45)

for each j=1,,Nj=1,\dots,N. Since 𝐀^t=𝐀\hat{\mathbf{A}}_{t}=\mathbf{A}, t>N\forall t>N and the length of the largest path in the graph is pp, we can rewrite (32) and (33) as Duncan [2004]

(𝐈𝐀)1=𝐈+𝐀+𝐀2++𝐀p,\displaystyle(\mathbf{I}-\mathbf{A})^{-1}=\mathbf{I}+\mathbf{A}+\mathbf{A}^{2}+\dots+\mathbf{A}^{p}, (46)

and

(𝐈𝐀^t)1=𝐈+𝐀^t+𝐀^t2++𝐀^tp,\displaystyle(\mathbf{I}-\hat{\mathbf{A}}_{t})^{-1}=\mathbf{I}+\hat{\mathbf{A}}_{t}+\hat{\mathbf{A}}_{t}^{2}+\dots+\hat{\mathbf{A}}_{t}^{p}, (47)

respectively. Therefore, by using (46) and (47) in place of (32) and (33), and based on (A.3), the following holds for (A.3).

[𝟏(𝐈\displaystyle\mathbb{P}{\Big{[}}{\bf 1}^{\top}(\mathbf{I}- 𝐀^t)1diag(𝐱)(𝜷^t+𝐂t)\displaystyle\hat{\mathbf{A}}_{t})^{-1}\text{diag}(\mathbf{x}^{\ast})(\hat{\boldsymbol{\beta}}_{t}+\mathbf{C}_{t})
𝟏(𝐈𝐀)1diag(𝐱)𝜷]spt2p(s+1).\displaystyle\leq{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta}{\Big{]}}\leq s^{p}t^{-2p(s+1)}. (48)

For (A.3), we have similar results as follows.

[𝟏(𝐈\displaystyle\mathbb{P}{\Big{[}}{\bf 1}^{\top}(\mathbf{I}- 𝐀^t)1diag(𝐱t+1)(𝜷^t𝐂t)\displaystyle\hat{\mathbf{A}}_{t})^{-1}\text{diag}(\mathbf{x}_{t+1})(\hat{\boldsymbol{\beta}}_{t}-\mathbf{C}_{t})
𝟏(𝐈𝐀)1diag(𝐱t+1)𝜷]spt2p(s+1).\displaystyle\geq{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}_{t+1})\boldsymbol{\beta}{\Big{]}}\leq s^{p}t^{-2p(s+1)}. (49)

Finally, we consider (31). We have

𝟏(𝐈𝐀)1diag(𝐱)𝜷𝟏(𝐈𝐀)1diag(𝐱t+1)𝜷\displaystyle{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta}-{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}_{t+1})\boldsymbol{\beta}
2𝟏(𝐈𝐀^t)1diag(𝐱t+1)𝐂t\displaystyle\hskip 85.35826pt-2{\bf 1}^{\top}(\mathbf{I}-\hat{\mathbf{A}}_{t})^{-1}\text{diag}(\mathbf{x}_{t+1})\mathbf{C}_{t}
=(a)𝟏(𝐈𝐀)1diag(𝐱)𝜷𝟏(𝐈𝐀)1diag(𝐱t+1)𝜷\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta}-{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}_{t+1})\boldsymbol{\beta}
2j:j(𝐱t+1)𝐰t[j]𝐂t[j]\displaystyle\hskip 85.35826pt-2\sum_{j:j\in\mathcal{I}(\mathbf{x}_{t+1})}\mathbf{w}_{t}^{\top}[j]\mathbf{C}_{t}[j]
=(b)𝟏(𝐈𝐀)1diag(𝐱)𝜷𝟏(𝐈𝐀)1diag(𝐱t+1)𝜷\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta}-{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}_{t+1})\boldsymbol{\beta}
2j:j(𝐱t+1)𝐰t[j](s+1)lnt𝐦t[j]\displaystyle\hskip 85.35826pt-2\sum_{j:j\in\mathcal{I}(\mathbf{x}_{t+1})}\mathbf{w}_{t}^{\top}[j]\sqrt{\frac{(s+1)\ln{t}}{\mathbf{m}_{t}[j]}}
(c)𝟏(𝐈𝐀)1diag(𝐱)𝜷𝟏(𝐈𝐀)1diag(𝐱t+1)𝜷\displaystyle\stackrel{{\scriptstyle(c)}}{{\geq}}{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta}-{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}_{t+1})\boldsymbol{\beta}
2swmax(s+1)lnTl\displaystyle\hskip 85.35826pt-2sw_{\max}\sqrt{\frac{(s+1)\ln{T}}{l}}
(d)𝟏(𝐈𝐀)1diag(𝐱)𝜷𝟏(𝐈𝐀)1diag(𝐱t+1)𝜷\displaystyle\stackrel{{\scriptstyle(d)}}{{\geq}}{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta}-{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}_{t+1})\boldsymbol{\beta}
Δmin\displaystyle\hskip 11.38109pt-\Delta_{\min}
(e)𝟏(𝐈𝐀)1diag(𝐱)𝜷𝟏(𝐈𝐀)1diag(𝐱t+1)𝜷\displaystyle\stackrel{{\scriptstyle(e)}}{{\geq}}{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}^{\ast})\boldsymbol{\beta}-{\bf 1}^{\top}(\mathbf{I}-\mathbf{A})^{-1}\text{diag}(\mathbf{x}_{t+1})\boldsymbol{\beta}
Δ(𝐱t+1)=0,\displaystyle\hskip 11.38109pt-\Delta(\mathbf{x}_{t+1})=0, (50)

where in (a)(a) and (c)(c) we used the definition of 𝐰t\mathbf{w}_{t}^{\top} and wmaxw_{\max}, respectively. Moreover, in (b)(b) and (d)(d), we substituted the value for 𝐂t[j]\mathbf{C}_{t}[j] and ll, respectively. (e)(e) follows from the definition of Δmin\Delta_{\min}. Hence, we conclude that (31) never happens.

By using (A.3), (A.3), and (A.3), we achieve the following.

𝔼\displaystyle\mathbb{E} [𝒯i(T)]4(s+1)lnT(Δminswmax)2\displaystyle[\mathscr{T}_{i}(T)]\leq\left\lceil{\frac{4(s+1)\ln{T}}{(\frac{\Delta_{\min}}{sw_{\max}})^{2}}}\right\rceil
+t=1[mv1=1tmvs=1tmu1=ltmus=lt2spt2p(s+1)]\displaystyle+\sum_{t=1}^{\infty}{\Bigg{[}}\sum_{m_{v_{1}}=1}^{t}\dots\sum_{m_{v_{s}}=1}^{t}\sum_{m_{u_{1}}=l}^{t}\dots\sum_{m_{u_{s}}=l}^{t}2s^{p}t^{-2p(s+1)}{\Bigg{]}}
4wmax2s2(s+1)lnTΔmin2+1+spt=12t2\displaystyle\leq\frac{4w_{\max}^{2}s^{2}(s+1)\ln{T}}{\Delta_{\min}^{2}}+1+s^{p}\sum_{t=1}^{\infty}2t^{-2}
4wmax2s2(s+1)lnTΔmin2+1+π23sp.\displaystyle\leq\frac{4w_{\max}^{2}s^{2}(s+1)\ln{T}}{\Delta_{\min}^{2}}+1+\frac{\pi^{2}}{3}s^{p}. (51)

Therefore, the expected regret is upper bounded as

T(𝒳)\displaystyle\mathcal{R}_{T}(\mathcal{X}) Δmaxi=1N𝔼[𝒯i(T)]\displaystyle\leq\Delta_{\max}\sum_{i=1}^{N}\mathbb{E}[\mathscr{T}_{i}(T)]
i=1N[4wmax2s2(s+1)lnTΔmin2+1+π23sp]Δmax\displaystyle\leq\sum_{i=1}^{N}{\Big{[}}\frac{4w_{\max}^{2}s^{2}(s+1)\ln{T}}{\Delta_{\min}^{2}}+1+\frac{\pi^{2}}{3}s^{p}{\Big{]}}\Delta_{\max}
[4wmax2s2(s+1)NlnTΔmin2+N+π23spN]Δmax.\displaystyle\leq{\Big{[}}\frac{4w_{\max}^{2}s^{2}(s+1)N\ln{T}}{\Delta_{\min}^{2}}+N+\frac{\pi^{2}}{3}s^{p}N{\Big{]}}\Delta_{\max}. (52)

\blacksquare

Appendix B Additional Information and Experiments Regarding Covid-19 Dataset

B.1 Abbreviations of Regions in Italy

Table 1 lists the abbreviations together with the original names of the 2121 regions in Italy that we study in our numerical experiments.

Table 1: List of regions in Italy and the corresponding abbreviations.
Abbreviation Region Name
ABR Abruzzo
BAS Basilicata
CAL Calabria
CAM Campania
EMR Emilia-Romagna
FVG Friuli Venezia Giulia
LAZ Lazio
LIG Liguria
LOM Lombardia
MAR Marche
MOL Molise
PAB Provincia Autonoma di Bolzano
PAT Provincia Autonoma di Trento
PIE Piemonte
PUG Puglia
SAR Sardegna / Sardigna
SIC Siciliana
TOS Toscana
UMB Umbria
VDA Valle d’Aosta / Vallée d’Aoste
VEN Veneto

B.2 Overall Daily New Cases of Covid-19 Infection

Italy has been severely affected by the COVID-19 pandemic. In April 20202020, the country had the highest death toll in Europe. From the beginning of the pandemic, with the goal of containing the outbreak, the Italian government has put in place an increasing number of restrictions. Figure 5 depicts the overall daily new cases of covid-19 of the 2121 regions in Italy for the considered time interval in our numerical experiments. Due to space limitation, we use abbreviations for region names. Table 1 lists the abbreviations together with the original names of the regions.

B.3 Additional Experiments

As the governments try to contain the spread of Covid-19, they usually adopt restrictive measures such as quarantine over the regions that are showing the most number of overall daily new infections. As a result, they destructively ignore the effects of causal spread of the virus, meaning that they only focus on the overall daily new cases of regions without their causal effects on other regions. Therefore, we refer to this method of finding the best political interventions as the naive approach. Our goal is to show the superiority of our proposed algorithm over this naive approach. Similar to the experiments in our paper, we run SEM-UCB to find the 66 regions that are contributing the most to the total number of daily new cases in Italy. Figure 6 compares the performance of our algorithm with that of the naive approach. The diagram shows the ratio of the amount of contributions of the selected regions by the algorithms over the total number of daily new infections in the country for each day. As expected, after the initialization phase, SEM-UCB learns the underlying graph that influences the data. Consequently, it performs better with respect to the naive approach due to the fact that it takes the effects of causalities into account. We note that, due to such causal effects, it might be the case that a region with a lower number of overall daily cases contributes more than other regions with higher number of overall daily cases. This diagram provides the evidence that our framework can be highly effective in real-world applications such as analysis of the spread of Covid-19.

Refer to caption

Figure 5: Overall daily new cases of Covid-19 for different regions in Italy during the study period.

Refer to caption

Figure 6: The ratio of the amount of contributions of the selected regions by SEM-UCB and the naive approach over the total number of daily new infections in the country for each day.