This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

When Are Linear Stochastic Bandits Attackable?

Huazheng Wang
Princeton University
[email protected]
Most work was done while with University of Virginia.
   Haifeng Xu
University of Virginia
[email protected]
   Hongning Wang
University of Virginia
[email protected]
Abstract

We study adversarial attacks on linear stochastic bandits: by manipulating the rewards, an adversary aims to control the behaviour of the bandit algorithm. Perhaps surprisingly, we first show that some attack goals can never be achieved. This is in a sharp contrast to context-free stochastic bandits, and is intrinsically due to the correlation among arms in linear stochastic bandits. Motivated by this finding, this paper studies the attackability of a kk-armed linear bandit environment. We first provide a complete necessity and sufficiency characterization of attackability based on the geometry of the arms’ context vectors. We then propose a two-stage attack method against LinUCB and Robust Phase Elimination. The method first asserts whether the given environment is attackable; and if yes, it poisons the rewards to force the algorithm to pull a target arm linear times using only a sublinear cost. Numerical experiments further validate the effectiveness and cost-efficiency of the proposed attack method.

1 Introduction

In a contextual bandit problem, a learner takes sequential actions to interact with an environment to maximize its received cumulative reward. As a natural and important variant, linear stochastic bandits (Auer,, 2002; Li et al.,, 2010; Abbasi-yadkori et al.,, 2011) assume the expected reward of an arm aa is a linear function of its feature vector xax_{a} and an unknown bandit parameter 𝜽\mathrm{\boldsymbol{\theta}}^{*}. A linear bandit algorithm thus adaptively improves its estimation of 𝜽\mathrm{\boldsymbol{\theta}}^{*} based on the reward feedback on its pulled arms. Thanks to their sound theoretical guarantees and promising empirical performance, linear stochastic bandits have become a reference solution to many real-world problems, such as content recommendation and online advertisement (Li et al.,, 2010; Chapelle and Li,, 2011; Durand et al.,, 2018).

Since bandit algorithms adapt their behavior according to its received feedback, such algorithms are susceptible to adversarial attacks, especially poisoning attacks. Under such an attack, a malicious adversary observes the pulled arm and its reward feedback, and then modifies the reward to misguide the bandit algorithm to pull a target arm, which is of the adversary’s interest. Due to the wide applicability of bandit algorithms in practice as mentioned before, understanding the robustness of such algorithms under poisoning attacks becomes increasingly important (Jun et al.,, 2018; Liu and Shroff,, 2019; Garcelon et al.,, 2020).

Most existing studies on adversarial attacks in bandits focused on the context-free stochastic multi-armed bandit (MAB) settings. Jun et al., (2018) and Liu and Shroff, (2019) showed that an adversary can force any MAB algorithm to pull a target arm linear times only using a logarithmic cost. Garcelon et al., (2020) showed the vulnerability of kk-armed linear contextual bandits under poisoning attacks. Linear stochastic bandits are related to context-free stochastic bandits and linear contextual bandits. Interestingly, however, there is no known result about attacks on linear stochastic bandit until now. This paper shall provide a formal explanation for this gap — the analysis of attacks to linear stochastic bandits turns out to be significantly more challenging due to the correlation among arms; in fact, some learning environment is provably unattackable.

Specifically, we fill the aforementioned gap by studying poisoning attacks on kk-armed linear stochastic bandits, where an adversary modifies the reward using a sublinear attack cost to misguide the bandit algorithm to pull a target arm x~\tilde{x} linear times. We first show that a linear stochastic bandit environment is not always efficiently attackable111Throughout this paper, “efficient attack” means fooling the bandit algorithm to pull the target arm for linear times with a sublinear attack cost. We will use attackable and efficiently attackable interchangeably, as the adversary normally only has a limited budget and needs to design a cost-efficient strategy., and its attackability is governed by the feasibility of finding a parameter vector 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}, under which the rewards of all non-target arms are smaller than the reward of target arm x~\tilde{x} and the reward of x~\tilde{x} remains the same as that in the original environment specified by 𝜽\mathrm{\boldsymbol{\theta}}^{*}. Intuitively, to promote the target arm x~\tilde{x}, an adversary needs to lower the rewards of non-target arms in the null space of x~\tilde{x} by 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}, which might not be always feasible. We prove the feasibility of the resulting convex quadratic program is both sufficient and necessary for attacking a linear stochastic bandit environment.

Inspired by our attackability analysis, we propose a two-stage attack framework against linear stochastic bandit algorithms and demonstrate its application against LinUCB (Li et al.,, 2010) and Robust Phase Elimination (Bogunovic et al.,, 2021): the former is one of the most widely used linear contextual bandit algorithms, and the latter is a robust version designed for settings with adversarial corruptions. In the first stage, our method collects a carefully calibrated amount of rewards on the target arm to assess whether the given environment is attackable. The decision is based on an “empirical” version of our feasibility characterization. If attackable, i.e., there exists a feasible solution 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}, in the second stage the method depresses the rewards the bandit algorithm receives on non-target arms based on 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}, to fool the bandit algorithm to recognize the target arm as optimal. We prove that in an attackable environment, both algorithms can be successfully manipulated with only a sublinear cost.

Our main contributions can be summarized as follows:

  • We characterize the sufficient and necessary conditions about when a stochastic linear bandit environment is attackable as the feasibility of a convex quadratic program. En route to proving the sufficiency, we also provide an oracle attack method that can attack any no-regret learning algorithm given the knowledge of ground-truth bandit parameter 𝜽\mathrm{\boldsymbol{\theta}}^{*}. If the environment is unattackable, i.e., the program is infeasible, our necessity proof implies that even the vanilla LinUCB algorithm cannot be efficiently attacked. A direct corollary of our characterization is that context-free stochastic MAB is always attackable, resonating the findings in (Jun et al.,, 2018; Liu and Shroff,, 2019).

  • We propose a two-stage attack method that works without the knowledge of ground-truth bandit parameter. In the first stage, the algorithm detects the attackability of the environment and estimates the model parameter. In the second stage, it solves for a working solution 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}} and attacks accordingly. Our theoretical analysis shows this attack method is effective against LinUCB (Li et al.,, 2010) and Robust Phase Elimination (Bogunovic et al.,, 2021), i.e., pulling the target arm To(T)T-o(T) times using o(T)o(T) cost when the environment is attackable.

2 Preliminaries

Linear stochastic bandit. We study poisoning attacks to the fundamental kk-armed linear stochastic bandit problem (Auer,, 2002), where a bandit algorithm sequentially interacts with an environment for TT rounds. In each round tt, the algorithm pulls an arm at[k]={1,,k}a_{t}\in[k]=\{1,\cdots,k\} from a set 𝒜={xi}i=1k\mathcal{A}=\{x_{i}\}^{k}_{i=1} with kk arms, and receives reward ratr_{a_{t}} from the environment. Each arm aa is associated with a dd-dimensional context vector xadx_{a}\in\mathbb{R}^{d}; and the observed reward follows a linear mapping rat=xat𝖳𝜽+ηtr_{a_{t}}=x_{a_{t}}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}+\eta_{t}, where 𝜽d\mathrm{\boldsymbol{\theta}}^{*}\in\mathbb{R}^{d} is a common unknown bandit parameter vector and ηt\eta_{t} is an RR-sub-Gaussian noise term. We assume context vectors and parameters are all bounded; and for convenience and without loss of generality, we assume xi21||x_{i}||_{2}\leq 1 and 𝜽21\|\mathrm{\boldsymbol{\theta}}^{*}\|_{2}\leq 1. The performance of a bandit algorithm is evaluated by its pseudo-regret, which is defined as RT(𝜽)=t=1T(xa𝖳𝜽xat𝖳𝜽)R_{T}(\mathrm{\boldsymbol{\theta}}^{*})=\sum_{t=1}^{T}(x^{\mathsf{T}}_{a^{*}}\mathrm{\boldsymbol{\theta}}^{*}-x^{\mathsf{T}}_{a_{t}}\mathrm{\boldsymbol{\theta}}^{*}), where aa^{*} is the best arm according to 𝜽\mathrm{\boldsymbol{\theta}}^{*}, i.e., a=argmaxa[k][xa𝖳𝜽]{a^{*}}=\arg\max_{a\in[k]}[x_{a}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}].

LinUCB.

LinUCB (Li et al.,, 2010; Abbasi-yadkori et al.,, 2011) is a classical algorithm for linear stochastic bandit. It estimates a bandit model parameter 𝜽^\hat{\mathrm{\boldsymbol{\theta}}} using ridge regression, i.e., 𝜽^t=𝐀t1i=1txairi\hat{\mathrm{\boldsymbol{\theta}}}_{t}=\mathbf{A}_{t}^{-1}\sum_{i=1}^{t}x_{a_{i}}r_{i}, where 𝐀t=i=1txaixai𝖳+λ𝐈\mathbf{A}_{t}=\sum_{i=1}^{t}x_{a_{i}}x_{a_{i}}^{\mathsf{T}}+\lambda\mathbf{I} and λ\lambda is the L2-regularization coefficient. We use x𝐀=x𝖳𝐀x\lVert x\rVert_{\mathbf{A}}=\sqrt{x^{\mathsf{T}}\mathbf{A}x} to denote the matrix norm of vector xx. The confidence bound about reward estimation on arm xx is defined as CBt(x)=αtx𝐀t1\text{CB}_{t}(x)=\alpha_{t}\lVert x\rVert_{\mathbf{A}_{t}^{-1}}, where αt\alpha_{t} is a high probability bound of 𝜽𝜽^t𝐀t\lVert\mathrm{\boldsymbol{\theta}}^{*}-\hat{\mathrm{\boldsymbol{\theta}}}_{t}\rVert_{\mathbf{A}_{t}}. In each round tt, LinUCB pulls an arm with the highest upper confidence bound, i.e., at=argmaxa[k][xa𝖳𝜽^t+CBt(xa)]a_{t}=\operatorname*{arg\,max}_{a\in[k]}[x_{a}^{\mathsf{T}}\hat{\mathrm{\boldsymbol{\theta}}}_{t}+\text{CB}_{t}(x_{a})] to balance the exploration-exploitation trade-off. LinUCB algorithm achieves a sublinear upper regret bound, i.e., RT=O~(T)R_{T}=\tilde{O}(\sqrt{T}) ignoring the logarithmic term.

Threat model.

The goal of an adversary is to fool a linear stochastic bandit algorithm to pull the target arm x~𝒜\tilde{x}\in\mathcal{A} for To(T)T-o(T) times. Like most recent works in this space (Jun et al.,, 2018; Liu and Shroff,, 2019; Garcelon et al.,, 2020; Zhang et al.,, 2020), we also consider the widely studied poisoning attack on the rewards: after arm ata_{t} is pulled by the bandit algorithm, the adversary modifies the realized reward ratr_{a_{t}} from the environment by Δrt\Delta r_{t} into r~at\tilde{r}_{a_{t}}, i.e., r~at=rat+Δrt\tilde{r}_{a_{t}}=r_{a_{t}}+\Delta r_{t}, and feeds the manipulated reward r~at\tilde{r}_{a_{t}} to the algorithm. Naturally, the adversary aims to achieve its attack goal with minimum attack cost, defined as C(T)=t=1T|Δrt|C(T)=\sum_{t=1}^{T}|\Delta r_{t}|. By convention, an attack strategy is said to be efficient if it uses only a sublinear total cost, i.e., C(T)=o(T)C(T)=o(T).

We conclude the preliminaries with an important remark about a key difference between attacking linear stochastic bandit studied in this paper and attacking kk-armed linear contextual bandit setting recently studied in (Garcelon et al.,, 2020). In linear contextual bandits, all arms share a context vector at each round but each arm has its own (to-be-estimated) bandit parameter. Therefore, the reward manipulation at a round tt will only affect the parameter estimation of the pulled arm ata_{t}, but not any other arms’. This “isolates” the attack’s impact in different arms. In contrast, in linear stochastic bandit, all arms share the same bandit parameter but have different context vectors. And thus manipulating the reward of any arm will alter the shared bandit parameter estimation, which will then affect the reward estimation of all arms due to the correlation among their context vectors. Such coupled effect of adversarial manipulation from the pulled arm ata_{t} to all other arms is unique in linear stochastic bandits, and makes its analysis of attack much more challenging. This is also the fundamental reason that some linear stochastic bandit environment may not be attackable (see our illustration in Example 1).

3 The Attackability of A Linear Stochastic Bandit Environment

We study the attackability of a linear stochastic bandit environment. At the first glance, one might wonder whether attackability is the property of a bandit algorithm rather than a property of the environment, since if an algorithm can be attacked, we should have “blamed” the algorithm for not being robust. A key finding of this work is attackability is also a property of the learning environment; and in other words, not all environments are attackable.

Definition 1 (Attackability of a kk-Armed Linear Stochastic Bandit Environment).

A kk-armed linear stochastic bandit environment 𝒜={xi}i=1k,𝛉\langle\mathcal{A}=\{x_{i}\}^{k}_{i=1},\mathrm{\boldsymbol{\theta}}^{*}\rangle is attackable with respect to the target arm x~𝒜\tilde{x}\in\mathcal{A} if for any no-regret learning algorithm, there exists an attack method that uses o(T)o(T) attack cost and fools the algorithm to pull arm x~\tilde{x} at least To(T)T-o(T) times with high probability222Typically, the high probability refers to 1δ1-\delta probability for an arbitrarily small δ\delta. Please see theorems later for rigorous statements. for any TT large enough, i.e., TT larger than a constant T0T_{0}.

We make a few remarks about the above definition of attackability. First, this definition is all about the bandit environment 𝒜,𝜽\langle\mathcal{A},\mathrm{\boldsymbol{\theta}}^{*}\rangle and the target arm x~\tilde{x}, but independent of any specific bandit algorithm. Second, if an attack method can only fool a bandit algorithm to pull the target arm x~\tilde{x} under a few different time horizons TT, it does not count as a successful attack – it has to succeed for infinitely many time horizons. Third, by reversing the order of quantifiers, we obtain the assertion that a bandit environment is not attackable w.r.t. the target arm x~\tilde{x} if there exists some no-regret learning algorithm such that no attack method can use o(T)o(T) attack cost to fool the algorithm to pull arm x~\tilde{x} at least To(T)T-o(T) times with high probability for any TT large enough.

The following simple yet insightful example illustrates that there are indeed linear stochastic bandit environment setups where some no-regret learning algorithm cannot be attacked.

Example 1 (An unattackable environment).

Figure 1 shows a three-arm environment with 𝒜={x1=(0,1),x2=(1,2),x3=(1,2)}\mathcal{A}=\{x_{1}=(0,1),x_{2}=(1,2),x_{3}=(-1,2)\}. Suppose the target arm x~=x1\tilde{x}=x_{1} and the ground-truth bandit parameter 𝛉=(1,1)\mathrm{\boldsymbol{\theta}}^{*}=(1,1)333One can re-scale all vectors with norms smaller than 1, e.g., divide each dimension by 10, without changing the conclusion that the environment is unattackable. . The expected true rewards of the arms are r1=1,r2=3,r3=1r_{1}^{*}=1,r_{2}^{*}=3,r_{3}^{*}=1 and x2x_{2} is the best arm in this environment.

Based on Definition 1, we will need to identify a no-regret learning algorithm that cannot be attacked in this environment, and we argue that LinUCB is such an algorithm. Suppose, for the sake of contradiction, that there exists an efficient attack which fools LinUCB to pull x1x_{1} To(T)T-o(T) times. LinUCB then must estimate 𝛉\mathrm{\boldsymbol{\theta}}^{*} in the x1x_{1}’s direction almost accurately as TT becomes large, since the Ω(T)\Omega(T) amount of true reward feedback in x1x_{1} direction will ultimately dominate the o(T)o(T) adversarial manipulations. This suggests that the estimated parameter 𝛉^t\hat{\mathrm{\boldsymbol{\theta}}}_{t} will be close to (α,1)(\alpha,1) for some α\alpha. Since (α,1)𝖳(x2+x3)=4(\alpha,1)^{\mathsf{T}}(x_{2}+x_{3})=4, implying that either x2x_{2} or x3x_{3} will have its estimated reward larger than 22 (i.e., strictly larger than x1x_{1}’s estimated reward) for any α\alpha. This shows that for large TT, x1x_{1} cannot be the arm with the highest UCB during the execution of LinUCB, which causes a contradiction. Therefore, this environment cannot be efficiently attacked with o(T)o(T) cost. Here we give an intuitive argument about this environment with target arm x~\tilde{x} is not attackable, while its formal proof is an instantiation of our Theorem 1.

Refer to caption
Figure 1: Illustration of attackability of a linear stochastic bandit environment.

Note that when 𝒜={x1,x2}\mathcal{A}=\{x_{1},x_{2}\}, the environment becomes attackable: as shown in Figure 1, a feasible attack strategy is to perturb reward according to 𝜽~=(2,1)\tilde{\mathrm{\boldsymbol{\theta}}}=(-2,1). The key idea is that in the null space of x1x_{1}, 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp} reduces the reward of x2x_{2} to make x1x_{1} the best arm without changing the actual reward of x1x_{1} from the environment. The presence of arm x3x_{3} prevents the existence of such a 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp} (and therefore 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}) and makes the environment unattackable.

The above example motivates us to study when a linear stochastic bandit environment is attackable. After all, only when facing an unattackable environment, we can ensure the existence of a no-regret algorithm that will be robust to any o(T)o(T) poisoning attacks.

Next we show that there indeed exists a complete characterization about when a linear stochastic bandit environment is attackable. As Example 1 shows, the attackability of a bandit environment depends on the arm set 𝒜={xi}i=1k\mathcal{A}=\{x_{i}\}^{k}_{i=1}, the target arm x~\tilde{x}, and the underlying bandit parameter 𝜽\mathrm{\boldsymbol{\theta}}^{*}. For clarity of presentation, in this section, we shall always assume that the adversary knows exactly the ground-truth bandit parameter 𝜽\mathrm{\boldsymbol{\theta}}^{*} and thus the true expected reward of each arm. This is also called the oracle attack in previous works (Jun et al.,, 2018; Rakhsha et al.,, 2020). But in the next section, we will show that this assumption is not necessary: when the bandit environment is indeed attackable, we can design provably successful attacks even if the adversary does not know the underlying bandit parameter 𝜽\mathrm{\boldsymbol{\theta}}^{*}.

We need the following convenient notation to state our results. Let

𝜽=x~𝖳𝜽x~22x~\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}=\frac{\tilde{x}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}}{\lVert\tilde{x}\rVert^{2}_{2}}\tilde{x} (1)

denote the projection of ground-truth bandit parameter 𝜽\mathrm{\boldsymbol{\theta}}^{*} onto the targeted x~\tilde{x} direction. Since the attackability depends on the target arm x~\tilde{x} as well, we shall include the target arm x~\tilde{x} as part of the bandit environment. The following theorem provides a clean characterization about the attackability of a linear stochastic bandit environment.

Theorem 1 (Characterization of Attackability).

A bandit environment 𝒜={xi}i=1k,𝛉,x~\langle\mathcal{A}=\{x_{i}\}^{k}_{i=1},\mathrm{\boldsymbol{\theta}}^{*},\tilde{x}\rangle is attackable if and only if the optimal objective ϵ\epsilon^{*} of the following convex quadratic program (CQP) satisfies ϵ>0\epsilon^{*}>0.

maximizeϵsubject tox~𝖳𝜽ϵ+xa𝖳(𝜽+𝜽~),for xax~.x~𝖳𝜽~=0𝜽+𝜽~21\begin{array}[]{lll}\mbox{maximize}&{\epsilon}&\\ \mbox{subject to}&\tilde{x}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}\geq\epsilon+x_{a}^{\mathsf{T}}(\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}+\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp}),&\mbox{for }x_{a}\not=\tilde{x}.\\ &\tilde{x}^{\mathsf{T}}\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp}=0&\\ &\|\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}+\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp}\|_{2}\leq 1&\\ \end{array} (2)

where ϵ and 𝛉~d\epsilon\in\mathbb{R}\text{ and }\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp}\in\mathbb{R}^{d} are variables.

Since the conceptual message of Theorem 1 significantly differs from previous studies on adversarial attacks in bandit algorithms, we would like to elaborate on its implications.

First of all, we, for the first time, point out some learning environment is intrinsically robust. Even the vanilla LinUCB algorithm, as we will analyze in the proof of Theorem 1, cannot be efficiently attacked when CQP (2) is not satisfied. Notably, although almost all previous works have focused on the vulnerability of bandit algorithms, e.g., by designing attacks against UCB, ϵ\epsilon-Greedy (Jun et al.,, 2018), LinUCB (Garcelon et al.,, 2020), it just so happens that they were already studied under an attackable environment (see our Corollary 1). To our best knowledge, the problem about the intrinsic robustness of a linear bandit environment has not been studied before and can be viewed as a complement to these previous works. Second, as we will show next, our proof techniques are also significantly different from existing ones, since what is central to our proof is to demonstrate that when CQP (2) is not satisfied, there will exist a robust algorithm that cannot be efficiently attacked by any adversary. This can be viewed as analyzing the robustness of certain bandit algorithms when ϵ0\epsilon^{*}\leq 0 in CQP (2).

Since CQP (2) and its solutions will show up very often in our later analysis, we provide the following definition for reference convenience.

Definition 2 (Attackability Index and Certificate).

The optimal objective ϵ\epsilon^{*} of CQP (2) is called the attackability index and the optimal solution 𝛉~\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp} to CQP (2) is called the attackability certificate.444We sometimes omit “attackability” when it is clear from the context, and simply mention index and certificate.

We should note both the index ϵ\epsilon^{*} and certificate 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp} are intrinsic to the bandit environment 𝒜={xi}i=1k,𝜽,x~\langle\mathcal{A}=\{x_{i}\}^{k}_{i=1},\mathrm{\boldsymbol{\theta}}^{*},\tilde{x}\rangle, and are irrelevant to any bandit algorithms used. As we will see in the next section when designing attack algorithms without the knowledge of 𝜽\mathrm{\boldsymbol{\theta}}^{*}, the index ϵ\epsilon^{*} will determine how difficult it is to attack the environment.

Algorithm 1 Oracle Null Space Attack
1:  Inputs: T,𝜽T,\mathrm{\boldsymbol{\theta}}^{*}
2:  Initialize:
3:  if Optimal objective ϵ\epsilon^{*} of CQP (2) >0>0  then // Attackability Test
4:     Find the optimal solution 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp}
5:     Set 𝜽~=𝜽+𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}=\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}+\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp}
6:  else
7:     return Not attackable
8:  end if
9:  for  t=1t=1 to TT do
10:     Bandit algorithm pulls arm ata_{t}
11:     Attacker observes the corresponding reward rt=xat𝖳𝜽+ηtr_{t}=x_{a_{t}}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}+\eta_{t} from the environment
12:     if  xatx~x_{a_{t}}\neq\tilde{x} then
13:        Set r~t=xat𝖳𝜽~+η~t\tilde{r}_{t}=x_{a_{t}}^{\mathsf{T}}\tilde{\mathrm{\boldsymbol{\theta}}}+\tilde{\eta}_{t} // Attack
14:     else
15:        Set r~t=rt\tilde{r}_{t}=r_{t}
16:     end if
17:     Bandit algorithm observes modified reward r~t\tilde{r}_{t}
18:  end for
Proof of Theorem 1.

Proof of sufficiency. This direction is relatively straightforward. We show that there exists an efficient attack strategy if CQP (2) holds.

Suppose the attackability index ϵ>0\epsilon^{*}>0 and let 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp} be a certificate. In Algorithm 1, we design the oracle null space attack based on the knowledge of bandit parameter 𝜽\mathrm{\boldsymbol{\theta}}^{*}. Let 𝜽~=𝜽+𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}=\mathrm{\boldsymbol{\theta}}_{\parallel}^{*}+\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp} where 𝜽\mathrm{\boldsymbol{\theta}}_{\parallel}^{*} is defined in Eq (1). The adversary perturbs the reward of any non-target arm xax~x_{a}\not=\tilde{x} by r~a,t=xa𝖳𝜽~+η~t\tilde{r}_{a,t}=x_{a}^{\mathsf{T}}\tilde{\mathrm{\boldsymbol{\theta}}}+\tilde{\eta}_{t}, whereas the reward of the target arm x~\tilde{x} is not touched. In other words, the adversary misleads the algorithm to believe that 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}} is the ground-truth parameter We should note both 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}} and 𝜽\mathrm{\boldsymbol{\theta}}^{*} generate the same expected reward on x~\tilde{x}, i.e., x~𝖳𝜽~=x~𝖳𝜽=x~𝖳𝜽\tilde{x}^{\mathsf{T}}\tilde{\mathrm{\boldsymbol{\theta}}}=\tilde{x}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}_{\parallel}^{*}=\tilde{x}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}. To make the attack appear less “suspicious”, a sub-Gaussian noise term η~t\tilde{\eta}_{t} is added to the perturbed reward to make it stochastic. The key idea is that the attacker does not need to perturb the reward of the target arm because the original reward is the same as perturbed reward in expectation. Instead, the attacker only perturbs the reward in the null space of x~\tilde{x} according to 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp}, which guarantees the cost-efficiency of the attack.

Since the perturbed rewards observed by the bandit algorithm are generated by 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}, the target arm x~\tilde{x} is the optimal arm in this environment due to the attackability index ϵ\epsilon^{*} being strictly positive. According to the definition, any no-regret bandit algorithm will only pull suboptimal arms, i.e., the non-target arms, o(T)o(T) times and pull target arm To(T)T-o(T) times with high probability. Thus the attack is successful. Moreover, the cost of oracle attack is o(T)o(T) because the attacker only perturbs rewards on the non-target arms for o(T)o(T) times, and the cost on each attack is bounded by a constant (because of the finite norm of xax_{a} and 𝜽\mathrm{\boldsymbol{\theta}}^{*}). Importantly, we note that this argument only relies on the definition of “no regret” but does not depend on what the algorithm is. This is crucial for proving the sufficiency of attackability.

Proof of necessity. This is the more difficult direction. Due to space limit, we only provide the proof sketch here while leave the involved technical arguments to Appendix B. We shall prove that if ϵ0\epsilon^{*}\leq 0, the bandit environment is not attackable. To do so, we need to identify at least one no-regret bandit algorithm such that no attack strategy can successfully attack it. We argue that even the vanilla LinUCB is already robust to any attack strategy with o(T)o(T) cost when ϵ0\epsilon^{*}\leq 0. Recall that LinUCB maintains an estimate 𝜽^t\hat{\mathrm{\boldsymbol{\theta}}}_{t} at round tt using the attacked rewards {r~1:t}\{\tilde{r}_{1:t}\}. We consider LinUCB with the choice of L2-regularization parameter λ\lambda that guarantees θ^t2<1\|\hat{\theta}_{t}\|_{2}<1 in order to satisfy the last constraint in CQP (2). Consider the decomposition 𝜽^t=𝜽^t,+𝜽^t,\hat{\mathrm{\boldsymbol{\theta}}}_{t}=\hat{\mathrm{\boldsymbol{\theta}}}_{t,\parallel}+\hat{\mathrm{\boldsymbol{\theta}}}_{t,\perp}, where x~𝜽^t,\tilde{x}\perp\hat{\mathrm{\boldsymbol{\theta}}}_{t,\perp} and x~𝜽^t,\tilde{x}\parallel\hat{\mathrm{\boldsymbol{\theta}}}_{t,\parallel}.

Suppose, for the sake of contradiction, that LinUCB is attackable when ϵ0\epsilon^{*}\leq 0. According to Definition 1, the target arm x~\tilde{x} will be pulled To(T)T-o(T) times with high probability for infinitely many different time horizons TT. Fix any large TT; we know that x~\tilde{x} must have the largest UCB score whenever it is pulled at some round t[T]t\in[T], or formally, for any xax~x_{a}\not=\tilde{x} we must have the following:

x~𝖳𝜽^t,+CBt(x~)xa𝖳𝜽^t,+xa𝖳𝜽^t,+CBt(xa).\tilde{x}^{\mathsf{T}}\hat{\mathrm{\boldsymbol{\theta}}}_{t,\parallel}+\text{CB}_{t}(\tilde{x})\geq x_{a}^{\mathsf{T}}\hat{\mathrm{\boldsymbol{\theta}}}_{t,\parallel}+x_{a}^{\mathsf{T}}\hat{\mathrm{\boldsymbol{\theta}}}_{t,\perp}+\text{CB}_{t}(x_{a}). (3)

By attackability, we know that the above inequality will hold for infinitely many tts. Our main idea to construct the proof is that as tt\to\infty, we have CBt(x~)0\text{CB}_{t}(\tilde{x})\to 0 and CBt(xa)>0\text{CB}_{t}(x_{a})>0. Moreover, the estimation of 𝜽^t,\hat{\mathrm{\boldsymbol{\theta}}}_{t,\parallel} will converge to 𝜽\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}, since x~\tilde{x} will be pulled for to(t)t-o(t) times. The key challenge is to show CBt(xa)CBt(x~)\text{CB}_{t}(x_{a})-\text{CB}_{t}(\tilde{x}), due to Inequality (3), is strictly greater than 0 for all large tt. To do so, we prove a Θ(log(t/δ)o(t))\Theta\left(\sqrt{\frac{\log(t/\delta)}{o(t)}}\right) lower bound for CBt(xa)\text{CB}_{t}(x_{a}) (Lemma 3 in Appendix B) and an O(log(t/δ)t)O(\sqrt{\frac{\log(t/\delta)}{t}}) upper bound for CBt(x~)\text{CB}_{t}(\tilde{x}) (Lemma 2). The main technical barrier we overcome is the lower bound proof for the confidence bound term, which employs non-standard arguments since most (if not all) of the bandit algorithm analysis only needs the upper bound of the confidence bound terms. Due to this reason, we believe this technical proof is of independent interest, particularly for the analysis of robust properties of linear bandit algorithms.

By letting tt\to\infty, we obtain the following condition:

x~𝖳𝜽>xa𝖳𝜽+xa𝖳𝜽^t,,xax~.\displaystyle\tilde{x}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}>x_{a}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}+x_{a}^{\mathsf{T}}\hat{\mathrm{\boldsymbol{\theta}}}_{t,\perp},\forall x_{a}\not=\tilde{x}. (4)

This implies that for any sufficiently large tt, there must exist a 𝜽^t,\hat{\mathrm{\boldsymbol{\theta}}}_{t,\perp} that and makes the optimal objective of CQP (2) ϵ\epsilon^{*} positive. But this contradicts the starting assumption of ϵ0\epsilon^{*}\leq 0; hence, the bandit environment is not attackable. ∎

We now provide an intuitive explanation about Theorem 1. CQP (2) is to find 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp} such that: 1) it is orthogonal to x~\tilde{x} (hence its subscript); and 2) it maximizes the gap ϵ\epsilon between x~𝖳𝜽\tilde{x}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}_{\parallel} and the largest xa𝖳(𝜽+𝜽~)x_{a}^{\mathsf{T}}(\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}+\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp}) among all xax~x_{a}\not=\tilde{x}. Theorem 1 states that the bandit environment is attackable if and only if such a gap is strictly larger than 0, i.e., when the geometry of context vectors allows the adversary to lower non-target arms’ rewards in the null space of x~\tilde{x}. The constraint 𝜽+𝜽~21\|\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}+\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp}\|_{2}\leq 1 ensures the attacked rewards are in the same scale as the unattacked rewards.

Recent works have shown that any no-regret algorithm for the context-free kk-armed setting (where arm set 𝒜\mathcal{A} is orthonormal) can always be attacked (Liu and Shroff,, 2019). This finding turns out to be a corollary of Theorem 1.

Corollary 1.

For standard stochastic multi-armed bandit setting where arm set 𝒜\mathcal{A} is orthonormal, the environment 𝒜={xa},𝛉,x~\langle\mathcal{A}=\{x_{a}\},\mathrm{\boldsymbol{\theta}}^{*},\tilde{x}\rangle is attackable for any target arm x~\tilde{x}.

Proof.

Since arms are orthogonal to each other, it is easy to see that 𝜽~=C[xa:xax~xa]\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp}=-C[\sum_{x_{a}:x_{a}\not=\tilde{x}}x_{a}] achieves objective value Cx~T𝜽C-\tilde{x}^{T}\mathrm{\boldsymbol{\theta}}^{*}_{\parallel} in CQP (2). Let CC be a large enough positive constant such that the objective value is positive gives us a feasible 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp} to CQP (2), which yields the corollary. ∎

The intuition behind this corollary is that arms in context-free stochastic multi-armed bandits are independent, and an adversary can arbitrarily lower the rewards of non-target arms to make the target arm optimal. This is also the attack strategy in (Jun et al.,, 2018; Liu and Shroff,, 2019). Garcelon et al., (2020) showed that similar idea works for kk-armed linear contextual bandits where each arm is associated with an unknown bandit parameter and the reward estimations are independent among different arms.

We should point an important distinction between poisoning attacks to kk-armed bandits and another line of research on stochastic bandits under adversarial corruption initiated by Lykouris et al., (2018). For poisoning attacks considered in this paper, the adversary manipulates the realized rewards after the algorithm selects an action, whereas in (Lykouris et al.,, 2018), the adversary manipulates the entire reward vector before the algorithm selects any action. Obviously, the later threat model is strictly weaker and has led to various bandit algorithms that can have sublinear regret so long as the total manipulation is sublinear in TT (Lykouris et al.,, 2018; Zimmert and Seldin,, 2021).

4 Effective Attacks without Knowledge of True Model Parameters

In the previous section, we characterized the attackability of a linear stochastic bandit environment by assuming the knowledge of ground-truth bandit parameter 𝜽\mathrm{\boldsymbol{\theta}}^{*}. We now show that such prior knowledge is not needed when designing practical attacks. Towards this end, we propose provably effective attacks against two representative bandit algorithms: the most well-known LinUCB and a state-of-the-art robust linear stochastic bandit algorithm, Robust Phase Elimination (Bogunovic et al.,, 2021). We remark that the optimal attacks to these algorithms depend on the characteristics of algorithms themselves and are generally different, due to their different levels of robustness. This also resonates the important message mentioned in the introduction, i.e., the attackability analysis often goes hand-in-hand with the understanding of robustness of different algorithms, as reflected in various parts of our analysis. However, we point out that it is an intriguing open question to understand whether there is a single attack strategy that can manipulate any no-regret algorithm in an attackable environment.

Two-stage Null Space Attack.

Our proposed attack method is presented in Algorithm 2. The method spends the first T1T_{1} rounds as the first stage to attack rewards on all arms by imitating a bandit environment 𝜽0\mathrm{\boldsymbol{\theta}}_{0}, in which x~\tilde{x} is the best arm such that arm x~\tilde{x} will be pulled most often by the bandit algorithm. This stage is for the adversary to observe the true rewards of x~\tilde{x} from the environment, which helps it estimate the parameter 𝜽\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}. At round T1T_{1}, the method uses the estimate of 𝜽\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}, denoted as 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel}, to perform the “attackability test” by solving CQP (2) using 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel} to obtain an estimated index ϵ~\tilde{\epsilon}^{*} and certificate 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp}. If ϵ~>0\tilde{\epsilon}^{*}>0, the method asserts the environment is attackable and starts the second stage of attack. From T1T_{1} to TT, the method perturbs the reward of non-target arms by r~t=xat𝖳(𝜽~+𝜽~)+η~t\tilde{r}_{t}=x_{a_{t}}^{\mathsf{T}}(\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel}+\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp})+\tilde{\eta}_{t} (just like the oracle attack but using the estimated 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel}). When the bandit algorithm pulls the target arm x~\tilde{x} for the first time in the second stage, the method will compensate the reward of x~\tilde{x} as shown in line 20, where n(x~)n(\tilde{x}) is the number of times target arm is pulled before T1T_{1}. The goal is to correct the rewards on x~\tilde{x} collected in the first stage to follow 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}} instead of 𝜽0\mathrm{\boldsymbol{\theta}}_{0}. Note that a larger T1T_{1} brings in more observations on x~\tilde{x} to improve the estimate of 𝜽\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}; but it also means a higher attack cost. The optimal choice of T1T_{1} depends on the “robustness” of the bandit algorithm to be attacked. Consequently, it also leads to different attack cost for different algorithms. For example, as we will show next, the attack to Robust Phase Elimination will be more costly than the attack to LinUCB.

Algorithm 2 Two-stage Null Space Attack
1:  Inputs: T,T1T,T_{1}
2:  𝜽0=argmax𝜽21[x~𝖳𝜽maxxax~xa𝖳𝜽]\mathrm{\boldsymbol{\theta}}_{0}=\arg\max_{\|\mathrm{\boldsymbol{\theta}}\|_{2}\leq 1}\Big{[}\tilde{x}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}-\max_{x_{a}\not=\tilde{x}}x_{a}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}\Big{]}, let ϵ0\epsilon^{*}_{0} be its optimal objective
3:  if ϵ00\epsilon^{*}_{0}\leq 0 then // Initial attackability test
4:     return Not attackable
5:  end if
6:  for  t=1t=1 to T1T_{1} do
7:     Set r~t=xat𝖳𝜽0+η~t\tilde{r}_{t}=x_{a_{t}}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}_{0}+\tilde{\eta}_{t} // Attack as if x~\tilde{x} is the best
8:     Bandit algorithm observes modified reward r~t\tilde{r}_{t}
9:  end for
10:  Estimate 𝜽~=i=1n(x~)ri(x~)n(x~)x~22x~\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel}=\frac{\sum_{i=1}^{n(\tilde{x})}r_{i}(\tilde{x})}{n(\tilde{x})\|\tilde{x}\|_{2}^{2}}\tilde{x}
11:  Solve CQP (2) using 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel} to obtain the estimated attackability index ϵ~\tilde{\epsilon}^{*} and certificate 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp}
12:  if ϵ~0\tilde{\epsilon}^{*}\leq 0 then // Attackability test
13:     return Not attackable
14:  else // Attack stage
15:     Set 𝜽~=𝜽~+𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}=\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel}+\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp}
16:     for t=T1+1t=T_{1}+1 to TT  do
17:        if  xatx~x_{a_{t}}\neq\tilde{x} then
18:           Set r~t=xat𝖳𝜽~+η~t\tilde{r}_{t}=x_{a_{t}}^{\mathsf{T}}\tilde{\mathrm{\boldsymbol{\theta}}}+\tilde{\eta}_{t}
19:        else if xat=x~x_{a_{t}}=\tilde{x} for the first time then
20:           Set r~t=n(x~)×x~𝖳(𝜽~𝜽0)+x~𝖳𝜽~+η~t\tilde{r}_{t}=n(\tilde{x})\times\tilde{x}^{\mathsf{T}}(\tilde{\mathrm{\boldsymbol{\theta}}}-\mathrm{\boldsymbol{\theta}}_{0})+\tilde{x}^{\mathsf{T}}\tilde{\mathrm{\boldsymbol{\theta}}}+\tilde{\eta}_{t}
21:        else
22:           Set r~t=rt\tilde{r}_{t}=r_{t}
23:        end if
24:        Bandit algorithm observes modified reward r~t\tilde{r}_{t}
25:     end for
26:  end if

Note that our attackability test might make both false positive and false negative assertions due to the estimation error in 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel}. But as TT becomes larger, the estimate gets more accurate and the assertion is correct with high probability.

Remark 1.

We acknowledge that the rewards from the two stages follow different reward distributions and could be detected, e.g., using some homogeneity test (Li et al.,, 2021). Thus a bandit player could realize the attack if equipped with some change detector. However, attacking such a cautious bandit algorithm is beyond the scope of this paper. Moreover, it is very difficult (if not impossible) to attack with a stationary reward distribution or undetectable perturbations (e.g., using a fixed target parameter θ~\tilde{\theta}). We could easily find cases where the adversary’s parameter θ~\tilde{\theta} is limited to extremely few choices and it is almost impossible to directly start the attack with a valid θ~\tilde{\theta} without knowing θ\theta^{*}. For example, if we change the third arm in Example 1 to x3=(1+ϵ,0)x_{3}=(-1+\epsilon,0) with a small ϵ\epsilon, we can see that the valid parameters are only in a small range around θ~=(1ϵ,1)\tilde{\theta}=(-1-\epsilon,1). Therefore, in order to attack with a stationary reward distribution, the adversary needs to start from somewhere very close to θ~=(1ϵ,1)\tilde{\theta}=(-1-\epsilon,1), which we believe is extremely difficult without knowing θ\theta^{*}. Overall, we think designing an attack strategy against a bandit algorithm with reward change detector or showing the inability to attack such cautious algorithms is an important future work of ours.

Attack against LinUCB.

We now show how LinUCB algorithm can be attacked by Algorithm 2.

Theorem 2.

For large enough T1T_{1}, the attack strategy in Algorithm 2 will correctly assert the attackability with probability at least 1δ1-\delta. Moreover, when the environment is attackable, with probability at least 13δ1-3\delta the attack strategy will fool LinUCB to pull non-target arms at most

O(d(log(T/δ)+T1log(T1/δ)+Tlog(1/δ)/T1)Tlog(T/δ)/ϵ)\begin{split}O\Big{(}&d\big{(}\sqrt{\log(T/\delta)}+\sqrt{T_{1}}\log{(T_{1}/\delta)}\\ &+\sqrt{T\log(1/\delta)}/\sqrt{T_{1}}\big{)}\sqrt{T\log(T/\delta)}/\epsilon^{*}\Big{)}\end{split}

rounds. And with probability at least 14δ1-4\delta, the adversary’s cost is at most

O(T1+d(log(T/δ)+T1log(T1/δ)+Tlog(1/δ)/T1)Tlog(T/δ)/ϵ).\begin{split}O\Big{(}T_{1}+&d\big{(}\sqrt{\log(T/\delta)}+\sqrt{T_{1}}\log{(T_{1}/\delta)}\\ &+\sqrt{T\log(1/\delta)}/\sqrt{T_{1}}\big{)}\sqrt{T\log(T/\delta)}/\epsilon^{*}\Big{)}.\end{split}

Specifically, when T1=T1/2T_{1}=T^{1/2}, the strategy gives the minimum attack cost in the order of O~(T3/4)\tilde{O}(T^{3/4}), and the non-target arms are pulled at most O~(T3/4)\tilde{O}(T^{3/4}) rounds.

Proof Sketch..

To prove the the assertion is correct with high probability, the key idea is that the estimated 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel} is close to the true parameter 𝜽\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}. We first note that in the first stage, the bandit algorithm will pull the target arm x~\tilde{x} T1O(T1)T_{1}-O(\sqrt{T_{1}}) times, since x~\tilde{x} is the best arm according to 𝜽0\mathrm{\boldsymbol{\theta}}_{0}. According to the Hoeffding’s inequality, the estimation error 𝜽~𝜽22log(2/δ)T1O(T1)\|\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel}-\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}\|_{2}\leq\sqrt{\frac{2\log(2/\delta)}{T_{1}-O(\sqrt{T_{1}})}}. Therefore, with a large enough T1T_{1}, the error on x~\tilde{x}’s reward estimation is smaller than ϵ\epsilon^{*}. Thus solving CQP (2) with 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel} and we can correctly assert attackability with positive estimated index ϵ~\tilde{\epsilon}^{*} when the environment is attackable with index ϵ\epsilon^{*}.

To prove the success and the cost of the attack, we need to analyze the behavior of LinUCB under the reward discrepancy between the two stages. Our proof crucially hinges on the following robustness result of LinUCB.

Lemma 1.

Consider LinUCB with ridge regression under poisoning attack. Let St=τ{1t},xaτx~|Δτ|S^{\prime}_{t}=\sum_{\tau\in\{1\dots t\},x_{a_{\tau}}\neq\tilde{x}}|\Delta_{\tau}| be the total corruption on non-target arms until time tt and assume every corruption on target arm is bounded by γ\gamma. For any t=1Tt=1\dots T, with probability at least 1δ1-\delta we have

𝜽~𝜽^tAtαt+St/λ+γt\|\tilde{\mathrm{\boldsymbol{\theta}}}-\hat{\mathrm{\boldsymbol{\theta}}}_{t}\|_{A_{t}}\leq\alpha_{t}+S^{\prime}_{t}/\sqrt{\lambda}+\gamma\sqrt{t} (5)

where αt=dlog(1+t/λδ)+λ\alpha_{t}=\sqrt{d\log\left(\frac{1+t/\lambda}{\delta}\right)}+\sqrt{\lambda}.

Based on this lemma, we can derive the regret RT(𝜽~)R_{T}(\tilde{\mathrm{\boldsymbol{\theta}}}) of LinUCB with 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}} as the true parameter. The total corruption on non-target arms is O(dT1log(T1/δ))O\big{(}d\sqrt{T_{1}}\log{(T_{1}/\delta)}\big{)} given the rewards are generated by 𝜽0\mathrm{\boldsymbol{\theta}}_{0} (the rewards of target arm in the first stage are compensated in line 20). Because the target arm’s rewards are not attacked in the second stage and follows 𝜽\mathrm{\boldsymbol{\theta}}^{*}, we have γ=O~(1/T1)\gamma=\tilde{O}(1/\sqrt{T_{1}}). Since the non-target arms are pulled at most RT(𝜽~)/ϵR_{T}(\tilde{\mathrm{\boldsymbol{\theta}}})/\epsilon^{*} rounds, substitute the total corruption back and we have the resulting bound.

The attack cost has two sources: attacks in the first stage for T1T_{1} times, and attacks on the non-target arms in the second stage. The second term has the same order as the number of rounds where the non-target arms are pulled by LinUCB. Each attack cost can be decomposed as 1) the change of mean reward |xa𝖳(𝜽~𝜽)||x_{a}^{\mathsf{T}}(\tilde{\mathrm{\boldsymbol{\theta}}}-\mathrm{\boldsymbol{\theta}}^{*})|, and 2) the sub-Gaussian noise |η~t||\tilde{\eta}_{t}|, the sum of which increases linearly with high probability. By setting T1=T1/2T_{1}=T^{1/2}, the total cost achieves O~(T3/4)\tilde{O}(T^{3/4}). ∎

Remark 2.

Lemma 1 shows that LinUCB still enjoys sublinear regret for any corruption amount S=o(T)S=o(\sqrt{T}). This tolerance of o(T)o(\sqrt{T}) attack turns out to be the same as the recently proposed robust linear contextual bandit algorithm based on phase elimination in (Bogunovic et al.,, 2021) (which we examine next). However, the regret term STS\sqrt{T} in LinUCB has a worse dependence on SS within the S=o(T)S=o(\sqrt{T}) regime compared to the O(S2)O(S^{2}) regret dependence of the robust algorithm in (Bogunovic et al.,, 2021).

Attack against Robust Phase Elimination.

We now show that Robust Phase Elimination (RobustPhE) (Bogunovic et al.,, 2021) can also be attacked by Algorithm 2. Comparing to attacking LinUCB, the robustness of RobustPhE brings challenge to the first stage attack, as the attack cost is more sensitive to the length of this stage.

Corollary 2.

For any large enough T1T_{1}, the attack strategy in Algorithm 2 will correctly assert the attackability with high probability. Moreover, when the environment is attackable, with probability at least 12δ1-2\delta the attack strategy will fool RobustPhE to pull non-target arms at most

O((dTlog(T/δ)+dTlog(T)log(1/δ)/T1+T12)/ϵ)O\Big{(}\big{(}d\sqrt{T}\log(T/\delta)+\sqrt{d}T\log(T)\log(1/\delta)/\sqrt{T_{1}}+T_{1}^{2}\big{)}/\epsilon^{*}\Big{)}

rounds. And with probability at least 13δ1-3\delta, the adversary’s cost is at most

O(T1+(dTlog(T/δ)+dTlog(T)log(1/δ)/T1+T12)/ϵ)O\Big{(}T_{1}+\big{(}d\sqrt{T}\log(T/\delta)+\sqrt{d}T\log(T)\log(1/\delta)/\sqrt{T_{1}}+T_{1}^{2}\big{)}/\epsilon^{*}\Big{)}

Specifically, setting T1=T2/5T_{1}=T^{2/5} gives the minimum attack cost order O~(T4/5)\tilde{O}(T^{4/5}) and the non-target arms are pulled at most O~(T4/5)\tilde{O}(T^{4/5}) rounds.

RobustPhE has an additional regret term O(S2)O(S^{2}) for total corruption SS (assuming SS is unknown to the bandit algorithm). If we view the second stage attack model 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}} as the underlying environment bandit model, rewards generated by 𝜽0\mathrm{\boldsymbol{\theta}}_{0} in the first stage should be considered as corrupted rewards. The T1T_{1} amount of rewards from the first stage means T1T_{1} amount of corruption, which leads to the additional T12T_{1}^{2} term compared with the results in Theorem 2. Hence, the adversary can only run fewer iterations in the first stage but spend more attack cost there. On the other hand, this also facilitates the design of attack such that line 19-20 in Algorithm 2 is not necessary: the corruption in the first stage can be handled by the robustness of RobustPhE. The unattacked rewards in second stage are viewed as misspecification from 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}} with error γ\gamma, which leads to the O~(γT)\tilde{O}(\gamma T) term (the second term) in the bound. Our success in attacking RobustPhE does not violate the robustness claim in the original paper (Bogunovic et al.,, 2021): RobustPhE could tolerate o(T)o(\sqrt{T}) corruption and our attack cost is O~(T4/5)\tilde{O}(T^{4/5}).

5 Experiments

Refer to caption Refer to caption
(a) Total cost of attack. (b) Target arm pulls.
Figure 2: Total cost and target arm pulls under different attack methods. We report averaged cost and variance of 10 runs.

We use simulation-based experiments to validate the effectiveness and cost-efficiency of our proposed attack methods. In our simulations, we generate a size-kk arm pool 𝒜\mathcal{A}, in which each arm aa is associated with a context vector xadx_{a}\in\mathbb{R}^{d}. Each dimension of xax_{a} is drawn from a set of zero-mean Gaussian distributions with variances sampled from a uniform distribution U(0,1)U(0,1). Each xax_{a} is then normalized to xa2=1\|x_{a}\|_{2}=1. The bandit model parameter 𝜽\mathrm{\boldsymbol{\theta}}^{*} is sampled from N(0,1)N(0,1) and normalized to 𝜽2=1\|\mathrm{\boldsymbol{\theta}}^{*}\|_{2}=1. We set dd to 10, the standard derivation σ\sigma of Gaussian noise ηt\eta_{t} and η~t\tilde{\eta}_{t} to 0.1, and the arm pool size kk to 30 in our simulations. We run the experiment for T=10,000T=10,000 iterations. To create an attackable environment, we will re-sample the environment 𝒜,𝜽,x~\langle\mathcal{A},\mathrm{\boldsymbol{\theta}}^{*},\tilde{x}\rangle until it is attackable, following Theorem 1.

We compared the two proposed attack methods, oracle null space attack and two-stage null space attack, against LinUCB Li et al., (2010) and Robust Phase Elimination (RobustPhE) Bogunovic et al., (2021). We report average results of 10 runs where in each run we randomly sample an attackable environment. Both oracle attack and two-stage attack can effectively fool the two bandit algorithms to pull the target arm linear times as shown in Figure 2(b), and the total cost of the attack is shown in Figure 2(a). We observe that both attack methods are cost-efficient with a sublinear total cost, while the two-stage attack requires higher attack cost when attacking the same bandit algorithm. Specifically, we notice that the adversary spends almost linear cost in the first stage. This is because in the first stage the adversary attacks according to parameter 𝜽0\mathrm{\boldsymbol{\theta}}_{0}, which leads to an almost constant cost at every round. This is to help the adversary to estimate bandit model parameter in order to construct target parameter 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}. In the second stage, the cost gets much smaller since the adversary only attacks the non-target arms. We also notice that for the same attack method, attacking RobustPhE requires a higher cost and the number of target arm pull is also smaller comparing with attacking LinUCB, due to the robustness of the algorithm.

6 Related Work

Adversarial attacks to bandit algorithms was first studied in the stochastic multi-armed bandit setting (Jun et al.,, 2018; Liu and Shroff,, 2019) and recently in linear contextual bandits (Garcelon et al.,, 2020). These works share a similar attack idea: lowering the rewards of non-target arms while not modifying the reward of target arm. However, as our attackability analysis revealed, this idea can fail in a linear stochastic bandit environment where one cannot lower the rewards of non-target arms without modifying the reward of target arm, due to their correlation. This insight is a key reason that gives rise to unattackable environments. Ma et al., (2018) also considered the attackability issue of linear bandits, but under the setting of offline data poisoning attack where the adversary has the power to modify the rewards in history. There are also several recent works on reward poisoning attacks against reinforcement learning (Yu and Sra,, 2019; Zhang et al.,, 2020; Rakhsha et al.,, 2021, 2020), but with quite different focus as ours. Besides reward poisoning attacks, recent works also studied other threat model such as action poisoning attacks (Liu and Lai,, 2020, 2021).

A parallel line of works focused on improving the robustness of bandit algorithms. Lykouris et al., (2018) proposed a robust MAB algorithm and Gupta et al., (2019) further improved the solution with additive regret dependency on attack cost. Zimmert and Seldin, (2021); Masoudian and Seldin, (2021) proposed best-of-both-world solutions for both stochastic and adversarial bandits which also solved stochastic bandits with adversarial corruption. Ito, (2021) further proposed optimal robust algorithm to adversarial corruption. These work assumed a weaker oblivious adversary who determines the manipulation before the bandit algorithm pulls an arm. Hajiesmaili et al., (2020) studied robust adversarial bandit algorithm. Guan et al., (2020) proposed a median-based robust bandit algorithm for probabilistic unbounded attack. Bogunovic et al., (2021) proposed robust phase elimination algorithm for linear stochastic bandits under a stronger adversary (same as ours), which could tolerate o(T)o(\sqrt{T}) corruption when the total corruption is unknown to the algorithm. We showed that our two-stage null space attack could effectively attack this algorithm with O~(T4/5)\tilde{O}(T^{4/5}) cost. Recently, Zhao et al., (2021) proposed an OFUL style robust algorithm that can handle infinite action set, but only tolerates o(T1/4)o(T^{1/4}) corruption.

7 Conclusion

In this paper, we studied the problem of poisoning attacks in kk-armed linear stochastic bandits. Different from context-free stochastic bandits and kk-armed linear contextual bandits where the environment is always attackable, we showed that some linear stochastic bandit environments are not attackable due to the correlation among arms. We characterized the attackability condition as the feasibility of a CQP based on the geometry of the arms’ context vectors. Our key insight is that given the ground-truth parameter 𝜽\mathrm{\boldsymbol{\theta}}^{*}, the adversary should perform oracle attack that lowers the reward of non-target arms in the null space of the target arm’s convex vector x~\tilde{x}. Based on this insight, we proposed a two-stage null space attack without the knowledge of 𝜽\mathrm{\boldsymbol{\theta}}^{*} and applied it against LinUCB and Robust Phase Elimination. We showed that the proposed attack methods are effective and cost-efficient, both theoretically and empirically.

As our future work, it is interesting to study the lower bound of attack cost in linear stochastic bandits and also design cost-optimal attack method with a matching upper bound. One idea is to design a multi-stage attack method following the doubling trick idea, which was brief discussed in Appendix C.3. Although the analysis could be much more challenging than our two-stage attack, it may lead to a lower attack cost as well as handling unknown time horizon TT. Another intriguing direction is to study algorithm-oblivious choice of the length of the first stage T1T_{1} — or more generally, algorithm-oblivious attack strategies — that can achieve sublinear cost for arbitrary no-regret algorithm without the knowledge of 𝜽\mathrm{\boldsymbol{\theta}}^{*}.

Acknowledgements

This work was supported in part by National Science Foundation Grant IIS-2128019, IIS-1618948, IIS-1553568, Bloomberg Data Science Ph.D. Fellowship, and a Google Faculty Research Award.

References

  • Abbasi-yadkori et al., (2011) Abbasi-yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. In NIPS, pages 2312–2320.
  • Auer, (2002) Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397–422.
  • Bogunovic et al., (2021) Bogunovic, I., Losalka, A., Krause, A., and Scarlett, J. (2021). Stochastic linear bandits robust to adversarial attacks. In International Conference on Artificial Intelligence and Statistics, pages 991–999. PMLR.
  • Chapelle and Li, (2011) Chapelle, O. and Li, L. (2011). An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257.
  • Durand et al., (2018) Durand, A., Achilleos, C., Iacovides, D., Strati, K., Mitsis, G. D., and Pineau, J. (2018). Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis. In Machine Learning for Healthcare Conference, pages 67–82.
  • Garcelon et al., (2020) Garcelon, E., Roziere, B., Meunier, L., Tarbouriech, J., Teytaud, O., Lazaric, A., and Pirotta, M. (2020). Adversarial attacks on linear contextual bandits. Advances in Neural Information Processing Systems, 33.
  • Guan et al., (2020) Guan, Z., Ji, K., Bucci Jr, D. J., Hu, T. Y., Palombo, J., Liston, M., and Liang, Y. (2020). Robust stochastic bandit algorithms under probabilistic unbounded adversarial attack. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4036–4043.
  • Gupta et al., (2019) Gupta, A., Koren, T., and Talwar, K. (2019). Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pages 1562–1578. PMLR.
  • Hajiesmaili et al., (2020) Hajiesmaili, M., Talebi, M. S., Lui, J., Wong, W. S., et al. (2020). Adversarial bandits with corruptions: Regret lower bound and no-regret algorithm. Advances in Neural Information Processing Systems, 33.
  • Ito, (2021) Ito, S. (2021). On optimal robustness to adversarial corruption in online decision problems. Advances in Neural Information Processing Systems, 34:7409–7420.
  • Jun et al., (2018) Jun, K.-S., Li, L., Ma, Y., and Zhu, J. (2018). Adversarial attacks on stochastic bandits. In Advances in Neural Information Processing Systems, pages 3640–3649.
  • Lattimore et al., (2020) Lattimore, T., Szepesvari, C., and Weisz, G. (2020). Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, pages 5662–5670. PMLR.
  • Li et al., (2021) Li, C., Wu, Q., and Wang, H. (2021). Unifying clustered and non-stationary bandits. In International Conference on Artificial Intelligence and Statistics, pages 1063–1071. PMLR.
  • Li et al., (2010) Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM.
  • Liu and Shroff, (2019) Liu, F. and Shroff, N. (2019). Data poisoning attacks on stochastic bandits. In International Conference on Machine Learning, pages 4042–4050.
  • Liu and Lai, (2020) Liu, G. and Lai, L. (2020). Action-manipulation attacks on stochastic bandits. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3112–3116. IEEE.
  • Liu and Lai, (2021) Liu, G. and Lai, L. (2021). Provably efficient black-box action poisoning attacks against reinforcement learning. Advances in Neural Information Processing Systems, 34.
  • Lykouris et al., (2018) Lykouris, T., Mirrokni, V., and Paes Leme, R. (2018). Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 114–122. ACM.
  • Ma et al., (2018) Ma, Y., Jun, K.-S., Li, L., and Zhu, X. (2018). Data poisoning attacks in contextual bandits. In International Conference on Decision and Game Theory for Security, pages 186–204. Springer.
  • Masoudian and Seldin, (2021) Masoudian, S. and Seldin, Y. (2021). Improved analysis of the tsallis-inf algorithm in stochastically constrained adversarial bandits and stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pages 3330–3350. PMLR.
  • Rakhsha et al., (2020) Rakhsha, A., Radanovic, G., Devidze, R., Zhu, X., and Singla, A. (2020). Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. In International Conference on Machine Learning, pages 7974–7984. PMLR.
  • Rakhsha et al., (2021) Rakhsha, A., Zhang, X., Zhu, X., and Singla, A. (2021). Reward poisoning in reinforcement learning: Attacks against unknown learners in unknown environments. arXiv preprint arXiv:2102.08492.
  • Vershynin, (2018) Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press.
  • Yu and Sra, (2019) Yu, T. and Sra, S. (2019). Efficient policy learning for non-stationary mdps under adversarial manipulation. arXiv preprint arXiv:1907.09350.
  • Zhang et al., (2020) Zhang, X., Ma, Y., Singla, A., and Zhu, X. (2020). Adaptive reward-poisoning attacks against reinforcement learning. arXiv preprint arXiv:2003.12613.
  • Zhao et al., (2021) Zhao, H., Zhou, D., and Gu, Q. (2021). Linear contextual bandits with adversarial corruptions. arXiv preprint arXiv:2110.12615.
  • Zimmert and Seldin, (2021) Zimmert, J. and Seldin, Y. (2021). Tsallis-inf: An optimal algorithm for stochastic and adversarial bandits. J. Mach. Learn. Res., 22:28–1.

Appendix A Notations

For clarity, we collect the notations used in the paper below.

x~\tilde{x} Context vector of target arm
xax_{a} Context vector of arm aa
𝜽\mathrm{\boldsymbol{\theta}}^{*} Unknown bandit model parameter of the environment
𝜽\mathrm{\boldsymbol{\theta}}^{*}_{\parallel} Projection of 𝜽\mathrm{\boldsymbol{\theta}}^{*} on x~\tilde{x}
rtr_{t} Unattacked reward feedback at time tt
ηt\eta_{t} Sub-Gaussian noise in reward, i.e., rt=xt𝖳𝜽+ηtr_{t}=x_{t}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}+\eta_{t}.
r~t\tilde{r}_{t} Attacked reward
𝜽^t\hat{\mathrm{\boldsymbol{\theta}}}_{t} Parameter estimated by the victim bandit algorithm with attacked rewards {r~1:t}\{\tilde{r}_{1:t}\}
𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel} Parameter parallel to x~\tilde{x}, estimated by adversary with unattacked rewards
𝜽~\tilde{\mathrm{\boldsymbol{\theta}}} Paramter of adversary’s attack strategy
𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp} Attackability certificate, the parameter orthogonal to x~\tilde{x} solved by CQP (2)
ϵ\epsilon^{*} Attackability index, optimal objective of CQP (2)
ϵ~\tilde{\epsilon}^{*} Estimated index, optimal objective of CQP (2) with 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel} replacing 𝜽\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}

Appendix B Details on Attackability of Linear Stochastic Bandits

B.1 Necessity Proof of Theorem 1

To prove its necessity, we will rely on the following results.

Lemma 2.

Suppose arm xx is pulled nn times till round tt by LinUCB. Its confidence bound CBt(x)\text{CB}_{t}(x) in LinUCB satisfies

CBt(x)αtn.\text{CB}_{t}(x)\leq\frac{\alpha_{t}}{\sqrt{n}}. (6)

with probability at least 1δ1-\delta, where αt=dlog(1+t/λδ)+λ\alpha_{t}=\sqrt{d\log\left(\frac{1+t/\lambda}{\delta}\right)}+\sqrt{\lambda}. Furthermore, we have

CBt(x)O(log(t/δ)n)\text{CB}_{t}(x)\leq O\left(\sqrt{\frac{\log(t/\delta)}{n}}\right) (7)

with probability at least 1δ1-\delta.

Proof.

In Abbasi-yadkori et al., (2011), the exploration bonus term is computed as CBt(x)=αtx𝐀t1\text{CB}_{t}(x)=\alpha_{t}\lVert x\rVert_{\mathbf{A}_{t}^{-1}}. Denote 𝐀t=n×xx𝖳+λ𝐈\mathbf{A}^{\prime}_{t}=n\times xx^{\mathsf{T}}+\lambda\mathbf{I}. Since 𝐀t=i=1txaixai𝖳+λ𝐈\mathbf{A}_{t}=\sum_{i=1}^{t}x_{a_{i}}x_{a_{i}}^{\mathsf{T}}+\lambda\mathbf{I}, we have 𝐀t𝐀t\mathbf{A}_{t}\succ\mathbf{A}^{\prime}_{t}. We can thus bound x𝐀t1\lVert x\rVert_{\mathbf{A}_{t}^{-1}} by

x𝐀t1x𝐀t11n\lVert x\rVert_{\mathbf{A}_{t}^{-1}}\leq\lVert x\rVert_{{\mathbf{A}^{\prime}_{t}}^{-1}}\leq\frac{1}{\sqrt{n}} (8)

According to Theorem 2 in  Abbasi-yadkori et al., (2011),

αt=dlog(1+t/λδ)+λ=Θ(log(t/δ)).\alpha_{t}=\sqrt{d\log\left(\frac{1+t/\lambda}{\delta}\right)}+\sqrt{\lambda}=\Theta(\sqrt{\log(t/\delta)}). (9)

Combining αt\alpha_{t} and (8) finishes the proof. ∎

Claim 1.

Target arm x~\tilde{x} is pulled n=To(T)n=T-o(T) times till round TT. According to Lemma 2, we have

CBt(x~)O(log(T/δ)To(T))\text{CB}_{t}(\tilde{x})\leq O\left(\sqrt{\frac{\log(T/\delta)}{T-o(T)}}\right) (10)
Lemma 3.

Suppose arm xx is pulled tmt-m times till round tt by LinUCB, and other arms are pulled mm times in total. Confidence bound CBt(xa)\text{CB}_{t}(x_{a}) of any arm xax_{a} that is not parallel to xx satisfies

CBt(xa)αt(1m+λbtm+λ)\text{CB}_{t}(x_{a})\geq\alpha_{t}\left(\frac{1}{\sqrt{m+\lambda}}-\frac{b}{\sqrt{t-m+\lambda}}\right) (11)

with probability at least 1δ1-\delta, where αt=dlog(1+t/λδ)+λS\alpha_{t}=\sqrt{d\log\left(\frac{1+t/\lambda}{\delta}\right)}+\sqrt{\lambda}S and constant b=xa𝖳xx𝖳xb=\frac{x_{a}^{\mathsf{T}}x}{x^{\mathsf{T}}x}. Furthermore, we have

CBt(xa)Θ(log(t/δ)(1m1tm))\text{CB}_{t}(x_{a})\geq\Theta\left(\sqrt{\log(t/\delta)}\left(\frac{1}{\sqrt{m}}-\frac{1}{\sqrt{t-m}}\right)\right) (12)

with probability at least 1δ1-\delta

Proof.

Since CBt(x)=αtx𝐀t1\text{CB}_{t}(x)=\alpha_{t}\lVert x\rVert_{\mathbf{A}_{t}^{-1}}, we need to show a lower bound of xa𝐀t1\lVert x_{a}\rVert_{\mathbf{A}_{t}^{-1}}. Since xaxx_{a}\nparallel x, we decompose xa=xa+xax_{a}=x_{a}^{\parallel}+x_{a}^{\perp}, where xaxx_{a}^{\parallel}\parallel x. By the reverse triangle inequality we have

xa𝐀t1\displaystyle\lVert x_{a}\rVert_{\mathbf{A}_{t}^{-1}} xa𝐀t1xa𝐀t1\displaystyle\geq\lVert x_{a}^{\perp}\rVert_{\mathbf{A}_{t}^{-1}}-\lVert x_{a}^{\parallel}\rVert_{\mathbf{A}_{t}^{-1}} (13)

First we analyze the term xa𝐀t1\lVert x_{a}^{\perp}\rVert_{\mathbf{A}_{t}^{-1}}. Decompose 𝐀t=(tm)×xx𝖳+i,xaixxaixai𝖳+λ𝐈\mathbf{A}_{t}=(t-m)\times xx^{\mathsf{T}}+\sum_{i,x_{a_{i}}\neq x}x_{a_{i}}x_{a_{i}}^{\mathsf{T}}+\lambda\mathbf{I} and let 𝐀t=i,xaixxaixai𝖳+λ𝐈\mathbf{A}^{\prime}_{t}=\sum_{i,x_{a_{i}}\neq x}x_{a_{i}}x_{a_{i}}^{\mathsf{T}}+\lambda\mathbf{I}. Since xaxx_{a}^{\perp}\perp x, we have

xa𝖳𝐀txa=xa𝖳𝐀txa.{x_{a}^{\perp}}^{\mathsf{T}}\mathbf{A}_{t}x_{a}^{\perp}={x_{a}^{\perp}}^{\mathsf{T}}\mathbf{A}^{\prime}_{t}x_{a}^{\perp}.

There are at most mm terms in the summation of 𝐀t=i,xaixxaixai𝖳+λI\mathbf{A}^{\prime}_{t}=\sum_{i,x_{a_{i}}\neq x}x_{a_{i}}x_{a_{i}}^{\mathsf{T}}+\lambda I, thus

xa𝖳𝐀txaxa𝖳(mxa22×xaxa𝖳+λ𝐈)xam+λ{x_{a}^{\perp}}^{\mathsf{T}}\mathbf{A}^{\prime}_{t}x_{a}^{\perp}\leq{x_{a}^{\perp}}^{\mathsf{T}}\left(\frac{m}{\|{x_{a}^{\perp}}\|_{2}^{2}}\times{x_{a}^{\perp}}{x_{a}^{\perp}}^{\mathsf{T}}+\lambda\mathbf{I}\right)x_{a}^{\perp}\leq m+\lambda

Then we have

xa𝐀t1=xa𝖳𝐀t1xa=xa𝖳𝐀t1xa=1xa𝖳𝐀txa1m+λ\lVert x_{a}^{\perp}\rVert_{\mathbf{A}_{t}^{-1}}=\sqrt{{x_{a}^{\perp}}^{\mathsf{T}}\mathbf{A}^{-1}_{t}x_{a}^{\perp}}=\sqrt{{x_{a}^{\perp}}^{\mathsf{T}}\mathbf{A}^{\prime-1}_{t}x_{a}^{\perp}}=\frac{1}{\sqrt{{x_{a}^{\perp}}^{\mathsf{T}}\mathbf{A}^{\prime}_{t}x_{a}^{\perp}}}\geq\frac{1}{\sqrt{m+\lambda}} (14)

Similar to Eq (8), we have

xa𝐀t1xa2x21tm+λ\lVert x_{a}^{\parallel}\rVert_{\mathbf{A}_{t}^{-1}}\leq\frac{\lVert x_{a}^{\parallel}\rVert_{2}}{\lVert x\rVert_{2}}\frac{1}{\sqrt{t-m+\lambda}} (15)

Let constant b=xa2x2=xa𝖳xx𝖳xb=\frac{\lVert x_{a}^{\parallel}\rVert_{2}}{\lVert x\rVert_{2}}=\frac{x_{a}^{\mathsf{T}}x}{x^{\mathsf{T}}x}. Substitute the terms and we have

CBt(x)=αtx𝐀t1αt(xa𝐀t1xa𝐀t1)αt(1m+λbtm+λ).\displaystyle\text{CB}_{t}(x)=\alpha_{t}\lVert x\rVert_{\mathbf{A}_{t}^{-1}}\geq\alpha_{t}\left(\lVert x_{a}^{\perp}\rVert_{\mathbf{A}_{t}^{-1}}-\lVert x_{a}^{\parallel}\rVert_{\mathbf{A}_{t}^{-1}}\right)\geq\alpha_{t}\left(\frac{1}{\sqrt{m+\lambda}}-\frac{b}{\sqrt{t-m+\lambda}}\right). (16)

Claim 2.

Non-target arms are pulled m=o(T)m=o(T) times till round TT. According to Lemma 3, any arm xax~x_{a}\nparallel\tilde{x} satisfies

CBt(xa)Θ(log(T/δ)(1o(T)1To(T)))\text{CB}_{t}(x_{a})\geq\Theta\left(\sqrt{\log(T/\delta)}\left(\frac{1}{\sqrt{o(T)}}-\frac{1}{\sqrt{T-o(T)}}\right)\right) (17)

with probability at least 1δ1-\delta.

Lemma 4.

Suppose the non-target arms {xax~}\{x_{a}\neq\tilde{x}\} are pulled o(T)o(T) times, the arm x~\tilde{x} is pulled To(T)T-o(T) times, and the total manipulation is CTC_{T}. With probability at least 1δ1-\delta, reward estimation error satisfies

|x𝖳𝜽^T,x𝖳𝜽|CTTo(T)+αtTo(T),x𝒜.\left|x^{\mathsf{T}}\hat{\mathrm{\boldsymbol{\theta}}}_{T,\parallel}-x^{\mathsf{T}}{\mathrm{\boldsymbol{\theta}}}^{*}_{\parallel}\right|\leq\frac{C_{T}}{T-o(T)}+\frac{\alpha_{t}}{\sqrt{T-o(T)}},\qquad\forall x\in\mathcal{A}. (18)
Proof.
𝜽^T,𝜽2\displaystyle\|\hat{\mathrm{\boldsymbol{\theta}}}_{T,\parallel}-\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}\|_{2} =x~𝖳(𝜽^T𝜽)x~22x~2\displaystyle=\left\|\frac{\tilde{x}^{\mathsf{T}}(\hat{\mathrm{\boldsymbol{\theta}}}_{T}-\mathrm{\boldsymbol{\theta}}^{*})}{\lVert\tilde{x}\rVert^{2}_{2}}\tilde{x}\right\|_{2}
=1x~22x~𝖳𝐀t1(t=1Txtr~t(xt)𝐀t𝜽)x~2\displaystyle=\frac{1}{{\lVert\tilde{x}\rVert^{2}_{2}}}\left\|\tilde{x}^{\mathsf{T}}\mathbf{A}_{t}^{-1}\left(\sum_{t=1}^{T}x_{t}\tilde{r}_{t}(x_{t})-\mathbf{A}_{t}\mathrm{\boldsymbol{\theta}}^{*}\right)\tilde{x}\right\|_{2}
=1x~22x~𝖳𝐀t1(t=1Txt(r~t(xt)xt𝖳𝜽)λ𝜽)x~2\displaystyle=\frac{1}{{\lVert\tilde{x}\rVert^{2}_{2}}}\left\|\tilde{x}^{\mathsf{T}}\mathbf{A}_{t}^{-1}\left(\sum_{t=1}^{T}x_{t}(\tilde{r}_{t}(x_{t})-x_{t}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*})-\lambda\mathrm{\boldsymbol{\theta}}^{*}\right)\tilde{x}\right\|_{2}
1x~22x~𝖳𝐀t1(t=1TxtΔt+t=1Txtηtλ𝜽)x~2\displaystyle\leq\frac{1}{{\lVert\tilde{x}\rVert^{2}_{2}}}\left\|\tilde{x}^{\mathsf{T}}\mathbf{A}_{t}^{-1}\left(\sum_{t=1}^{T}x_{t}\Delta_{t}+\sum_{t=1}^{T}x_{t}\eta_{t}-\lambda\mathrm{\boldsymbol{\theta}}^{*}\right)\tilde{x}\right\|_{2}
1x~22x~𝖳𝐀t1(t=1TxtΔt)x~2+1x~22x~𝖳𝐀t1/2𝐀t1/2(t=1Txtηtλ𝜽)x~2\displaystyle\leq\frac{1}{{\lVert\tilde{x}\rVert^{2}_{2}}}\left\|\tilde{x}^{\mathsf{T}}\mathbf{A}_{t}^{-1}\left(\sum_{t=1}^{T}x_{t}\Delta_{t}\right)\tilde{x}\right\|_{2}+\frac{1}{{\lVert\tilde{x}\rVert^{2}_{2}}}\left\|\tilde{x}^{\mathsf{T}}\mathbf{A}_{t}^{-1/2}\mathbf{A}_{t}^{-1/2}\left(\sum_{t=1}^{T}x_{t}\eta_{t}-\lambda\mathrm{\boldsymbol{\theta}}^{*}\right)\tilde{x}\right\|_{2}
CTTo(T)+αtTo(T)\displaystyle\leq\frac{C_{T}}{T-o(T)}+\frac{\alpha_{t}}{\sqrt{T-o(T)}}

Note that x~21\lVert\tilde{x}\rVert_{2}\leq 1. In the last step, the first term is because there are To(T)T-o(T) number of x~x~𝖳\tilde{x}\tilde{x}^{\mathsf{T}} in AtA_{t} and x~𝖳𝐀t121To(T)\|\tilde{x}^{\mathsf{T}}\mathbf{A}_{t}^{-1}\|_{2}\leq\frac{1}{T-o(T)}, and t=1TxtΔt2\|\sum_{t=1}^{T}x_{t}\Delta_{t}\|_{2} is bounded by total manipulation CTC_{T}. Similarly, in the second term we have x~𝖳𝐀t1/221To(T)\|\tilde{x}^{\mathsf{T}}\mathbf{A}_{t}^{-1/2}\|_{2}\leq\frac{1}{\sqrt{T-o(T)}}, and 𝐀t1/2(t=1Txtηtλ𝜽)2𝐀t1/2(t=1Txtηt)2+𝐀t1/2λ𝜽2=t=1Txtηt𝐀t1+λ𝜽𝐀t1αt\|\mathbf{A}_{t}^{-1/2}\left(\sum_{t=1}^{T}x_{t}\eta_{t}-\lambda\mathrm{\boldsymbol{\theta}}^{*}\right)\|_{2}\leq\|\mathbf{A}_{t}^{-1/2}\left(\sum_{t=1}^{T}x_{t}\eta_{t}\right)\|_{2}+\|\mathbf{A}_{t}^{-1/2}\lambda\mathrm{\boldsymbol{\theta}}^{*}\|_{2}=\|\sum_{t=1}^{T}x_{t}\eta_{t}\|_{\mathbf{A}_{t}^{-1}}+\|\lambda\mathrm{\boldsymbol{\theta}}^{*}\|_{\mathbf{A}_{t}^{-1}}\leq\alpha_{t} is the self-normalized bound for vector-valued martingales following Theorem 1 in  Abbasi-yadkori et al., (2011). ∎

Now we are ready to prove that the index ϵ\epsilon^{*} in CQP (2) being positive is the necessary condition of an attackable environment.

Proof of Necessity of Theorem 1.

We prove if ϵ0\epsilon^{*}\leq 0, the bandit environment is not attackable. To prove this, we show that there exists some no-regret bandit algorithm (LinUCB in particular) such that no attacking strategy can succeed. In particular, we will show that LinUCB (with a specific choice of its L2-regularization parameter λ\lambda) is robust under any attacking strategy with o(T)o(T) cost when ϵ0\epsilon^{*}\leq 0. We prove it by contradiction: assume LinUCB is attackable with o(T)o(T) cost when ϵ0\epsilon^{*}\leq 0. According to Definition 1, the target arm x~\tilde{x} will be pulled To(T)T-o(T) times for infinitely many different time horizons TT, and the following inequalities hold when arm x~\tilde{x} is pulled by LinUCB:

x~𝖳𝜽^T,+CBT(x~)>xa𝖳𝜽^T,+xa𝖳𝜽^T,+CBT(xa),xax~\displaystyle\tilde{x}^{\mathsf{T}}\hat{\mathrm{\boldsymbol{\theta}}}_{T,\parallel}+\text{CB}_{T}(\tilde{x})>x_{a}^{\mathsf{T}}\hat{\mathrm{\boldsymbol{\theta}}}_{T,\parallel}+x_{a}^{\mathsf{T}}\hat{\mathrm{\boldsymbol{\theta}}}_{T,\perp}+\text{CB}_{T}(x_{a}),\forall x_{a}\not=\tilde{x} (19)

where 𝜽^t\hat{\mathrm{\boldsymbol{\theta}}}_{t} is LinUCB’s estimated parameter at round tt based on the attacked rewards. We decompose 𝜽^T=𝜽^T,+𝜽^T,\hat{\mathrm{\boldsymbol{\theta}}}_{T}=\hat{\mathrm{\boldsymbol{\theta}}}_{T,\parallel}+\hat{\mathrm{\boldsymbol{\theta}}}_{T,\perp}, where x~𝜽^t,\tilde{x}\perp\hat{\mathrm{\boldsymbol{\theta}}}_{t,\perp} and x~𝜽^T,\tilde{x}\parallel\hat{\mathrm{\boldsymbol{\theta}}}_{T,\parallel}. We will now show that the above inequalities lead to

x~𝖳𝜽>xa𝖳𝜽+xa𝖳𝜽^T,,xax~\displaystyle\tilde{x}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}>x_{a}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}+x_{a}^{\mathsf{T}}\hat{\mathrm{\boldsymbol{\theta}}}_{T,\perp},\forall x_{a}\not=\tilde{x}

when TT\to\infty.

By Lemma 4 we have

xa𝖳𝜽^T,\displaystyle x_{a}^{\mathsf{T}}\hat{\mathrm{\boldsymbol{\theta}}}_{T,\parallel} xa𝖳𝜽CTTo(T)αTTo(T)\displaystyle\geq x_{a}^{\mathsf{T}}{\mathrm{\boldsymbol{\theta}}}^{*}_{\parallel}-\frac{C_{T}}{T-o(T)}-\frac{\alpha_{T}}{\sqrt{T-o(T)}}
x~𝖳𝜽^T,\displaystyle\tilde{x}^{\mathsf{T}}\hat{\mathrm{\boldsymbol{\theta}}}_{T,\parallel} x~𝖳𝜽+CTTo(T)+αTTo(T)\displaystyle\leq\tilde{x}^{\mathsf{T}}{\mathrm{\boldsymbol{\theta}}}^{*}_{\parallel}+\frac{C_{T}}{T-o(T)}+\frac{\alpha_{T}}{\sqrt{T-o(T)}}

Substitute them back and we have that with probability at least 12δ1-2\delta,

x~𝖳𝜽>xa𝖳𝜽+xa𝖳𝜽^T,+CBT(xa)CBT(x~)2CTTo(T)2αTTo(T),xax~\displaystyle\tilde{x}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}>x_{a}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}+x_{a}^{\mathsf{T}}\hat{\mathrm{\boldsymbol{\theta}}}_{T,\perp}+\text{CB}_{T}(x_{a})-\text{CB}_{T}(\tilde{x})-\frac{2C_{T}}{T-o(T)}-\frac{2\alpha_{T}}{\sqrt{T-o(T)}},\forall x_{a}\not=\tilde{x} (20)

Let us first consider the case of xax~x_{a}\nparallel\tilde{x}. Substitute Claim 1 and Claim 2 back and we have with probability at least 14δ1-4\delta

x~𝖳𝜽>\displaystyle\tilde{x}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}> xa𝖳𝜽+xa𝖳𝜽^T,+Θ(log(T/δ)(1o(T)1To(T)))\displaystyle x_{a}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}+x_{a}^{\mathsf{T}}\hat{\mathrm{\boldsymbol{\theta}}}_{T,\perp}+\Theta\left(\sqrt{\log(T/\delta)}\left(\frac{1}{\sqrt{o(T)}}-\frac{1}{\sqrt{T-o(T)}}\right)\right)
O(log(T/δ)To(T))2CTTo(T)2αTTo(T),xax~\displaystyle-O\left(\sqrt{\frac{\log(T/\delta)}{T-o(T)}}\right)-\frac{2C_{T}}{T-o(T)}-\frac{2\alpha_{T}}{\sqrt{T-o(T)}},\forall x_{a}\nparallel\tilde{x}

Taking TT\to\infty and noticing that the last three terms diminish to 0 faster than the third term, there must exists a T0T_{0} such that for any T>T0T>T_{0},

x~𝖳𝜽>xa𝖳𝜽+xa𝖳𝜽^T,,xax~\displaystyle\tilde{x}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}>x_{a}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}+x_{a}^{\mathsf{T}}\hat{\mathrm{\boldsymbol{\theta}}}_{T,\perp},\forall x_{a}\not=\tilde{x} (21)

satisfies when xax~x_{a}\nparallel\tilde{x}.

Now we consider the special case that some xax~x_{a}\parallel\tilde{x} and show that the above inequality is still strict. Let xa=cx~x_{a}=c\tilde{x}. If |c|>1|c|>1, we have CBT(xa)CBT(x~)=(c1)CBT(x~)>0\text{CB}_{T}(x_{a})-\text{CB}_{T}(\tilde{x})=(c-1)\text{CB}_{T}(\tilde{x})>0. If |c|<1|c|<1, since x~\tilde{x} is pulled linear times for any large tt with sublinear cost, then x~𝖳θ^t,>0\tilde{x}^{\mathsf{T}}\hat{\theta}_{t,\parallel}>0; otherwise the cost would be linear. We directly have x~𝖳θ^t,=xa𝖳θ^t,+(1c)x~𝖳θ^t,>xa𝖳θ^t,\tilde{x}^{\mathsf{T}}\hat{\theta}_{t,\parallel}=x_{a}^{\mathsf{T}}\hat{\theta}_{t,\parallel}+(1-c)\tilde{x}^{\mathsf{T}}\hat{\theta}_{t,\parallel}>x_{a}^{\mathsf{T}}\hat{\theta}_{t,\parallel}. This leads to x~𝖳θ>xa𝖳θ\tilde{x}^{\mathsf{T}}\theta^{*}_{\parallel}>x_{a}^{\mathsf{T}}\theta^{*}_{\parallel} (inequality (21)) since xa𝜽^T,x_{a}\perp\hat{\mathrm{\boldsymbol{\theta}}}_{T,\perp}.

Combining the two cases, we know there must exist a 𝜽^T,\hat{\mathrm{\boldsymbol{\theta}}}_{T,\perp} that satisfies inequality (21) (the first constraint of CQP (2)), x~𝜽^T,\tilde{x}\perp\hat{\mathrm{\boldsymbol{\theta}}}_{T,\perp} (the second constraint of CQP (2)), and makes the objective of CQP (2) larger than 0. To satisfy the last constraint, we consider LinUCB with the choice of L2-regularization parameter λ\lambda that guarantees 𝜽^t2<1\|\hat{\mathrm{\boldsymbol{\theta}}}_{t}\|_{2}<1 given the data for large enough TT and any t<Tt<T. Note that ridge regression is equivalent to minimizing square loss under some constraint 𝜽^t2c\|\hat{\mathrm{\boldsymbol{\theta}}}_{t}\|_{2}\leq c, and there always exists a one-to-one correspondence between λ\lambda and cc (one simple way to show the correspondence is using KKT conditions). Therefore, we are guaranteed to find a λ\lambda that gives us c=1ζc=1-\zeta where ζ\zeta is an arbitrarily small constant. Then we know that 𝜽^T,\hat{\mathrm{\boldsymbol{\theta}}}_{T,\perp} satisfies 𝜽^T=𝜽^T,+𝜽^T,2c<1\|\hat{\mathrm{\boldsymbol{\theta}}}_{T}=\hat{\mathrm{\boldsymbol{\theta}}}_{T,\parallel}+\hat{\mathrm{\boldsymbol{\theta}}}_{T,\perp}\|_{2}\leq c<1. We prove the last constraint 𝜽~=𝜽+𝜽^T,21\|\tilde{\mathrm{\boldsymbol{\theta}}}=\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}+\hat{\mathrm{\boldsymbol{\theta}}}_{T,\perp}\|_{2}\leq 1 by the fact that 𝜽^T,+𝜽^T,2<1\|\hat{\mathrm{\boldsymbol{\theta}}}_{T,\parallel}+\hat{\mathrm{\boldsymbol{\theta}}}_{T,\perp}\|_{2}<1 and 𝜽𝜽^T,2\|\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}-\hat{\mathrm{\boldsymbol{\theta}}}_{T,\parallel}\|_{2} is arbitrarily small for large enough TT according to Lemma 4.

Now we proved that there exists a 𝜽^T,\hat{\mathrm{\boldsymbol{\theta}}}_{T,\perp} that satisfies all the constraints in CQP (2) with positive objective, which means the optimal objective ϵ\epsilon^{*} must also be positive. This however contradicts our assumption ϵ0\epsilon^{*}\leq 0, implying that such LinUCB is not attackable by any attack strategy. ∎

Appendix C Details on Effective Attacks Without Knowledge of Model Parameters

We now prove the theorems of using Two-stage Null Space Attack (Algorithm 2) against LinUCB and Robust Phase Elimination.

C.1 Proof of Theorem 2

Proof of Theorem 2.

We first prove that for a large enough TT, Algorithm 2 will correctly assert the attackability with probability at least 1δ1-\delta. We rely on the following lemma to show 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel} estimated in step 11 of Algorithm 2 is close to the true parameter 𝜽\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}.

Lemma 5 (Estimation error of 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel}).

Algorithm 2 estimates 𝛉\mathrm{\boldsymbol{\theta}}^{*}_{\parallel} by

𝜽~=i=1n(x~)ri(x~)n(x~)x~22x~.\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel}=\frac{\sum_{i=1}^{n(\tilde{x})}r_{i}(\tilde{x})}{n(\tilde{x})\|\tilde{x}\|_{2}^{2}}\tilde{x}. (22)

With probability at least 1δ1-\delta, the estimation error is bounded by

𝜽~𝜽22R2log(1/δ)n\|\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel}-\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}\|_{2}\leq\sqrt{\frac{2R^{2}\log(1/\delta)}{n}} (23)

where the rewards have RR-sub-Gaussian noise.

Proof.

𝜽\mathrm{\boldsymbol{\theta}}^{*}_{\parallel} is the projected vector of 𝜽\mathrm{\boldsymbol{\theta}}^{*} onto x~\tilde{x}, which is

𝜽=x~𝖳𝜽x~22x~\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}=\frac{\tilde{x}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}}{\|\tilde{x}\|_{2}^{2}}\tilde{x}

as defined in Eq (1). Though we need to estimate the vector 𝜽~d\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel}\in\mathbb{R}^{d}, we actually only need to estimate the scale of it by l^=i=1n(x~)ri(x~)n(x~)x~22\hat{l}=\frac{\sum_{i=1}^{n(\tilde{x})}r_{i}(\tilde{x})}{n(\tilde{x})\|\tilde{x}\|_{2}^{2}}, since the direction is known to be x~\tilde{x}. Based on Hoeffding’s inequality, the estimation error of averaged rewards on x~\tilde{x} is bounded by

P(|i=1n(x~)ri(x~)n(x~)r(x~)|2R2log(1/δ)n(x~))δP\left(\left\lvert\frac{\sum_{i=1}^{n(\tilde{x})}r_{i}(\tilde{x})}{n(\tilde{x})}-r^{*}(\tilde{x})\right\rvert\geq\sqrt{\frac{2R^{2}\log(1/\delta)}{n(\tilde{x})}}\right)\leq\delta (24)

where r(x~)=x~𝖳𝜽r^{*}(\tilde{x})=\tilde{x}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}. Considering x~22=1\|\tilde{x}\|_{2}^{2}=1 and we finish the proof. ∎

In the first stage, the bandit algorithm will pull the target arm x~\tilde{x} for T1O(T1)T_{1}-O(\sqrt{T_{1}}) times, since x~\tilde{x} is the best arm according to 𝜽0\mathrm{\boldsymbol{\theta}}_{0}. According to Lemma 5, with probability at least 1δ1-\delta the estimation error is bounded as

𝜽~𝜽22R2log(1/δ)T1O(T1).\|\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel}-\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}\|_{2}\leq\sqrt{\frac{2R^{2}\log(1/\delta)}{T_{1}-O(\sqrt{T_{1}})}}.

As a result, with a large enough T1T_{1}, the error of x~\tilde{x}’s reward estimation satisfies

|x~𝖳𝜽~x~𝖳𝜽|x~2𝜽~𝜽22R2log(1/δ)T1O(T1)ϵ.\lvert\tilde{x}^{\mathsf{T}}\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel}-\tilde{x}^{\mathsf{T}}\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}\rvert\leq\|\tilde{x}\|_{2}\|\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel}-\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}\|_{2}\leq\sqrt{\frac{2R^{2}\log(1/\delta)}{T_{1}-O(\sqrt{T_{1}})}}\leq\epsilon^{*}.

Thus solving CQP (2) with 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel} replacing 𝜽\mathrm{\boldsymbol{\theta}}^{*}_{\parallel} and we could correctly assert attackability with an estimated positive index ϵ~\tilde{\epsilon}^{*} when the environment is attackable with index ϵ\epsilon^{*}.

Remark 3.

From the analysis above, we notice that the adversary requires sufficiently large T1T_{1} to collect enough rewards on the target arm, in order to make the correct attackability assertion. When T1T_{1} is not large enough, the attackability test may have false positive or false negative conclusions. We empirically test the error rate and report the results in Additional Experiments section.

We now prove the success and total cost of the proposed attack. The analysis relies on the “robustnes” property of LinUCB stated in Lemma 1, which is restated and proved below.

Lemma 6 (Robustness of ridge regression).

Consider LinUCB with ridge regression under poisoning attack. Let St=τ{1t},xaτx~|Δτ|S^{\prime}_{t}=\sum_{\tau\in\{1\dots t\},x_{a_{\tau}}\neq\tilde{x}}|\Delta_{\tau}| be the total corruption on non-target arms until time tt and assume every corruption on target arm is bounded by γ\gamma. Then for any t=1Tt=1\dots T, with probability at least 1δ1-\delta we have

𝜽~𝜽^tAtαt+St/λ+γt\|\tilde{\mathrm{\boldsymbol{\theta}}}-\hat{\mathrm{\boldsymbol{\theta}}}_{t}\|_{A_{t}}\leq\alpha_{t}+S^{\prime}_{t}/\sqrt{\lambda}+\gamma\sqrt{t} (25)

where αt=dlog(1+t/λδ)+λ\alpha_{t}=\sqrt{d\log\left(\frac{1+t/\lambda}{\delta}\right)}+\sqrt{\lambda}.

Proof.

Based on the closed form solution of ridge regression, we have

𝜽^t=𝜽~λ𝐀t1𝜽~+𝐀t1τ=1txaτ[ητ+Δτ]\displaystyle\hat{\mathrm{\boldsymbol{\theta}}}_{t}=\tilde{\mathrm{\boldsymbol{\theta}}}-\lambda\mathbf{A}_{t}^{-1}\tilde{\mathrm{\boldsymbol{\theta}}}+\mathbf{A}_{t}^{-1}\sum_{\tau=1}^{t}x_{a_{\tau}}[\eta_{\tau}+\Delta_{\tau}]

Therefore, using ideas from Abbasi-yadkori et al., (2011), we can have

𝜽^t𝜽~𝐀t\displaystyle\|\hat{\mathrm{\boldsymbol{\theta}}}_{t}-\tilde{\mathrm{\boldsymbol{\theta}}}\|_{\mathbf{A}_{t}} \displaystyle\leq λ1/2𝜽2+τ=1txaτητ𝐀t1+τ=1txaτΔτ𝐀t1\displaystyle\lambda^{1/2}\|\mathrm{\boldsymbol{\theta}}^{*}\|_{2}+\|\sum_{\tau=1}^{t}x_{a_{\tau}}\eta_{\tau}\|_{\mathbf{A}_{t}^{-1}}+\|\sum_{\tau=1}^{t}x_{a_{\tau}}\Delta_{\tau}\|_{\mathbf{A}_{t}^{-1}}
\displaystyle\leq αt+τ=1txaτΔτ𝐀t1\displaystyle\alpha_{t}+\|\sum_{\tau=1}^{t}x_{a_{\tau}}\Delta_{\tau}\|_{\mathbf{A}_{t}^{-1}}
\displaystyle\leq αt+τ{1t},xaτx~xaτΔτ𝐀t1+τ{1t},xaτ=x~xaτΔτ𝐀t1\displaystyle\alpha_{t}+\|\sum_{\tau\in\{1\dots t\},x_{a_{\tau}}\neq\tilde{x}}x_{a_{\tau}}\Delta_{\tau}\|_{\mathbf{A}_{t}^{-1}}+\|\sum_{\tau\in\{1\dots t\},x_{a_{\tau}}=\tilde{x}}x_{a_{\tau}}\Delta_{\tau}\|_{\mathbf{A}_{t}^{-1}}
\displaystyle\leq αt+τ{1t},xaτx~xaτΔτ𝐀t1+γn(x~)x~𝐀t1\displaystyle\alpha_{t}+\|\sum_{\tau\in\{1\dots t\},x_{a_{\tau}}\neq\tilde{x}}x_{a_{\tau}}\Delta_{\tau}\|_{\mathbf{A}_{t}^{-1}}+\|\gamma n(\tilde{x})\tilde{x}\|_{\mathbf{A}_{t}^{-1}}
\displaystyle\leq αt+τ{1t},xaτx~xaτΔτ2/λ+γn(x~)x~𝐀t1\displaystyle\alpha_{t}+\|\sum_{\tau\in\{1\dots t\},x_{a_{\tau}}\neq\tilde{x}}x_{a_{\tau}}\Delta_{\tau}\|_{2}/\sqrt{\lambda}+\|\gamma n(\tilde{x})\tilde{x}\|_{\mathbf{A}_{t}^{-1}}
\displaystyle\leq αt+St/λ+γn(x~)x~𝐀t1\displaystyle\alpha_{t}+S^{\prime}_{t}/\sqrt{\lambda}+\|\gamma n(\tilde{x})\tilde{x}\|_{\mathbf{A}_{t}^{-1}}
\displaystyle\leq αt+St/λ+γn(x~)n(x~)\displaystyle\alpha_{t}+S^{\prime}_{t}/\sqrt{\lambda}+\frac{\gamma n(\tilde{x})}{\sqrt{n(\tilde{x})}}
\displaystyle\leq αt+St/λ+γt\displaystyle\alpha_{t}+S^{\prime}_{t}/\sqrt{\lambda}+\gamma\sqrt{t}

with probability at least 1δ1-\delta, where n(x~)n(\tilde{x}) is the times target arm has been pulled. The second step is based on the definition of αt\alpha_{t} and introduces the high probability bound, the fourth step is because we have |Δτ|<γ|\Delta_{\tau}|<\gamma if xaτ=x~x_{a_{\tau}}=\tilde{x}; the fifth step is because of 𝐀tλ𝐈\mathbf{A}_{t}\succeq\lambda\mathbf{I}; and the second last inequality follows Eq (8). Finally, notice that n(x~)tn(\tilde{x})\leq t and we finish the proof. ∎

Let us first analyze the attack in the first stage. Denote RT(𝜽)R_{T}(\mathrm{\boldsymbol{\theta}}) as the regret of LinUCB until round TT, where 𝜽\mathrm{\boldsymbol{\theta}} is the ground-truth parameter. We know from Abbasi-yadkori et al., (2011) that if the rewards are all generated by 𝜽\mathrm{\boldsymbol{\theta}} then with probability at least 1δ1-\delta we have

RT(𝜽)=αTdTlog(1+T/λδ)=O(dTlog(T/δ))R_{T}(\mathrm{\boldsymbol{\theta}})=\alpha_{T}\sqrt{dT\log\left(\frac{1+T/\lambda}{\delta}\right)}=O(d\sqrt{T}\log(T/\delta)) (26)

where αt=dlog(1+t/λδ)+λ\alpha_{t}=\sqrt{d\log\left(\frac{1+t/\lambda}{\delta}\right)}+\sqrt{\lambda}. Then the attack in the first T1T_{1} rounds based on 𝜽0\mathrm{\boldsymbol{\theta}}_{0} should make the bandit algorithm pull x~\tilde{x} at least T1RT1(𝜽0)/ϵ0T_{1}-R_{T_{1}}(\mathrm{\boldsymbol{\theta}}_{0})/\epsilon_{0}^{*} times. According to Lemma 5, with probability at least 1δ1-\delta parameter estimation error is bounded by

𝜽~,T1𝜽22log(1/δ)/T1RT1(𝜽0)/ϵ022log(1/δ)/T1\|\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel,T_{1}}-\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}\|_{2}\leq\sqrt{2\log(1/\delta)}/\sqrt{T_{1}-R_{T_{1}}(\mathrm{\boldsymbol{\theta}}_{0})/\epsilon_{0}^{*}}\leq 2\sqrt{2\log(1/\delta)}/\sqrt{T_{1}} (27)

for large enough T1T_{1}. This means we have

γ=x~𝖳(𝜽~,T1𝜽)22log(1/δ)/T1\gamma=\|\tilde{x}^{\mathsf{T}}\big{(}\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel,T_{1}}-\mathrm{\boldsymbol{\theta}}^{*}_{\parallel}\big{)}\|\leq 2\sqrt{2\log(1/\delta)}/\sqrt{T_{1}} (28)

Now we prove the attack is successful with high probability. Consider the regret of the LinUCB against 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}} as the ground-truth parameter. The estimation error in 𝜽^t𝜽~\hat{\mathrm{\boldsymbol{\theta}}}_{t}-\tilde{\mathrm{\boldsymbol{\theta}}} has three sources: the sub-Gaussian noise, the rewards on non-target arms in the first stage generated by 𝜽0\mathrm{\boldsymbol{\theta}}_{0} (the rewards on the target arm are corrected to 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}} in step 19-20 in Algorithm 2), and the unattacked rewards on target arm in the second stage generated by 𝜽\mathrm{\boldsymbol{\theta}}^{*}. According to Lemma 1, with probability at least 13δ1-3\delta, we have

𝜽^t𝜽~𝐀tαt+RT1(𝜽0)/λ+22tlog(1/δ)/T1,t>T1.\|\hat{\mathrm{\boldsymbol{\theta}}}_{t}-\tilde{\mathrm{\boldsymbol{\theta}}}\|_{\mathbf{A}_{t}}\leq\alpha_{t}+R_{T_{1}}(\mathrm{\boldsymbol{\theta}}_{0})/\sqrt{\lambda}+2\sqrt{2t\log(1/\delta)}/\sqrt{T_{1}},t>T_{1}.

To show the number of rounds pulling non-target arms, we first look at the regret against 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}, i.e., RT(𝜽~)R_{T}(\tilde{\mathrm{\boldsymbol{\theta}}}).

RT(𝜽~)\displaystyle R_{T}(\tilde{\mathrm{\boldsymbol{\theta}}}) t=1T(x~𝖳𝜽~xat𝖳𝜽~)\displaystyle\leq\sum_{t=1}^{T}\left(\tilde{x}^{\mathsf{T}}\tilde{\mathrm{\boldsymbol{\theta}}}-x_{a_{t}}^{\mathsf{T}}\tilde{\mathrm{\boldsymbol{\theta}}}\right)
t=1T(x~𝖳𝜽^t+CBt(x~)xat𝖳𝜽~)\displaystyle\leq\sum_{t=1}^{T}\left(\tilde{x}^{\mathsf{T}}\hat{\mathrm{\boldsymbol{\theta}}}_{t}+\text{CB}_{t}(\tilde{x})-x_{a_{t}}^{\mathsf{T}}\tilde{\mathrm{\boldsymbol{\theta}}}\right)
t=1T(xat𝖳𝜽^t+CBt(xat)xat𝖳𝜽~)\displaystyle\leq\sum_{t=1}^{T}\left(x_{a_{t}}^{\mathsf{T}}\hat{\mathrm{\boldsymbol{\theta}}}_{t}+\text{CB}_{t}(x_{a_{t}})-x_{a_{t}}^{\mathsf{T}}\tilde{\mathrm{\boldsymbol{\theta}}}\right)
t=1T2CBt(xat)\displaystyle\leq\sum_{t=1}^{T}2\text{CB}_{t}(x_{a_{t}})
2Tt=1TCBt2(xat)\displaystyle\leq 2\sqrt{T\sum_{t=1}^{T}\text{CB}^{2}_{t}(x_{a_{t}})}
2𝜽^T𝜽~𝐀TTt=1Tx𝐀t12\displaystyle\leq 2\|\hat{\mathrm{\boldsymbol{\theta}}}_{T}-\tilde{\mathrm{\boldsymbol{\theta}}}\|_{\mathbf{A}_{T}}\sqrt{T\sum_{t=1}^{T}\lVert x\rVert^{2}_{\mathbf{A}_{t}^{-1}}}
2(αT+RT1(𝜽0)/λ+22Tlog(1/δ)/T1)dTlog(1+T/λδ)\displaystyle\leq 2\big{(}\alpha_{T}+R_{T_{1}}(\mathrm{\boldsymbol{\theta}}_{0})/\sqrt{\lambda}+2\sqrt{2T\log(1/\delta)}/\sqrt{T_{1}}\big{)}\sqrt{dT\log\left(\frac{1+T/\lambda}{\delta}\right)}

holds with probability at least 13δ1-3\delta. And LinUCB will pull non-target arms at most RT(𝜽~)/ϵR_{T}(\tilde{\mathrm{\boldsymbol{\theta}}})/\epsilon^{*} times, which can be bounded by

RT(𝜽~)/ϵ\displaystyle R_{T}(\tilde{\mathrm{\boldsymbol{\theta}}})/\epsilon^{*} 2(αT+RT1(𝜽0)/λ+22Tlog(1/δ)/T1)dTlog(1+T/λδ)/ϵ\displaystyle\leq 2\left(\alpha_{T}+R_{T_{1}}(\mathrm{\boldsymbol{\theta}}_{0})/\sqrt{\lambda}+2\sqrt{2T\log(1/\delta)}/\sqrt{T_{1}}\right)\sqrt{dT\log\left(\frac{1+T/\lambda}{\delta}\right)}/\epsilon^{*}
2(dTlog(1+T/λδ)+λ+RT1(𝜽0)/λ+22Tlog(1/δ)/T1)dTlog(1+T/λδ)/ϵ\displaystyle\leq 2\left(\sqrt{dT\log\left(\frac{1+T/\lambda}{\delta}\right)}+\sqrt{\lambda}+R_{T_{1}}(\mathrm{\boldsymbol{\theta}}_{0})/\sqrt{\lambda}+2\sqrt{2T\log(1/\delta)}/\sqrt{T_{1}}\right)\sqrt{dT\log\left(\frac{1+T/\lambda}{\delta}\right)}/\epsilon^{*}

and is in the order of

O(d(log(T/δ)+T1log(T1/δ)+Tlog(1/δ)/T1)Tlog(T/δ)/ϵ)O\left(d\left(\sqrt{\log(T/\delta)}+\sqrt{T_{1}}\log{(T_{1}/\delta)}+\sqrt{T\log(1/\delta)}/\sqrt{T_{1}}\right)\sqrt{T\log(T/\delta)}/\epsilon^{*}\right) (29)

The T1log(T1/δ)\sqrt{T_{1}}\log{(T_{1}/\delta)} term is due to the “corrupted” rewards of non-target arms observed in the first stage. Setting T1=T1/2T_{1}=T^{1/2} gives us the minimum number of rounds pulling non-target arms in O~(T3/4)\tilde{O}(T^{3/4}) according to Eq (29).

Now we prove the total cost C(T)C(T). Note that in order to make the attack “stealthy”, we inject sub-Gaussian noise η~t\tilde{\eta}_{t} on perturbed reward to make it stochastic. We separate the total cost by the cost on changing the mean reward and the cost of sub-Gaussian noise as

C(T)=t={1..T},r~trt|Δrt|i=1N|xai𝖳(𝜽~𝜽)|+i=1N|η~i|C(T)=\sum_{t=\{1..T\},\tilde{r}_{t}\neq r_{t}}|\Delta r_{t}|\leq\sum_{i=1}^{N}|x_{a_{i}}^{\mathsf{T}}(\tilde{\mathrm{\boldsymbol{\theta}}}-\mathrm{\boldsymbol{\theta}}^{*})|+\sum_{i=1}^{N}|\tilde{\eta}_{i}| (30)

where i{t={1..T}:r~trt}i\in\{t=\{1..T\}:\tilde{r}_{t}\neq r_{t}\}. Let N=|{i}|N=|\{i\}| be the total rounds of attack. Since we know the times attacking non-target arms is bounded by Eq 29 and attack target arm at most T1T_{1} times, we have with probablity 1δ1-\delta,

N=T1+O(d(log(T/δ)+T1log(T1/δ)+Tlog(1/δ)/T1)Tlog(T/δ)/ϵ)=O~(T3/4)N=T_{1}+O\left(d\left(\sqrt{\log(T/\delta)}+\sqrt{T_{1}}\log{(T_{1}/\delta)}+\sqrt{T\log(1/\delta)}/\sqrt{T_{1}}\right)\sqrt{T\log(T/\delta)}/\epsilon^{*}\right)=\tilde{O}(T^{3/4}) (31)

when setting T1=T1/2T_{1}=T^{1/2}.

Notice that since η~t\tilde{\eta}_{t} is RR-sub-Gaussian, its absolute value |η~t||\tilde{\eta}_{t}| is also RR-sub-Gaussian, and 𝔼[|η~t|]<L\mathbb{E}[|\tilde{\eta}_{t}|]<L for some constant LL following Proposition 2.5.2 in Vershynin, (2018). According to the general Hoeffding’s inequality, with probability at least 1δ1-\delta,

i=1N|η~i|NL+NLlog(2/δ)=O(N+Nlog(1/δ))\sum_{i=1}^{N}|\tilde{\eta}_{i}|\leq NL+\sqrt{NL\log(2/\delta)}=O(N+\sqrt{N\log(1/\delta)}) (32)

Thus the second term of Eq (30) has the order of O(N)=O~(T3/4)O(N)=\tilde{O}(T^{3/4}). The first term of Eq (30) is bounded by 2N+2T12N+2T_{1} because each attack changes the mean reward at most 2 except the compensation step in line 20, and the reward compensation on target arm can be bounded by 2T12T_{1} because target arm is pulled at most T1T_{1} times in the first stage. Overall, with probability at least 14δ1-4\delta

C(T)i=1N|xai𝖳(𝜽~𝜽)|+i=1N|η~i|2N+2T1+NL+NLlog(2/δ)=O~(N)=O~(T3/4)C(T)\leq\sum_{i=1}^{N}|x_{a_{i}}^{\mathsf{T}}(\tilde{\mathrm{\boldsymbol{\theta}}}-\mathrm{\boldsymbol{\theta}}^{*})|+\sum_{i=1}^{N}|\tilde{\eta}_{i}|\leq 2N+2T_{1}+NL+\sqrt{NL\log(2/\delta)}=\tilde{O}(N)=\tilde{O}(T^{3/4}) (33)

when setting T1=T1/2T_{1}=T^{1/2}.

C.2 Proof Sketch of Corollary 2

The proof is similar to the proof of Theorem 2, and thus we only explain the difference here.

Instead of using Lemma 1 to analyze the impact of perturbed rewards generated by 𝜽0\mathrm{\boldsymbol{\theta}}_{0} (against 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}) collected in the first stage, we know RobustPhE has an additional regret term O(S2)O(S^{2}) for corruption SS (assuming SS is unknown to the bandit algorithm). Since the bandit algorithm observes T1T_{1} rewards in the first stage, S2T1S\leq 2T_{1} and the additional regret due to rewards from first stage is O~(T12)\tilde{O}(T_{1}^{2}). For the unattacked rewards on target arm in the second stage generated by 𝜽\mathrm{\boldsymbol{\theta}}^{*}, we view them as rewards generated by 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}} with misspecification error γ\gamma, i.e., |x~𝖳(𝜽𝜽~)|γ|\tilde{x}^{\mathsf{T}}(\mathrm{\boldsymbol{\theta}}^{*}-\tilde{\mathrm{\boldsymbol{\theta}}})|\leq\gamma. Proposition 5.1 in Lattimore et al., (2020) showed that the phase elimination algorithm with misspecification γ\gamma has additional regret in O(γdTlog(T))O(\gamma\sqrt{d}T\log(T)). With γ22log(1/δ)/T1\gamma\leq 2\sqrt{2\log(1/\delta)}/\sqrt{T_{1}} by Eq (28), the total regret is

RT(𝜽~)=O(dTlog(T/δ)+dTlog(T)log(1/δ)/T1+T12)R_{T}(\tilde{\mathrm{\boldsymbol{\theta}}})=O\Big{(}d\sqrt{T}\log(T/\delta)+\sqrt{d}T\log(T)\log(1/\delta)/\sqrt{T_{1}}+T_{1}^{2}\Big{)}

with probability at least 12δ1-2\delta. Therefore, we have with probability at least 12δ1-2\delta, the attack strategy will fool RobustPhE to pull non-target arms at most O((dTlog(T/δ)+dTlog(T)log(1/δ)/T1+T12)/ϵ)O\Big{(}(d\sqrt{T}\log(T/\delta)+\sqrt{d}T\log(T)\log(1/\delta)/\sqrt{T_{1}}+T_{1}^{2})/\epsilon^{*}\Big{)} rounds.

Similar to Eq (31), we bound the total rounds of attacking RobustPhE by

N=T1+O((dTlog(T/δ)+dTlog(T)log(1/δ)/T1+T12)/ϵ)N=T_{1}+O\Big{(}\big{(}d\sqrt{T}\log(T/\delta)+\sqrt{d}T\log(T)\log(1/\delta)/\sqrt{T_{1}}+T_{1}^{2}\big{)}/\epsilon^{*}\Big{)} (34)

From Eq (33), we know the total cost is in the same order as the rounds of attack. So with probability at least 13δ1-3\delta the total cost is

O(T1+(dTlog(T/δ)+dTlog(T)log(1/δ)/T1+T12)/ϵ).O\Big{(}T_{1}+(d\sqrt{T}\log(T/\delta)+\sqrt{d}T\log(T)\log(1/\delta)/\sqrt{T_{1}}+T_{1}^{2})/\epsilon^{*}\Big{)}.

Setting T1=T2/5T_{1}=T^{2/5} gives us the minimum attack cost O~(T4/5)\tilde{O}(T^{4/5}), and the non-target arms are pulled at most O~(T4/5)\tilde{O}(T^{4/5}) rounds.

Remark 4.

Note that we bound the total corruption by T1T_{1}, which means the adversary does not need to compensate the rewards on the target arm as shown in line 20 in Algorithm 2. The robustness of RobustPhE allows us to carry over the rewards in the first stage while LinUCB does not.

C.3 Attack under unknown TT

Our two-stage null space attack algorithm requires that TT is known for the convenience of analysis. Here we discuss a promising idea that leveraging the doubling trick to extend the attack to the case of unknown TT, which will lead to a multi-stage attack that repeatedly adjusts the target parameter θ~\tilde{\theta} at each stage. More concretely, to attack LinUCB without knowing TT, the adversary can start with a pre-specified horizon T0T_{0} and run the two-stage attack. Once the actual horizon reaches T0T_{0}, the adversary will expand the horizon to T02T_{0}^{2} (i.e., the “doubling” trick), and views previous T0T_{0} rounds as the new first stage, re-calculates target parameter θ~\tilde{\theta} using rewards from the T0T_{0} rounds and runs the new attack strategy until T02T_{0}^{2}. Using this exponential horizon sequence {Ti=T02i,i}\{T_{i}=T_{0}^{2^{i}},i\in\mathbb{N}\}, we have a multi-stage attack on LinUCB with unknown time horizon. Similarly, to attack RobustPhE, we will need to adjust the horizon sequence to be {Ti=T0(5/2)i,i}\{T_{i}=T_{0}^{(5/2)^{i}},i\in\mathbb{N}\}, which will lead to a similar multi-stage attack on RobustPhE. We believe this direction is feasible based on our current analysis, and more thorough and complete proof should be the target of our next work.

Appendix D Additional Experiments

Refer to caption
Figure 3: False negative rate of attackability test

In Figure 3, we study the false negative rate of the attackability test in Algorithm 2, i.e., how often the adversary mistakenly asserts that an attackable environment is not attackable. As we explained in Proof of Theorem 1, the wrong assertion is because of using estimated 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel} instead of the ground-truth bandit parameter. In this experiment, we consider a challenging attackable three-arm environment with 𝒜={x1=(0,1),x2=(0.11,1.1),x3=(2,0)}\mathcal{A}=\{x_{1}=(0,1),x_{2}=(0.11,1.1),x_{3}=(-2,0)\}, x~=x1\tilde{x}=x_{1} and 𝜽=(0,0.5)\mathrm{\boldsymbol{\theta}}^{*}=(0,0.5). By solving CQP (2), we have attackability index ϵ=0.005\epsilon^{*}=0.005 and certificate 𝜽~=(0.5,0)\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp}=(-0.5,0)555We introduce arm x3x_{3} to guarantee the first dimension of 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\perp} cannot be smaller than 0.5-0.5. Comparing x~\tilde{x} and x2x_{2} and we can see the optimal solution is ϵ=0.005\epsilon^{*}=0.005.. We test two-stage null space attack against LinUCB with T=10,000T=10,000 and the adversary will test the attackability after the first T1=T1/2=100T_{1}=T^{1/2}=100 rounds. We vary T1T_{1} from 11 to 100100 to see how many iterations is sufficient for attackability test. We report averaged results of 100 runs. We also vary the standard derivation σ\sigma of Gaussian noise from 0.1 to 0.3. In Figure 3, we can see that the false negative rate is almost zero when T1>50T_{1}>50, suggesting T1=100T_{1}=100 is sufficient. When σ=0.1\sigma=0.1 the adversary only needs around 10 rounds to make a correct assertion. We also notice the false negative rate becomes higher under a larger noise scale. As suggested in Lemma 5, the error in 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel} estimation is larger if noise scale is larger or the number of target arm’s rewards n(x~)n(\tilde{x}) is smaller, which highly depends on T1T_{1}. Larger error means CQP (2) with 𝜽~\tilde{\mathrm{\boldsymbol{\theta}}}_{\parallel} is more likely to be unfeasible and gives false negative assertion. However, T1=100T_{1}=100 is still enough for the attackability test when ϵ=0.005\epsilon^{*}=0.005.