Understanding the Role of Feedback in Online Learning with Switching Costs
Abstract
In this paper, we study the role of feedback in online learning with switching costs. It has been shown that the minimax regret is under bandit feedback and improves to under full-information feedback, where is the length of the time horizon. However, it remains largely unknown how the amount and type of feedback generally impact regret. To this end, we first consider the setting of bandit learning with extra observations; that is, in addition to the typical bandit feedback, the learner can freely make a total of extra observations. We fully characterize the minimax regret in this setting, which exhibits an interesting phase-transition phenomenon: when , the regret remains , but when , it becomes , which improves as the budget increases. To design algorithms that can achieve the minimax regret, it is instructive to consider a more general setting where the learner has a budget of total observations. We fully characterize the minimax regret in this setting as well and show that it is , which scales smoothly with the total budget . Furthermore, we propose a generic algorithmic framework, which enables us to design different learning algorithms that can achieve matching upper bounds for both settings based on the amount and type of feedback. One interesting finding is that while bandit feedback can still guarantee optimal regret when the budget is relatively limited, it no longer suffices to achieve optimal regret when the budget is relatively large.
1 Introduction
Online learning over a finite set of actions is a classical problem in machine learning research. It can be formulated as a -round repeated game between a learner and an adversary: at each round, the learner chooses one of the actions and suffers the loss of this chosen action, where the loss is determined by the adversary. At the end of each round, the learner receives some feedback and uses it to update her policy at the next round. The goal of the learner is to minimize the regret, defined as the difference between her cumulative loss and that of the best fixed action in hindsight.
In terms of the type of feedback, two important settings have been extensively studied in the literature: bandit and full information. At each round, if the learner observes only the loss of the chosen action, then it is called bandit feedback, and the game is called adversarial multi-armed bandits (MAB) or non-stochastic bandits with adversarial losses (Auer et al., 2002b). On the other hand, if the losses of all actions are revealed to the learner, then it is called full-information feedback, and the game becomes prediction with expert advice (Cesa-Bianchi & Lugosi, 2006).
The regret in these two settings has been well understood. Specifically, the minimax regret is 111We use standard big notations (e.g., , , and ); those with tilde (e.g., , , and ) hide poly-logarithmic factors. under bandit feedback (Auer et al., 2002b; Audibert & Bubeck, 2009) and is under full information (Cesa-Bianchi & Lugosi, 2006, Theorems 2.2 and 3.7) (Hazan, 2016) (Orabona, 2019, Section 6.8). These results imply that learning under bandit feedback is slightly harder than under full information, in the sense that the dependency on is worse ( vs. ). However, the scaling with respect to remains the same (i.e., ).
In the above standard settings, the learner is allowed to arbitrarily switch actions at two consecutive rounds. However, in many real-world decision-making problems, switching actions may incur a cost (e.g., due to system reconfiguration and resource reallocation) (Zhang et al., 2005; Kaplan, 2011). Motivated by this practical consideration, a new setting called online learning with switching costs has also been extensively studied (Arora et al., 2012; Cesa-Bianchi et al., 2013). In this setting, the learner needs to pay an additional unit loss whenever she switches actions.
Interestingly, it has been shown that in this new setting, learning under bandit feedback is significantly harder than under full information. Under full-information feedback, even with switching costs, the minimax regret remains , which can be achieved by several algorithms such as Shrinking Dartboard (SD) (Geulen et al., 2010) and Follow-the-Perturbed-Leader (FTPL) (Devroye et al., 2013). On the other hand, Dekel et al. (2013) shows a (worse) lower bound of for the bandit setting, which can be matched (up to poly-logarithmic factors) by the batched EXP3 algorithm (Arora et al., 2012). These results reveal that introducing switching costs makes bandit problems strictly harder than expert problems due to the worse dependency on (i.e., vs. ).
Our Contributions. While these two special cases have been well studied, it remains largely unknown how feedback impacts regret in general. To close this important gap, we aim to fundamentally understand the role of feedback (in terms of both amount and type) in online learning with switching costs. Our main contributions are as follows.
(i) We first consider the setting of bandit learning with extra observations, where in addition to the typical bandit feedback, the learner can freely make a total of extra observations in an arbitrary form (Section 3). We present a tight characterization of the minimax regret, which exhibits an interesting phase-transition phenomenon (see Fig. 1(a)). Specifically, when , the regret remains , but when , it becomes , which improves as the budget increases.
(ii) To understand this phenomenon and design algorithms that can achieve the minimax regret, it is instructive to consider a more general setting where the learner has a budget of total observations (Section 4). We fully characterize the minimax regret in this setting as well and show that it is , which scales smoothly with the total budget (see Fig. 1(b)). Furthermore, we propose a generic algorithmic framework, which enables us to design different learning algorithms that can achieve matching upper bounds for both settings based on the amount and type of feedback.
(iii) Our findings highlight the crucial impact of feedback type (bandit vs. others) in the second setting (see Table 1). In particular, while both bandit and other types of feedback can achieve optimal regret when the budget is relatively limited, pure bandit feedback is no longer sufficient to guarantee optimal regret when the budget is relatively large. However, in the standard setting without switching costs, all three types of feedback we consider can achieve optimal regret in the full range of . This reveals that the impact of feedback type is (partly) due to switching costs.


Feedback Type | Minimax Regret | |||
---|---|---|---|---|
w/ SC | w/o SC | |||
Full-information | ||||
Flexible | ||||
|
||||
|
2 Problem Setup
In this section, we introduce basic notations and present the problem setup. For any positive integer , let , and let be the loss sequence . We use to denote the indicator function of event : if event happens, and otherwise.
The learning problem can be viewed as a repeated game between a learner and an adversary. Assume that there are actions the learner can choose. Let be the length of the time horizon, which is fixed at the beginning of the game and is known to the learner. At each round , the adversary assigns a loss in to each action in ; the learner samples an action from a probability distribution (also determined by the learner) over the action set . After taking action , the learner suffers a loss of the chosen action, i.e., . By the end of each round, the learner observes the loss of some actions (specific types of such feedback will be discussed later) and updates probability distribution that will be used at the next round. Each time when the learner takes an action different from that at the previous round, one unit of switching cost is incurred. The regret under a learning algorithm over a loss sequence , denoted by , is defined as the difference between the cumulative loss (including the switching costs incurred) under algorithm and that of the optimal (best fixed) action in hindsight:
(1) |
For a randomized algorithm, we consider the expected regret (or simply regret), denoted by , where the expectation is taken over the randomness of the algorithm. Without loss of generality, let , i.e., the first action does not incur any switching cost. The adversary is assumed to be oblivious, in the sense that the whole loss sequence is determined by the adversary before the game begins. In this paper, for any given algorithm , we are interested in the worst-case (expected) regret over all possible loss sequences (i.e., instance-independent), denoted by :
(2) |
Let be the set of all feasible learning algorithms following the specified learning protocol. We define the minimax (or optimal) regret, denoted by , as the minimum worst-case regret under all feasible learning algorithms in :
(3) |
For notational ease, we may drop in and simply use whenever there is no ambiguity.
To understand the role of feedback in online learning with switching costs, we will consider two different settings with an observation budget: (i) in addition to the typical bandit feedback, the learner can freely make a total of extra observations (Section 3); (ii) the learner can freely make total observations (Section 4). Due to space limitations, in Appendix A we provide motivating examples for the settings with an observation budget we consider.
3 Bandit Learning with Switching Costs under Extra Observation Budget
Observing the gap in the optimal regret bound under bandit and full-information feedback ( vs. ), it is natural to ask: How much can one improve upon the regret if the learner is allowed to make some extra observations in addition to the typical bandit feedback?
Motivated by this question, we consider the setting of bandit learning with switching costs under an extra observation budget. We consider the learning protocol specified in Section 2, and in addition to the typical bandit feedback, the learner is allowed to freely use at most extra observations of the loss of other action(s) throughout the game, where is an integer in . At the two endpoints of and , this new setting recovers the bandit and full-information cases, respectively. In this section, by slightly abusing the notation, we also use to denote the set of all learning algorithms using typical bandit feedback plus extra observations, and we are interested in the minimax regret for .
3.1 Minimax Regret
We first present our main result of the minimax regret in this setting, which is formally stated in Theorem 1.
Theorem 1.
In the setting of bandit learning with switching costs under an extra observation budget , the minimax regret is given by
Remark 1.
Interestingly, this minimax regret exhibits a phase-transition phenomenon (see, also, Fig. 1(a)): when the amount of extra observations is relatively small (i.e., ), they are insufficient for improving the regret, which remains ; however, when the amount is large enough (i.e., ), the regret decreases smoothly as the budget increases.
3.2 Lower Bound
To establish Theorem 1, we will first show a fundamental lower bound, which is formally stated in Proposition 1.
Proposition 1.
For any learning algorithm that can use a total of extra observations in addition to the typical bandit feedback, there exists a loss sequence (which may depend on both and ) such that
We provide detailed proof of the above lower bound in Appendix B. Here, we present a proof sketch that mainly focuses on the key steps of the lower bound analysis with necessary explanations. The proof sketch reveals useful insights that not only help explain the interesting phase-transition phenomenon but also shed light on the design of algorithms that can achieve this lower bound.
Proof Sketch of Proposition 1.
We first give an overview of the construction of hard loss sequences in our setting and the main ideas behind the construction.
Generally speaking, the difficulty of bandit problems lies in the exploitation-exploration tradeoff. On the one hand, the learner wants to pull empirically good actions in order to enjoy a low instantaneous loss (i.e., exploitation); on the other hand, she may also want to pull other actions and gain useful information to distinguish the optimal (best fixed) action and suboptimal actions (i.e., exploration).
In the presence of switching costs, Dekel et al. (2013) proposes hard instances (i.e., loss sequences) based on a multi-scale random walk such that useful information toward distinguishability (between the optimal action and suboptimal actions) can only be obtained when the learner switches actions, which, however, incurs switching costs. Using carefully constructed instances, they show that switching costs increase the intrinsic difficulty of bandit learning and result in a regret lower bound of .
However, the hard instances in Dekel et al. (2013) work for pure bandit feedback only. That is, if the learner can obtain full-information feedback at any of the rounds, she would immediately identify the optimal action and suffer no regret in the rest of the game. The reason is that the optimal action has the (unique) lowest loss at all rounds.
To make it still hard to learn even when the learner has some extra feedback, we will borrow an idea from Shi et al. (2022) to modify the original hard instance in Dekel et al. (2013): at each round, an additional layer of action-dependent noise is added to the loss of each action. As a result, the optimal action no longer has the lowest loss at all rounds and therefore cannot be trivially identified even when the learner can make extra observations.
In the rest of the proof sketch, we present three key steps of the proof and provide high-level explanations.
Step 1: Establishing the relationship between two regrets. As in Dekel et al. (2013), each loss value in the initial loss sequence we construct, denoted by , may not be bounded in ; through truncation, we construct the actual loss sequence by simply projecting each initial loss value onto . For notational ease, we use and to denote the regret over loss sequences and , respectively. Recall that the goal is to obtain a lower bound on , which, however, is hard to analyze directly due to the truncation. Instead, we show that it suffices to obtain a lower bound on (i.e., the regret under untruncated loss sequence), due to the following relationship:
(4) |
where is the gap between the instantaneous losses of the optimal action and a suboptimal action. The value of will be determined later.
Step 2: Obtaining a lower bound on . Let be the expected total number of action switches. Through careful information-theoretic analysis, we obtain the following (informal) lower bound on in terms of the number of switches and extra observation budget :
(5) |
where is a positive term that contains some constants and poly-logarithmic terms of .
We now explain each term in Eq. (5). Term reflects that without any useful information toward distinguishability, the learner may be stuck with a suboptimal action throughout the game, thus suffering regret. Term roughly represents the amount of useful information for gaining distinguishability and thus reducing the regret: better distinguishability leads to a larger and thus a lower regret. Term is simply the switching costs incurred.
Step 3: Choosing a proper value of . Note that the lower bound in Eq. (5) is a quadratic function of . By finding the minimizer of this quadratic function, denoted by , we can further obtain the following lower bound:
(6) |
It now remains to choose a proper value of based on . By considering two different cases ( and ) and choosing accordingly, we show that one of and dominates the other. Then, we can obtain the desired lower bound by combining these two cases. This completes the proof sketch. ∎
Remark 2.
While we use the same instance construction method in Shi et al. (2022), the problem they study is very different from ours. In particular, their learning protocol and the definition of switching costs are different, and they do not consider an observation budget as we do. We present a detailed discussion about the key difference in Section 5.
3.3 Insights from Lower Bound Analysis
Next, we give some useful observations and important insights that can be obtained from the above proof sketch, in particular, from Eq. (5), which provides a unified view of the lower bound in online learning with bandit feedback and flexible extra observations within a budget.
As a warm-up, we begin with the standard bandit case (i.e., ), which has been extensively studied (Dekel et al., 2013). Recall that under the current instance construction, bandit feedback provides useful information only when the learner switches actions. From Eq. (5), one can observe that there is a tradeoff between exploration and switching costs: on the one hand, in order to better explore and enjoy a lower regret, the learner has to switch frequently (i.e., a larger ) so as to gain more information (i.e., a larger ); on the other hand, however, since the learner has to pay one unit of switching cost for each switch (contributing to ), she should not switch too often. To strike the balance between the two, the best the learner can do is to switch times; otherwise, the regret can only be worse because is the minimizer of the lower bound in Eq. (5). Finally, choosing to be in Eq. (6) yields the bound for the bandit case.
Remark 3.
The above discussion indicates that with switching costs, the worst-case hard instance restrains the learner from obtaining distinguishability from more than rounds (i.e., rounds associated with action switches) rather than rounds as in the standard bandit learning setting (without switching costs). This is also the key reason why the minimax regret is worse in bandit learning with switching costs.
Next, we consider the first case: . In this case, one might hope to obtain a smaller regret (compared to the bandit case) with the help of additional feedback. However, we will show that unfortunately, the gain from those additional observations is negligible for improving the regret order-wise, and hence, the previous bound remains. To see this, let take the same value as in the bandit case (i.e., ) in Eq. (6); although now becomes positive instead of zero (as in the bandit case), it is still dominated by , which results in the same bound as in the bandit case.
We now turn to the second case: . In contrast to the previous case, due to a relatively large budget, the distinguishability provided by those extra observations (which do not contribute to switching costs) is no longer negligible. This leads to a smaller regret. In particular, by choosing , we have dominate and obtain the desired lower bound. In other words, one can reduce the regret through free exploration enabled by such extra observations without incurring switching costs.
3.4 Fundamental Questions about Algorithm Design
The above insights we gain from the lower bound analysis can also shed light on the algorithm design. In fact, these motivate us to ask several fundamental questions, not only about how to achieve optimal regret but also about the role of feedback in online learning with switching costs, in terms of both the amount and type of feedback.
On the one hand, it is straightforward to achieve a matching upper bound when . Specifically, one can simply ignore all the extra observations and use bandit feedback only, e.g., batched EXP3 (Arora et al., 2012), which enjoys a regret. Although the bounds match, only of the bandit feedback from the rounds contribute to distinguishability due to the tradeoff introduced by switching costs (see Remark 3). Given this observation, it is natural to ask: (Q1) Can one still achieve the same regret of while using bandit feedback from rounds only? Moreover, how would regret scale with the amount of available feedback if the (bandit) feedback is even more limited (e.g., )?
On the other hand, it remains largely unknown how to match the bound when . Note that in the derivation of the lower bound, we optimistically view that all extra observations contribute to useful information toward distinguishability (see term in Eq. (5)). To achieve this, however, one needs to answer an important question: (Q2) How to carefully design a learning algorithm that can properly use these extra observations to indeed gain sufficient useful information toward distinguishability and match the lower bound? Moreover, since now dominates (order-wise), can one still match the lower bound of using extra observations only (i.e., not using any bandit feedback)?
To address these fundamental questions, it turns out that it would be more instructive to consider a general setting where the learner has a budget for total observations (see Section 4) rather than extra observations. We will show that the results obtained for this general setting will naturally answer the aforementioned questions. In particular, we show that there exist learning algorithms that can match the lower bound (up to poly-logarithmic factors), hence concluding the minimax regret stated in Theorem 1.
4 Online Learning with Switching Costs under Total Observation Budget
In this section, we consider a more general setting of online learning with switching costs under a total observation budget. Specifically, at each round, the learner can freely choose to observe the loss of up to actions (which may not necessarily include the action played), as long as the total number of observations over rounds does not exceed the budget , which is an integer in . Without loss of generality, we assume . We aim to understand the role of feedback in this general setting by studying the following fundamental question: (Q3) How does the minimax regret scale with the amount of available feedback in general? What is the impact of different types of feedback (bandit, full-information, etc.)?
To proceed, we need some additional notations for this section. Let be the observation set, i.e., the set of actions whose loss the learner chooses to observe at round , and let be the total number of observations, i.e., . Naturally, we have . For example, bandit feedback is a special case with and ; full-information feedback is another special case with and . By slightly abusing the notation in this section, we also use to denote the minimax regret over the set of all learning algorithms that satisfy the learning protocol specified in Section 2 and do not exceed the total observation budget .
4.1 Minimax Regret
We first present the main result of this section and fully characterize the minimax regret for this general setting.
Theorem 2.
In the setting of online learning with switching costs under a total observation budget , the minimax regret is given by .
Remark 4.
To establish this result, we need to obtain both a lower bound and a matching upper bound. For the lower bound, it turns out that it suffices to use an existing lower bound, which was originally derived for standard online learning without switching costs. We restate this lower bound in Lemma 1.
Lemma 1.
(Seldin et al., 2014, Theorem 2) In the setting of online learning (without switching costs) under a total observation budget , the minimax regret is lower bounded by .
Naturally, this serves as a valid lower bound for the setting with switching costs we consider. In fact, we will show that this lower bound is tight (up to poly-logarithmic factors), which in turn offers the following important message.
Remark 5.
If the learner can freely make observations over rounds within the budget, introducing switching costs does not increase the intrinsic difficulty of the online learning problem in terms of the minimax regret.
Now, it only remains to show that there exist algorithms that can achieve a matching upper bound (up to poly-logarithmic factors), which will be the main focus of the next subsection.
4.2 Learning Algorithms and Upper Bounds
In this subsection, we show that there indeed exist algorithms that can achieve the lower bound in Lemma 1, which further implies the tight bound in Theorem 2. Instead of focusing on one particular algorithm, we first propose a generic algorithmic framework, which not only enables us to design various optimal learning algorithms in a unified way but also facilitates a fundamental understanding of the problem by distilling its key components.
Our generic framework builds upon the classic Online Mirror Descent (OMD) framework with negative entropy regularizer (also called the Hedge algorithm) (Littlestone & Warmuth, 1989) and incorporates the following three key components to tackle both switching costs and observation budget in a synergistic manner.
Batching Technique. The batching technique was originally proposed for addressing adaptive adversaries (Arora et al., 2012), but naturally provides low switching guarantees. We divide rounds into batches and judiciously distribute the available observations across batches. That is, instead of consuming observations at every round as in standard online learning (which could even be infeasible when observation budget is relatively small), we use observations only at a single round randomly sampled from each batch. One key step to obtain the desired regret guarantee is to feed the (unbiased estimate of) batch-average loss to the learning algorithm at the end of each batch. While this technique is borrowed from Shi et al. (2022), the problem setup we consider is very different (see Section 5).
Shrinking Dartboard (SD). SD is a calibrated technique for controlling the number of action switches in online learning under a lazy version of Hedge. That is, with a carefully crafted probability distribution, the action tends to remain unchanged across two consecutive rounds (Geulen et al., 2010) while preserving the same marginal distribution as in Hedge. In our algorithmic framework, we generalize this idea to the batching case with general feedback: the same action can be played across two consecutive batches (instead of across rounds), and it is no longer required to use only full-information feedback as in Geulen et al. (2010).
Feedback Type. Recall that the learner is allowed to freely request feedback within the total budget. Hence, our last component lies in the feedback type. That is, the learner has the flexibility to choose the observation set (not limited to bandit or full-information feedback only). In order to achieve a matching upper bound, however, the choice of the observation set (i.e., the type of feedback) is crucial in some cases. We will elaborate on this in Section 4.3.
Putting these three components together, we arrive at our unified algorithmic framework, which is presented in Algorithm 1. Given the input , , and of the problem, we need to determine the following input of the algorithm: the number of batches , batch size , learning rate , and indicator (Line 1), along with the initialization of some variables (Line 2). Throughout the game, we maintain a positive weight for each action in each batch . Both the weights and the action for each batch may be updated only between two consecutive batches. Hence, in each batch , we keep playing the chosen action until the end of the batch (Line 4); we sample a round uniformly at random from the current batch (Line 5) and choose an observation set in a certain way (to be specified later) such that the loss of each action in will be observed at round (Line 6). We then construct an unbiased estimate (Line 7), denoted by , of the batch-average loss (which depends on the choice of and will be specified later) and then update the weight and sampling probability of each action accordingly: and (Line 8). Finally, we determine action for the next batch (Line 9). Specifically, if the SD indicator , probability is always zero, and hence, action is sampled using fresh randomness with probability proportional to action weights as normally done in Hedge: sample following distribution . If the SD indicator , with probability , we keep the current action for the next batch (i.e., ); otherwise, we sample a new action following distribution .
With Algorithm 1 in hand, we are ready to introduce several specific instantiations and study their regret guarantees. In particular, for each instantiation we will specify the choice of the following parameters: number of batches , batch size , learning rate , SD indicator , and observation set . In the following, we first demonstrate one simple instantiation that uses full-information feedback only. Then, we show how to generalize this instantiation using more flexible feedback (i.e., not limited to full information only) while achieving the same performance guarantee.
Instantiation via Full-information Feedback. In this instantiation of Algorithm 1, we receive full-information feedback at a randomly selected round in each batch (i.e., and ) and SD is turned on (i.e., ). At a high level, this can be viewed as a batched generalization of the original SD algorithm (Geulen et al., 2010) with batches (since we have observations in each batch), and hence, the corresponding batch size is . For ease of exposition, we assume that and are integers. Specifically, we have , , , , , and . We use to denote this instantiation and present its regret upper bound in Proposition 2. The proof is provided in Appendix C.
Proposition 2.
The worst-case regret under algorithm is upper bounded by .
Remark 6.
This result immediately implies an upper bound of the minimax regret: , which, along with the lower bound in Lemma 1, further implies the tight bound in Theorem 2. Note that there is an additional factor in the upper bound. This shares the same pattern as in the setting even without switching costs (see Seldin et al. (2014, Theorem 1)), where the achieved upper bound also has an additional factor.
Remark 7.
For the previous setting considered in Section 3, the above result also implies an upper bound of the minimax regret: , when , by simply ignoring all bandit feedback (i.e., ). On the other hand, as discussed in Section 3.4, when , one can simply ignore extra observations and use pure bandit feedback only (e.g., batched EXP3 (Arora et al., 2012)) to achieve a regret. Combining these results, along with the lower bound in Proposition 1, implies the tight bound in Theorem 1. Moreover, this also answers question (Q2) raised in Section 3.
The result of our first instantiation shows that the optimal regret can indeed be achieved (up to a factor) when full-information feedback is employed. However, we can also show that the use of full-information feedback is not essential. In fact, it suffices to have an observation set chosen uniformly at random from all subsets of with the same cardinality, which leads to a more flexible instantiation of Algorithm 1 presented below.
Instantiation via Flexible Feedback. In this instantiation, instead of having as under full-information feedback, we allow . The key to this flexibility is a careful construction of an unbiased estimate of the batch-average loss (i.e., ). Specifically, let be any integer that satisfies if and if .222To fully use the budget, cannot be too small when . Then, we have , , , , is chosen uniformly at random from , and for all . We use to denote this instantiation and present its regret upper bound in Proposition 3. The proof is provided in Appendix D.
Proposition 3.
The worst-case regret under algorithm is upper bounded by .
An astute reader may already notice that in the above flexible instantiation, while the number of observations can be one (i.e., ), it is not the same as standard bandit feedback. This is because here, needs to be chosen uniformly at random rather than simply being the action played in that batch (i.e., ) as in the standard bandit setting (with a batch size of one). Motivated by this subtle difference, we will devote the next subsection to studying the impact of feedback type.
4.3 Impact of Feedback Type
In this subsection, we study the impact of feedback type by presenting another instantiation of Algorithm 1 via pure bandit feedback only. In this case, we naturally have .
Instantiation via Bandit Feedback. This instantiation is a generalized version of batched EXP3 (Arora et al., 2012) with flexible batch size. Specifically, we have , , , , , and for all . We use to denote this instantiation. When , we obtain a regret upper bound for and state it in Proposition 4. The proof is provided in Appendix E.
Proposition 4.
When , the worst-case regret under algorithm is upper bounded by .
Remark 8.
This result is encouraging, in the sense that when , even using pure bandit feedback can achieve the optimal minimax regret of . This result also answers question (Q1) raised in Section 3. First, it captures the regret scaling with respect to the amount of bandit feedback (i.e., still ) when is relatively small. Second, it implies that to achieve a regret of , it suffices to use bandit feedback from only rounds rather than all rounds as in the classic algorithms (Arora et al., 2012). The same minimax regret at these two endpoints ( and ) further implies that if only bandit feedback is allowed, the minimax regret is also when (i.e., in-between the two endpoints). In this case, bandit feedback is no longer sufficient to achieve the optimal minimax regret of , although full-information and flexible feedback can still achieve this optimal minimax regret (see Propositions 2 and 3). Clearly, this shows the crucial impact of different types of feedback (when the total budget is large), which answers the second part of question (Q3). On the other hand, however, a straightforward result (Proposition 5 in Appendix F), along with Propositions 2 and 3 and Lemma 1, shows that in the standard setting without switching costs, all three types of feedback can achieve optimal regret in the full range of . This reveals that the impact of feedback type is partly due to switching costs. We also summarize these results in Table 1.
5 Related Work
In this section, we present detailed discussions on several lines of research that are most relevant to ours. We omit the discussion on bandit and expert problems with switching costs as we have discussed this line of work in Section 1.
Online Learning with Total Observation Budget. In this line of research, the focus is on regret minimization when feedback is not always available and hence “limited” within a total budget. For example, in the so-called “label efficient (bandit) game” (Cesa-Bianchi et al., 2004; Audibert & Bubeck, 2010), the learner can ask for full-information/bandit feedback from no more than round(s). It is shown that the tight optimal regrets are and under full-information and bandit feedback, respectively. Seldin et al. (2014) also considers a total observation budget in online learning, where the learner can freely request feedback, as long as the total amount of observed losses does not exceed the given total budget . They establish a tight characterization of the minimax regret in their setting (i.e., ). However, they do not consider switching costs, nor the case when the total observation budget is smaller than in their algorithm design. Interestingly, we show that introducing switching costs does not increase the intrinsic difficulty of online learning in the sense that the minimax regret remains , but the feedback type becomes crucial.
Bandits with Additional Observations. Yun et al. (2018) considers the bandit setting with additional observations, where the learner can freely make observations at each round in addition to the bandit feedback. Hence, this can be viewed as a special case of online learning with a total observation budget (Seldin et al., 2014). That is, a total of observations are used in a particular way (i.e., bandit plus extra observations). They present a tight characterization of the scaling of the minimax regret with respect to , , and . Similar to Seldin et al. (2014), however, switching costs are not considered.
Online Learning with Switching Costs and Feedback Graphs. Arora et al. (2019) considers online learning with switching costs and feedback graphs, where given a feedback graph , the learner observes the loss associated with the neighboring action(s) of the chosen action (including itself). However, the feedback graph is given and hence the additional feedback is not of the learner’s choice. Arora et al. (2019) shows that in this setting, the minimax regret is , where is the domination number of the feedback graph . Hence, the dependency on remains the same as in the standard bandit setting without additional observations (i.e., ). On the contrary, in the setting we consider, the learner can freely decide the loss of which actions to observe, which leads to different (and more interesting) regret bounds.
Online Learning with Limited Switches. Altschuler & Talwar (2018) considers online learning with limited switches. In contrast to the settings with switching costs, here the learner does not pay additional losses for switching actions; instead, the total number of switches allowed is capped at . Compared to our setting, a key difference is that switching is a constraint rather than a penalty added to the loss/cost function. They show that in the bandit setting, the minimax regret is , i.e., the regret improves as the switching budget increases; in the expert setting, however, there is a phase-transition phenomenon: while the minimax regret is when , it remains when .
Online Learning against Adaptive Adversaries. Online learning with switching costs can also be viewed as a special case of learning against adaptive adversaries, where the losses at round are adapted to actions taken at both rounds and (in contrast to the oblivious adversaries we consider). Such adversaries have a bounded memory (of size one), in the sense that they could adapt only up to the most recent action, instead of any history in the earlier rounds (Cesa-Bianchi et al., 2013). Adopting the multi-scale random walk argument in Dekel et al. (2013), it has been shown that against adaptive adversaries with a memory of size one, the minimax policy regret is under both bandit feedback (Cesa-Bianchi et al., 2013) and full-information feedback (Feng & Loh, 2018). This is fundamentally different from the special case with switching costs, where the minimax regret is different under bandit feedback and full-information feedback ( vs. ).
Stochastic Bandits and the Best of Both Worlds.
Note that the above discussions have been focused on the adversarial setting. There is another body of work focused on the stochastic setting (see, e.g., Auer et al. (2002a); Auer (2003); Simchi-Levi & Xu (2019)), where the loss/reward follows some fixed distribution rather than being generated arbitrarily by an adversary. Hence, it is very different from the adversarial setting we consider. An interesting line of work has been focused on designing algorithms that can perform well in both adversarial and stochastic settings, thus achieving the best of both worlds (see, e.g., Bubeck & Slivkins (2012); Zimmert et al. (2019)).
Other Related Work.
In Shi et al. (2022), a novel bandit setting with switching costs and additional feedback has been considered. Specifically, the learner maintains an “action buffer” for each round, which is a subset of actions with fixed cardinality , and the learner can only take an action from this buffer set. Their switching cost can be roughly viewed as how much change is made to this buffer set throughout the game – replacing an action in the buffer set incurs a constant cost. While the learner can observe the losses of all the actions in this buffer set for free, the learner can also choose to receive full-information feedback (i.e., observing the losses of all actions rather than just actions in the buffer set) by paying another (larger) constant cost. Although we draw inspiration from their work for deriving the lower bound and designing algorithms, both their problem setup and regret definition are very different from ours, and more importantly, they do not consider observation budget.
6 Conclusion
Our work is motivated by a well-known gap in the minimax regret under bandit feedback and full-information feedback in online learning with switching costs. We attempted to fundamentally understand the role of feedback by studying two cases of observation budget: (i) bandit feedback plus an extra observation budget and (ii) a total observation budget. Our findings reveal that both the amount and type of feedback play crucial roles when there are switching costs.
Acknowledgments
We thank the anonymous paper reviewers for their insightful feedback. This work is supported in part by the NSF grants under CNS-2112694 and CNS-2153220.
References
- Altschuler & Talwar (2018) Altschuler, J. M. and Talwar, K. Online learning over a finite action set with limited switching. ArXiv, abs/1803.01548, 2018.
- Amir et al. (2022) Amir, I., Azov, G., Koren, T., and Livni, R. Better best of both worlds bounds for bandits with switching costs. In Advances in Neural Information Processing Systems, volume 35, pp. 15800–15810, 2022.
- Arora et al. (2012) Arora, R., Dekel, O., and Tewari, A. Online bandit learning against an adaptive adversary: from regret to policy regret. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1747–1754, 2012.
- Arora et al. (2019) Arora, R., Marinov, T. V., and Mohri, M. Bandits with feedback graphs and switching costs. Advances in Neural Information Processing Systems, 32, 2019.
- Audibert & Bubeck (2009) Audibert, J.-Y. and Bubeck, S. Minimax policies for adversarial and stochastic bandits. In Annual Conference Computational Learning Theory, 2009.
- Audibert & Bubeck (2010) Audibert, J.-Y. and Bubeck, S. Regret bounds and minimax policies under partial monitoring. J. Mach. Learn. Res., 11:2785–2836, 2010.
- Auer (2003) Auer, P. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res., 3:397–422, 2003.
- Auer et al. (2002a) Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235–256, 2002a.
- Auer et al. (2002b) Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32:48–77, 2002b.
- Bubeck & Slivkins (2012) Bubeck, S. and Slivkins, A. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory, pp. 42–1. JMLR Workshop and Conference Proceedings, 2012.
- Cesa-Bianchi & Lugosi (2006) Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006.
- Cesa-Bianchi et al. (2004) Cesa-Bianchi, N., Lugosi, G., and Stoltz, G. Minimizing regret with label efficient prediction. IEEE Transactions on Information Theory, 51:2152–2162, 2004.
- Cesa-Bianchi et al. (2013) Cesa-Bianchi, N., Dekel, O., and Shamir, O. Online learning with switching costs and other adaptive adversaries. In Advances in Neural Information Processing Systems, volume 26, 2013.
- Dekel et al. (2013) Dekel, O., Ding, J., Koren, T., and Peres, Y. Bandits with switching costs: T^2/3 regret. arXiv preprint arXiv:1310.2997, 2013.
- Devroye et al. (2013) Devroye, L., Lugosi, G., and Neu, G. Prediction by random-walk perturbation. In Annual Conference Computational Learning Theory, 2013.
- Feng & Loh (2018) Feng, Z. and Loh, P.-L. Online learning with graph-structured feedback against adaptive adversaries. 2018 IEEE International Symposium on Information Theory (ISIT), pp. 931–935, 2018.
- Geulen et al. (2010) Geulen, S., Vöcking, B., and Winkler, M. Regret minimization for online buffering problems using the weighted majority algorithm. Electron. Colloquium Comput. Complex., TR10, 2010.
- Hazan (2016) Hazan, E. Introduction to online convex optimization. Found. Trends Optim., 2:157–325, 2016.
- Kaplan (2011) Kaplan, S. Power plants: characteristics and costs. DIANE Publishing, 2011.
- Littlestone & Warmuth (1989) Littlestone, N. and Warmuth, M. K. The weighted majority algorithm. 30th Annual Symposium on Foundations of Computer Science, pp. 256–261, 1989.
- Neu (2015) Neu, G. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In NIPS, 2015.
- Orabona (2019) Orabona, F. A modern introduction to online learning. ArXiv, abs/1912.13213, 2019.
- Rouyer et al. (2021) Rouyer, C., Seldin, Y., and Cesa-Bianchi, N. An algorithm for stochastic and adversarial bandits with switching costs. In International Conference on Machine Learning, 2021.
- Seldin et al. (2014) Seldin, Y., Bartlett, P. L., Crammer, K., and Abbasi-Yadkori, Y. Prediction with limited advice and multiarmed bandits with paid observations. In International Conference on Machine Learning, 2014.
- Shi et al. (2022) Shi, M., Lin, X., and Jiao, L. Power-of-2-arms for bandit learning with switching costs. In Proceedings of the Twenty-Third International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, pp. 131–140, 2022.
- Simchi-Levi & Xu (2019) Simchi-Levi, D. and Xu, Y. Phase transitions in bandits with switching constraints. ERN: Other Econometrics: Mathematical Methods & Programming (Topic), 2019.
- Yao (1977) Yao, A. C.-C. Probabilistic computations: Toward a unified measure of complexity. 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), pp. 222–227, 1977.
- Yun et al. (2018) Yun, D., Proutière, A., Ahn, S., Shin, J., and Yi, Y. Multi-armed bandit with additional observations. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2:1 – 22, 2018.
- Zhang et al. (2005) Zhang, Y., Murata, M., Takagi, H., and Ji, Y. Traffic-based reconfiguration for logical topologies in large-scale wdm optical networks. Journal of Lightwave Technology, 23:2854–2867, 2005.
- Zimmert et al. (2019) Zimmert, J., Luo, H., and Wei, C.-Y. Beating stochastic and adversarial semi-bandits optimally and simultaneously. In International Conference on Machine Learning, pp. 7683–7692. PMLR, 2019.
Appendix A Motivating Examples for Online Learning with Switching Costs and Observation Budget
Consider a retail company that uses online learning to improve its website user interface (UI) design in order to attract more users. In this case, actions correspond to different UI designs. First, switching costs should be taken into account as frequently changing the website interface may become annoying to users. To evaluate other actions (different UI designs), the company can run an A/B test and display different interfaces to separate and relatively small groups of users so that the feedback of other actions is also available (in addition to the one displayed to the main and large population). However, each different website needs additional resources to be deployed and maintained, and hence, one may want to impose a total observation budget.
Another example would be Machine Learning as a Service (MLaaS). Consider a company that uses large ML models for jobs like prediction, chatbots, etc. They may train several different models and dynamically choose the best one via online learning. Changing the deployed ML model is not free: the new model needs to be loaded (which could be costly, especially nowadays when the number of parameters is quite large), and other components in the pipeline may also need to be adjusted accordingly. As a result, it is natural that redeploying or updating model components would incur a cost. While the performance of the deployed model is easily accessible, the company can also run these jobs using other models that are not deployed in the production system, to receive additional feedback. However, running these jobs consumes additional resources (e.g., computing and energy), which is not free either. Therefore, one may want to impose a budget on the number of additional observations (i.e., evaluations).
Appendix B Proof of Proposition 1
Proof of Proposition 1.
Our proof is built on Yao’s minimax principle (Yao, 1977). That is, we will establish a lower bound on the expected regret for any deterministic algorithm over a class of randomly generated loss sequences, which further implies the same lower bound for any randomized algorithm over some loss sequence in this class.
To begin with, we would like to give some high-level ideas about the proof. Note that while the loss sequence generation method we adopt will be the same as Algorithm 1 in Shi et al. (2022), we need a different analysis to establish the lower bound due to a different setting we consider. Specifically, in the original loss sequence construction based on multi-scale random walk (Dekel et al., 2013), the optimal action has the lowest loss at all rounds. With bandit feedback, useful information toward distinguishability (between the optimal action and suboptimal actions) is gained only when the learner switches between actions. With full-information feedback, however, the learner can immediately identify the optimal action even at one round only. Therefore, to construct a hard instance (i.e., loss sequence) for the setting where the learner is equipped with additional observations beyond the bandit feedback, Shi et al. (2022) introduced an action-dependent noise in addition to the original loss (which is called the hidden loss). Now, the learner’s information comes from two parts. On the one hand, the learner still gains distinguishability from switches (which is related to hidden losses). On the other hand, conditioning on hidden losses, the extra observations also provide additional information. Combining two parts together, we obtain a lower bound related to both the number of switches and the number of extra observations. For convenience, we restate this loss sequence generation method in Algorithm 2. Specifically, we first generate the sequence according to the random walk design (Line 4). Next, we determine the loss before truncation, i.e., (Line 5). We first add an action-dependent noise (which is an i.i.d. Gaussian random variable) to for each action . And then, for the optimal action only, we will further subtract (which is determined in the very beginning as an input to the algorithm) from the value obtained after adding . Finally, we truncate each onto range and obtain (Line 6).
Next, we give some additional notations needed for this proof. For any , let denote the conditional probability measure given the special (i.e., optimal) action , i.e., . As a special case, when , denotes the conditional probability measure where all the actions are identical and there is no special action. Let denote the conditional expectation under measure , and let . Let denote the observed loss sequence throughout the game. For two probability distributions and on the same space, let denote the Kullback-Leibler (KL) divergence (i.e., relative entropy) between and , and let denote the total variation distance between and .
Let be the indicator of whether it is switched from or to action between rounds and , let be the total number of action switches from or to action , let be the total number of action switches, i.e., , and let be the number of extra observations made at round in addition to the bandit feedback. Then, we naturally have and since the learning algorithm makes no more than extra observations in total. Let be the regret of the deterministic learning algorithm interacting with the loss sequence , and let be the (hypothetical) regret on the same action sequence with respect to loss sequence .
In the following proof, we need Lemmas A.1 and A.4 of Shi et al. (2022) and Lemma 2 of Dekel et al. (2013). We restate these three results as Lemmas 2, 3, and 4, respectively.
Lemma 2.
(Shi et al., 2022, Lemma A.1 (restated)) The KL divergence between and can be upper bounded as follows:
Lemma 2 is obtained by considering five disjoint cases (which corresponds to the five terms on the Right-Hand-Side (RHS) in terms of different values of , , and . This lemma reveals the relationship between the KL divergence and the number of switches and the number of extra observations and will be used for deriving Eq. (7).
Lemma 3.
Although the multi-scale random walk serves as a powerful and convenient tool to help us obtain the desired lower bound, an issue is that the losses could lie out of the range , which does not satisfy our problem setup. That is, based on the random walk, we can derive a lower bound on , with respect to a possibly unbounded loss sequence . However, our goal is to obtain a lower bound with respect to the bounded losses. To get around this issue, Lemma 3 presents a useful result: if and satisfy certain conditions, then the difference between and will not be too large, which is sufficient to give us the desired result.
Lemma 4.
Remark 10.
Lemma 4 relies on careful design of the random walk. We refer interested readers to (Dekel et al., 2013, Section 3) for technical details. In the following proof, we will first bound the KL divergence in part by . This term is different from the switching costs we consider, as roughly denotes the switch between rounds and , rather than between two consecutive rounds. To handle this difference, Lemma 4 builds a connection between the two and will be used for deriving Eq. (7).
With the above three restated lemmas, we are now ready to derive an upper bound on the KL divergence between and . In particular, for any , we have
(7) |
where (a) is from Lemma 2, (b) is obtained by enlarging the last four terms using the fact that , (c) is obtained by applying the monotonicity property of probability to the first term and merging the last four disjoint events, and (d) is from Lemma 4 and the fact that . Note that Eq. (7) indicates that the KL divergence (which can be viewed as the information obtained by the learner) is related to both the number of switches and the amount of extra feedback. With the upper bound on the KL divergence in Eq. (7), we can also bound the total variation. Specifically, we have
(8) |
where (a) is from Pinsker’s inequality, (b) is from Eq. (7), (c) is from Jensen’s inequality, and (d) is from .
With all the above results, we are ready to derive a lower bound on after showing two intermediate steps. Let be the number of times action is played up to round (which is a random variable). We first assume that the deterministic learning algorithm makes at most switches on any loss sequence, which will be used for deriving Eq. (9) below, and we will later relax this assumption. Under this assumption, we have
(9) |
where (a) is from the definition that , (b) is from rewriting the expectations, (c) is from the assumption of no more than switches, and (d) is from the definition of the total variation. Also, we have
(10) |
where (a) is from , (b) is from rewriting the expectations, and (c) is from the definition of the total variation.
We now lower bound the expected value of as follows:
(11) |
where (a) is from the regret definition (i.e., Eq. (1)), (b) is from Eqs. (9) and (10), (c) is from Eq. (8), (d) is from an elementary inequality: , and (e) is obtained by minimizing the quadratic function of .
Now, we turn to lower bound . By Lemma 3, if it holds that and , then we have . We first assume and later show that the selected satisfies this condition. Then, we have
(12) |
where the last step is from Eq. (11) and choosing .
We now consider two cases for : and , for some .
In the first case of , we have
(13) |
where (a) is from , (b) is obtained by choosing (where satisfies ), and (c) is simply due to for a sufficiently large . Since , we have when is sufficiently large.
In the second case of , we have
(14) |
where (a) is due to , (b) is obtained by choosing (where satisfies ), and (c) again is due to (applied to the second term). Since , we have when is sufficiently large.
Now, we want to relax the assumption that the deterministic learning algorithm makes no more than switches. Similar to the proof of Theorem 2 in Dekel et al. (2013), we consider the following: If the learning algorithm makes more than switches, then we halt the algorithm at the point when there are exactly switches and repeat its action after the last switch throughout the rest of the game. We use to denote the regret of this halted algorithm over the same loss sequence as in .
We consider two cases for the number of switches made by the original learning algorithm: and .
In the first case of , halting does not happen, and trivially, we have .
In the second case of , the original learning algorithm makes more than switches. We use to denote the round index at which the -th switch happens. Clearly, we have . As a result, the halted algorithm keeps playing action from round to the end of the game. We now rewrite and as follows:
By taking the difference between and and then taking the expectation with respect to the randomness from loss generation, we have
To see this, we first observe that at each round, the loss gap between actions is either or 0 in expectation (because the Gaussian noise we add has zero mean). Therefore, the first term can be bounded by (i.e., the gap at each round multiplied by the time horizon). Since the original learning algorithm makes more than switches, we have , which implies .
Combining the above two cases yields . Since we already obtain a lower bound for (i.e., Eqs. (13) and (14)), the same lower bound also holds for (within a constant factor of 2).
Finally, we complete the proof by applying Yao’s principle. ∎
We also give two remarks about technical details in the proof of Proposition 1.
Remark 11.
To conclude that , Dekel et al. (2013) shows that , which is a stronger result compared to what is needed in an expected sense. This stronger result relies on the fact that the loss gap between actions is either or 0 at each round. However, this may not be true anymore after introducing an additional action-dependent noise as in Shi et al. (2022). Despite this difference, one can still show , which is used to prove Proposition 1.
Remark 12.
Some readers may ask: If the deterministic algorithm switches more than times, should not the switching costs already imply the desired lower bound on the regret? Why is it necessary to show a reduction from switch-limited algorithms to arbitrary algorithms? To see this, we note that the lower bound of is obtained after selecting to be of order , while such selection is based on the previous analysis, which is further based on the assumption that the algorithm makes no more than switches. Therefore, the reduction from switch-limited algorithms to arbitrary algorithms is necessary.
Appendix C Proof of Proposition 2
Before proving Proposition 2, we first present two lemmas on the properties of SD. These are straightforward extensions of Lemmas 1 and 2 in Geulen et al. (2010) to their batched versions. A key difference is that we consider batches instead of rounds. While the proofs follow the same line of analysis, we provide the proofs below for completeness.
Lemma 5.
For the instantiations and of algorithm 1, over any loss sequence, we have
Proof of Lemma 5.
We first note that in these two instantiations of Algorithm 1, the feedback depends on the randomly chosen only, and is thus independent of what actions are taken by the learner throughout the game. As a result, the whole feedback sequence can be fixed even before the learner’s actions are determined.
In the following, we will show by induction that conditioning on any random feedback sequence , it holds (almost surely) that .
The base case of is trivial due to the algorithm design (specifically, the uniform action weight initialization). In the following, we move forward to the induction step, i.e., we show that for every , if it holds that , then it also holds that , as follows:
where (a) is due to Line 9 in Algorithm 1 (specifically, there are two disjoint cases for action to be played in batch : (i) action was played in batch , and it does not change according to probability ; (ii) action is selected based on fresh randomness (i.e., the previous action does not stay unchanged with probability ), regardless of which action is played in batch ), and (b) is from the inductive hypothesis that .
Finally, taking the expectation on both sides of the above completes the proof. ∎
Lemma 5 implies that SD does not change the marginal distribution of action selection. Hence, one still enjoys the same regret guarantee as that of standard OMD (i.e., without SD).
Lemma 6.
For the instantiations and of algorithm 1, over any loss sequence, the expected number of switches satisfies the following:
Proof of Lemma 6.
Proof of Proposition 2.
Our goal here is to establish an upper bound on the expected regret for algorithm , which consists of two parts (cf. Eqs. (2) and (1)): (i) standard regret in terms of loss and (ii) switching cost. We will establish bounds on both respectively, hence obtaining the final result in the proposition.
To start with, let us establish an upper bound on the standard regret in terms of loss. To this end, we will build upon the classic analysis of OMD (cf. (Orabona, 2019, Section 6.6)). That is, for any (random) sequence , learning rate , and vector such that its -th coordinate is 1 and all the others are 0, it holds almost surely that
where and , i.e., line 8 in Algorithm 1. Taking the expectation on both sides yields that
(15) |
where (a) follows from the algorithm design of , i.e., and is a randomly selected time slot within batch , and (b) comes from the boundedness of losses, the fact that , and noting that .
We can further rewrite the left-hand-side (LHS) of the above inequality as follows:
(16) |
where (a) holds since for any , we have . This is true since under , the choice of for constructing is a round index chosen uniformly at random from the current batch .
After obtaining the above upper bound on the standard regret (i.e., without switching costs), we now turn to bound the switching costs under . To this end, we directly leverage Lemma 6 along with the loss estimate to obtain that
(18) |
Appendix D Proof of Proposition 3
Proof of Proposition 3.
The organization of this proof is the same as that of Proposition 2, and the only difference lies in the loss estimate in this instantiation.
We first consider the case when , i.e., can be any integer from . Recall that is the total observation budget and is the number of observations made in each batch. Similarly, we have, for any (random) sequence , learning rate , and vector such that its -th coordinate is 1 and all the others are 0, it holds almost surely that
Taking the expectation on both sides yields that
(19) |
where (a) follows from the algorithm design of , i.e., the loss estimate and is a randomly selected time slot within batch , and (b) comes from the boundedness of losses, the fact that , and noting that .
We can further rewrite the LHS of the above inequality as follows:
(20) |
where (a) holds since for any , we have . This is true since under , the choices of both and for constructing are uniformly random and independent with each other.
After obtaining the above upper bound on the standard regret (i.e., without switching cost), we now turn to bound the switching costs under . To this end, we directly leverage Lemma 6 along with the loss estimate to obtain that
(22) |
Finally, combining Eqs. (21) and (22), we can bound the total regret as follows:
where (a) is from (recall that now we have and ), and (b) is obtained by choosing . Hence, we have completed the proof of Proposition 3.
The proof for the case of is exactly the same, except for the (implicit) fact that we need batch size to be well-defined, i.e., . That is why in this case is less flexible: now needs to be sufficiently large to fully exploit the total budget. ∎
Appendix E Proof of Proposition 4
Proof of Proposition 4.
The organization of this proof is the same as that of Proposition 2, and the main difference lies in the commonly-used importance-weighted estimator for bandit feedback. In addition, we note that it is now sufficient to directly bounding the switching costs by the number of batches.
We still start with the same fundamental conclusion in OMD analysis. Specifically, for any (random) sequence , learning rate , and vector such that its -th coordinate is 1 and all the others are 0, it holds almost surely that
Taking the expectation on both sides, we have
(23) |
where (a) follows from the algorithm design of , i.e., the loss estimate for all , (b) follows from the assumption that all losses are bounded by one and the fact that , and (c) is obtained by choosing .
We can further rewrite the LHS of the above inequality as follows:
(24) |
where (a) holds since for any , we have .
When , we have . Combining it with Eq. (25), we can conclude that both the standard regret and switching costs are of order , which gives us the desired result. ∎
Appendix F An Auxiliary Technical Result
In this section, we show that in the standard setting without switching costs, only using bandit feedback (e.g., algorithm ) can also achieve optimal regret (i.e., matching the lower bound of , up to poly-logarithmic factors) in the full range of . We state this result in Proposition 5.
Proposition 5.
In the standard setting without switching costs, for any , the worst-case regret under algorithm is upper bounded by .
Proof of Proposition 5.
The proof follows the same line of analysis as that in the proof of Proposition 4, except that we only require (instead of ) and do not consider switching costs. Therefore, Eq. (25) implies the following upper bound on the regret:
Note that can be any fixed action, including the best fixed action. This completes the proof. ∎