An Information-Theoretic Analysis of Nonstationary Bandit Learning
Current Revision: December 2023)
Abstract
In nonstationary bandit learning problems, the decision-maker must continually gather information and adapt their action selection as the latent state of the environment evolves. In each time period, some latent optimal action maximizes expected reward under the environment state. We view the optimal action sequence as a stochastic process, and take an information-theoretic approach to analyze attainable performance. We bound per-period regret in terms of the entropy rate of the optimal action process. The bound applies to a wide array of problems studied in the literature and reflects the problem’s information structure through its information-ratio.
1 Introduction
We study the problem of learning in interactive decision-making. Across a sequence of time periods, a decision-maker selects actions, observes outcomes, and associates these with rewards. They hope to earn high rewards, but this may require investing in gathering information.
Most of the literature studies stationary environments — where the likelihood of outcomes under an action is fixed across time.111An alternative style of result lets the environment change, but tries only to compete with the best fixed action in hindsight. Efficient algorithms limit costs required to converge on optimal behavior. We study the design and analysis of algorithms in nonstationary environments, where converging on optimal behavior is impossible.
In our model, the latent state of the environment in each time period is encoded in a parameter vector. These parameters are unobservable, but evolve according to a known stochastic process. The decision-maker hopes to earn high rewards by adapting their action selection as the environment evolves. This requires continual learning from interaction and striking a judicious balance between exploration and exploitation. Uncertainty about the environment’s state cannot be fully resolved before the state changes and this necessarily manifests in suboptimal decisions. Strong performance is impossible under adversarial forms of nonstationarity but is possible in more benign environments. Why are A/B testing, or recommender systems, widespread and effective even though nonstationarity is a ubiquitous concern? Quantifying the impact different forms of nonstationarity have on decision-quality is, unfortunately, quite subtle.
Our contributions.
We provide a novel information-theoretic analysis that bounds the inherent degradation of decision-quality in dynamic environments. The work can be seen as a very broad generalization of russo16, russo22 from stationary to dynamic environments. To understand our results, it is important to first recognize that the latent state evolution within these environments gives rise to a latent optimal action process. This process identifies an optimal action at each time step, defined as the one that maximizes the expected reward based on the current environment state. LABEL:thm:main-result bounds the per-period regret in terms of the entropy rate of this optimal action process. This rate is indicative of the extent to which changes in the environment lead to unpredictable and erratic shifts in the optimal action process. Complementing LABEL:thm:main-result, LABEL:sec:lower lower bounds regret by the entropy rate of the optimal action process in a family of nonstationarity bandit problems.
Through examples, we illustrate that the entropy rate of the action process may be much lower than the entropy rate of the latent environment states. Subsection LABEL:subsec:example provides an illustrative example of nonstationarity, drawing inspiration from A/B testing scenarios. This distinction is further explored in the context of news recommendation problems, as detailed in LABEL:ex:news-rec and LABEL:ex:news-rec-revisit.
We provide several generalizations of our main result. LABEL:sec:sts considers problems where the environment evolves very rapidly, so aiming to track the optimal action sequence is futile. The assurances of LABEL:thm:main-result become vacuous in this case. Regardless of how erratic the environment is, LABEL:thm:main-result-sts shows that it is possible to earn rewards competitive with any weaker benchmark — something we call a satisficing action sequence — whose entropy rate is low.
LABEL:sec:rate-distortion interprets and generalizes the entropy rate through the lens of a communication problem. In this pr