Stochastic differential equations for limiting description of UCB rule for Gaussian multi-armed bandits††thanks: The reported study was funded by RFBR, project number 20-01-00062.
Abstract
We consider the upper confidence bound strategy for Gaussian multi-armed bandits with known control horizon sizes and build its limiting description with a system of stochastic differential equations and ordinary differential equations. Rewards for the arms are assumed to have unknown expected values and known variances. A set of Monte-Carlo simulations was performed for the case of close distributions of rewards, when mean rewards differ by the magnitude of order , as it yields the highest normalized regret, to verify the validity of the obtained description. The minimal size of the control horizon when the normalized regret is not noticeably larger than maximum possible was estimated.
Keywords:
Gaussian multi-armed bandit UCB stochactic differential equations limiting description1 Introduction
We consider a multi-armed bandit (MAB) problem. MAB can be viewed as a slot machine with arms. Each one of the arms can be selected for play, which yields some random income (reward). It is suggested that a number of times that a gambler can play is known beforehand. This number is called the control horizon size. Formally MAB is a controlled stochastic process . Value only depends on the chosen arm. Gambler’s task is to maximize the total cumulative expected reward during the time of control.
MABs and corresponding algorithms are thoroughly analyzed in [13]. This is a reinforcement learning problem that exemplifies the exploration–exploitation tradeoff dilemma, so it is also studied in machine learning [1][3]. This problem is also known as the problem of adaptive control in a random environment [15] and the problem of expedient behavior [16]. MABs have also been used to model problems such as managing research projects in a large organization like a science foundation or a pharmaceutical company [9][2].
Further we consider a Gaussian MAB, when rewards are normally distributed with probability density functions
We assume that mean rewards are unknown. In [11, 6] it is shown that incorrect values for variances only slightly affect overall rewards, therefore they can be estimated during the initial steps of control. That is the reason we assume that the variances of rewards are assumed to be known. Therefore, the considered MAB can be described with vector parameter .
Gaussian MAB can be useful in a batch processing setting [14, 4, 10]. In that scenario gambler can only change the choice of the arm after they have used it for a given number of times (a batch). According to the central limit theorem, the sum of rewards for a batch will have a normal distribution (if batch size is relatively large) for a wide range of distributions of single-step rewards (if variances of rewards are finite). Batch processing can be also performed in parallel.
When some control strategy is used, the expected cumulative reward will be lower than the maximally possible. That happens due to the fact that some number of steps will be spent to explore the distributions of arms’ rewards. Maximal cumulative reward would be obtained if the most lucrative arm was known and used on every step of the control. The expected difference between the maximal possible and the obtained rewards is called the regret and can be expressed as
denotes the expected value calculated with respect to measure generated by strategy and parameter . Therefore, the goal of control is to minimize the regret.
Regret is dependent on the horizon size. To compare the strategies for different horizon sizes it is reasonable to consider the scaled regret
for .
Single-step income on step can be expressed as
where is a standard normal random variable, and , .
Then cumulative reward for using -th arm on step can be expressed as
where is a normal random variable zero mean and variance which we obtain when add up all of the rewards for the arm, is the number of times the -th arm was used, ; .
UCB strategy (as described by Lai [12]) with the parameter is defined as selection of arm maximizing the value
Strategy prescribes to use each of the arms once in the beginning of control to make the initial estimations regarding the mean rewards.
Presence of slowly growing term ensures the necessity to use every arm from time to time even if the estimated mean reward from its usage so far was lower than from the other arms due to some reason. That is the way the UCB strategy negotiates the exploration–exploitation dilemma: the gambler strives to maximize his or her reward by using the arm with the highest estimated mean, but also collects the information about the less explored arms.
We aim to build a limiting description (with a system of stochastic differential equations) of UCB strategy for a Gaussian multi-armed bandit when .
2 System of stochastic differential equations to describe UCB strategy for Gaussian two-armed bandit
First, we give a limiting description for the case of two-armed bandit.
In [5, 7] an invariant description of a UCB strategy for a MAB was given. Invariant description scales the control horizon to a unit size by scaling the step number to smaller values (proportional to ). If , then step value is infinitely small.
Differentials of cumulative reward for using -th arm can be expressed as
Here are independent Wiener processes, is a continuous variable that corresponds to the usage ratio of each arm. Each of those equations correspond to a Wiener process with a constant drift. At each moment only one of values can be increased as only one of the arms is used. Therefore, we consider two areas: in the first increases with time , in the second increases. We’ll use indicators for convenience.
If the first arm is chosen, then
and
If the second arm is chosen, then
and
We’ll take notice that and . Also remember that : one and only one arm is chosen at every moment.
Each of the areas is determined according to the chosen strategy. UCB rule can be rewritten as
The area where the first arm is chosen will be defined by the condition . Otherwise the second arm is chosen. To write this condition formally we use the Heaviside step function notation. Its value is zero for negative arguments and one for positive arguments.
For the UCB rule the indicator will be the function of arguments :
Using this notation, we write the system of two Itô stochastic differential equations and one ordinary differential equation equations to describe the UCB strategy for Gaussian two-armed bandit as
3 Simulation results
A series of Monte-Carlo simulations was performed to verify how the obtained system of equations describes the behavior of the Gaussian two-armed bandit when the UCB strategy is used. Without the loss of generality, we consider a Gaussian two-armed bandit with the zero mean reward for the first arm . We observe how the normalized regret depends on the value of the mean reward for the second arm. Also, we assume : in [7] it is shown that the maximum normalized regret is observed when the variances of the arm rewards are equal. Also, we set .
Further we consider the case of “close” distributions of rewards as it is when the highest values of regret are observed [17]. Its definitive feature is that the difference of mean rewards has order :
Without the loss of generality we set .
For the Gaussian multi-armed bandit simulations, we consider the cases of horizon sizes . Figure 1 contains the plots of normalized regret vs different values of . Data for the plots are averaged over 10000 simulations. Figure 1 also shows the normalized regrets calculated for numerical solutions of presented stochastic differential equations for different values of , averaged over 10000 simulations. We used the Euler–Maruyama method with step sizes of and .

We see that normalized regret is about the same for different horizon sizes. Also, the obtained system of equations gives the fitting description of the UCB strategy. We note that maximal normalized regret (approximately 0.73) in these simulations is observed when .
Having the limiting description allows to answer the question at what horizon size the normalized regret will not be noticeably larger than found maximal normalized regret. This question is of importance as the batch version of the algorithm (reported in [5, 7]) allows to vary the batch sizes. Parallel processing is possible in that case, and it can result in shorter processing times.
Figure 2 shows maximum normalized regret (in domain of close distributions of rewards) vs the control horizon size (left plot shows , right plot shows ). Data is averaged over 10000 simulations.

We see that after the normalized regret exceeds the value found via the limiting description by no more than . That means that for the batch processing (including parallel), the batch size may be chosen fairly large as long as the number of processed batches stays small of moderate (i.e. 50 or more).
4 System of stochastic differential equations to describe UCB strategy for Gaussian MAB
Further we obtain a limiting description with a system of stochastic differential equations for the UCB strategy for Gaussian multi-armed bandit with arms using the similar reasoning as before. Considered MAB has unknown mean rewards and known variances of rewards .
First, we express as
Indicator for the arm will be equal to one if the corresponding upper confidence bound is greater than the confidence bounds for all other arms. We express it as the product of step functions
for .
Then the system of stochastic differential equations and ordinary differential equations that describes the UCB strategy takes form
Presented system of equations also can be used to determine the horizon size at which the normalized regret converges. That can be used as the number of processed batches for the batch version of strategy, described in [5].
We consider a case of three-armed bandit as an example. We also consider the case of close distribution of rewards, described in section 3. The highest normalized regret is observed when and [8]. This happens as in this situation the strategy must distinguish between the best arm and two competing arms with fairly similar but lower rewards. Simulations show that the normalized regret converges at horizon size , i.e. is no more than greater than its value for larger horizon sizes.
5 Conclusion
We built and examined a limiting description (for horizon size ) with a system of stochastic differential equations for the UCB strategy for the Gaussian multi-armed bandit.
A series of Monte-Carlo simulations was performed to verify the validity of obtained results for the case of two-armed bandit in the case of “close” distributions of rewards. Maximum normalized regret for the case was determined. The minimal size of the control horizon when the normalized regret is not noticeably larger than maximum possible was estimated.
5.0.1 Acknowledgements
The reported study was funded by RFBR, project number 20-01-00062.
References
- [1] Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3, 397–422 (2002). https://doi.org/10.1162/153244303321897663
- [2] Berry, D.A., Fristedt, B.: Bandit problems. sequential allocation of experiments. Chapman and Hall, London/New York (1985)
- [3] Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press (2006). https://doi.org/10.1017/CBO9780511546921
- [4] Gao, Z., Han, Y., Ren, Z., Zhou, Z.: Batched multi-armed bandits problem. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada (2019)
- [5] Garbar, S.: Invariant description for batch version of ucb strategy for multi-armed bandit. Journal of Physics: Conference Series 1658, 012015 (2020). https://doi.org/10.1088/1742-6596/1658/1/012015
- [6] Garbar, S.: Dependency of regret on accuracy of variance estimation for different versions of ucb strategy for gaussian multi-armed bandits. Journal of Physics: Conference Series 2052, 012013 (2021). https://doi.org/10.1088/1742-6596/2052/1/012013
- [7] Garbar, S.: Invariant description of ucb strategy for multi-armed bandits for batch processing scenario. 24th International Conference on Circuits, Systems, Communications and Computers (CSCC), Chania, Greece pp. 75–78 (2020). https://doi.org/10.1109/CSCC49995.2020.00021
- [8] Garbar, S., Kolnogorov, A.: Invariant description for regret of ucb strategy for gaussian multi-arm bandit. Proceedings of the 6th Stochastic Modeling Techniques and Data Analysis International Conference with Demographics Workshop 44, 251–260 (2020)
- [9] Gittins, J.: Multi-armed bandit allocation indices. Wiley-Interscience Series in Systems and Optimization., Chichester: John Wiley & Sons, Ltd. (1989)
- [10] Kolnogorov, A.: Parallel design of robust control in the stochastic environment (the two-armed bandit problem). Automation and Remote Control 73 (2012). https://doi.org/10.1134/S000511791204008X
- [11] Kolnogorov, A.: Gaussian two-armed bandit and optimization of batch data processing. Problems of Information Transmission 54, 84–100 (2018). https://doi.org/10.1134/S0032946018010076
- [12] Lai, T.: Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics 15 (1987). https://doi.org/10.1214/aos/1176350495
- [13] Lattimore, T., Szepesvári, C.: Bandit Algorithms. Cambridge University Press (2020). https://doi.org/10.1017/9781108571401
- [14] Perchet, V., Rigollet, P., Chassang, S., Snowberg, E.: Batched bandit problems. The Annals of Statistics 44 (2015). https://doi.org/10.1214/15-AOS1381
- [15] Sragovich, V.: Mathematical theory of adaptive control. World Scientific Publishing (2005)
- [16] Tsetlin, M.: Automaton theory and the modelling of biological systems. Academic Press, New York (1973)
- [17] Vogel, W.: An asymptotic minimax theorem for the two armed bandit problem. The Annals of Mathematical Statistics 31 (1960). https://doi.org/10.1214/aoms/1177705907