¹¹institutetext: University of Toronto ²²institutetext: Carleton College

Using Adaptive Experiments to Rapidly Help Students

Angela Zavaleta-Bernuy 11 Qi Yin Zheng 11 Hammad Shaikh 11 Jacob Nogas 11 Anna Rafferty 22 Andrew Petersen 11 Joseph Jay Williams 11

Abstract

Adaptive experiments can increase the chance that current students obtain better outcomes from a field experiment of an instructional intervention. In such experiments, the probability of assigning students to conditions changes while more data is being collected, so students can be assigned to interventions that are likely to perform better. Digital educational environments lower the barrier to conducting such adaptive experiments, but they are rarely applied in education. One reason might be that researchers have access to few real-world case studies that illustrate the advantages and disadvantages of these experiments in a specific context. We evaluate the effect of homework email reminders in students by conducting an adaptive experiment using the Thompson Sampling algorithm and compare it to a traditional uniform random experiment. We present this as a case study on how to conduct such experiments, and we raise a range of open questions about the conditions under which adaptive randomized experiments may be more or less useful.

Keywords:

Reinforcement learning Randomized experiments Multi-armed bandits A/B testing Field deployment

1 Introduction

Instructors frequently look for ways to support their students and improve their performance. With access to online learning environments, instructors can gather feedback in a larger scale setting. Since optimal instructional designs and scaffolds may not be known ahead of time, multiple possibilities can be tested using Uniform Random (UR) A/B experiments, also known as randomized control trials. In a UR A/B experiment, students are uniformly assigned to the different conditions that an instructor or researcher would like to test to learn about their relative effectiveness. One challenge of this approach is how to use data more rapidly to help current students. To mitigate this, we can aim to maximize total learning by having most students being subject to the more effective conditions as they become known.

Adaptive randomization is an effective strategy for assigning more students to the current most optimal condition, while retaining the ability to test other conditions. We use a Multi-Armed Bandit (MAB) algorithm that uses machine learning to increase the number of students assigned to the current most effective condition (or arm) [1], [7]. MAB are commonly used for rapid use of data in different areas such as marketing to optimize the benefits of the users and balance exploration vs. exploitation [1], [3]. For this study, we used Thompson Sampling (TS), a probability matching algorithm, where the probability of assignment is proportional to the probability that the arm is optimal [1].

In this paper, we present a real-world experiment to illustrate the benefits and limitations of using UR A/B experiments and TS in educational settings. First, we use UR A/B experiments to evaluate different versions of emails in a homework reminder intervention to determine if a more effective version can be identified. We then compare the results of UR A/B experiments with the TS results to study its performance and benefits. Our experimental design allows us to compare classical balanced A/B comparisons side-by-side with a TS adaptive experiment to evaluate the trade-offs of using each of these methods.

2 Multi-Armed Bandit (MAB) Algorithms

To optimize the experience of students, we use the TS algorithm designed to solve MAB problems, useful for adaptively assigning participants to conditions. [7].

The stochastic MAB problem is the problem of sequentially choosing from a discrete set of actions to maximize cumulative reward, where a reward is some measure of the effectiveness of the chosen action (arm). In this paper, we focus on the MAB problem with binary rewards. More precisely, we choose between $K$ versions (arms), and we denote the choice of action at step $t$ of the experiment by $x_{t}$ . Assuming we choose the $k-{th}$ arm, where $k\in\{1,2,\ldots,K\}$ , then $x_{t}=k$ , and we receive a reward $r_{t}$ with probability $p_{k}$ .

TS shows strong empirical performance in maximizing the cumulative reward [5]. TS is a Bayesian algorithm that maintains a posterior distribution over each reward $p_{k}$ . In our case, we use a Beta prior with parameters $\alpha_{k}$ and $\beta_{k}$ . Arms are chosen by sampling values from the posteriors over each arm, and choosing the arm corresponding to the highest sample drawn. The posterior distribution is then updated based on the chosen action $x_{t}=k$ and observed reward $r_{t}$ . We use a uniform prior for each arm, $\alpha_{k}=\beta_{k}=1$ , for all $k\in\{1,2,...,K\}$ .

(\alpha_{k},\beta_{k})\leftarrow\begin{cases}(\alpha_{k},\beta_{k})&\text{if }x_{t}\neq k\\ (\alpha_{k},\beta_{k})+(r_{t},1-r_{t})&\text{if }x_{t}=k\\ \end{cases}

3 Methods: Traditional and Adaptive Experimentation

For three consecutive weeks, we tested four different versions of the emails to investigate which might be more effective in leading students to click on the homework link appended in the email. To evaluate how TS adapts the distribution of students to each email version, we sent the messages in four different batches on consecutive days of the week (Tuesday to Friday).

Using UR, each of the four email reminders has the same probability of being assigned to a student. For the TS algorithm, the probability of assignment of the email reminders version is proportional to the reward (in our case, click rate) in all the previous batches, which is updated after each batch.

For our interventions, we used two different variations of the TS algorithm. For Weeks 1 and 2, we used a UR-TS Hybrid where the TS updating of the probability that an arm has the highest click rate used data from the UR participants too. This hybrid is called $\epsilon$ [0.5]-TS, because with epsilon (epsilon = 0.5) probability, arms are assigned using UR. This takes inspiration from past algorithms like epsilon-greedy [6] and top-two TS [4]. This is interesting to investigate because scientists may want to get the benefits of obtaining data under UR (in case TS introduces biases [2]) for analysis, while also then using that data to improve the performance of the TS algorithm. For Week 3, we applied traditional TS that did not use the data from the UR assignment.

4 Analysis and Results

Using a panel regression with week fixed effects (i.e., include indicators for Week2- $\epsilon$ [0.5]-TS and Week3-TS), we can aggregate the uniform portion of the experiment for the three weeks to evaluate the effects of the four arms on student click rate. We find that the average click rate in our sample across the three weeks is around $19\%.$ All four arms had click rates are within 2% of each other and their difference is not statistically significant regardless of time and participant effects. The lack of an optimal arm suggests that all students were assigned to fairly similar treatments. These results are robust to also including student fixed effects within the panel regression model.

For both of the later weeks, $\epsilon$ [0.5]-TS behaves similarly, favouring one particular arm. Some arms are assigned a substantially higher number of students and the others far fewer, but the seemingly favoured arm varies across the two weeks and is not consistent in Week2- $\epsilon$ [0.5]-TS. In Table 1, we show the cumulative reward per arm and per batch, which influenced the probability of assignment. Even though the probability of assignment aligns with the click rates from the previous batch, the algorithm drastically shifts to the arm with the highest reward and leaves minimal room for exploration in batch 3. As the click rates were not consistent across the different batches, we are unable to identify the presence of the most efficient arm, which is reflected on the cumulative rewards in Table 1, and could be a result of there being minimal differences between arms.

		Arm 1		Arm 2		Arm 3		Arm 4
		CCR	PA	CCR	PA	CCR	PA	CCR	PA
Week1- $\epsilon$ [0.5]-TS	Batch 1	0.200	0.117	0.277	0.659	0.212	0.177	0.167	0.047
	Batch 2	0.219	0.466	0.22	0.443	0.149	0.040	0.152	0.052
	Batch 3	0.206	0.434	0.209	0.452	0.137	0.017	0.163	0.097
	Batch 4	0.163	0.077	0.205	0.697	0.161	0.104	0.162	0.122
Week2- $\epsilon$ [0.5]-TS	Batch 1	0.213	0.116	0.086	0.000	0.246	0.225	0.300	0.659
	Batch 2	0.198	0.082	0.14	0.004	0.244	0.311	0.266	0.603
	Batch 3	0.197	0.065	0.186	0.041	0.261	0.666	0.235	0.229
	Batch 4	0.194	0.052	0.209	0.109	0.249	0.505	0.240	0.334
Week3-TS	Batch 1	0.231	0.477	0.153	0.056	0.22	0.389	0.157	0.078
	Batch 2	0.233	0.926	0.137	0.030	0.135	0.033	0.120	0.011
	Batch 3	0.183	0.510	0.129	0.039	0.133	0.070	0.174	0.381
	Batch 4	0.181	0.405	0.152	0.086	0.144	0.076	0.181	0.433

Table 1: The Table presents the cumulative click rate (CCR) and the probability of assignment (PA) for each arm in the next batch for the three weeks of interventions. The arms represent a version of the message. Each batch represents a separate day of message deployments: Monday (Batch 1), Tuesday (Batch 2), Wednesday (Batch 3), and Thursday (Batch 4). The cumulative click rate of each arm represents the percentage of people who received the message that clicked accumulated up to and including that batch. The probability of assignment is the probability that a particular arm will be chosen in the next batch calculated using 1000000 Monte Carlo simulations. Bold values highlight the arm with the highest probability of assignment for each batch on every week.

5 Discussion & Limitations

We present a case study of adaptive experimentation, using the Thompson Sampling MAB algorithm, when a difference between arms is not observed (i.e., the arms/conditions are equally effective), according to a traditional Uniform Random experiment. We illustrate how TS may randomly favour an arm, even when giving this arm more frequently has no consequences for participants. The TS algorithm minimizes regret — it aims to keep participants from being assigned to sub-optimal arms — so in the case where arms are equivalent, one could argue that any of them could be presented. However, this can be problematic for scientific inference and statistical analysis [2].

One limitation is that there might have been unobserved confounding variables that caused the click rates to change in one particular batch (e.g., students had an assignment due or a test), which will also affect the algorithm – this is one concern to keep in mind in applying MAB algorithms for adaptive experimentation.

6 Conclusion & Future Work

This paper provides an example of a real-world intervention using adaptive experiments, which can help instructors and researchers to use the results from experiments to more rapidly benefit students. We illustrate an instance of conducting such experiments using the TS algorithm, where the results suggest there is no difference between arms/conditions. We hope that this paper provides a first step for instructors and researchers to investigate adaptive experimentation in education. Future work can explore how the algorithm behaves in a wider variety of scenarios, such as different batch sizes and structure, and alternative differences between the arm/conditions.

7 Acknowledgements

This work was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) (#RGPIN-2019-06968), as well as by the Office of Naval Research (ONR) (#N00014-21-1-2576).

References

[1] Lomas, J.D., Forlizzi, J., Poonwala, N., Patel, N., Shodhan, S., Patel, K., Koedinger, K., Brunskill, E.: Interface design optimization as a multi-armed bandit problem. In: Proceedings of the 2016 CHI conference on human factors in computing systems. pp. 4142–4153 (2016)
[2] Rafferty, A., Ying, H., Williams, J.: Statistical consequences of using multi-armed bandits to conduct adaptive educational experiments. JEDM— Journal of Educational Data Mining 11(1), 47–79 (2019)
[3] Rafferty, A.N., Ying, H., Williams, J.J.: Bandit assignment for educational experiments: Benefits to students versus statistical power. In: International Conference on Artificial Intelligence in Education. pp. 286–290. Springer (2018)
[4] Russo, D.: Simple bayesian algorithms for best arm identification. In: Conference on Learning Theory. pp. 1417–1418 (2016)
[5] Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT press (2018)
[6] Watkins, C.J.C.H.: Learning from delayed rewards (1989)
[7] Williams, J.J., Rafferty, A.N., Tingley, D., Ang, A., Lasecki, W.S., Kim, J.: Enhancing online problems through instructor-centered tools for randomized experiments. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. pp. 1–12 (2018)