The Nah Bandit: Modeling User Non-compliance in Recommendation Systems
Abstract
Recommendation systems now pervade the digital world, ranging from advertising to entertainment. However, it remains challenging to implement effective recommendation systems in the physical world, such as in mobility or health. This work focuses on a key challenge: in the physical world, it is often easy for the user to opt out of taking any recommendation if they are not to her liking, and to fall back to her baseline behavior. It is thus crucial in cyber-physical recommendation systems to operate with an interaction model that is aware of such user behavior, lest the user abandon the recommendations altogether. This paper thus introduces the Nah Bandit, a tongue-in-cheek reference to describe a Bandit problem where users can say ‘nah’ to the recommendation and opt for their preferred option instead. As such, this problem lies in between a typical bandit setup and supervised learning. We model the user non-compliance by parameterizing an anchoring effect of recommendations on users. We then propose the Expert with Clustering (EWC) algorithm, a hierarchical approach that incorporates feedback from both recommended and non-recommended options to accelerate user preference learning. In a recommendation scenario with users, rounds per user, and clusters, EWC achieves a regret bound of , achieving superior theoretical performance in the short term compared to LinUCB algorithm. Experimental results also highlight that EWC outperforms both supervised learning and traditional contextual bandit approaches. This advancement reveals that effective use of non-compliance feedback can accelerate preference learning and improve recommendation accuracy. This work lays the foundation for future research in Nah Bandit, providing a robust framework for more effective recommendation systems.
Index Terms:
Online preference learning, Contextual bandit, Non-compliance, Clustering, Recommendation system, Expert adviceI Introduction
Online recommendation systems have been widely applied in the digital world, such as personalized news recommendations [1], advertisement placements [2], and search engines [3]. However, implementing effective recommendation systems in the physical world remains challenging, such as in shopping [4, 5], mobility [6, 7], or health [8, 9]. For instance, while bandit approaches have been widely used in digital interactions by modeling the recommendation process as pulling arms of the bandit [1], they often overlook a key scenario in the physical world: users can easily opt out of taking any recommended option if it is not to their liking and revert to their baseline behavior. It is thus crucial for cyber-physical recommendation systems to operate with an interaction model that is aware of such behavior to prevent users from abandoning the recommendations altogether. We name this problem the Nah Bandit, a tongue-in-cheek reference to describe when users say ‘nah’ to the recommendation and opt for their preferred option instead. Our research addresses the Nah Bandit problem by considering both the impact of recommendations and the choices users make regarding non-recommended options.
Contextual bandits have been widely applied in digital interactions such as news recommendations and advertisement placements. In these settings, people may only have access to the recommended options. Therefore, the only form of non-compliance is to take none of the options and leave the service. However, picking an option that is not recommended is prevalent in the physical world. For example, consider the scenario where a customer is shopping in a store. All the items are presented to the customer in the showcase. A store clerk might recommend certain items to a customer, but the customer does not always purchase the recommended ones. The customer can choose any item in the showcase that he prefers, and the clerk can observe which items the customer eventually buys. This information allows the clerk to make more personalized recommendations during the customer’s next visit. Examples like these are prevalent, such as in shopping [4, 5] and mobility recommendations [6, 7].
Both supervised learning and traditional contextual bandit methods fail to address the Nah Bandit problem effectively. Supervised learning methods, such as classification based methods, decision-tree based methods, and neural networks, assume that the user selects from all options while not accounting for the influence of recommendations on users, which is called the anchoring effect. Conversely, contextual bandit methods, such as LinUCB [1], Thompson Sampling [10], and NeuralUCB [11], do not incorporate feedback from non-recommended items because they assume that the user only selects from recommended options, which hinders their ability to quickly capture user preferences. The Nah Bandit problem requires both efficiently understanding the anchoring effect and rapidly identifying user preferences from non-compliance.
User selects from recommended options | User selects from all options | |
---|---|---|
User is influenced by recommendations | Bandit | Nah Bandit (This work) |
User is not influenced by recommendations | N/A | Supervised learning |
It has been widely studied that the anchoring effect can significantly influence user choices [12, 13, 14, 15, 16, 17]. [16] found that, in the aggregate, the anchoring effect is linear and proportional to the size of the recommendation perturbation. Based on the information from [16], we propose a user non-compliance model to parameterize the anchoring effect for each user. This is the simplest method to solve the Nah Bandit problem, which reduces the bias introduced by the anchoring effect when learning user preferences. We further prove the sample complexity of parameter estimation in the user non-compliance model by transforming it into logistic regression. This result shows the speed at which we can learn user preference parameters in Nah Bandit problem.
To rapidly capture user preferences, algorithms such as user-based Collaborative Filtering [18] use the similarity between users to make recommendations. Some approaches further assume a network structure [19, 20], or a hierarchical structure among users [21, 22] to make recommendation. Similar to these works, we assume a hierarchical structure in the Nah Bandit problem. We propose a novel hierarchical contextual bandit algorithm—Expert with Clustering (EWC). EWC uses the user non-compliance model and clustering to determine the preference parameters of each user cluster. It then views each cluster as an expert and uses the Hedge [23] algorithm to select the expert that best predicts user choices. The likelihood that at least one expert accurately predicts the user choice is high, regardless of compliance. This leads to rapid identification of user cluster identity in the Nah Bandit problem. We further establish the regret bound of EWC. In a recommendation scenario with users, rounds per user, and user clusters, we demonstrate that EWC achieves a regret bound of . This regret bound underscores the theoretical efficacy of EWC in the short term compared to LinUCB [1]. We validate EWC in two different applications: travel routes and restaurant recommendations. Experimental results highlight that EWC achieves superior performance compared to both supervised learning and traditional contextual bandit approaches.
This paper extends and subsumes our preliminary work [24], where we introduced EWC algorithm for a travel route recommendation problem. We proposed a Loss-guided Distance metric to enhance the performance of clustering in EWC and proved a regret bound for EWC. However, our previous work focused on a specific travel problem that included only two options per decision, and did not formally define the Nah Bandit problem. The support vector machine (SVM) framework used for offline training in that work is not adaptable to scenarios with multiple options and, more importantly, cannot address the anchoring effect in the Nah Bandit problem. In this work, we formally introduce a novel bandit framework—Nah Bandit—for modeling the online preference learning problem. This framework distinctively incorporates user observable non-compliance, offering the potential to accelerate preference learning. We propose the user non-compliance model as the simplest method to solve the Nah Bandit problem. Compared to SVM, this model adapts to scenarios with multiple options by computing the utility of each option, and reduces the bias from the anchoring effect by parameterizing the user’s dependence on recommendation. We further combine the user non-compliance model with EWC, allowing EWC to efficiently utilize non-compliance feedback, thereby enabling rapid and accurate learning in the Nah Bandit problem. Additionally, by incorporating user context into the preference learning process, EWC improves the speed of adapting to user preferences. We validate our proposed method EWC against a comprehensive set of baselines with multiple applications. We also conduct an ablation study to assess the impact of each component. These enhancements and extensive evaluations help us better understand EWC algorithm and its potential applications. Overall, this research demonstrates that effectively utilizing non-compliance feedback can accelerate preference learning and enhance recommendation accuracy. Our work establishes a foundation for future research into the Nah Bandit problem, providing a robust framework for developing more effective recommendation systems.

I-A Contributions
We summarize our contributions as follows:
-
1.
We introduce a novel bandit framework—Nah Bandit—for modeling the online preference learning problem, in which users go ‘nah’ to the recommendation and choose their originally preferred option instead. This framework distinctively incorporates user observable non-compliance, offering the potential to accelerate preference learning.
-
2.
We propose a user non-compliance model as the simplest way to solve the Nah Bandit problem, which parameterizes the anchoring effect. We analyze its sample complexity to show the speed at which we can learn user preference parameters. Based on this model, we propose a hierarchical contextual bandit framework, Expert with Clustering (EWC). This framework effectively utilizes non-compliance feedback and hierarchical information, enabling rapid and accurate learning of user preferences.
-
3.
We establish the regret bound of EWC. In a recommendation scenario with users, rounds per user, and user clusters, EWC achieves a regret bound of , achieving superior theoretical performance in the short term compared to LinUCB.
II Related Works
II-A Supervised Online Recommendation
The field of recommendation systems has been significantly shaped by various supervised learning methods. A foundational approach is Collaborative Filtering (CF), which bases its recommendations on the similarity between users [18] or items [25]. Matrix Factorization (MF) [26], a specialized form of collaborative filtering, is notable for decomposing the user-item interaction matrix into lower-dimensional matrices representing the latent user and item factors. Recent advancements have led to the introduction of online versions of these traditional methods, like Online Collaborative Filtering [27] and Online Matrix Factorization [28], to address the dynamic nature of user preferences and real-time data.
Decision tree-based methods, such as Gradient Boosting Machines [29] and Random Forests [30], and their online versions [31, 32] represent another class of supervised learning methods in recommendation systems. These methods are widely used in the area of advertisements due to their ability to learn nonlinear features. XGBoost [33], an important variant of Gradient Boosting Machines, provides fast and accurate recommendations and is used as a baseline in our experiments.
However, these supervised recommendation methods often overlook the potential impact of the anchoring effect. Our research aims to address this oversight by reducing the bias introduced by the anchoring effect to make more accurate recommendations.
II-B Contextual Bandits with Clustering
The contextual bandit framework offers an efficient solution for recommendation systems. This concept was initially explored by [34], focusing on the use of confidence bounds within the exploration-exploitation trade-off, particularly in contextual bandits with linear value functions. Building on this, [1] extended this framework to personalized news recommendations by introducing LinUCB algorithm, setting a benchmark in the field. More recently, advanced algorithms such as NeuralUCB [11] have been developed, leveraging neural networks to model complex, non-linear relationships in the data, further enhancing the flexibility and effectiveness of contextual bandits. The integration of clustering techniques into the contextual bandit framework represents a significant advancement in the field. The foundational unsupervised learning algorithm, K-Means clustering, introduced by [35], laid the groundwork for this development. Based on this, [36] proposed a personalized recommendation algorithm using hierarchical tag clustering. With the development of contextual bandit, [22] proposed a novel approach to cluster users within the contextual bandit framework, using graphs to depict user similarities. A notable progression in this area is the DYNUCB algorithm [21], which is a dynamic clustering bandit algorithm that divides the population of users into multiple clusters and customizes the bandits for each cluster. Additionally, [37] proposes the LOCB approach which enhances performance by focusing on the accuracy of clustering and its impact on regret minimization in bandit algorithms. Building on these works, we also adopted clustering to segment users into several groups and tailor recommendations to each group.
II-C Recommendation systems with user non-compliance
We refer to user non-compliance as any action taken by the user that is not among the recommended action(s). Contextual bandit methods have been extended to consider user abandonment, for example, deleting an app or leaving before re-engaging later. Users may abandon the system for a variety of reasons, including fatigue from overexposure to irrelevant content or boredom from seeing too many similar recommendations [38], lack of trust [39], or non-alignment with the user’s immediate self-interests [40]. Some solutions proposed include incorporating the risk of user abandonment into the recommendation algorithms [41, 42, 43], or never offering recommendations that would yield lower expected reward than a user would have achieved if she acted alone [40]. Our work considers a softer form of user non-compliance, in which the user still selects an option within the same action class (e.g., mobility trip option), albeit not a recommended one. Our algorithm seeks to quickly learn user preferences by acknowledging such non-compliance and learning from these user actions. This novel approach provides a holistic view of user preferences, which is crucial for understanding the comparative utility of options and accelerating preference learning, especially in scenarios with limited data.
III Nah Bandit
III-A Definition of Nah Bandit
In the Nah Bandit problem, users may say ‘nah’ to the recommendation and choose their originally preferred option instead. This means the losses (or rewards) come from user’s choice. We define the Nah Bandit in a recommendation problem as follows. This framework combines elements of both supervised learning and partial feedback in a bandit setting.
Definition 1.
Nah Bandit is a scenario where a provider is tasked with recommending a set of options to a user, where represents the number of options. At each decision round within the total rounds , the provider recommends one option, labeled to the user. Subsequently, the provider observes the user’s choice , and incurs a loss from the user’s choice. The objective is to minimize the cumulative regret over all decision rounds .
Conversely, in the traditional bandit framework for recommendation systems, the user is assumed to select only from the recommended options, and therefore, the losses (or rewards) are derived solely from those recommended options.
III-B Comparison with Bandit and Supervised Learning
Nah Bandit differs from both traditional contextual bandit and supervised learning. Bandit framework utilizes feedback only from the recommended option, therefore its exploration-exploitation trade-off handles the anchoring effect of recommendation on users. In contrast, the supervised learning framework assumes the user’s choice and feedback are independent of the recommendation, allowing the user to choose any option. In the Nah Bandit problem, the user can choose any option, and we use the user’s feedback to learn their preferences. The feedback remains partial because different recommendations can lead to different outcomes. For example, some users might rely heavily on recommendations and are more likely to choose the recommended option, while others may carefully select from all options to choose the one that best matches their preferences. Table I compares the Nah Bandit, traditional bandit problem, and supervised learning.
If we adopt supervised learning approaches to solve the Nah Bandit problem, they may lead to sub-optimal recommendations because they do not account for the anchoring effect of recommendations on the feedback. Conversely, if we adopt traditional bandit approaches, the recommended options are viewed as the pulled arms of the bandit. In this case, we receive zero reward from the pulled arm if the user chooses none of the recommended options, and the feedback from the user’s actual choice is not used to update the recommendation system. The Nah Bandit framework offers a potential solution concept to accelerate preference learning by effectively utilizing non-compliance feedback.
III-C Problem Formulation
We further incorporate contexts of users and options in the Nah Bandit. Additionally, we extend the recommendation from single-user to a multi-user, and assume a hierarchical structure among users. We formulate our problem as follows.
Consider a scenario where a provider is tasked with recommending a set of options, denoted as , to a population of users with hierarchical structure, with the total number of users being . Each user, identified as in the set , is represented by a unique user context vector . At each decision-making round within the total rounds , the provider is presented with the user context and a specific subset of options , where represents the number of options in the subset. For simplicity, we assume that the value of remains constant across all subsets . Each option indexed by , is defined by an option context vector . Upon receiving this information, the provider recommends one option, labeled , from the set to the user. Subsequently, the provider observes the user’s choice, denoted as , and incurs a loss , determined by a predefined loss function known to the provider. It is important to note that the user’s choice may be influenced by the recommended option .
The objective of this scenario is to minimize the total cumulative regret over all users and decision rounds. This is mathematically formulated as .
IV A User Non-compliance Model
IV-A Model Description
A key assumption in our problem is that the user’s choice may be influenced by the anchoring effect. This leads to a scenario of partial feedback akin to a contextual bandit setting, where learning user preferences can be challenging. [16] uses a rating drift, defined as the actual rating minus the predicted rating, to represent the anchoring effect. They found that, in aggregate, the rating drift is linear and proportional to the size of the recommendation perturbation. This means that the more we recommend one option, the higher rating the user will give to this option. However, the slope of this linear relationship, which represents the user’s additional preference for the recommended options, is unknown. Building on [16], we propose a user non-compliance model to discern this additional preference, thereby addressing the issue of the anchoring effect.
Assume that each user has a fixed but unknown preference parameter . Given , we can make predictions using a known function . The key idea is that we assume there exists a within that quantifies the additional preference toward recommended options. Users with a high highly rely on the recommended option, while users with select the option with the highest utility for them, regardless of the recommendation. Our goal is to learn this for each user.
We propose a user non-compliance model, which is a linear model that parameterizes the user’s dependence on the recommendation. First, we incorporate a context in each option context which represents the degree to which this item is recommended to the user. For example, . The utility of each option is then defined as . Let represent the utility vector. The probability of selecting each option is predicted as , where denotes the softmax function. Let be the one-hot encoding of . The discrepancy between the predicted probability and the actual choice is quantified using the KL-divergence. The detailed methodology is encapsulated in Algorithm 1.
This approach is the simplest way to solve the Nah Bandit problem. It provides a supervised learning way to learn user’s preference parameters . It reduces the bias in the learning process that might be introduced by the anchoring effect, thereby preventing the user non-compliance model from falling into sub-optimal recommendations.
IV-B Sample Complexity Analysis
This section presents a sample complexity analysis of user preference parameter estimation in the user non-compliance model. If we let the model use the context of only two options to update the model, where one is the user’s choice and the other is not, this model is equivalent to a logistic regression.
Lemma 1 (Sample Complexity of Parameter Estimation in Logistic Regression (Theorem 4 in [44])).
Consider a logistic regression model with input and output . The parameter space is the unit sphere . where is the Sigmoid function, is the inverse temperature, and is the parameter of the model. The observed data are independent copies of and with unknown parameter . For any fixed , assume , and where is an absolute constant. Then with probability at least , the empirical risk minimizer achieves .
Theorem 1 (Sample Complexity of Parameter Estimation in the User Non-compliance Model).
Let the user non-compliance model use the context of only two options to update the model, where one is the user’s choice and the other is not. Assume . Suppose are i.i.d. samples from a a distribution determined by , where . Assume the user non-compliance model outputs the empirical risk minimizer . For any fixed , if , and where is an absolute constant, then with probability at least , we have .
Lemma 1 shows the sample complexity of parameter estimation in a logistic regression model. Using Lemma 1, we can get Theorem 1, which shows the sample complexity of the user preference parameter in the user non-compliance model. The proof of Theorem 1 is in Appendix A-D. This result demonstrates the speed at which we can learn user preference parameters in the Nah Bandit problem.
V Expert with Clustering
V-A General Framework
Another core aspect of our problem is rapidly identifying user preferences based on both compliance and non-compliance. To address this, we introduce the Expert with Clustering (EWC) algorithm, a novel hierarchical contextual bandit approach. EWC consists of both an offline training phase and an online learning phase. It transforms the recommendation problem into a prediction with expert advice problem, using clustering to get experts during the offline training phase and employing the Hedge algorithm to select the most effective expert during the online learning phase.
Prediction with expert advice is a classic online learning problem introduced by [23]. Consider a scenario in which a decision maker has access to the advice of experts. At each decision round , advice from these experts is available, and the decision maker selects an expert based on a probability distribution and follows his advice. Subsequently, the decision maker observes the loss of each expert, denoted as . The primary goal is to identify the best expert in hindsight, which essentially translates to minimizing the regret: , where is the best expert throughout the time.
We cast the recommendation problem into the framework of prediction with expert advice in the following way. Recall the assumption that each user has a fixed but unknown preference parameter . Given , EWC algorithm operates under the assumption of a cluster structure within the users’ preference parameters .
In the offline training phase, utilizing a set of offline training data where and are number of users and decision rounds in training data, we initially employ the user non-compliance model as a learning framework to determine each user’s preference parameter . Despite differences between training and testing data, both are sampled from the same distribution. This allows for an approximate determination of , providing insights into the hierarchical structure among users, albeit with some degree of approximation. Subsequently, a clustering method is applied on to identify centroids and user’s cluster affiliation , where and , . The number of clusters serves as a hyperparameter. We select the value of that yields the minimum regret on the offline training set.
Each centroid is considered an expert. In the online learning phase, using the Hedge algorithm, we initialize their weights and, at every online decision round, select an expert . An expert provides advice suggesting that a user’s preference parameters closely resemble the centroid . Consequently, we use this centroid to estimate the user’s preferences. The recommendation is then formulated. For example, where . Upon receiving the user’s chosen option , we calculate the loss for each expert and update the weights in Hedge based on this loss. The loss for each expert is determined by a known loss function . For example, . The details of this process are encapsulated in Algorithm 2.
The EWC algorithm efficiently utilizes non-compliance feedback in both the offline training and the online learning phase. In offline training, the learning framework within EWC is an interchangeable module that can be implemented using various models, such as SVM or neural networks. Compared to other models, the user non-compliance model captures preferences from both compliance and non-compliance feedback with low bias. In online learning, the likelihood that at least one expert accurately predicts the user choice is high, regardless of compliance. This leads to rapid identification of user cluster identity in the Nah Bandit problem.
V-B Clustering with Loss-guided Distance
The core parameter influencing the regret in our model is the set of centroids . An accurately representative set of centroids can significantly reflect users’ behaviors, whereas poorly chosen centroids may lead to suboptimal performance. In our simulations, we observed that the standard K-Means algorithm has limitations, as similar values in the Euclidean space do not necessarily yield similar user preferences.
To address the limitation of K-Means clustering, researchers in fields such as federated learning [45, 46] and system identification [47] have devised bespoke objective functions to enhance clustering methodologies. Inspired by [45], we introduce a distance metric guided by the loss function which is tailored for online preference learning. Our objective is to ensure that values within the same cluster exhibit similar performance. Thus, we replace the traditional L2 norm distance with the prediction loss incurred when assigning to user . Here, we define: , be the one-hot encodings of , , and . The Loss-guided Distance is defined as . The detailed clustering is presented in Algorithm 3.
V-C Accelerating Learning with User Context
In our model, we capitalize on the user context to facilitate accelerated preference learning during the initial phase. We hypothesize a latent relationship between the user context and the user’s cluster affiliation. In the offline training phase, we utilize user context vectors along with the users’ cluster labels to train a logistic regression model, denoted as . This model is designed to map the user context to a probabilistic distribution over the potential cluster affiliations.
During the online learning phase, we employ the trained logistic regression model to predict the probability of each user’s group affiliation based on their respective context. These predicted probabilities are then used to initialize the weights of the experts for each user, i.e., for all .
VI Regret Analysis
VI-A Regret Bound of EWC
In our problem, we define the loss function . We define the regret of EWC as the performance difference between EWC and recommendation with known user preference parameter :
(1) |
Since the study in [48] shows the performance of K-Means clustering using the norm distance, we similarly adopt the norm distance to analyze regret in our framework. Theorem 2 is our main theoretical result. The proof is in Appendix A-A. The regret bound of EWC can be represented as a sum of two components: the regret from the Hedge algorithm and the bias introduced by representing users with centroids.
Theorem 2 (Regret Bound of EWC).
Let be any distribution of with , , and finite Kurtosis. Let be the optimal expert for user , and . If is Lipschitz continuous for all with Lipschitz constant , Frobenius distance, and dimension normalization, then with probability at least , the regret of EWC is bounded by:
(2) |
The Gaussian Mixture Model (GMM) aligns closely with our hypothesis of a hierarchical structure among users, which is a typical assumption in the analysis of clustering algorithms. By assuming that the distribution of users’ preferences follows a GMM, we derive Corollary 1.
Corollary 1.
Assume is a Gaussian Mixture Model (GMM) with Gaussian distributions, each of which has weight , mean , and covariance , and the clustering outputs the optimal centroids where . Define be the average loss caused by centroids. With probability at least , the regret of EWC is bounded by
(3) |
The proof of Corollary 1 is provided in Appendix A-B. EWC does not achieve sublinear regret in the long term because it uses the cluster centroid as an estimate of user-specific preferences each time. However, if the estimation error is low, indicated by a small value, EWC achieves low regret in the short term.
VI-B Comparison to LinUCB
Lemma 2 (Regret Bound of SupLinUCB (proved by [49])).
Assume , s.t. . Define where . If SupLinUCB runs with , with probability at least , .
Corollary 2 (Advantage of EWC).
Assume , then when , .
We compare the regret of EWC with LinUCB. Lemma 2 is the regret bound of SupLinUCB (a variant of LinUCB). Corollary 2 is the comparison between EWC and LinUCB. The proof of Corollary 2 is provided in Appendix A-C. EWC demonstrates superior theoretical performance compared to LinUCB when is relatively small.
VII Experiment
In this section, we perform empirical analysis to validate our algorithm in two different applications: travel routes and restaurant recommendations.
VII-A Baselines
We compare our EWC algorithm against several baseline methods to determine its performance in online preference learning. The user non-compliance model (Non-compliance) is a linear model that parameterizes user dependence on recommendation (Algorithm 1). LinUCB refines the upper confidence bound method to suit linear reward scenarios, aiming to strike a balance between exploring new actions and exploiting known ones. We adopt the hybrid linear model in LinUCB proposed by [1] to learn from both user context and option context. DYNUCB combines LinUCB with dynamic clustering, which divides the population of users into multiple clusters and customizes the bandits for each cluster. We also use the hybrid linear model in DYNUCB. XGBoost is a highly efficient supervised learning algorithm based on gradient boosting.
VII-B Travel Route Recommendation
We validate our algorithm in travel route recommendations. The data is collected from a community survey first, and then expanded to represent a diverse driving population.
VII-B1 Description
Consider a social planner is tasked with recommending travel route options to a population of drivers, where each driver has a user context vector . Each route option at decision round with index is parameterized by an option context vector . For simplicity, we consider two options (), each with two relevant travel metrics (): travel time and emission. Thus, at each decision round of a user, the social planner faces a choice between two options: route 1, the standard route with regular travel time and emissions, and route 2, an eco-friendly alternative that offers reduced emissions while coming with increased travel time.
VII-B2 Experimental Setup
Community Survey. This study involved a community survey conducted in July 2023 on the University of North Carolina at Charlotte campus, and a total of 43 individuals participated. Participants provided the driving choice preferences as well as demographic data covering age, gender, ethnicity, and educational level. The survey’s main component involved a series of questions assessing willingness to adhere to route recommendations under varying scenarios with distinct travel times and carbon dioxide emission levels. Participants rated their likelihood of following these recommendations on the Likert scale, offering insight into their decision-making criteria. Mobility User Simulation. To better represent a diverse driving population, we expanded our dataset. We use the Bayesian inference model that resembles the original distribution from the survey data [50]. For each user in the survey data, we sample his preference parameter from a multivariate normal distribution. Based on this , we calculate the prediction loss compared to the real likelihood. Then we calculate the acceptance rate . We accept this sample with probability . The process above is repeated until we collect samples for each user. In order to incorporate the influence of the recommendation on the user’s choice, we concatenate with where is sampled from a beta distribution and then multiplied by . Higher means the population has more preference for the recommended option. represents a supervised learning scenario where the user’s choice is independent of the recommendation, while means a bandit feedback scenario. Based on the synthetic preference parameters, we sample travel routes and generate user choices on users’ routes. The detailed context description and parameter setting are shown in Appendix B-A.
VII-B3 Results and Interpretation
In this section, we present the relative performance of our proposed algorithm, EWC, by comparing it with various baselines over a series of total rounds. The experiment is repeated with 5 different random seeds. Figure 2 shows the results with the travel route recommendation dataset. The regret represents the cumulative difference between the rewards of the algorithm’s selections and the optimal choices.
Our proposed algorithm, EWC, demonstrates a significantly lower regret than that of other baseline methods in all scenarios. It achieves a very low slope in the early rounds, indicating that EWC algorithm effectively incorporates user preference information and rapidly learns user preferences.
The user non-compliance model can learn user’s preference from both compliance and non-compliance feedback, as its slope increasingly decreases. However, it does not learn rapidly, as its slop decreases slowly and it does not show a significant advantage over LinUCB. Compared to the user non-compliance model, EWC achieves a much lower regret. This is because EWC efficiently uses the hierarchical information within the group of users’ preferences learned by the user non-compliance model. It accelerates the preference learning process by clustering and prediction with expert advice.
XGBoost achieves the second lowest regret when , which represents a supervised learning scenario. However, as increases, representing a bandit feedback scenario where users have a stronger preference for recommended routes, XGBoost’s performance deteriorates significantly. This suggests that supervised learning methods overlook the influence of recommendations on user choices, leading to sub-optimal outcomes. In contrast, LinUCB performs well when and , demonstrating that its exploration-exploitation balancing strategy provides an advantage.
DYNUCB shows high regret in all scenarios. We believe that since it learns online, it obtains inaccurate in early rounds, leading to inaccurate clustering and consequently poor performance. In contrast, EWC algorithm utilizes the relatively accurate from offline training. Additionally, the loss-guided distance metric in clustering improves clustering performance.



VII-B4 Ablation Study
In this subsection, we perform an ablation study to assess the impact of each component of our proposed EWC algorithm. We aim to understand their contribution to the overall performance. EWC consists of three main components: (1) the user non-compliance model, (2) clustering and prediction with expert advice, and (3) linear regression on user context. Without non-compliance is EWC algorithm without using the user non-compliance model. We use a linear model to learn user’s preference parameter instead. Non-compliance is EWC without clustering and prediction with expert advice. It is reduced to the user non-compliance model. Without is EWC without using user context to accelerate preference learning. We also incorporate oracle methods in this section to show the potential of our EWC algorithm. Oracle Cluster is EWC with precise clustering to integrate group behaviors into user decision-making. We use Oracle Cluster to test the learning speed of prediction with expert advice. Lastly, Oracle uses complete information of user-specific preferences learned by Algorithm 1 to test the user non-compliance model in the offline training of EWC.
Figure 3 shows the results of the ablation study. EWC achieves a much lower regret compared to Non-compliance. As explained in Section VII-B3, the clustering and prediction with expert advice components significantly accelerate preference learning. EWC exhibits a lower regret slope than Without in the early rounds, indicating that using user context leads to a good initialization of the weight of each expert, further decreasing regret. EWC shows a lower slope than Without non-compliance across the entire time span, indicating that the user non-compliance model reduces the bias introduced by the anchoring effect when learning user preferences. The low slope of Oracle Cluster indicates that the shared preference in each cluster can represent each user’s specific preference well. The long-term slope of EWC mirrors that of Oracle Cluster, suggesting that prediction with expert advice rapidly identifies each user’s cluster identity. The Oracle shows extremely low regret due to its complete information of user-specific preference. It indicates that the user non-compliance model in the offline training successfully learns users’ preferences. These two methods show the potential lower regret bounds that EWC could aspire to achieve.



VII-C Restaurant Recommendation
VII-C1 Data
The dataset for restaurant recommendations were constructed using the Entree Chicago Recommendation Data [51]. This rich dataset is a collection of user interactions with the Entree Chicago restaurant recommendation system, which includes user preferences, selections, and ratings of various dining establishments within the Chicago area. We select four features to be included in the option context: food quality, service level, price, and style. The detailed context description and parameter setting are shown in Appendix B-B.
VII-C2 Results
The experiment is repeated with 10 different random seeds. Figure 4 provides the comparative result of EWC and baselines in the restaurant recommendation scenario over 1000 total rounds. EWC algorithm shows the lowest regret across the entire time span, demonstrating its effectiveness in restaurant recommendation. XGBoost shows the highest regret, likely due to its cold start problem. XGBoost model is more complex than other baselines, but the limited number of decision rounds per user is insufficient for adequate training. DYNUCB exhibits lower regret than LinUCB, and EWC outperforms the user non-compliance model, indicating that the clustering method efficiently leverages the hierarchical structure among users, thereby accelerating the preference learning process. The user non-compliance model performs better than LinUCB, demonstrating its adaptability to the Nah Bandit problem. This also contributes to EWC’s superior performance compared to DYNUCB.

VIII Conclusion
In this paper, we introduce a novel bandit framework—Nah Bandit. This framework offers the potential to accelerate preference learning. To solve this problem efficiently, we first introduce a user non-compliance model that parameterizes the anchoring effect to reduce bias when learning user preferences. Based on the user non-compliance model, we introduce Expert with Clustering (EWC), a novel hierarchical contextual bandit algorithm designed to address the Nah Bandit. EWC efficiently utilizes both compliance and non-compliance from users, achieving low regret in different feedback scenarios. This work lays the foundation for future research in Nah Bandit.
However, EWC may not achieve sublinear regret in the long term because it uses the cluster centroid as an estimate of user-specific preferences each time. In future work, we plan to delve deeper into the Nah Bandit problem, aiming to refine EWC algorithm to achieve low regret in both early rounds and the long term. Additionally, clustering algorithms often struggle with high-dimensional data due to the “curse of dimensionality,” which means that EWC may not perform well when the option context dimension is high. We hope to incorporate the dimension reduction method into EWC to handle high-dimensional data. In the sample complexity analysis, we allow the user non-compliance model to use the context of only two options for updates, transforming it into a logistic regression. However, there is potential for sample complexity analysis using the context of all options, which could result in a tighter bound.
Acknowledgments
This work was partially supported by the National Science Foundation (NSF) under grant number 2149511 and the Kwanjeong scholarship. The authors would like to thank Prof. Hamed Tabkhi and Babak Rahimi Ardabili for their survey data support, and Prof. Christos Cassandras and Prof. Andreas Malikopoulos for their insightful discussions.
Appendix A Proof details
A-A Proof of Theorem 2
Before describing our theoretical proofs, we first introduce the backgrounds. In the expert problem, spanning total rounds with experts, we denote the best expert throughout the duration as . [52] proved that the regret bound is:
(4) |
The loss of K-Means algorithm is define as , where is the cluster centroid assigned to . [48] proved the Uniform deviation bound of K-Means algorithm. The result is as follows. Consider be any set of centroids, as any distribution on with mean and variance . Assuming finite Kurtosis ( moment) and given , and a sample size from , we establish that for , the uniform deviation bound of K-Means holds with at least probability:
(5) |
Combining the results above, we prove the regret bound of EWC (Algorithm 2).
Proof.
(6) |
Since,
(7) | ||||
Recall that ,
(8) | ||||
By the regret bound of Hedge and triangle inequality,
(9) |
By the Lipschitz condition, s.t. ,
(10) | ||||
By inequation 5, with probability at least ,
(11) | ||||
(12) |
∎
A-B Proof of Corollary 1
Proof.
Since , and , the expected squared distance . So, . Since , we can get . ∎
A-C Proof of Corollary 2
Proof.
Since and , is equivalent to . Since , when , we can get . ∎
A-D Proof of Theorem 1
Proof.
Since we use the context of only two options to update the model where one is the user’s choice and the other is not, and , we can get . Therefore, the user non-compliance model is equivalent to the logistic regression. Since , , , , and , by Lemma 1, with probability at least , we have . ∎
Appendix B Experimental Setup Details
B-A Travel Route Recommendation
In the travel route recommendation, the contexts include travel time and CO2 emission of different routes: the regular route and the eco-friendly route. The context of the regular route is , which means of travel time and CO2 emission. The eco-friendly route has higher travel time and lower CO2 emission compared to the regular one. We generate the training and testing data based on the survey data. The user preference parameter is initially sampled from a multivariate normal distribution. We assume also shows cluster characteristic, so we sample from a beta distribution . Each dimension of the option context for the eco-friendly route is generated from a normal distribution. The parameters of data generation and experiment of travel route recommendation are listed in Table II. The user context “Age” and “Education level” are transformed into one-hot encodings, while others are binary variables.
Parameter | Value |
Number of decision rounds | 40 |
User context dimensions | 9 |
Option context dimensions | 2 |
Number of options | 2 |
Parameter for data generation | |
Acceptance temperature | |
Mean of | |
Covariance of | |
Mean of option context | |
Standard Deviation of option context | |
Parameter for training | |
Number of users | 446 |
Learning rate | 0.5 |
regularization parameter | 0.001 |
Number of clusters | 6 |
Parameter for testing | |
Number of users | 298 |
Exploration rate for LinUCB | 0.05 |
Learning rate for EWC | 1 |
Context | Range | Dimensions |
User Context | ||
Age | {18-34, 35-49, 50-64} | 3 |
Gender | {Male, female} | 1 |
Education level | {High school, bachelor, master or higher} | 3 |
Number of cars | {One, two or more} | 1 |
Intercept | {1} | 1 |
Option context | ||
Travel time | 1 | |
CO2 emission | 1 |
B-B Restaurant recommendation
The parameters used in the restaurant recommendation experiment are listed in Table IV. The description of the option context is shown in Table V. The user context ’Style’ is transformed into one-hot encoding. ‘Food quality’ and ‘Service level’ are transformed to . ’Price’ is transformed to accordingly.
Parameter | Value |
Option context dimensions | 9 |
Number of decision rounds | 3–105 |
Number of options | 2–18 |
Parameter for training | |
Number of users | 188 |
Learning rate | 0.5 |
regularization parameter | 0.01 |
Number of clusters | 8 |
Parameter for testing | |
Number of users | 75 |
Exploration rate for LinUCB | 0.05 |
Learning rate for EWC | 1 |
Option context | Range | Dimensions |
Food quality | {Fair, good, excellent, extraordinary, near-perfect} | 1 |
Service level | {Fair, good, excellent, extraordinary, near-perfect} | 1 |
Price | {Below $15, $15-$30, $30-$50, over $50} | 1 |
Style | {American, Asian, Latin, Middle Eastern, other} | 6 |
References
- [1] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A Contextual-Bandit Approach to Personalized News Article Recommendation,” in Proceedings of the 19th international conference on World wide web, pp. 661–670, Apr. 2010. arXiv:1003.0146 [cs].
- [2] F. Pase, D. Gündüz, and M. Zorzi, “Rate-constrained remote contextual bandits,” 2022.
- [3] M. Gigli and F. Stella, “Parametric bandits for search engine marketing optimisation,” in Advances in Knowledge Discovery and Data Mining: 26th Pacific-Asia Conference, PAKDD 2022, Chengdu, China, May 16–19, 2022, Proceedings, Part III, (Berlin, Heidelberg), p. 326–337, Springer-Verlag, 2022.
- [4] H. K. Kim, J. K. Kim, and Y. U. Ryu, “Personalized recommendation over a customer network for ubiquitous shopping,” IEEE Transactions on Services Computing, vol. 2, no. 2, pp. 140–151, 2009.
- [5] P. Wang, J. Guo, and Y. Lan, “Modeling retail transaction data for personalized shopping recommendation,” in Proceedings of the 23rd ACM international conference on conference on information and knowledge management, pp. 1979–1982, 2014.
- [6] J. Jin, H. Guo, J. Xu, X. Wang, and F.-Y. Wang, “An end-to-end recommendation system for urban traffic controls and management under a parallel learning framework,” IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 3, pp. 1616–1626, 2021.
- [7] O. Massicot and C. Langbort, “Competitive comparisons of strategic information provision policies in network routing games,” IEEE Transactions on Control of Network Systems, vol. 9, no. 4, pp. 1589–1599, 2022.
- [8] A. Tewari and S. A. Murphy, “From ads to interventions: Contextual bandits in mobile health,” Mobile health: sensors, analytic methods, and applications, pp. 495–517, 2017.
- [9] S. Tomkins, P. Liao, P. Klasnja, and S. Murphy, “Intelligentpooling: Practical thompson sampling for mhealth,” Machine learning, vol. 110, no. 9, pp. 2685–2727, 2021.
- [10] S. Agrawal and N. Goyal, “Thompson sampling for contextual bandits with linear payoffs,” in International conference on machine learning, pp. 127–135, PMLR, 2013.
- [11] D. Zhou, L. Li, and Q. Gu, “Neural contextual bandits with ucb-based exploration,” 2020.
- [12] A. Tversky and D. Kahneman, “Judgment under uncertainty: Heuristics and biases: Biases in judgments reveal some heuristics of thinking under uncertainty.,” science, vol. 185, no. 4157, pp. 1124–1131, 1974.
- [13] D. Green, K. E. Jacowitz, D. Kahneman, and D. McFadden, “Referendum contingent valuation, anchoring, and willingness to pay for public goods,” Resource and energy economics, vol. 20, no. 2, pp. 85–116, 1998.
- [14] A. Gunawardana and G. Shani, “A survey of accuracy evaluation metrics of recommendation tasks.,” Journal of Machine Learning Research, vol. 10, no. 12, 2009.
- [15] A. Furnham and H. C. Boo, “A literature review of the anchoring effect,” The journal of socio-economics, vol. 40, no. 1, pp. 35–42, 2011.
- [16] G. Adomavicius, J. C. Bockstedt, S. P. Curley, and J. Zhang, “Do recommender systems manipulate consumer preferences? a study of anchoring effects,” Information Systems Research, vol. 24, no. 4, pp. 956–975, 2013.
- [17] W. S. Rossi, J. W. Polderman, and P. Frasca, “The closed loop between opinion formation and personalized recommendations,” IEEE Transactions on Control of Network Systems, vol. 9, no. 3, pp. 1092–1103, 2022.
- [18] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry, “Using collaborative filtering to weave an information tapestry,” Communications of the ACM, vol. 35, no. 12, pp. 61–70, 1992.
- [19] X. Cheng and S. Maghsudi, “Distributed consensus algorithm for decision-making in multi-agent multi-armed bandit,” IEEE Transactions on Control of Network Systems, pp. 1–12, 2024.
- [20] A. M. Ospina, A. Simonetto, and E. Dall’Anese, “Time-varying optimization of networked systems with human preferences,” IEEE Transactions on Control of Network Systems, vol. 10, no. 1, pp. 503–515, 2023.
- [21] T. T. Nguyen and H. W. Lauw, “Dynamic clustering of contextual multi-armed bandits,” in Proceedings of the 23rd ACM international conference on conference on information and knowledge management, pp. 1959–1962, 2014.
- [22] C. Gentile, S. Li, and G. Zappella, “Online clustering of bandits,” in Proceedings of the 31st International Conference on Machine Learning (E. P. Xing and T. Jebara, eds.), vol. 32 of Proceedings of Machine Learning Research, (Bejing, China), pp. 757–765, PMLR, 22–24 Jun 2014.
- [23] N. Littlestone and M. Warmuth, “The weighted majority algorithm,” Information and Computation, vol. 108, no. 2, pp. 212–261, 1994.
- [24] T. Zhou, J.-H. Cho, B. R. Ardabili, H. Tabkhi, and C. Wu, “Expert with clustering: Hierarchical online preference learning framework,” 2024.
- [25] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based collaborative filtering recommendation algorithms,” in Proceedings of the 10th international conference on World Wide Web, pp. 285–295, 2001.
- [26] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.
- [27] A. S. Das, M. Datar, A. Garg, and S. Rajaram, “Google news personalization: scalable online collaborative filtering,” in Proceedings of the 16th International Conference on World Wide Web, WWW ’07, (New York, NY, USA), p. 271–280, Association for Computing Machinery, 2007.
- [28] X. He, H. Zhang, M.-Y. Kan, and T.-S. Chua, “Fast matrix factorization for online recommendation with implicit feedback,” in Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 549–558, 2016.
- [29] J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, pp. 1189–1232, 2001.
- [30] L. Breiman, “Random forests,” Machine learning, vol. 45, pp. 5–32, 2001.
- [31] B. Lakshminarayanan, D. M. Roy, and Y. W. Teh, “Mondrian forests: Efficient online random forests,” Advances in neural information processing systems, vol. 27, 2014.
- [32] A. Beygelzimer, E. Hazan, S. Kale, and H. Luo, “Online gradient boosting,” Advances in neural information processing systems, vol. 28, 2015.
- [33] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794, 2016.
- [34] P. Auer, “Using confidence bounds for exploitation-exploration trade-offs,” Journal of Machine Learning Research, vol. 3, no. Nov, pp. 397–422, 2002.
- [35] S. Lloyd, “Least squares quantization in pcm,” IEEE transactions on information theory, vol. 28, no. 2, pp. 129–137, 1982.
- [36] A. Shepitsen, J. Gemmell, B. Mobasher, and R. Burke, “Personalized recommendation in social tagging systems using hierarchical clustering,” in Proceedings of the 2008 ACM Conference on Recommender Systems, RecSys ’08, (New York, NY, USA), p. 259–266, Association for Computing Machinery, 2008.
- [37] Y. Ban and J. He, “Local clustering in contextual multi-armed bandits,” 2023.
- [38] J. Cao, W. Sun, Z.-J. M. Shen, and M. Ettl, “Fatigue-Aware Bandits for Dependent Click Models,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 3341–3348, Apr. 2020.
- [39] M. Yeomans, A. Shah, S. Mullainathan, and J. Kleinberg, “Making sense of recommendations,” Journal of Behavioral Decision Making, vol. 32, no. 4, pp. 403–414, 2019.
- [40] G. Bahar, O. Ben-Porat, K. Leyton-Brown, and M. Tennenholtz, “Fiduciary Bandits,” in Proceedings of the 37th International Conference on Machine Learning, pp. 518–527, PMLR, Nov. 2020.
- [41] J. Cao, W. Sun, and Z.-J. M. Shen, “Doubly Adaptive Cascading Bandits with User Abandonment,” Mar. 2019.
- [42] Z. Yang, X. Liu, and L. Ying, “Exploration. Exploitation, and Engagement in Multi-Armed Bandits with Abandonment,” in 2022 58th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1–2, Sept. 2022.
- [43] X. Wang, H. Xie, P. Wang, and J. C. S. Lui, “Optimizing recommendations under abandonment risks: Models and algorithms,” Performance Evaluation, vol. 161, p. 102351, Sept. 2023.
- [44] D. Hsu and A. Mazumdar, “On the sample complexity of parameter estimation in logistic regression with normal design,” 2024.
- [45] A. Ghosh, J. Chung, D. Yin, and K. Ramchandran, “An efficient framework for clustered federated learning,” in Advances in Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, eds.), vol. 33, pp. 19586–19597, Curran Associates, Inc., 2020.
- [46] F. Sattler, K.-R. Müller, and W. Samek, “Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 8, pp. 3710–3722, 2021.
- [47] L. F. Toso, H. Wang, and J. Anderson, “Learning personalized models with clustered system identification,” in 2023 62nd IEEE Conference on Decision and Control (CDC), pp. 7162–7169, 2023.
- [48] O. Bachem, M. Lucic, S. H. Hassani, and A. Krause, “Uniform deviation bounds for unbounded loss functions like k-means,” 2017.
- [49] W. Chu, L. Li, L. Reyzin, and R. Schapire, “Contextual bandits with linear payoff functions,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208–214, JMLR Workshop and Conference Proceedings, 2011.
- [50] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan, “An Introduction to MCMC for Machine Learning,” Machine Learning, vol. 50, pp. 5–43, Jan. 2003.
- [51] R. Burke, “Entree Chicago Recommendation Data.” UCI Machine Learning Repository, 2000. DOI: https://doi.org/10.24432/C5088P.
- [52] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997.
Appendix C Biography Section
![]() |
Tianyue Zhou is an undergraduate student in the School of Information Science and Technology at ShanghaiTech University. His research interests lie at the intersection of machine learning and transportation. He aims to develop efficient machine learning algorithms to address various real-world challenges in transportation. |
![]() |
Jung-Hoon Cho is a Ph.D. candidate in Civil and Environmental Engineering at the Massachusetts Institute of Technology (MIT). He earned both his M.S. and B.S. degrees in Civil and Environmental Engineering from Seoul National University. His primary research interest lies at the intersection of transportation and machine learning. Jung-Hoon aims to develop generalizable machine learning models to optimize traffic flow, thereby reducing urban congestion and greenhouse gas emissions. |
![]() |
Cathy Wu is an Associate Professor at MIT in LIDS, CEE, and IDSS. She holds a Ph.D. from UC Berkeley, and B.S. and M.Eng. from MIT, all in EECS, and completed a Postdoc at Microsoft Research. Her research interests are at the intersection of machine learning, autonomy, and mobility. Her research aims to advance generalizable optimization to enable next-generation mobility systems. Cathy is the recipient of the NSF CAREER, several PhD dissertation awards, and several publications with distinction. She serves on the Board of Governors for the IEEE ITSS and as an Area Chair for ICML and NeurIPS. |