This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newlistof

exampleloeList of Examples

Bayesian Optimization with LLM-Based Acquisition Functions for Natural Language Preference Elicitation

David Eric Austin [email protected] University of WaterlooWaterlooOntarioCanada Anton Korikov [email protected] Armin Toroghi  and  Scott Sanner University of TorontoTorontoOntarioCanada
(2024)
Abstract.

Designing preference elicitation (PE) methodologies that can quickly ascertain a user’s top item preferences in a cold-start setting is a key challenge for building effective and personalized conversational recommendation (ConvRec) systems. While large language models (LLMs) enable fully natural language (NL) PE dialogues, we hypothesize that monolithic LLM NL-PE approaches lack the multi-turn, decision-theoretic reasoning required to effectively balance the exploration and exploitation of user preferences towards an arbitrary item set. In contrast, traditional Bayesian optimization PE methods define theoretically optimal PE strategies, but cannot generate arbitrary NL queries or reason over content in NL item descriptions – requiring users to express preferences via ratings or comparisons of unfamiliar items. To overcome the limitations of both approaches, we formulate NL-PE in a Bayesian Optimization (BO) framework that seeks to actively elicit NL feedback to identify the best recommendation. Key challenges in generalizing BO to deal with natural language feedback include determining: (a) how to leverage LLMs to model the likelihood of NL preference feedback as a function of item utilities, and (b) how to design an acquisition function for NL BO that can elicit preferences in the infinite space of language. We demonstrate our framework in a novel NL-PE algorithm, PEBOL, which uses: 1) Natural Language Inference (NLI) between user preference utterances and NL item descriptions to maintain Bayesian preference beliefs, and 2) BO strategies such as Thompson Sampling (TS) and Upper Confidence Bound (UCB) to guide LLM query generation. We numerically evaluate our methods in controlled simulations, finding that after 10 turns of dialogue, PEBOL can achieve an MRR@10 of up to 0.27 compared to the best monolithic LLM baseline’s MRR@10 of 0.17, despite relying on earlier and smaller LLMs.111Our code is publically available at https://github.com/D3Mlab/llm-pe.

Conversational Recommendation, Preference Elicitation, Bayesian Optimization, Online Recommendation, Query Generation
journalyear: 2024copyright: acmlicensedconference: 18th ACM Conference on Recommender Systems; October 14–18, 2024; Bari, Italybooktitle: 18th ACM Conference on Recommender Systems (RecSys ’24), October 14–18, 2024, Bari, Italydoi: 10.1145/3640457.3688142isbn: 979-8-4007-0505-2/24/1 ccs: Information systems Recommender systemsccs: Information systems Personalizationccs: Information systems Language models

1. Introduction

Personalized conversational recommendation (ConvRec) systems require effective natural language (NL) preference elicitation (PE) strategies that can efficiently learn a user’s top item preferences in cold start settings, ideally requiring only an arbitrary set of NL item descriptions. While the advent of large language models (LLMs) has introduced the technology to facilitate NL-PE conversations (Handa et al., 2024; Li et al., 2023) we conjecture that monolithic LLMs have limited abilities to strategically conduct active, multi-turn NL-PE dialogues about a set of arbitrary items. Specifically, we hypothesize that LLMs lack the multi-turn decision-theoretic reasoning to interactively generate queries that avoid over-exploitation or over-exploration of user-item preferences, thus risking over-focusing on already revealed item preferences or wastefully exploring preferences over low-value items. Further challenges faced by monolithic LLM NL-PE approaches include the need to jointly reason over large, potentially unseen sets of item descriptions, and the lack of control and interpretability in system behaviour even after prompt engineering or fine-tuning (Maynez et al., 2020).

Refer to caption
Figure 1. PEBOL’s belief updates over a cold-start user’s item utilities during three turns of NL dialogue. Bayesian preference beliefs not only facilitate recommendation, but also enable Bayesian optimization policies to guide LLM query generation, avoiding over-exploration (asking about clearly low-value items) and over-exploitation (over-focusing on known preferences).

In contrast, conventional PE algorithms (Li et al., 2021; Zhang et al., 2020; Lei et al., 2020; Lee et al., 2019; Zhao et al., 2013), including Bayesian optimization methods (Guo and Sanner, 2010; Yang et al., 2021; Christakopoulou et al., 2016; Vendrov et al., 2020; Boutilier, 2002), establish formal decision-theoretic policies such as Thompson Sampling (TS) and Upper Confidence Bound (UCB) (Karamanolakis et al., 2018) to balance exploration and exploitation with the goal of quickly identifying the user’s most preferred items. However, these techniques typically assume a user can express preferences via direct item ratings or comparisons – an unrealistic expectation when users are unfamiliar with most items (Biyik et al., 2023). While recent work has extended Bayesian PE to a fixed set of template-based queries over pre-defined keyphrases (Yang et al., 2021), no existing work extends Bayesian methodologies to generative NL-PE over a set of generic NL item descriptions.

In this paper, we make the following contributions:

  • We introduce the first Bayesian optimization formalization of NL-PE for arbitrary NL dialogue over a generic set of NL item descriptions – establishing a new framework for research on steering LLMs with decision-theoretic reasoning.

  • We present PEBOL (Preference Elicitation with Bayesian Optimization augmented LLMs), a novel NL-PE algorithm which 1) infers item preferences via Natural Language Inference (NLI) (Yin et al., 2019) between dialogue utterances and item descriptions to maintain Bayesian preference beliefs and 2) introduces LLM-based acquisition functions, where NL query generation is guided by decision-theoretic strategies such as TS and UCB over the preference beliefs.

  • We numerically evaluate PEBOL against monolithic GPT-3.5 and Gemini-Pro NL-PE methods via controlled NL-PE dialogue experiments over multiple NL item datasets and levels of user noise.

  • We observe that after 10 turns of dialogue, PEBOL can achieve a mean MRR@10 of up to 0.27 compared to the best monolithic LLM baseline’s MRR@10 of 0.17, despite relying on earlier and smaller LLMs.

2. Background and Related Work

2.1. Bayesian Optimization

Given an objective function f:𝒳f:\mathcal{X}\rightarrow\mathbb{R}, (standard) optimization systematically searches for a point x𝒳x^{*}\in\mathcal{X} that maximizes222We take the maximization direction since this paper searches for items with maximum utility for a person. ff. Bayesian optimization focuses on settings where ff is a black-box function which does not provide gradient information and cannot be evaluated exactly – rather, ff must be evaluated using indirect or noisy observations which are expensive to obtain (Garnett, 2023; Shahriari et al., 2016). To address these challenges, Bayesian optimization maintains probabilistic beliefs over f(x)f(x) and its observations to guide an uncertainty-aware optimization policy which decides where to next observe f(x)f(x).

Bayesian optimization begins with a prior p(f)p(f) which represents the beliefs about ff before any observations are made. Letting yiy_{i} represent a noisy or indirect observation of f(xi)f(x_{i}), and collecting a sequence of observations into a dataset 𝒟=(𝐱,𝐲)\mathcal{D}=(\mathbf{x},\mathbf{y}), an observation model defines the likelihood p(𝒟|f)p(\mathcal{D}|f). We then use the observed data and Bayes theorem to update our beliefs and obtain the posterior

(1) p(f|𝒟)=p(f)p(𝒟|f)p(𝒟).p(f|\mathcal{D})=\frac{p(f)p(\mathcal{D}|f)}{p(\mathcal{D})}.

This posterior informs an acquisition function γ(x|𝒟)\gamma(x|\mathcal{D}) which determines where to next observe f(x)f(x) in a way that balances exploitation (focusing observations where ff is likely near its maximum) with exploration (probing areas where ff has high uncertainty).

2.2. Preference Elicitation

PE has witnessed decades of research, and includes approaches based on Bayesian optimization (e.g., (Eric et al., 2007; Guo et al., 2010; Brochu et al., 2010; González et al., 2017; Khajah et al., 2016)), Bandits (e.g., (Li et al., 2010, 2011; Zhao et al., 2013; Christakopoulou et al., 2016)), constrained optimization (Rossi and Sperduti, 2004), and POMDPs (Boutilier, 2002). In the standard PE setting, a user is assumed to have some hidden utilities 𝐮=[u1,,uN]\mathbf{u}=[u_{1},...,u_{N}] over a set \mathcal{I} of NN items, where item ii is preferred to item jj if ui>uju_{i}>u_{j}. The goal of PE is typically to search for an item iargmaxiuii^{*}\in\arg\max_{i}u_{i} that maximizes user utility in a minimal number of PE queries, which most often ask a user to express item preferences as item ratings (e.g., (Li et al., 2010, 2011; Brochu et al., 2010; Zhao et al., 2013; Christakopoulou et al., 2016)) or relative preferences between item pairs or sets (e.g., (Boutilier, 2002; Guo and Sanner, 2010; Handa et al., 2024; Eric et al., 2007; González et al., 2017; Vendrov et al., 2020)). An alternative form of PE asks users to express preferences over predefined item features, also through rating- or comparison-based queries (Li et al., 2021; Zhang et al., 2020; Lei et al., 2020). Central to the above PE methods are query selection strategies that balance the exploration and exploitation of user preferences, with TS and UCB algorithms (cf. Sec. 4.2) often exhibiting strong performance (Zhao et al., 2013; Christakopoulou et al., 2016; Yang et al., 2021; Li et al., 2021; Zhang et al., 2020). However, none of these methods are able to interact with users through NL dialogue or reason about NL item descriptions.

Refer to caption
Figure 2. The PEBOL NL-PE algorithm, which maintains a Bayesian belief state over a user’s item preferences given an arbitrary set of NL item descriptions 𝐱\mathbf{x}. This belief is used by a decision-theoretic policy to balance the exploration and exploitation of preferences by strategically selecting an item description xitx_{i^{t}} as the basis for LLM query generation. Belief updates are computed through Bayesian inference with NLI entailment scores between item descriptions and query-response pairs.

2.3. Language-Based Preference Elicitation

Yang et al. (Yang et al., 2021) introduce Bayesian PE strategies using TS and UCB for keyphrase rating queries, where keyphrases are first mined from NL item reviews and then co-embedded with user-item preferences in a recommendation system. Handa et al. (Handa et al., 2024) propose using LLMs to interface with a conventional Bayesian PE system, suggesting a preprocessing step to extract features from NL descriptions and a verbalization step to fluidly express pairwise item comparison queries. Li et al. (Li et al., 2023) prompt an LLM to generate PE queries for some specific domain (e.g., news content, morals), observe user responses, and evaluate LLM relevance predictions for a single item. While these works make progress towards NL-PE, they do not study how LLM query generation can strategically explore user preferences towards an arbitrary item set outside the realm of item-based or category-based feedback.

2.4. Conversational Recommendation

Recent work on ConvRec uses language models333Earlier systems (e.g. (Li et al., 2018; Chen et al., 2019)) use relatively small RNN-based language models. to facilitate NL dialogue while integrating calls to a recommender module which generates item recommendations based on user-item interaction history (Li et al., 2018; Chen et al., 2019; Wang et al., 2022; Yang et al., 2022). He et al. (He et al., 2023) report that on common datasets, zero-shot GPT-3.5/4 outperforms these ConvRec methods, which generally use older language models and require user-item interaction history for their recommendation modules.

2.5. Natural Language Inference

Binary Natural Language Inference (NLI) (Yin et al., 2019) models predict the likelihood that one span of text called a premise is entailed by (i.e., can be inferred from) a second span called the hypothesis. For example, an effective NLI model should predict a high likelihood that the premise “I want to watch Iron Man” entails the hypothesis “I want to watch a superhero movie”. As illustrated by this example, the hypothesis typically must be more general than the premise. NLI models are trained by fine-tuning encoder-only LLMs on NLI datasets (Williams et al., 2018; Dagan et al., 2005; Thorne et al., 2018), which typically consist of short text spans for the premise and hypothesis – thus enabling relatively efficient performance on similar tasks with a fairly small number LLM parameters.

Refer to caption
Figure 3. Cherry-picked system-generated dialogues from our NL-PE experiments. The Monolithic GPT-3.5 dialogue (left) demonstrates over-exploitation, with q3q^{3} directly extending q2q^{2} after a positive user preference is observed and leading to the extreme case of query repetition (q4q^{4} = q3q^{3}). In contrast, PEBOL (right) continues exploring even after a positive response, while focusing on promising aspects (three out of four queries elicit a positive response) by using UCB-guided query generation.

3. Problem Definition

We now present a Bayesian optimization formulation of NL-PE. The goal of NL-PE is to facilitate a NL dialogue which efficiently discovers a user’s most preferred items out of a set of NN items. Each item ii\in\mathcal{I} has a NL description xix_{i}, which might be a title, long-form description, or even a sequence of reviews, with the item set \mathcal{I} collectively represented by 𝐱𝒳\mathbf{x}\in\mathcal{X} with 𝐱=[x1,,xN]\mathbf{x}=[x_{1},...,x_{N}]. We assume the user has some (unknown) utility function f:𝒳f:\mathcal{X}\rightarrow\mathbb{R} establishing hidden utilities 𝐮=f(𝐱)\mathbf{u}=f(\mathbf{x}) so that item ii is preferred to item jj if ui>uju_{i}>u_{j}. Our goal is to find the most preferred item(s):

(2) iargmaxiui.i^{*}\in\arg\max_{i\in\mathcal{I}}u_{i}.

In contrast to standard Bayesian PE formalisms (c.f. Sec 2.2), we do not assume that the user can effectively convey direct item-level preferences by either: 1) providing item ratings (i.e., utilities) or 2) pairwise or listwise item comparisons. Instead, we must infer user preferences by observing utterances during a NL system-user dialogue. At turn tt of a dialogue, we let qtq^{t} and rtr^{t} be the system and user utterance, respectively, with 𝐪t=[q1,,qt]\mathbf{q}^{t}=[q^{1},...,q^{t}] and 𝐫t=[r1,,rt]\mathbf{r}^{t}=[r^{1},...,r^{t}] representing all system and user utterances up to tt. In this paper, we call qtq^{t} the query and rtr^{t} the response, though extensions to more generic dialogues (e.g., when users can also ask queries) are discussed in Section 7. We let t=(𝐪t,𝐫t)\mathcal{H}^{t}=(\mathbf{q}^{t},\mathbf{r}^{t}) be the conversation history at turn tt.

To formulate NL-PE as a Bayesian optimization problem, we place a prior belief on the user’s utilities p(𝐮|𝐱)p(\mathbf{u}|\mathbf{x}), potentially conditioned on item descriptions since they are available before the dialogue begins. We then assume an observation model that gives the likelihood p(𝐫t|𝐱,𝐮,𝐪tp(\mathbf{r}^{t}|\mathbf{x},\mathbf{u},\mathbf{q}^{t}), letting us define the posterior utility belief as

(3) p(𝐮|𝐱,t)p(𝐫t|𝐱,𝐮,𝐪𝐭)p(𝐮|𝐱).p(\mathbf{u}|\mathbf{x},\mathcal{H}^{t})\propto p(\mathbf{r}^{t}|\mathbf{x},\mathbf{u,\mathbf{q}^{t}})p(\mathbf{u}|\mathbf{x}).

This posterior informs an acquisition function γ(𝐱,t)\gamma(\mathbf{x},\mathcal{H}^{t}) which generates444To represent the generative acquisition of NL outputs, we deviate from the conventional definition of acquisition functions as mapping to .\mathbb{R}. a new NL query

(4) qt+1=γ(𝐱,t),q^{t+1}=\gamma(\mathbf{x},\mathcal{H}^{t}),

to systematically search for ii^{*}. The preference beliefs also let us define an Expected Utility (EU) μit\mu_{i}^{t} for every item as

(5) μit=𝔼p(𝐮|𝐱,t)[ui],\mu_{i}^{t}=\mathbb{E}_{p(\mathbf{u}|\mathbf{x},\mathcal{H}^{t})}[u_{i}],

which allows the top-kk items to be recommended at any turn based on their expected utilities.

Our Bayesian optimization NL-PE paradigm lets us formalize several key questions, including:

  1. (1)

    How do we represent beliefs p(𝐮|𝐱,t)p(\mathbf{u}|\mathbf{x},\mathcal{H}^{t}) in user-item utilities 𝐮\mathbf{u}, given NL item descriptions 𝐱\mathbf{x} and a dialogue t\mathcal{H}^{t}?

  2. (2)

    What are effective models for the likelihood p(𝐫t|𝐱,𝐮,𝐪t)p(\mathbf{r}^{t}|\mathbf{x},\mathbf{u},\mathbf{q}^{t}) of observed responses 𝐫t\mathbf{r}^{t} given 𝐱\mathbf{x}, 𝐪t\mathbf{q}^{t}, and user utilities 𝐮\mathbf{u}?

  3. (3)

    How can our beliefs inform the generative acquisition of NL queries qt+1q^{t+1} given t\mathcal{H}^{t} to strategically search for ii^{*}?

These questions reveal a number of novel research directions discussed further in Section 7. In this paper, we present PEBOL, a NL-PE algorithm based on the above Bayesian optimization NL-PE formalism, and numerically evaluate it against monolithic LLM alternatives through controlled, simulated NL dialogues (cf. Sec. 6).

4. Methodology

Limitations of Monolithic LLM Prompting

An obvious NL-PE approach, described further as baseline in Section 5.1, is to prompt a monolithic LLM with all item descriptions 𝐱\mathbf{x}, dialogue history t\mathcal{H}^{t}, and instructions to generate a new query at each turn. However, providing all item descriptions [x1,,xN][x_{1},...,x_{N}] in the LLM context window is very computationally expensive for all but the smallest item sets. While item knowledge could be internalized through fine-tuning, each item update would imply system retraining. Critically, an LLM’s preference elicitation behaviour cannot be controlled other than by prompt-engineering or further fine-tuning, with neither option offering any guarantees of predictable or interpretable behaviour that balances the exploitation and exploration of user preferences.

PEBOL Overview

We propose to addresses these limitations by augmenting LLM reasoning with a Bayesian Optimization procedure in a novel algorithm, PEBOL, illustrated in Figure 2. At each turn tt, our algorithm maintains a probabilistic belief state over user preferences as a Beta belief state (cf. Sec. 4.1). This belief state guides an LLM-based acquisition function to generate NL queries explicitly balancing exploration and exploitation to uncover the top user preferences (cf. Sec. 4.2). In addition, our acquisition function reduces the context needed to prompt the LLM in each turn from all NN item descriptions 𝐱\mathbf{x} to a single strategically selected item description xitx_{i^{t}}. PEBOL then uses NLI over elicited NL preferences and item descriptions to map dialogue utterances to numerical observations (c.f. Sec 4.3).

Refer to caption
Figure 4. MRR@10 for MonoLLM and PEBOL-P with uncertainty-informed policies (UCB, TS, ER). All methods show preference learning over time and MonoLLM is generally outperformed by PEBOL.

4.1. Utility Beliefs

4.1.1. Prior Beliefs

Before any dialogue, PEBOL establishes an uninformed prior belief p(𝐮)p(\mathbf{u}) on user-item utilities. We assume item utilities are independent so that

(6) p(𝐮)=i=1Np(ui),p(\mathbf{u})=\prod_{i=1}^{N}p(u_{i}),

and that the prior for each utility uiu_{i} is a Beta distribution

(7) p(ui)=Beta(αi0,βi0).p(u_{i})=\text{Beta}(\alpha_{i}^{0},\beta_{i}^{0}).

Since this paper focuses on fully cold start settings, we assume a uniform Beta prior with (αi0,βi0)=(1,1)(\alpha_{i}^{0},\beta_{i}^{0})=(1,1). Beta distributions, illustrated in Figure 1, lie in the domain [0,1][0,1] – a normalized interval for bounded ratings in classical recommendation systems. We can thus interpret utility values of ui=1u_{i}=1 or ui=0u_{i}=0 to represent a complete like or dislike of item ii, respectively, while values ui(0,1)u_{i}\in(0,1) provide a strength of preference between these two extremes.

4.1.2. Observation Model

To perform a posterior update on our utility beliefs given observed responses 𝐫t\mathbf{r}^{t}, we need an observation model that represents the likelihood p(𝐫t|𝐱,𝐮,𝐪t)p(\mathbf{r}^{t}|\mathbf{x},\mathbf{u},\mathbf{q}^{t}). Modelling the likelihood of 𝐫t\mathbf{r}^{t} is a challenging task, so we will require some simplifying assumptions. Firstly, we assume that the likelihood of a single response rtr^{t} is independent from any previous dialogue history t1\mathcal{H}^{t-1}, so that:

(8) p(𝐫t|𝐱,𝐮,𝐪t)=t=1tp(rt|𝐱,𝐮,qt).p(\mathbf{r}^{t}|\mathbf{x},\mathbf{u},\mathbf{q}^{t})=\prod_{t^{\prime}=1}^{t}p(r^{t^{\prime}}|\mathbf{x},\mathbf{u},q^{t^{\prime}}).

Note that this independence assumption will allow incremental posterior belief updates, so that

(9) p(𝐮|𝐱,t)p(rt|𝐱,𝐮,qt)p(𝐮|𝐱,t1).p(\mathbf{u}|\mathbf{x},\mathcal{H}^{t})\propto p(r^{t}|\mathbf{x},\mathbf{u},q^{t})p(\mathbf{u}|\mathbf{x},\mathcal{H}^{t-1}).

4.1.3. Binary Item Response Likelihoods and Posterior Update

With the factorized distributions over item utilities and observational likelihood history now defined, we simply have to provide a concrete observational model of the response likelihood conditioned on the query, item descriptions, and latent utility: p(rt|𝐱,𝐮,qt)p(r^{t}|\mathbf{x},\mathbf{u},q^{t}).

Because the prior is factorized over conditionally independent uiu_{i} (cf. (6)), we can likewise introduce individual per-item factorized binary responses rit{0(𝑑𝑖𝑠𝑙𝑖𝑘𝑒),1(𝑙𝑖𝑘𝑒)}r^{t}_{i}\in\{0(\mathit{dislike}),1(\mathit{like})\} to represent the individual relevance of each item ii to the preference elicited at turn tt. Critically, we won’t actually require an individual response per item — this will be computed by a natural language inference (NLI) model (Dagan et al., 2005) to be discussed shortly — but we’ll begin with an individual binary response model for ritr_{i}^{t} for simplicity:

(10) p(rit|xi,ui,qt)=Bernoulli(ui).p(r_{i}^{t}|x_{i},u_{i},q^{t})=\mbox{Bernoulli}(u_{i}).

With our response likelihood defined, this now leads us to our first pass at a full posterior utility update that we term PEBOL-B for observed Binary rating feedback. Specifically, given observed binary ratings ritr^{t}_{i}, the update at t=1t=1 uses the Beta prior (7) with the Bernoulli likelihood (10) to form a standard Beta-Bernoulli conjugate pair and compute the posterior utility belief

(11) p(ui|xi,1)\displaystyle p(u_{i}|x_{i},\mathcal{H}^{1}) p(ui|xi)p(ri1|xi,ui,qt)\displaystyle\propto p(u_{i}|x_{i})p(r^{1}_{i}|x_{i},u_{i},q^{t})
(12) =Beta(αi1,βi1),\displaystyle=\text{Beta}(\alpha_{i}^{1},\beta_{i}^{1}),

where αi1=αi0+ri1\alpha_{i}^{1}=\alpha_{i}^{0}+r_{i}^{1}, βi1=βi0+(1ri1)\beta_{i}^{1}=\beta_{i}^{0}+(1-r_{i}^{1}). Subsequent incremental updates updates follow Eq. (9) and use the same conjugacy to give

(13) p(ui|xi,t)=Beta(αit,βit),\displaystyle p(u_{i}|x_{i},\mathcal{H}^{t})=\text{Beta}(\alpha_{i}^{t},\beta_{i}^{t}),

where αit=αit1+rit\alpha_{i}^{t}=\alpha_{i}^{t-1}+r_{i}^{t}, βit=βit1+(1rit)\beta_{i}^{t}=\beta_{i}^{t-1}+(1-r_{i}^{t}).

4.1.4. Natural Language Inference and Probabilistic Posterior Update

As hinted above, effective inference becomes slightly more nuanced since we don’t need to observe an explicit binary response per item in our PEBOL framework. Rather, we receive general preference feedback rtr^{t} on whether a user generically prefers a text description qtq^{t} and then leverage an NLI model (Dagan et al., 2005) to infer whether the description xix_{i} of item ii would be preferred according to this feedback. For instance, for a (qt,rt)(q^{t},r^{t}) pair (“Want to watch a children’s movie?”,“Yes”), NLI should infer a rating of r1t=1r_{1}^{t}=1 for x1=x_{1}= “The Lion King” and r2t=0r_{2}^{t}=0 for x2=x_{2}= “Titanic”.

To deal with the fact that NLI models actually return an entailment probability, our probabilistic observation variant, PEBOL-P leverages the probability that item description xix_{i} entails qtq_{t}, which we denote as witw_{i}^{t}. We provide a full graphical model and derivation of the Bayesian posterior update given this entailment probability in the Supplementary Material, but note that we can summarize the final result as a relaxed version of the binary posterior update of (13) that replaces the binary observation ri{0,1}r_{i}\in\{0,1\} with the entailment probability wit[0,1]w_{i}^{t}\in[0,1], i.e., αit=αit1+wit\alpha_{i}^{t}=\alpha_{i}^{t-1}+w_{i}^{t}, βit=βit1+(1wit)\beta_{i}^{t}=\beta_{i}^{t-1}+(1-w_{i}^{t}).

To visually illustrate how this posterior inference process works in practice, Figure 1 shows the effect of PEBOL’s posterior utility belief updates based on NLI for three query-response pairs – we can see the system gaining statistical knowledge about useful items for the user from the dialogue.

Refer to caption
Figure 5. MRR@10 for PEBOL-P with various context acquisition policies.

4.2. LLM-Based Acquisition Functions

Recall from Sec. 2.1 that in Bayesian optimization, the posterior informs an acquisition function which determines where to make the next observation. PEBOL generates a new query qtq^{t} with a two-step acquisition function γ\gamma, first using Bayesian Optimization policies (step 1) based on the posterior utility beliefs p(𝐮|𝐱,t)p(\mathbf{u}|\mathbf{x},\mathcal{H}^{t}) to select NL context, and then using this selected context to guide LLM prompting (step 2). We express the overall acquisition function γ=γGγC\gamma=\gamma^{G}\circ\gamma^{C} as a composition of a context acquisition function γC\gamma^{C} (cf. Sec. 4.2.1) and a NL generation function γG\gamma^{G} (cf. Sec. 4.2.2).

4.2.1. Context Acquisition via Bayesian Optimization Policies

First, PEBOL harnesses Bayesian optimization policies to select an item description xitx_{i^{t}} which will be used to prompt an LLM to generate a query about an aspect described by xitx_{i^{t}} (cf. Sec. 4.2.2). Selecting an item iti^{t} whose utility uitu_{i^{t}} is expected to be near the maximum, uiu_{i^{*}}, will generate exploitation queries asking about properties of items that are likely to be preferred by the user. In contrast, selecting an item iti^{t} associated with high uncertainty in its utility uitu_{i}^{t} will generate exploration queries that probe into properties of items for which user preferences are less known. Thus, strategically selecting xitx_{i^{t}} allows PEBOL to balance the exploration and exploitation behaviour of NL queries, decreasing the risks of becoming stuck in local optima (over-exploitation) or wasting resources exploring low utility item preferences (over-exploration). We define the item selected by the context acquisition function as

(14) it=γC(𝐱,t),i^{t}=\gamma^{C}(\mathbf{x},\mathcal{H}^{t}),

and list several alternatives for γC\gamma^{C}, including the well-known strategies of TS and UCB (Shahriari et al., 2016):

  1. (1)

    Thompson Sampling (TS): First, a sample of each item’s utility u^it\hat{u}^{t}_{i} is taken from the posterior, u^itp(ui|xi,t).\hat{u}^{t}_{i}\sim p(u_{i}|x_{i},\mathcal{H}^{t}). Then, the item with the highest sampled utility is selected:

    (15) it=argmaxiu^it.i^{t}=\arg\max_{i}\hat{u}^{t}_{i}.

    TS explores more when beliefs have higher uncertainty and exploits more as the system becomes more confident.

  2. (2)

    Upper Confidence Bound (UCB): Let Pk(α,β)P_{k}(\alpha,\beta) represent the kk’th percentile of Beta(α,β)\text{Beta}(\alpha,\beta), which provides a confidence bound on the posterior. UCB selects the item with the highest confidence bound

    (16) it=argmaxiPk(p(ui|xi,t)),i^{t}=\arg\max_{i}P_{k}(p(u_{i}|x_{i},\mathcal{H}^{t})),

    following a balanced strategy because confidence bounds are increased by both high utility and high uncertainty.

  3. (3)

    Entropy Reduction (ER): An explore-only strategy that selects the item with the most uncertain utility:

    (17) it=argmaxiVar(p(ui|xi,t)).i^{t}=\arg\max_{i}\text{Var}(p(u_{i}|x_{i},\mathcal{H}^{t})).
  4. (4)

    Greedy: An exploit-only strategy that selects the item with the highest expected utility μit\mu_{i}^{t} (Eq. 5):

    (18) it=argmaxiμit.i^{t}=\arg\max_{i}\mu_{i}^{t}.
  5. (5)

    Random: An explore-only heuristic that selects the next item randomly.

4.2.2. Generating Short, Aspect-Based NL Queries

Next, PEBOL prompts an LLM to generate a NL query qtq^{t} based on the selected item description xitx_{i^{t}} while also using the dialogue history t\mathcal{H}^{t} to avoid repetitive queries. We choose to generate “yes-or-no” queries asking if a user prefers items with some aspect ata^{t}, which is a short text span extracted dynamically from xitx_{i^{t}} to be different from any previously queried aspects a1,,at1a^{1},...,a^{t-1}. We adopt this query generation strategy to: 1) reduce cognitive load on the user, who may be frustrated by long and specific queries about unfamiliar items and 2) better facilitate NLI through brief, general phrases (Yin et al., 2019). Letting ϕ\phi represent the query generation prompt, we let

(19) qt,at=γG(xit,t,ϕ)q^{t},a^{t}=\gamma^{G}(x_{i^{t}},\mathcal{H}^{t},\phi)

be the LLM generated query and aspect at turn tt, with prompting details discussed in Section 5.2.3. An example of such a query and aspect (bold) is “Are you interested in movies with patriotic themes?”, generated by PEBOL in our movie recommendation experiments and shown in Figure 2.

4.3. NL Item-Preference Entailment

4.3.1. Preference Descriptions from Query Response Pairs

Next, PEBOL receives a NL user response rtr^{t}, which it must convert to individual item preference observations. Since the LLM is instructed to generate ”yes-or-no” queries qtq^{t} asking a user if they like aspect ata^{t}, we assume the user response will be a ”yes” or a ”no”, and create a NL description of the users preference ρt\rho^{t}, letting ρt=at\rho^{t}=a^{t} if rt=r^{t}=“yes”, and ρt=concat(“not ”,at)\rho^{t}=\text{concat({``not ''}},a^{t}) if rt=r^{t}= “no”. For example, given a query that asks if the user prefers the aspect “patriotism” in an item, if the user response is “yes”, then the user preference ρt\rho^{t} is “patriotism”, and “not patriotism” otherwise. This approach produces short, general preference descriptions that are well suited for NLI models (Yin et al., 2019).

4.3.2. Inferring Item Ratings from NL Preferences

Given a NL preference ρt\rho^{t}, PEBOL must infer whether the user would like an item described by xix_{i}. Specifically, PEBOL acquires ratings 𝐰t=[w1t,,wNt]\mathbf{w}^{t}=[w^{t}_{1},...,w^{t}_{N}] (cf. Sec. 4.1.4) by using NLI to predict whether an item description xix_{i} entails (i.e., implies) the preference ρt\rho^{t}. For example, we expect that an NLI model would predict that xi=x_{i}=The Lion King” entails ρt=\rho^{t}=“animated” while xj=x_{j}=“Titanic” does not, inferring that a user who expressed preference ρt\rho^{t} would like item ii but not jj. We use an NLI model Pω(xi,ρt)P_{\omega}(x_{i},\rho^{t}) to predict the probability witw_{i}^{t} that xix_{i} entails ρt\rho^{t}, and return rit=witr_{i}^{t}=\lfloor w_{i}^{t}\rceil in the case of binary observations (PEBOL-B) and witw_{i}^{t} in the case of probabilistic observations (PEBOL-P).

4.4. The Complete PEBOL System

This concludes the PEBOL specification – the entire process from prior utility belief to the LLM-based acquisition function generation of a query to the posterior utility update is illustrated in Figure 2.

Refer to caption
Figure 6. MRR@10 for PEBOL using binary vs. probabilistic entailment scores. PEBOL-P with the best policy (TS on Yelp and MovieLens, UCB on Recipe-MPR) generally outperforms PEBOL-B.

5. Experimental Methods

We numerically evaluate our PEBOL variations through controlled NL-PE dialogue experiments across multiple datasets and response noise levels – comparing against two monolithic LLM (MonoLLM) baselines. Specifically, these baselines directly use GPT-3.5-turbo-0613 (GPT MonoLLM) or Gemini-Pro (Gemini MonoLLM) as the NL-PE system, as described in Section 5.1. We do not compare against ConvRec methods (Li et al., 2018; Chen et al., 2019; Wang et al., 2022; Yang et al., 2022) because they are not cold-start systems, requiring observed user-item interactions data to drive their recommendation modules. We also do not base our experiments on ConvRec datasets such as ReDIAL (Li et al., 2018), since they are made up of pre-recorded conversation histories and cannot be used to evaluate active, cold-start NL-PE systems.

5.1. MonoLLM Baseline

A major challenge of using MonoLLM for NL-PE is that item descriptions 𝐱\mathbf{x} either need to be internalized through training or be provided in the context window (cf. Sec. 4) – since we focus on fully cold-start settings, we test the latter approach as a baseline. In each turn, given the full conversation history t\mathcal{H}^{t} and 𝐱\mathbf{x}, we prompt the MonoLLM to generate a new query to elicit user preferences – all prompts are shown in the Supplementary Materials. We evaluate recommendation performance after each turn by using another prompt to recommend a list of ten item names from 𝐱\mathbf{x} given t\mathcal{H}^{t}. Due to context window limits, this MonoLLM approach is only feasible for small item sets with short item descriptions; thus, we have to limit |||\mathcal{I}| to 100 for fair comparison to the MonoLLM baseline.

5.2. Simulation Details

We test PEBOL and MonoLLM through NL-PE dialogues with LLM-simulated users, where the simulated users’ item preferences remain hidden from the system. We evaluate recommendation performance over 10 turns of dialogue.

5.2.1. User Simulation

For each experiment, we simulate 100 users, each of which likes a single item ii\in\mathcal{I}. Each user is simulated by GPT-3.5-turbo-0613, which is given item description xix_{i} and instructed to provide only “yes” or “no” responses to a query qtq^{t} as if it was a user who likes item ii.

5.2.2. Evaluating Recommendations

We evaluate the top-10 recommendations in each turn using the Mean Reciprocal Rank (MRR@10) of the preferred item, which is equivalent to MAP@10 for the case of a single preferred item.

5.2.3. PEBOL Query Generation

In turn tt, given an item description xix_{i} and previously generated aspects (a1,,at1)(a^{1},...,a^{t-1}), an LLM (GPT-3.5-turbo-0613)555Experiments with newer LLMs such as Gemini or GPT4 for PEBOL query generation are left for future work due to the API time and cost requirements needed to simulate the many variants of PEBOL reported in Section 6. We do, however, compare PEBOL against a Gemini-MonoLLM baseline. is prompted to generate an aspect ata^{t} describing the item ii that is no more than 3 words long. The LLM is then prompted again to generate a “yes-or-no” query asking if a user prefers ata^{t}.

5.2.4. NLI

We use the 400M FAIR mNLI666https://huggingface.co/facebook/bart-large-mnli model to predicts logits for entailment, contradiction, and neutral, and divide these logits by an MNLI temperature T{1,10,100}T\in\{1,10,100\} As per the FAIR guidelines, we pass the temperature-scaled entailment and contradiction scores through a softmax layer and take the entailment probabilities. We report PEBOL results using the best MNLI temperature for the most datasets.

5.2.5. User Response Noise

We test three user response noise levels \in\ {0,0.25,0.5} corresponding to the proportion or user responses that are randomly selected between ”yes” and ”no”.

5.2.6. Omitting Query History Ablation

We test how tracking query history in PEBOL effects performance with an ablation study that removes previously generated aspects (a1,,at1)(a^{1},...,a^{t-1}) from the aspect extraction prompt.

5.3. Datasets

We obtain item descriptions from three real-world datasets: MovieLens25M777https://grouplens.org/datasets/movielens/25m/, Yelp888https://www.yelp.com/dataset, and Recipe-MPR (Zhang et al., 2023) (example item descriptions from each shown in Table 1 in the Supplementary Materials). After the filtering steps below for Yelp and MovieLens, we randomly sample 100 items to create 𝐱\mathbf{x}. For Yelp, we filter restaurant descriptions to be from a single major North American city (Philadelphia) and to have at least 50 reviews and five or more category labels. For MovieLens,999For all experiments with MovieLens, we use the 16k version of GPT-3.5-turbo-0613, due to MonoLLM requiring extra context length for 𝐱.\mathbf{x}. we filter movies to be in the 10% by rating count with at least 20 tags, and let movie descriptions use the title, genre labels, and 20 most common user-assigned tags.

Refer to caption
Figure 7. The effect of including the generated aspect history in the aspect generation prompt. Including the history improves performance, which we hypothesize is due to reducing repeated or uninformative queries.
Refer to caption
Figure 8. The effect of user response noise on MRR@10 – error bars are 95% confidence intervals.

5.4. Research Questions

Our experiments explore the following research questions (RQs):

  • RQ1: How does PEBOL perform against the MonoLLM baselines?

  • RQ2: Does PEBOL perform better with binary or probabilistic observations, and how sensitive is the latter to temperature?

  • RQ3: How do PEBOL and MonoLLM perform under user response noise?

  • RQ4: How do the context selection policies of TS, UCB, ER, Greedy, and Random effect PEBOL performance?

  • RQ5: How much does PEBOL performance depend on access to the query history during query generation?

6. Experimental Results

6.1. RQ1 - PEBOL vs. MonoLLM

Figure 4 shows MRR@10 over 10 dialogue turns for MonoLLM and PEBOL (UCB,TS,ER),101010For each PEBOL policy, we use the MNLI temperature that performed best on the most datasets with continuous responses (see Supplementary Materials). with 95% confidence intervals (CIs) at turn 10 shown in Figure 8 (see Supplementary Materials for CIs for all turns and experiments). All methods start near random guessing, reflecting a cold start, and show clear preference learning over time.

Compared to GPT-MonoLLM, which uses the same LLM (GPT-3.5-turbo-0613) as PEBOL does for query generation, after 10 turns of dialogue PEBOL achieves: a mean MRR@10 of 0.27 vs. GPT-MonoLLM’s MRR@10 of 0.12 on Yelp; 0.18 vs 0.09 on MovieLens; and 0.17 vs 0.11 on Recipe-MPR, respectively. Compared to Gemini-MonoLLM, which uses a newer generation LLM (Gemini-Pro) than PEBOL for query generation, after 10 turns PEBOL still achieves a higher mean MRR@10 of 0.27 vs. Gemini-MonoLLM’s mean MRR@10 of 0.17 on Yelp; 0.18 vs. 0.15 on MovieLens, and 0.17 vs 0.16 on Recipe-MPR, respectively. While we did not have the resources to test PEBOL with Gemini query generation, we hypothesize that using a newer LLM for query generation can further improve PEBOL performance, since using the newer LLM (Gemini) shows performance improvements for the MonoLLM baseline.

6.2. RQ2 - Binary vs. Probabilistic Responses

Figure 6 compares PEBOL performance using binary (PEBOL-B) vs. continuous (PEBOL-P) feedback, and shows that performance is typically better when continuous responses are used – indicating that binary feedback models discard valuable information from the entailment probabilities.

6.3. RQ3 - Effect of User Response Noise

Figure 8 shows the impact of user response noise on MRR@10 at turn 10 – PEBOL generally continues to outperform MonoLLM under user response noise. Specifically, at all noise levels, both MonoLLM baselines are outperformed by all PEBOL-P variants on Yelp, and by at least one PEBOL-P variant on MovieLens and Recipe-MPR.

6.4. RQ4 - Comparison of Context Acquisition Policies

Figure 5 compares the performance of various PEBOL context acquisition policies – all policies show active preference learning, other than random item selection on RecipeMPR. There is considerable overlap between methods, however for most turns TS does well on Yelp and MovieLens while being beaten by Greedy, ER, and UCB on Recipe-MPR. As expected due to the randomness in sampling, TS performance is correlated with random item selection, while UCB performs quite similarly to greedy.

6.5. RQ5 - Effect of Aspect History in Query Generation

As shown in Figure 7, we see improvements in PEBOL performance from including a list of previously generated aspects in the aspect generation prompt. For instance, the differences in mean MRR@10 from including vs. excluding the query history for TS after 10 turns were: 0.27 vs 0.16 for Yelp; 0.18 vs 0.14 for MovieLens, and 0.13 vs 0.09 for Recipe-MPR, respectively. Practically, including the aspect generation history also helps to avoid repeat queries, which gain no information and could frustrate a user.

7. Conclusion and Future Work

This paper presents a novel Bayesian optimization formalization of natural language (NL) preference elicitation (PE) over arbitrary NL item descriptions, as well as introducing and evaluating PEBOL, an algorithm for NL Preference Elicitation with Bayesian Optimization augmented LLMs. As discussed below, our study also presents many opportunities for future work, including for addressing some of the limitations of PEBOL and our experiment setup.

User Studies

Firstly, our experiments limited by their reliance on LLM-simulated users. While the dialogue simulations indicate reasonable behaviour in the observed results, such as initial recommendation performance near random guessing, preference learning over time, and coherent user responses in logs such as those shown in Figure 7, future work would benefit from human user studies.

Multi-Item Belief Updates

While the assumption that item utilities can be updated independently allows the use of a simple and interpertable Beta-Bernouilli update model for each item, it also requires a separate NLI calculation to be performed for each item, which is computationally expensive. A key future direction is thus to explore alternative belief state forms which enable the joint updating of beliefs over all items from a single NLI computation.

Collaborative Belief Updates

Since PEBOL does not leverage any historical interactions with other users, an important future direction is to study NL-PE which leverages collaborative, multi-user data. One possibility is to initialize a cold start user’s prior beliefs based on interaction histories with other users. Another direction is adapting collaborative filtering based belief updating, such as the methods used in item-based feedback PE techniques (e.g., (Christakopoulou et al., 2016)), to NL-PE.

Diverse Query Forms

While PEBOL uses a pointwise query generation strategy that selects one item description at a time for LLM context, future work can explore LLM-based acquisition functions with pairwise and setwise context selection. Such multi-item context selection would enable contrastive query generation that could better discriminate between item preferences.

NL-PE in ConvRec Architectures

Another direction for future research is the integration of NL-PE methodologies such as PEBOL into conversational recommendation (ConvRec) system architectures (e.g., (Friedman et al., 2023; Kemper et al., 2024; Deldjoo et al., 2024; Korikov et al., 2024b)), which must balance many tasks including recommendation, explanation, and personalized question answering. Thus, in contrast to PEBOL’s pointwise queries and “yes-or-no” user responses, the use of PE in ConvRec systems implies that future algorithms will need to elicit preferences based on arbitrary pairs of NL system-user utterances. In these potential extensions, aspect-based NLI could be enabled by extracting aspects from utterances with LLMs (Korikov et al., 2024a).

References

  • (1)
  • Biyik et al. (2023) Erdem Biyik, Fan Yao, Yinlam Chow, Alex Haig, Chih-wei Hsu, Mohammad Ghavamzadeh, and Craig Boutilier. 2023. Preference Elicitation with Soft Attributes in Interactive Recommendation. ArXiv abs/2311.02085 (2023). https://api.semanticscholar.org/CorpusID:265034238
  • Boutilier (2002) Craig Boutilier. 2002. A POMDP formulation of preference elicitation problems. In AAAI/IAAI. Edmonton, AB, 239–246.
  • Brochu et al. (2010) Eric Brochu, Tyson Brochu, and Nando De Freitas. 2010. A Bayesian interactive optimization approach to procedural animation design. In Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation. 103–112.
  • Chen et al. (2019) Qibin Chen, Junyang Lin, Yichang Zhang, Ming Ding, Yukuo Cen, Hongxia Yang, and Jie Tang. 2019. Towards Knowledge-Based Recommender Dialog System. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, Hong Kong, China, 1803–1813. https://doi.org/10.18653/v1/D19-1189
  • Christakopoulou et al. (2016) Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards Conversational Recommender Systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 815–824.
  • Dagan et al. (2005) Ido Dagan, Oren Glickman, and Bernardo Magnini. 2005. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges Workshop. Springer, 177–190.
  • Deldjoo et al. (2024) Yashar Deldjoo, Zhankui He, Julian McAuley, Anton Korikov, Scott Sanner, Arnau Ramisa, René Vidal, Maheswaran Sathiamoorthy, Atoosa Kasirzadeh, and Silvia Milano. 2024. A Review of Modern Recommender Systems Using Generative Models (Gen-RecSys). In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24), August 25–29, 2024, Barcelona, Spain.
  • Eric et al. (2007) Brochu Eric, Nando Freitas, and Abhijeet Ghosh. 2007. Active preference learning with discrete choice data. Advances in Neural Information Processing Systems 20 (2007).
  • Friedman et al. (2023) Luke Friedman, Sameer Ahuja, David Allen, Terry Tan, Hakim Sidahmed, Changbo Long, Jun Xie, Gabriel Schubiner, Ajay Patel, Harsh Lara, et al. 2023. Leveraging Large Language Models in Conversational Recommender Systems. arXiv preprint arXiv:2305.07961 (2023).
  • Garnett (2023) Roman Garnett. 2023. Bayesian optimization. Cambridge University Press.
  • González et al. (2017) Javier González, Zhenwen Dai, Andreas Damianou, and Neil D Lawrence. 2017. Preferential bayesian optimization. In International Conference on Machine Learning. PMLR, 1282–1291.
  • Guo and Sanner (2010) Shengbo Guo and Scott Sanner. 2010. Real-time multiattribute Bayesian preference elicitation with pairwise comparison queries. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 289–296.
  • Guo et al. (2010) Shengbo Guo, Scott Sanner, and Edwin V Bonilla. 2010. Gaussian process preference elicitation. Advances in Neural Information Processing Systems 23 (2010).
  • Handa et al. (2024) Kunal Handa, Yarin Gal, Ellie Pavlick, Noah Goodman, Jacob Andreas, Alex Tamkin, and Belinda Z. Li. 2024. Bayesian Preference Elicitation with Language Models. arXiv:2403.05534 [cs.CL]
  • He et al. (2023) Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, Bodhisattwa Prasad Majumder, Nathan Kallus, and Julian McAuley. 2023. Large Language Models as Zero-Shot Conversational Recommenders. Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (2023).
  • Karamanolakis et al. (2018) Giannis Karamanolakis, Kevin Raji Cherian, Ananth Ravi Narayan, Jie Yuan, Da Tang, and Tony Jebara. 2018. Item recommendation with variational autoencoders and heterogeneous priors. In Proceedings of the 3rd Workshop on Deep Learning for Recommender Systems. 10–14.
  • Kemper et al. (2024) Sara Kemper, Justin Cui, Kai Dicarlantonio, Kathy Lin, Danjie Tang, Anton Korikov, and Scott Sanner. 2024. Retrieval-Augmented Conversational Recommendation with Prompt-based Semi-Structured Natural Language State Tracking. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (Washington, DC, USA) (SIGIR ’24). ACM, New York, NY, USA.
  • Khajah et al. (2016) Mohammad M. Khajah, Brett D. Roads, Robert V. Lindsey, Yun-En Liu, and Michael C. Mozer. 2016. Designing Engaging Games Using Bayesian Optimization. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (San Jose, California, USA) (CHI ’16). Association for Computing Machinery, New York, NY, USA, 5571–5582. https://doi.org/10.1145/2858036.2858253
  • Korikov et al. (2024a) Anton Korikov, George Saad, Ethan Baron, Mustafa Khan, Manav Shah, and Scott Sanner. 2024a. Multi-Aspect Reviewed-Item Retrieval via LLM Query Decomposition and Aspect Fusion. In Proceedings of the 1st SIGIR’24 Workshop on Information Retrieval’s Role in RAG Systems, July 18, 2024, Washington D.C., USA.
  • Korikov et al. (2024b) Anton Korikov, Scott Sanner, Yashar Deldjoo, Francesco Ricci, Zhankui He, Julian McAuley, Arnau Ramisa, Rene Vidal, Maheswaran Sathiamoorthy, Atoosa Kasirzadeh, and Silvia Milano. 2024b. Large Language Model Driven Recommendation. arXiv preprint arXiv:2404.XXXXX (2024).
  • Lee et al. (2019) Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. 2019. MeLU: Meta-Learned User Preference Estimator for Cold-Start Recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Anchorage, AK, USA) (KDD ’19). Association for Computing Machinery, New York, NY, USA, 1073–1082. https://doi.org/10.1145/3292500.3330859
  • Lei et al. (2020) Wenqiang Lei, Gangyi Zhang, Xiangnan He, Yisong Miao, Xiang Wang, Liang Chen, and Tat-Seng Chua. 2020. Interactive Path Reasoning on Graph for Conversational Recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Virtual Event, CA, USA) (KDD ’20). Association for Computing Machinery, New York, NY, USA, 2073–2083. https://doi.org/10.1145/3394486.3403258
  • Li et al. (2023) Belinda Z. Li, Alex Tamkin, Noah Goodman, and Jacob Andreas. 2023. Eliciting Human Preferences with Language Models. arXiv:2310.11589 [cs.CL]
  • Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web. 661–670.
  • Li et al. (2011) Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. 297–306.
  • Li et al. (2018) Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. 2018. Towards deep conversational recommendations. Advances in Neural Information Processing Systems 31 (2018).
  • Li et al. (2021) Shijun Li, Wenqiang Lei, Qingyun Wu, Xiangnan He, Peng Jiang, and Tat-Seng Chua. 2021. Seamlessly Unifying Attributes and Items: Conversational Recommendation for Cold-start Users. ACM Trans. Inf. Syst. 39, 4, Article 40 (aug 2021), 29 pages. https://doi.org/10.1145/3446427
  • Maynez et al. (2020) Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 1906–1919. https://doi.org/10.18653/v1/2020.acl-main.173
  • Rossi and Sperduti (2004) Francesca Rossi and Allesandro Sperduti. 2004. Acquiring both constraint and solution preferences in interactive constraint systems. Constraints 9, 4 (2004), 311–332.
  • Shahriari et al. (2016) Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. 2016. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proc. IEEE 104, 1 (2016), 148–175. https://doi.org/10.1109/JPROC.2015.2494218
  • Thorne et al. (2018) James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Marilyn Walker, Heng Ji, and Amanda Stent (Eds.). Association for Computational Linguistics, New Orleans, Louisiana, 809–819. https://doi.org/10.18653/v1/N18-1074
  • Vendrov et al. (2020) Ivan Vendrov, Tyler Lu, Qingqing Huang, and Craig Boutilier. 2020. Gradient-Based Optimization for Bayesian Preference Elicitation. Proceedings of the AAAI Conference on Artificial Intelligence 34, 06 (Apr. 2020), 10292–10301. https://doi.org/10.1609/aaai.v34i06.6592
  • Wang et al. (2022) Xiaolei Wang, Kun Zhou, Ji-Rong Wen, and Wayne Xin Zhao. 2022. Towards unified conversational recommender systems via knowledge-enhanced prompt learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1929–1937.
  • Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (New Orleans, Louisiana). Association for Computational Linguistics, 1112–1122. http://aclweb.org/anthology/N18-1101
  • Yang et al. (2022) Bowen Yang, Cong Han, Yu Li, Lei Zuo, and Zhou Yu. 2022. Improving Conversational Recommendation Systems’ Quality with Context-Aware Item Meta-Information. In Findings of the Association for Computational Linguistics: NAACL 2022. 38–48.
  • Yang et al. (2021) Hojin Yang, Scott Sanner, Ga Wu, and Jin Peng Zhou. 2021. Bayesian Preference Elicitation with Keyphrase-Item Coembeddings for Interactive Recommendation. In Proceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization. 55–64.
  • Yin et al. (2019) Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019. Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, Hong Kong, China, 3914–3923. https://doi.org/10.18653/v1/D19-1404
  • Zhang et al. (2023) Haochen Zhang, Anton Korikov, Parsa Farinneya, Mohammad Mahdi Abdollah Pour, Manasa Bharadwaj, Ali Pesaranghader, Xi Yu Huang, Yi Xin Lok, Zhaoqi Wang, Nathan Jones, and Scott Sanner. 2023. Recipe-MPR: A Test Collection for Evaluating Multi-aspect Preference-based Natural Language Retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (Taipei, Taiwan) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 2744–2753. https://doi.org/10.1145/3539618.3591880
  • Zhang et al. (2020) Xiaoying Zhang, Hong Xie, Hang Li, and John C.S. Lui. 2020. Conversational Contextual Bandit: Algorithm and Application. In Proceedings of The Web Conference 2020 (Taipei, Taiwan) (WWW ’20). Association for Computing Machinery, New York, NY, USA, 662–672. https://doi.org/10.1145/3366423.3380148
  • Zhao et al. (2013) Xiaoxue Zhao, Weinan Zhang, and Jun Wang. 2013. Interactive collaborative filtering. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 1411–1420.

Appendix A Probabilistic Graphical Model for Posterior Utility Update

Refer to caption
Figure 9. Graphical model used for posterior utility updates. uiu_{i} is the random variable representing the utility of item ii\in\mathcal{I} and xix_{i} is the item description. qtq^{t} is the query presented at iteration tt, ritr_{i}^{t} is the variable representing the latent (not directly observed) relevance of item ii at step tt, and eite_{i}^{t} is the binary observation representing whether the item description entails the user preference. Unshaded and shaded nodes indicate unobserved and observed variables respectively.

In this section, we present the probabilistic graphical model for the posterior utility updates introduced in our paper with a more detailed derivation of the posterior utility belief. As discussed in the paper, the objective of posterior inference is to update the prior belief maintained over the utility of each item ii\in\mathcal{I} denoted by p(ui)p(u_{i}) given the query qtq^{t}, item description xix_{i}, and eite^{t}_{i}, the binary observation variable representing whether the item description entails the user’s response to the queried preference qtq_{t}.

Figure 9 shows the graphical model representation of these variables. The presented query qtq^{t} and the item description xix_{i} are observed (shaded), while the relevance of item ii to query qtq^{t} is latent (unshaded) and denoted by rit{0,1}r_{i}^{t}\in\{0,1\} and conditioned on the latent item utility uiu_{i}. We observe whether the item xix_{i} “truly” entails (i.e., eit=Truee_{i}^{t}=\text{True}) the user’s response to query qtq_{t} (as determined by the NLI entailment probability witw^{t}_{i} if the item is relevant, i.e., rit=1r_{i}^{t}=1). The conditional probability distributions in this graphical model are formally defined as follows:

(20) p(ui)\displaystyle p(u_{i}) =Beta(ui;αi,βi),\displaystyle=\text{Beta}(u_{i};\alpha_{i},\beta_{i}),
(21) p(rit|ui)\displaystyle p(r_{i}^{t}|u_{i}) =Bernoulli(ui),\displaystyle=\text{Bernoulli}(u_{i}),
(22) p(eit=True|rit,qt,xi)\displaystyle p(e_{i}^{t}=\text{True}|r_{i}^{t},q^{t},x_{i}) ={witrit=11witrit=0,\displaystyle=\begin{cases}w^{t}_{i}&r_{i}^{t}=1\\ 1-w^{t}_{i}&r_{i}^{t}=0\end{cases},

To further explain the rationale for Eq (22), we note that witw^{t}_{i} is the natural language entailment probability that item description xix_{i} entails the aspect queried in the user’s response to qtq_{t} given that item ii is relevant (rit=1r_{i}^{t}=1). This entailment probability is obtained from the NLI model, which produces the probability that the entailment is true, hence the reason why eit=Truee^{t}_{i}=\text{True}. If item ii is instead irrelevant (rit=0r_{i}^{t}=0) then we assume for simplicity that 1wit1-w_{i}^{t} is the probability of the true entailment.111111We note that an alternative approach (not used here) could attempt to use NLI to determine the probability of an incorrect true entailment (or confusion) given that item ii is irrelevant (rit=0r_{i}^{t}=0). That is, there is no inherent requirement for the two cases of Eq (22) to sum to 1 since ritr_{i}^{t} is on the conditional side.

To obtain the posterior utility p(ui|xi,qt,eit)p(u_{i}|x_{i},q^{t},e_{i}^{t}), we need to marginalize the joint distribution p(ui,rit|xi,qt,eit)p(u_{i},r_{i}^{t}|x_{i},q^{t},e_{i}^{t}) over ritr_{i}^{t}. Formally,

(23) p(ui|xi,qt,eit)=ritp(ui,rit|xi,qt,eit).p(u_{i}|x_{i},q^{t},e_{i}^{t})=\sum_{r_{i}^{t}}p(u_{i},r_{i}^{t}|x_{i},q^{t},e_{i}^{t}).

Considering the conditional independencies determined from the graphical model, the joint distribution factorizes as

(24) p(ui,rit|xi,qt,eit)=p(ui)p(rit|ui)p(eit|rit,xi,qt).p(u_{i},r_{i}^{t}|x_{i},q^{t},e_{i}^{t})=p(u_{i})p(r_{i}^{t}|u_{i})p(e_{i}^{t}|r_{i}^{t},x_{i},q^{t}).

Next, we replace the probability distribution of each factor according to Equations (20), (21), (22) in (23), to obtain

(25) p(ui|xi,qt,eit=True)rit{0,1}uiα(1ui)β({uirit=11uirit=0)({witrit=11witrit=0).p(u_{i}|x_{i},q^{t},e_{i}^{t}=\text{True})\\ \propto\sum_{r_{i}^{t}\in\{0,1\}}u_{i}^{\alpha}(1-u_{i})^{\beta}\left(\begin{cases}u_{i}&r_{i}^{t}=1\\ 1-u_{i}&r_{i}^{t}=0\end{cases}\right)\left(\begin{cases}w^{t}_{i}&r_{i}^{t}=1\\ 1-w^{t}_{i}&r_{i}^{t}=0\end{cases}\right).

Expanding the summation yields

(26) p(ui|xi,qt,eit=True)wituiα+1(1ui)β+(1wit)uiα(1ui)β+1witBeta(ui;α+1,β)+(1wit)Beta(ui;α,β+1)p(u_{i}|x_{i},q^{t},e_{i}^{t}=\text{True})\propto w^{t}_{i}u_{i}^{\alpha+1}(1-u_{i})^{\beta}+(1-w^{t}_{i})u_{i}^{\alpha}(1-u_{i})^{\beta+1}\\ \propto w^{t}_{i}\text{Beta}(u_{i};\alpha+1,\beta)+(1-w^{t}_{i})\text{Beta}(u_{i};\alpha,\beta+1)

The latter term represents a mixture of Beta distributions that is challenging to handle since multiple posterior updates would cause the number of components in the mixture to grow exponentially with the number of query observations mm, leading to substantial computational and memory complexity.

To address this issue, several methods have been proposed for approximating the posterior distribution to allow for tractable computations. In this work, we use the Assumed Density Filtering (ADF) approach, a technique widely used in Bayesian filtering and tracking problems to project a complex posterior to an assumed simpler form (often the same form as the prior to maintain a closed-form). In our case, we project the Beta mixture posterior to a single Beta in order to maintain a closed-form Beta approximation of the posterior update matching the form of the Beta prior in Eq (20).

To apply ADF, we assume a Beta distribution with parameters α\alpha^{\prime} and β\beta^{\prime} for the posterior, and approximate the original mixture of Beta’s with this distribution by equating their first moments (i.e., their means):

(27) αα+β\displaystyle\frac{\alpha^{\prime}}{\alpha^{\prime}+\beta^{\prime}} =wit(1+α)(1+α+β)+α(1wit)1+α+β\displaystyle=\frac{w^{t}_{i}(1+\alpha)}{(1+\alpha+\beta)}+\frac{\alpha(1-w^{t}_{i})}{1+\alpha+\beta}
(28) =[α+wit][α+wit]+[β+(1wit)].\displaystyle=\frac{\left[{\alpha+w^{t}_{i}}\right]}{\left[\alpha+w^{t}_{i}\right]+\left[\beta+(1-w^{t}_{i})\right]}.

Equating the numerators yields

(29) α=α+wit,\alpha^{\prime}=\alpha+w^{t}_{i},

and replacing this α\alpha^{\prime} in the equation of the denominators results in

(30) β=1+βwit.\beta^{\prime}=1+\beta-w^{t}_{i}.

Thus, the “mean matched” posterior is Beta(ui;α+wit,1+βwit)\text{Beta}(u_{i};\alpha+w^{t}_{i},1+\beta-w^{t}_{i}).

Matching two distributions by equating their first moments is a special case of a more general technique called “moment matching”, which is widely used to approximate a complex probability distribution with a simpler one by equating their moments. In our work, we adopted a special case of this approach by matching the first moments, which we refer to as “mean matching” of the distributions that we used for its simplicity and intuitive interpretation. However, this is only one of the possible solutions, and a complete moment matching derivation results in a slightly different solution.

With this “mean matching” derivation and current item ii posterior Beta(ui;α,β)Beta(u_{i};\alpha,\beta) at step t1t-1, we can now perform an incremental posterior update after at step tt given the probability witw_{i}^{t} that the item description xix_{i} entails preference query qtq_{t} yielding the closed-form Beta posterior Beta(ui;α+wit,1+βwit)Beta(u_{i};\alpha+w_{i}^{t},1+\beta-w_{i}^{t}) as used in PEBOL-P.

Table 1. Examples of LLM Aspect-Based Query Generation from an Item Description
Dataset Item Description did_{i} Generated Aspect ata^{t} Generated Query qtq^{t}
MovieLens Movie Title: Meet John Doe (1941)
Genres: Comedy, Drama
Tags: Christianity, Frank Capra, acting, anti-fascism, class issues, journalism, patriotic, pro american, thought provoking, AFI 100 (Cheers), BD-R, Barbara Stanwyck, Gary Cooper, baseball player, compare: This Is Our Land (2017), domain, funny, radio broadcast, reviewed in the NYer by Anthony Lanne (2018-04-30), suicide note
patriotism Are you interested in movies with patriotic themes?
classic Do you enjoy classic movies?
Recipe-MPR Spaghetti with mushrooms, onion, green pepper, chicken breasts, and alfredo sauce alfredo sauce Do you like alfredo sauce?
chicken breast Do you like chicken breasts?
Yelp name: Le Pain Quotidien categories: Restaurants, Bakeries, Breakfast & Brunch, Coffee & Tea, Food, Belgian, French bakery Do you like bakeries?
French pastries Do you like French pastries?
Refer to caption
Figure 10. MAP@10 for all turns on all datasets and noise levels
Refer to caption
Figure 11. The effect of MNLI temperature on MAP@10 without noise – error bars are 95% C.I.s. Other results are reported using the temperatures that perform best across the most datasets for each policy.
Refer to caption
Figure 12. Effect of MNLI Temperature with noise 0.25
Refer to caption
Figure 13. Effect of MNLI Temperature with noise 0.5
Refer to caption
(a) Monolithic LLM query generation
Refer to caption
(b) Monolithic LLM recommendation generation
Refer to caption
(c) Aspect generation
Refer to caption
(d) Aspect-based query generation
Refer to caption
(e) User simulation
Figure 14. LLM Prompting Templates
Table 2. MRR@10 Mean and 95% C.I. for PEBOL-P vs MonoLLM on MovieLens without Response Noise
Turn 1 2 3 4 5 6 7 8 9 10
Method
MonoLLM GPT Mean 0.04 0.06 0.06 0.07 0.08 0.08 0.10 0.11 0.09 0.09
MonoLLM GPT CI LB 0.01 0.02 0.02 0.02 0.03 0.03 0.05 0.06 0.04 0.04
MonoLLM GPT CI UB 0.08 0.10 0.11 0.11 0.12 0.13 0.15 0.16 0.14 0.15
MonoLLM Gemini Mean 0.04 0.02 0.06 0.06 0.07 0.08 0.11 0.10 0.10 0.15
MonoLLM Gemini CI LB 0.01 0.01 0.02 0.03 0.04 0.05 0.07 0.05 0.06 0.09
MonoLLM Gemini CI UB 0.07 0.04 0.10 0.10 0.11 0.12 0.16 0.15 0.14 0.20
ER Mean 0.05 0.05 0.06 0.08 0.09 0.10 0.12 0.12 0.13 0.14
ER CI LB 0.01 0.02 0.03 0.04 0.05 0.05 0.06 0.07 0.07 0.08
ER CI UB 0.08 0.09 0.09 0.12 0.13 0.15 0.17 0.18 0.19 0.20
Greedy Mean 0.05 0.05 0.07 0.07 0.09 0.09 0.10 0.10 0.11 0.13
Greedy CI LB 0.02 0.02 0.03 0.03 0.05 0.04 0.05 0.05 0.06 0.07
Greedy CI UB 0.09 0.09 0.10 0.11 0.14 0.14 0.15 0.15 0.17 0.19
Random Mean 0.05 0.05 0.08 0.11 0.12 0.14 0.14 0.16 0.16 0.14
Random CI LB 0.01 0.02 0.04 0.05 0.06 0.08 0.08 0.09 0.09 0.09
Random CI UB 0.08 0.09 0.13 0.16 0.18 0.21 0.20 0.22 0.22 0.20
TS Mean 0.05 0.08 0.11 0.10 0.12 0.15 0.15 0.16 0.17 0.18
TS CI LB 0.01 0.03 0.06 0.05 0.06 0.09 0.09 0.10 0.10 0.11
TS CI UB 0.08 0.13 0.17 0.15 0.17 0.22 0.21 0.23 0.23 0.24
UCB Mean 0.05 0.05 0.06 0.07 0.09 0.09 0.10 0.10 0.12 0.13
UCB CI LB 0.02 0.02 0.03 0.03 0.05 0.04 0.05 0.05 0.06 0.07
UCB CI UB 0.09 0.09 0.10 0.10 0.13 0.14 0.15 0.15 0.17 0.19
Table 3. MRR@10 Mean and 95% C.I. for PEBOL-P vs MonoLLM on Recipe-MPR without Response Noise
Turn 1 2 3 4 5 6 7 8 9 10
Method
MonoLLM GPT Mean 0.05 0.09 0.07 0.09 0.05 0.06 0.07 0.08 0.09 0.11
MonoLLM GPT CI LB 0.01 0.05 0.03 0.05 0.02 0.02 0.03 0.04 0.04 0.06
MonoLLM GPT CI UB 0.08 0.13 0.10 0.14 0.08 0.10 0.11 0.13 0.13 0.17
MonoLLM Gemini Mean 0.02 0.03 0.04 0.07 0.07 0.11 0.11 0.13 0.10 0.16
MonoLLM Gemini CI LB 0.01 0.02 0.02 0.04 0.03 0.06 0.07 0.08 0.06 0.10
MonoLLM Gemini CI UB 0.03 0.05 0.07 0.10 0.10 0.15 0.15 0.18 0.13 0.21
ER Mean 0.06 0.05 0.08 0.11 0.11 0.13 0.14 0.15 0.16 0.16
ER CI LB 0.02 0.02 0.04 0.06 0.06 0.08 0.08 0.09 0.09 0.10
ER CI UB 0.10 0.08 0.13 0.16 0.16 0.18 0.19 0.22 0.22 0.22
Greedy Mean 0.06 0.06 0.07 0.10 0.11 0.13 0.15 0.15 0.16 0.17
Greedy CI LB 0.02 0.02 0.03 0.05 0.06 0.07 0.09 0.09 0.10 0.11
Greedy CI UB 0.10 0.09 0.11 0.15 0.17 0.18 0.21 0.21 0.23 0.24
Random Mean 0.05 0.03 0.03 0.03 0.04 0.04 0.05 0.06 0.07 0.07
Random CI LB 0.02 0.01 0.01 0.00 0.01 0.01 0.02 0.02 0.03 0.02
Random CI UB 0.08 0.06 0.06 0.05 0.06 0.07 0.08 0.10 0.11 0.11
TS Mean 0.06 0.06 0.06 0.08 0.09 0.10 0.12 0.13 0.13 0.13
TS CI LB 0.02 0.02 0.02 0.03 0.05 0.05 0.06 0.07 0.07 0.07
TS CI UB 0.10 0.09 0.10 0.12 0.14 0.14 0.17 0.19 0.18 0.19
UCB Mean 0.06 0.06 0.08 0.11 0.13 0.14 0.16 0.16 0.17 0.17
UCB CI LB 0.02 0.02 0.03 0.06 0.07 0.08 0.10 0.09 0.11 0.11
UCB CI UB 0.10 0.09 0.12 0.16 0.18 0.19 0.23 0.22 0.24 0.24
Table 4. MRR@10 Mean and 95% C.I. for PEBOL-P vs MonoLLM on Yelp without Response Noise
Turn 1 2 3 4 5 6 7 8 9 10
Method
MonoLLM GPT Mean 0.03 0.05 0.04 0.05 0.07 0.07 0.07 0.09 0.09 0.12
MonoLLM GPT CI LB 0.01 0.01 0.01 0.02 0.03 0.03 0.03 0.04 0.04 0.06
MonoLLM GPT CI UB 0.05 0.08 0.07 0.08 0.10 0.12 0.12 0.14 0.14 0.17
MonoLLM Gemini Mean 0.06 0.09 0.11 0.09 0.10 0.14 0.13 0.16 0.20 0.17
MonoLLM Gemini CI LB 0.02 0.05 0.06 0.05 0.06 0.08 0.09 0.10 0.14 0.12
MonoLLM Gemini CI UB 0.10 0.14 0.15 0.13 0.14 0.20 0.18 0.21 0.26 0.23
ER Mean 0.06 0.07 0.09 0.11 0.14 0.16 0.18 0.21 0.22 0.26
ER CI LB 0.03 0.03 0.05 0.06 0.08 0.09 0.12 0.14 0.15 0.18
ER CI UB 0.10 0.11 0.14 0.16 0.20 0.22 0.25 0.28 0.29 0.33
Greedy Mean 0.06 0.08 0.11 0.11 0.14 0.15 0.17 0.18 0.19 0.22
Greedy CI LB 0.03 0.04 0.06 0.07 0.08 0.09 0.10 0.12 0.13 0.16
Greedy CI UB 0.10 0.12 0.15 0.16 0.20 0.21 0.23 0.25 0.26 0.29
Random Mean 0.06 0.09 0.11 0.13 0.17 0.17 0.17 0.18 0.20 0.21
Random CI LB 0.03 0.04 0.06 0.08 0.11 0.11 0.11 0.12 0.13 0.15
Random CI UB 0.10 0.13 0.16 0.19 0.23 0.23 0.23 0.24 0.26 0.28
TS Mean 0.06 0.08 0.11 0.11 0.14 0.17 0.20 0.25 0.25 0.27
TS CI LB 0.03 0.04 0.06 0.07 0.09 0.11 0.13 0.18 0.18 0.20
TS CI UB 0.10 0.12 0.17 0.15 0.20 0.23 0.26 0.33 0.32 0.34
UCB Mean 0.07 0.09 0.11 0.12 0.14 0.16 0.18 0.20 0.20 0.23
UCB CI LB 0.03 0.05 0.06 0.07 0.08 0.09 0.12 0.13 0.14 0.16
UCB CI UB 0.10 0.13 0.16 0.17 0.20 0.22 0.25 0.26 0.27 0.30
Table 5. MRR@10 Mean and 95% C.I. for PEBOL-P without Aspect History on MovieLens without Response Noise
Turn 1 2 3 4 5 6 7 8 9 10
Method
TS Mean 0.03 0.05 0.06 0.06 0.08 0.11 0.13 0.12 0.13 0.14
TS CI LB 0.01 0.02 0.02 0.02 0.04 0.06 0.08 0.06 0.07 0.08
TS CI UB 0.06 0.09 0.09 0.09 0.13 0.16 0.19 0.17 0.18 0.19
UCB Mean 0.03 0.03 0.04 0.05 0.05 0.05 0.06 0.06 0.06 0.06
UCB CI LB 0.01 0.01 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02
UCB CI UB 0.05 0.06 0.07 0.09 0.08 0.08 0.09 0.10 0.09 0.09
Table 6. MRR@10 Mean and 95% C.I. for PEBOL-P without Aspect History on Recipe-MPR without Response Noise
Turn 1 2 3 4 5 6 7 8 9 10
Method
TS Mean 0.04 0.05 0.03 0.04 0.06 0.07 0.06 0.09 0.10 0.09
TS CI LB 0.01 0.02 0.01 0.01 0.03 0.03 0.03 0.04 0.05 0.05
TS CI UB 0.07 0.09 0.05 0.07 0.10 0.12 0.10 0.13 0.14 0.14
UCB Mean 0.04 0.05 0.06 0.08 0.08 0.09 0.08 0.09 0.10 0.10
UCB CI LB 0.01 0.01 0.02 0.04 0.04 0.04 0.04 0.04 0.05 0.05
UCB CI UB 0.07 0.08 0.09 0.12 0.12 0.13 0.12 0.13 0.14 0.14
Table 7. MRR@10 Mean and 95% C.I. for PEBOL-P without Aspect History on Yelp without Response Noise
Turn 1 2 3 4 5 6 7 8 9 10
Method
TS Mean 0.05 0.06 0.07 0.08 0.09 0.11 0.14 0.14 0.15 0.16
TS CI LB 0.02 0.02 0.03 0.04 0.05 0.06 0.09 0.08 0.09 0.10
TS CI UB 0.09 0.09 0.10 0.11 0.13 0.16 0.20 0.19 0.21 0.22
UCB Mean 0.05 0.08 0.09 0.11 0.12 0.15 0.15 0.17 0.17 0.19
UCB CI LB 0.02 0.04 0.05 0.06 0.07 0.09 0.10 0.11 0.11 0.13
UCB CI UB 0.09 0.12 0.13 0.16 0.17 0.20 0.21 0.22 0.22 0.25
Table 8. MRR@10 Mean and 95% C.I. for PEBOL-B on MovieLens without Response Noise
Turn 1 2 3 4 5 6 7 8 9 10
Method
ER Mean 0.03 0.04 0.04 0.05 0.06 0.08 0.09 0.11 0.11 0.12
ER CI LB 0.01 0.01 0.01 0.02 0.03 0.04 0.04 0.06 0.07 0.07
ER CI UB 0.06 0.07 0.08 0.08 0.10 0.12 0.14 0.16 0.16 0.16
Greedy Mean 0.03 0.04 0.05 0.05 0.07 0.09 0.10 0.11 0.11 0.11
Greedy CI LB 0.01 0.01 0.02 0.02 0.03 0.05 0.05 0.06 0.07 0.07
Greedy CI UB 0.06 0.08 0.08 0.09 0.11 0.13 0.14 0.16 0.16 0.16
Random Mean 0.03 0.04 0.04 0.06 0.06 0.08 0.09 0.10 0.10 0.10
Random CI LB 0.01 0.01 0.01 0.03 0.03 0.04 0.05 0.06 0.06 0.06
Random CI UB 0.06 0.06 0.07 0.09 0.10 0.12 0.13 0.14 0.14 0.14
TS Mean 0.03 0.05 0.06 0.07 0.08 0.07 0.09 0.12 0.12 0.13
TS CI LB 0.01 0.02 0.02 0.03 0.03 0.03 0.04 0.07 0.07 0.07
TS CI UB 0.06 0.08 0.09 0.11 0.12 0.11 0.13 0.17 0.18 0.18
UCB Mean 0.03 0.04 0.05 0.05 0.06 0.08 0.09 0.10 0.11 0.11
UCB CI LB 0.01 0.01 0.02 0.02 0.03 0.04 0.05 0.06 0.06 0.06
UCB CI UB 0.06 0.07 0.08 0.08 0.09 0.13 0.14 0.15 0.15 0.15
Table 9. MRR@10 Mean and 95% C.I. for PEBOL-B on Recipe-MPR without Response Noise
Turn 1 2 3 4 5 6 7 8 9 10
Method
ER Mean 0.04 0.06 0.09 0.07 0.09 0.09 0.10 0.13 0.10 0.10
ER CI LB 0.00 0.02 0.05 0.03 0.04 0.04 0.05 0.07 0.05 0.05
ER CI UB 0.07 0.10 0.14 0.10 0.14 0.14 0.16 0.19 0.15 0.15
Greedy Mean 0.04 0.04 0.06 0.07 0.10 0.11 0.12 0.12 0.12 0.13
Greedy CI LB 0.00 0.01 0.02 0.03 0.05 0.06 0.06 0.06 0.06 0.07
Greedy CI UB 0.07 0.08 0.10 0.10 0.15 0.17 0.18 0.18 0.18 0.19
Random Mean 0.04 0.03 0.03 0.05 0.04 0.04 0.05 0.04 0.05 0.06
Random CI LB 0.00 0.00 0.01 0.01 0.01 0.01 0.01 0.01 0.02 0.02
Random CI UB 0.07 0.06 0.05 0.09 0.07 0.08 0.08 0.07 0.09 0.11
TS Mean 0.04 0.06 0.06 0.07 0.09 0.09 0.10 0.10 0.11 0.12
TS CI LB 0.00 0.02 0.02 0.03 0.05 0.04 0.05 0.05 0.06 0.06
TS CI UB 0.07 0.09 0.10 0.12 0.14 0.14 0.16 0.15 0.17 0.17
UCB Mean 0.04 0.05 0.07 0.07 0.09 0.11 0.12 0.12 0.13 0.13
UCB CI LB 0.00 0.02 0.03 0.03 0.04 0.05 0.06 0.07 0.07 0.08
UCB CI UB 0.07 0.08 0.11 0.11 0.14 0.16 0.17 0.18 0.18 0.19
Table 10. MRR@10 Mean and 95% C.I. for PEBOL-B on Yelp without Response Noise
Turn 1 2 3 4 5 6 7 8 9 10
Method
ER Mean 0.05 0.05 0.07 0.09 0.11 0.11 0.11 0.11 0.11 0.11
ER CI LB 0.02 0.02 0.04 0.05 0.07 0.07 0.07 0.06 0.06 0.06
ER CI UB 0.08 0.09 0.11 0.12 0.15 0.15 0.15 0.15 0.15 0.15
Greedy Mean 0.05 0.06 0.08 0.09 0.10 0.10 0.10 0.10 0.10 0.10
Greedy CI LB 0.02 0.02 0.04 0.05 0.06 0.06 0.06 0.06 0.06 0.06
Greedy CI UB 0.08 0.09 0.11 0.13 0.14 0.13 0.13 0.14 0.14 0.13
Random Mean 0.05 0.06 0.07 0.09 0.11 0.10 0.10 0.11 0.12 0.14
Random CI LB 0.02 0.03 0.03 0.05 0.06 0.06 0.06 0.07 0.07 0.09
Random CI UB 0.08 0.09 0.11 0.14 0.15 0.14 0.14 0.15 0.16 0.19
TS Mean 0.05 0.06 0.12 0.12 0.16 0.16 0.18 0.18 0.22 0.24
TS CI LB 0.02 0.02 0.07 0.07 0.10 0.10 0.12 0.12 0.15 0.17
TS CI UB 0.08 0.10 0.17 0.17 0.22 0.22 0.24 0.24 0.28 0.31
UCB Mean 0.05 0.06 0.08 0.10 0.10 0.10 0.10 0.10 0.10 0.10
UCB CI LB 0.02 0.02 0.04 0.06 0.06 0.06 0.06 0.06 0.06 0.06
UCB CI UB 0.08 0.09 0.12 0.13 0.14 0.14 0.14 0.14 0.14 0.13
Table 11. MRR@10 Mean and 95% C.I. at Turn 10 for PEBOL-P and MonoLLM baselines on MovieLens with Different Levels of Response Noise
Noise Level 0 0.25 0.5
Method
MonoLLM GPT Mean 0.09 0.07 0.04
MonoLLM GPT CI LB 0.04 0.03 0.01
MonoLLM GPT CI UB 0.15 0.10 0.07
MonoLLM Gemini Mean 0.15 0.06 0.03
MonoLLM Gemini CI LB 0.09 0.02 0.00
MonoLLM Gemini CI UB 0.20 0.09 0.05
ER Mean 0.14 0.09 0.06
ER CI LB 0.08 0.04 0.03
ER CI UB 0.20 0.14 0.10
Greedy Mean 0.13 0.09 0.06
Greedy CI LB 0.07 0.04 0.02
Greedy CI UB 0.19 0.14 0.10
Random Mean 0.14 0.09 0.07
Random CI LB 0.09 0.05 0.03
Random CI UB 0.20 0.13 0.11
TS Mean 0.18 0.07 0.06
TS CI LB 0.11 0.03 0.02
TS CI UB 0.24 0.10 0.10
UCB Mean 0.13 0.08 0.07
UCB CI LB 0.07 0.04 0.03
UCB CI UB 0.19 0.13 0.11
Table 12. MRR@10 Mean and 95% C.I. at Turn 10 for PEBOL-P and MonoLLM baselines on Recipe-MPR with Different Levels of Response Noise
Noise Level 0 0.25 0.5
Method
MonoLLM GPT Mean 0.11 0.06 0.05
MonoLLM GPT CI LB 0.06 0.02 0.01
MonoLLM GPT CI UB 0.17 0.11 0.08
MonoLLM Gemini Mean 0.16 0.13 0.08
MonoLLM Gemini CI LB 0.10 0.07 0.03
MonoLLM Gemini CI UB 0.21 0.19 0.12
ER Mean 0.16 0.15 0.09
ER CI LB 0.10 0.09 0.04
ER CI UB 0.22 0.22 0.13
Greedy Mean 0.17 0.13 0.07
Greedy CI LB 0.11 0.07 0.03
Greedy CI UB 0.24 0.19 0.11
Random Mean 0.07 0.04 0.02
Random CI LB 0.02 0.01 0.01
Random CI UB 0.11 0.06 0.04
TS Mean 0.13 0.06 0.06
TS CI LB 0.07 0.03 0.02
TS CI UB 0.19 0.08 0.10
UCB Mean 0.17 0.13 0.08
UCB CI LB 0.11 0.08 0.04
UCB CI UB 0.24 0.19 0.12
Table 13. MRR@10 Mean and 95% C.I. at Turn 10 for PEBOL-P and MonoLLM baselines on Yelp with Different Levels of Response Noise
Noise Level 0 0.25 0.5
Method
MonoLLM GPT Mean 0.12 0.08 0.05
MonoLLM GPT CI LB 0.06 0.03 0.01
MonoLLM GPT CI UB 0.17 0.13 0.09
MonoLLM Gemini Mean 0.17 0.09 0.07
MonoLLM Gemini CI LB 0.12 0.05 0.02
MonoLLM Gemini CI UB 0.23 0.14 0.11
ER Mean 0.26 0.14 0.07
ER CI LB 0.18 0.08 0.03
ER CI UB 0.33 0.19 0.11
Greedy Mean 0.22 0.14 0.08
Greedy CI LB 0.16 0.09 0.04
Greedy CI UB 0.29 0.19 0.11
Random Mean 0.21 0.14 0.10
Random CI LB 0.15 0.09 0.05
Random CI UB 0.28 0.19 0.14
TS Mean 0.27 0.14 0.12
TS CI LB 0.20 0.08 0.06
TS CI UB 0.34 0.20 0.17
UCB Mean 0.23 0.13 0.10
UCB CI LB 0.16 0.08 0.05
UCB CI UB 0.30 0.18 0.14