This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Incorporating External Knowledge and Goal Guidance for LLM-based Conversational Recommender Systems

Chuang Li12, Yang Deng1, Hengchang Hu1, Min-Yen Kan1, Haizhou Li13
1National University of Singapore
2NUS Graduate School for Integrative Sciences and Engineering
3Chinese University of Hong Kong, Shenzhen
{lichuang, hengchanghu}@u.nus.edu
{ydeng, kanmy, haizhou.li}@nus.edu.sg
Abstract

This paper aims to efficiently enable large language models (LLMs) to use external knowledge and goal guidance in conversational recommender system (CRS) tasks. Advanced LLMs (e.g., ChatGPT) are limited in domain-specific CRS tasks for 1) generating grounded responses with recommendation-oriented knowledge, or 2) proactively leading the conversations through different dialogue goals. In this work, we first analyze those limitations through a comprehensive evaluation, showing the necessity of external knowledge and goal guidance which contribute significantly to the recommendation accuracy and language quality. In light of this finding, we propose a novel ChatCRS framework to decompose the complex CRS task into several sub-tasks through the implementation of 1) a knowledge retrieval agent using a tool-augmented approach to reason over external Knowledge Bases and 2) a goal-planning agent for dialogue goal prediction. Experimental results on two multi-goal CRS datasets reveal that ChatCRS  sets new state-of-the-art benchmarks, improving language quality of informativeness by 17% and proactivity by 27%, and achieving a tenfold enhancement in recommendation accuracy111Our code is publicly available at Git4ChatCRS.

1 Introduction

Conversational recommender system (CRS) integrates conversational and recommendation system (RS) technologies, naturally planning and proactively leading the conversations from non-recommendation goals (e.g., “chitchat” or “question answering”) to recommendation-related goals (e.g., “movie recommendation; Jannach et al., 2021; Liu et al., 2023b). Compared with traditional RS, CRS highlights the multi-round interactions between users and systems using natural language. Besides the recommendation task evaluated by the recommendation accuracy as in RS, CRS also focuses on multi-round interactions in response generation tasks including asking questions, responding to user utterances or balancing recommendation versus conversation Li et al. (2023).

Refer to caption
Figure 1: An example of CRS tasks with external knowledge and goal guidance. (Blue: CRS tasks; Red: External Knowledge and Goal Guidance)

Large language models (LLMs; e.g., ChatGPT) that are significantly more proficient in response generation show great potential in CRS applications. However, current research concentrates on evaluating only their recommendation capability Sanner et al. (2023); Dai et al. (2023). Even though LLMs demonstrate a competitive zero-shot recommendation proficiency, their recommendation performance primarily depends on content-based information (internal knowledge) and exhibits sensitivity towards demographic data He et al. (2023); Sanner et al. (2023). Specifically, LLMs excel in domains with ample internal knowledge (e.g., English movies). However, in domains with scarce internal knowledge (e.g., Chinese movies222The Chinese movie domain encompasses CRS datasets originally sourced from Chinese movie websites, featuring both Chinese and international films.), we found through our empirical analysis (§ 3) that their recommendation performance notably diminishes. Such limitation of LLM-based CRS motivates exploring solutions from prior CRS research to enhance domain coverage and task performance.

Prior work on CRS has employed general language models (LMs; e.g., DialoGPT) as the base architecture, but bridged the gap to domain-specific CRS tasks by incorporating external knowledge and goal guidance Wang et al. (2021); Liu et al. (2023b). Inspired by this approach, we conduct an empirical analysis on the DuRecDial dataset Liu et al. (2021) to understand how external inputs333In this paper, we limit the scope of external inputs to external knowledge and goal guidance. can efficiently adapt LLMs in the experimented domain and enhance their performance on both recommendation and response generation tasks.

Our analysis results (§ 3) reveal that despite their strong language abilities, LLMs exhibit notable limitations when directly applied to CRS tasks without external inputs in the Chinese movie domain. For example, lacking domain-specific knowledge (“Jimmy’s Award”) hinders the generation of pertinent responses, while the absence of explicit goals (“recommendation”) leads to unproductive conversational turns (Figure 1). Identifying and mitigating such constraints is crucial for developing effective LLM-based CRS Li et al. (2023).

Motivated by the empirical evidence that external inputs can significantly boost LLM performance on both CRS tasks, we propose a novel ChatCRS framework. It decomposes the overall CRS problem into sub-components handled by specialized agents for knowledge retrieval and goal planning, all managed by a core LLM-based conversational agent. This design enhances the framework’s flexibility, allowing it to work with different LLM models without additional fine-tuning while capturing the benefits of external inputs (Figure 2b). Our contributions can be summarised as:

  • We present the first comprehensive evaluation of LLMs on both CRS tasks, including response generation and recommendation, and underscore the challenges in LLM-based CRS.

  • We propose the ChatCRS framework as the first knowledge-grounded and goal-directed LLM-based CRS using LLMs as conversational agents.

  • Experimental findings validate the efficacy and efficiency of ChatCRS in both CRS tasks. Furthermore, our analysis elucidates how external inputs contribute to LLM-based CRS.

Refer to caption
Figure 2: a) Empirical analysis of LLMs in CRS tasks with DG, COT& Oracle; b) System design of ChatCRS framework using LLMs as a conversational agent to control the goal planning and knowledge retrieval agents.

2 Related Work

Attribute-based/Conversational approaches in CRS. Existing research in CRS has been categorized into two approaches Gao et al. (2021); Li et al. (2023): 1) attribute-based approaches, where the system and users exchange item attributes without conversation Zhang et al. (2018); Lei et al. (2020), and 2) conversational approaches, where the system interacts users through natural language Li et al. (2018); Deng et al. (2023); Wang et al. (2023a).

LLM-based CRS. LLMs have shown promise in CRS applications as 1) zero-shot conversational recommenders with item-based Palma et al. (2023); Dai et al. (2023) or conversational inputs He et al. (2023); Sanner et al. (2023); Wang et al. (2023b); 2) AI agents controlling pre-trained CRS or LMs for CRS tasks Feng et al. (2023); Liu et al. (2023a); Huang et al. (2023); and 3) user simulators evaluating interactive CRS systems Wang et al. (2023c); Zhang and Balog (2020); Huang et al. (2024). However, there is a lack of prior work integrating external inputs to improve LLM-based CRS models.

Multi-agent and tool-augmented LLMs. LLMs, as conversational agents, can actively pursue specific goals through multi-agent task decomposition and tool augmentation Wang et al. (2023d). This involves delegating subtasks to specialized agents and invoking external tools like knowledge retrieval, enhancing LLMs’ reasoning abilities and knowledge coverage Yao et al. (2023); Wei et al. (2023); Yang et al. (2023); Jiang et al. (2023).

In our work, we focus on the conversational approach, jointly evaluating CRS on both recommendation and response generation tasks Wang et al. (2023a); Li et al. (2023); Deng et al. (2023). Unlike existing methods, ChatCRS uniquely combines goal planning and tool-augmented knowledge retrieval agents within a unified framework. This leverages LLMs’ innate language and reasoning capabilities without requiring extensive fine-tuning.

3 Preliminary: Empirical Analysis

We consider the CRS scenario where a system systemsystem interacts with a user uu. Each dialogue contains TT conversation turns with user and system utterances, denoted as C=C={sjsystem,sjus_{j}^{system},s_{j}^{u}}j=1T{}^{T}_{j=1}. The target function for CRS is expressed in two parts: given the dialogue history CjC_{j} of the past jthj^{th} turns, it generates 1) the recommendation of item ii and 2) the next system response sj+1systems_{j+1}^{system}. In some methods, knowledge KK is given as an external input to facilitate both the recommendation and response generation tasks while dialogue goals GG only facilitate the response generation task due to the fixed “recommendation” goals in the recommendation task. Given the user’s contextual history CjC_{j}, systemsystem generates recommendation results ii and system response sj+1systems_{j+1}^{system} in Eq. 1.

y=j=1TPθ(i,sj+1system|Cj,K,G)y^{*}=\prod\nolimits_{j=1}^{T}P_{\theta}~{}(i,s_{j+1}^{system}|~{}C_{j},K,G) (1)

3.1 Empirical Analysis Approaches

Building on the advancements of LLMs over general LMs in language generation and reasoning, we explore their inherent response generation and recommendation capabilities, with and without external knowledge or goal guidance. Our analysis comprises three settings, as shown in Figure 2a:

  • Direct Generation (DG). LLMs directly generate system responses and recommendations without any external inputs (Figure LABEL:ICLa).

  • Chain-of-thought Generation (COT). LLMs internally reason their built-in knowledge and goal-planning scheme for both CRS tasks (Figure LABEL:ICLb).

  • Oracular Generation (Oracle). LLMs leverage gold-standard external knowledge and dialogue goals to enhance performance in both CRS tasks, providing an upper bound (Figure LABEL:ICLc).

Additionally, we conduct an ablation study of different knowledge types on both CRS tasks by analyzing 1) factual knowledge, referring to general facts about entities and expressed as single triple (e.g., [Jiong–Star sign–Taurus]), and 2) item-based knowledge, related to recommended items and expressed as multiple triples (e.g., [Cecilia–Star in–<movie 1, movie 2, …, movie n>]). Our primary experimental approach utilizes in-context learning (ICL) on the DuRecDial dataset Liu et al. (2021). Figure LABEL:ICL provides an overview of the ICL prompts, with examples detailed in Appendix LABEL:Prompt and experiments detailed in § 5. For response generation, we evaluate content preservation (bleubleu-nn, F1F1) and diversity (distdist-nn) with knowledge and goal prediction accuracy. For recommendation, we evaluate top-K ranking accuracy (NDCG@k,MRR@kNDCG@k,MRR@k).

LLMLLM TaskTask NDCG@10/50NDCG@10/50 MRR@10/50MRR@10/50
ChatGPT DG 0.024/0.035 0.018/0.020
COT-K 0.046/0.063 0.040/0.043
Oracle-K 0.617/0.624 0.613/0.614
LLaMA-7b DG 0.013/0.020 0.010/0.010
COT-K 0.021/0.029 0.018/0.020
Oracle-K 0.386/0.422 0.366/0.370
LLaMA-13b DG 0.027/0.031 0.024/0.024
COT-K 0.037/0.040 0.035/0.036
Oracle-K 0.724/0.734 0.698/0.699
Table 1: Empirical analysis for recommendation task in DuRecDial dataset (KK: Knowledge; Red: Best result).
LLMLLM ApproachApproach K/GK/G bleu1bleu1 bleu2bleu2 bleubleu dist1dist1 dist2dist2 F1F1 AccG/KAcc_{G/K}
ChatGPT DG 0.448 0.322 0.161 0.330 0.814 0.522 -
COT G 0.397 0.294 0.155 0.294 0.779 0.499 0.587
K 0.467 0.323 0.156 0.396 0.836 0.474 0.095
Oracle G 0.429 0.319 0.172 0.315 0.796 0.519 -
K 0.497 0.389 0.258 0.411 0.843 0.488 -
BOTH 0.428 0.341 0.226 0.307 0.784 0.525 -
LLaMA-7b DG 0.417 0.296 0.145 0.389 0.813 0.495 -
COT G 0.418 0.293 0.142 0.417 0.827 0.484 0.215
K 0.333 0.238 0.112 0.320 0.762 0.455 0.026
Oracle G 0.450 0.322 0.164 0.431 0.834 0.504 -
K 0.359 0.270 0.154 0.328 0.762 0.473 -
BOTH 0.425 0.320 0.187 0.412 0.807 0.492 -
LLaMA-13b DG 0.418 0.303 0.153 0.312 0.786 0.507 -
COT G 0.463 0.332 0.172 0.348 0.816 0.528 0.402
K 0.358 0.260 0.129 0.276 0.755 0.473 0.023
Oracle G 0.494 0.361 0.197 0.373 0.825 0.543 -
K 0.379 0.296 0.188 0.278 0.754 0.495 -
BOTH 0.460 0.357 0.229 0.350 0.803 0.539 -
Table 2: Empirical analysis for response generation task in DuRecDial dataset (K/GK/G: Knowledge or goal; AccG/KAcc_{G/K}: Accuracy of knowledge or goal predictions; Red: Best result for each model; Underline: Best results for all).

3.2 Empirical Analysis Findings

We summarize our three main findings given the results of the response generation and recommendation tasks shown in Tables 1 and 2.

Finding 1: The Necessity of External Inputs in LLM-based CRS. Integrating external inputs significantly enhances performance across all LLM-based CRS tasks (Oracle), underscoring the insufficiency of LLMs alone as effective CRS tools and highlighting the indispensable role of external inputs. Remarkably, the Oracle approach yields over a tenfold improvement in recommendation tasks with only external knowledge compared to DG and COT methods, as the dialogue goal is fixed as “recommendation” (Table 1). Although utilizing internal knowledge and goal guidance (COT) marginally benefits both tasks, we see in Table 2 for the response generation task that the low accuracy of internal predictions adversely affects performance.

Finding 2: Improved Internal Knowledge or Goal Planning Capability in Advanced LLMs. Table 2 reveals that the performance of Chain-of-Thought (COT) by a larger LLM (LLaMA-13b) is comparable to oracular performance of a smaller LLM (LLaMA-7b). This suggests that the intrinsic knowledge and goal-setting capabilities of more sophisticated LLMs can match or exceed the benefits derived from external inputs used by their less advanced counterparts. Nonetheless, such internal knowledge or goal planning schemes are still insufficient for CRS in domain-specific tasks while the integration of more accurate knowledge and goal guidance (Oracle) continues to enhance performance to state-of-the-art (SOTA) outcomes.

Finding 3: Both factual and item-based knowledge jointly improve LLM performance on domain-specific CRS tasks. As shown in Table 3, integrating both factual and item-based knowledge yields performance gains for LLMs on both response generation and recommendation tasks. Our analysis suggests that even though a certain type of knowledge may not directly benefit a CRS task (e.g., factual knowledge may not contain the target items for the recommendation task), it can still benefit LLMs by associating unknown entities with their internal knowledge, thereby adapting the universally pre-trained LLMs to task-specific domains more effectively. Consequently, we leverage both types of knowledge jointly in our ChatCRS framework.

Response Generation Task
KnowledgeKnowledge bleu1/2/F1bleu1/2/F1 dist1/2dist1/2
Both Types 0.497/0.389/0.488 0.411/0.843
-w/o Factual* 0.407/0.296/0.456 0.273/0.719
-w/o Item-based* 0.427/0.331/0.487 0.277/0.733
Recommendation Task
KnowledgeKnowledge NDCG@10/50NDCG@10/50 MRR@10/50MRR@10/50
Both Types 0.617/0.624 0.613/0.614
-w/o Factual* 0.272/0.290 0.264/0.267
-w/o Item-based* 0.376/0.389 0.371/0.373
Table 3: Ablation study for ChatGPT with different knowledge types in DuRecDial dataset.

4 ChatCRS

Our ChatCRS modelling framework has three components: 1) a knowledge retrieval agent, 2) a goal planning agent and 3) an LLM-based conversational agent (Figure 2b). Given a complex CRS task, an LLM-based conversational agent first decomposes it into subtasks managed by knowledge retrieval or goal-planning agents. The retrieved knowledge or predicted goal from each agent is incorporated into the ICL prompt to instruct LLMs to generate CRS responses or recommendations.

4.1 Knowledge Retrieval agent

Our analysis reveals that integrating both factual and item-based knowledge can significantly boost the performance of LLM-based CRS. However, knowledge-enhanced approaches for LLM-based CRS present unique challenges that have been relatively unexplored compared to prior training-based methods in CRS or retrieval-augmented (RA) methods in NLP Zhang (2023); Di Palma (2023).

Training-based methods, which train LMs to memorize or interpret knowledge representations through techniques like graph propagation, have been widely adopted in prior CRS research Wei et al. (2021); Zhang et al. (2023). However, such approaches are computationally infeasible for LLMs due to their input length constraints and training costs. RA methods, which first collect evidence and then generate responses, face two key limitations in CRS Manzoor and Jannach (2021); Gao et al. (2023). First, without a clear query formulation in CRS, RA methods can only approximate results rather than retrieve the exact relevant knowledge Zhao et al. (2024); Barnett et al. (2024). Especially when multiple similar entries exist in the knowledge base (KB), precisely locating the accurate knowledge for CRS becomes challenging. Second, RA methods retrieve knowledge relevant only to the current dialogue turn, whereas CRS requires planning for potential knowledge needs in future turns, differing from knowledge-based QA systems Mao et al. (2020); Jiang et al. (2023). For instance, when discussing a celebrity without a clear query (e.g., “I love Cecilia…”), the system should anticipate retrieving relevant factual knowledge (e.g., “birth date” or “star sign”) or item-based knowledge (e.g., “acting movies”) for subsequent response generation or recommendations, based on the user’s likely interests.

To address this challenge, we employ a relation-based method which allows LLMs to flexibly plan and quickly retrieve relevant “entity–relation–entity” knowledge triples KK by traversing along the relations RR of mentioned entities EE Moon et al. (2019); Jiang et al. (2023). Firstly, entities for each utterance is directly provided by extracting entities in the knowledge bases from the dialogue utterance Zou et al. (2022). Relations that are adjacent to entity EE from the KB are then extracted as candidate relations (denoted as F1F1) and LLMs are instructed to plan the knowledge retrieval by selecting the most pertinent relation RR^{*} given the dialogue history CjC_{j}. Knowledge triples KK^{*} can finally be acquired using entity EE and predicted relation RR^{*} (denoted as F2F2). The process is formulated in Figure 3 and demonstrated with an example in Figure LABEL:KG-b. Given the dialogue utterance “I love Cecilia…” and the extracted entity [Cecilia], the system first extracts all potential relations for [Cecilia], from which the LLM selects the most relevant relation, [Star in]. The knowledge retrieval agent then fetches the complete knowledge triple [Cecilia–Star in–<movie 1, movie 2, …, movie n>].

Refer to caption
Figure 3: Knowledge retrieval agent in ChatCRS.
Model N-shot DuRecDial TG-Redial
bleu1bleu1 bleu2bleu2 dist2dist2 F1F1 bleu1bleu1 bleu2bleu2 dist2dist2 F1F1
MGCG FullFull 0.362 0.252 0.081 0.420 NA NA NA NA
MGCG-G FullFull 0.382 0.274 0.214 0.435 NA NA NA NA
TPNet FullFull 0.308 0.217 0.093 0.363 NA NA NA NA
UniMIND* FullFull 0.418 0.328 0.086 0.484 0.291 0.070 0.200 0.328
ChatGPT 33 0.448 0.322 0.814 0.522 0.262 0.126 0.987 0.266
LLaMA 33 0.418 0.303 0.786 0.507 0.205 0.096 0.970 0.247
ChatCRS 33 0.460 0.358 0.803 0.540 0.300 0.180 0.987 0.317
Table 4: Results of response generation task on DuRecDial and TG-Redial datasets. (UniMIND*: Results from the ablation study in the original UniMIND paper.)
Model N-shot DuRecDial TG-Redial
NDCG@10/50NDCG@10/50 MRR@10/50MRR@10/50 NDCG@10/50NDCG@10/50 MRR@10/50MRR@10/50
SASRec FullFull 0.369 / 0.413 0.307 / 0.317 0.009 / 0.018 0.005 / 0.007
UniMIND FullFull 0.599 / 0.610 0.592 / 0.594 0.031 / 0.050 0.024 / 0.028
ChatGPT 3 0.024 / 0.035 0.018 / 0.020 0.001 / 0.003 0.005 / 0.005
LLaMA 3 0.027 / 0.031 0.024 / 0.024 0.001 / 0.006 0.003 / 0.005
ChatCRS 3 0.549 / 0.553 0.543 / 0.543 0.031 / 0.033 0.082 / 0.083
Table 5: Results of recommendation task on DuRecDial and TG-Redial datasets.

When there are multiple entities in one utterance, we perform the knowledge retrieval one by one and in the scenario where there are multiple item-based knowledge triples, we randomly selected a maximum of 50 item-based knowledge due to the limitations of input token length. We implement N-shot ICL to guide LLMs in choosing knowledge relations and we show the detailed ICL prompt and instruction with examples in Table LABEL:table:_KR (§ LABEL:AK).

4.2 Goal Planning agent

Accurately predicting the dialogue goals is crucial for 1) proactive response generation and 2) balancing recommendations versus conversations in CRS. Utilizing goal annotations for each dialogue utterance from CRS datasets, we leverage an existing language model, adjusting it for goal generation by incorporating a Low-Rank Adapter (LoRA) approach Hu et al. (2021); Dettmers et al. (2023). This method enables parameter-efficient fine-tuning by adjusting only the rank-decomposition matrices. For each dialogue history CjkC^{k}_{j} (jj-thth turn in dialogue kk; jTj\in T, kNk\in N), the LoRA model is trained to generate the dialogue goal GG^{*} for the next utterance using the prompt of dialogue history, optimizing the loss function in Eq 2 with θ\theta representing the trainable parameters of LoRA. The detailed prompt and instructions are shown in Table LABEL:table:_GP (§ LABEL:AG).

Lg=kNjTlogPθ(G|Cjk)L_{g}=-\sum\nolimits^{N}_{k}~{}\sum\nolimits^{T}_{j}{\log P_{\theta}~{}(G^{*}|~{}C^{k}_{j})} (2)

4.3 LLM-based Conversational Agent

In ChatCRS, the knowledge retrieval and goal-planning agents serve as essential tools for CRS tasks, while LLMs function as tool-augmented conversational agents that utilize these tools to accomplish primary CRS objectives. Upon receiving a new dialogue history CjC_{j}, the LLM-based conversational agent employs these tools to determine the dialogue goal GG^{*} and relevant knowledge KK^{*}, which then instruct the generation of either a system response sj+1systems_{j+1}^{system} or an item recommendation ii through prompting scheme, as formulated in Eq 3. The detailed ICL prompt can be found in § LABEL:Prompt.

i,sj+1system=LLM(Cj,K,G)i,~{}s_{j+1}^{system}=LLM(~{}C_{j},K^{*},G^{*}) (3)

5 Experiments

5.1 Experimental Setups

Datasets. We conduct the experiments on two multi-goal Chinese CRS benchmark datasets a) DuRecDial Liu et al. (2021) in English and Chinese, and b) TG-ReDial Zhou et al. (2020) in Chinese (statistics in Table LABEL:data). Both datasets are annotated for goal guidance, while only DuRecDial contains knowledge annotation and an external KB–CNpedia Zhou et al. (2022) is used for TG-Redial.
Baselines. We compare our model with ChatGPT444OpenAI API: gpt-3.5-turbo-1106 and LLaMA-7b/13b Touvron et al. (2023) in few-shot settings. We also compare fully-trained UniMIND Deng et al. (2023), MGCG-GLiu et al. (2023b), TPNetWang et al. (2023a), MGCG Liu et al. (2020) and SASRec Kang and McAuley (2018), which are previous SOTA CRS and RS models and we summarise each baseline in § LABEL:A1.
Automatic Evaluation. For response generation evaluation, we adopt BLEUBLEU, F1F1 for content preservation and DistDist for language diversity. For recommendation evaluation, we adopt NDCG@kNDCG@k and MRR@KMRR@K to evaluate top K ranking accuracy. For the knowledge retrieval agent, we adopt Accuracy (AccAcc), Precision (PP), Recall (RR) and F1F1 to evaluate the accuracy of relation selection (§ LABEL:AK).
Human Evaluation. For human evaluation, we randomly sample 100 dialogues from DuRecDial, comparing the responses produced by UniMIND, ChatGPT, LLaMA-13b and ChatCRS. Three annotators are asked to score each generated response with {0: poor, 1: ok, 2: good} in terms of a) general language quality in (Flu)ency and (Coh)erence, and b) CRS-specific language qualities of (Info)rmativeness and (Pro)activity. Details of the process and criterion are discussed in § LABEL:A2.
Implementation Details. For both the CRS tasks in Empirical Analysis, we adopt N-shot ICL prompt settings on ChatGPT and LLaMA* Dong et al. (2022), where NN examples from the training data are added to the ICL prompt. In modelling framework, for the goal planning agent, we adopt QLora as a parameter-efficient way to fine-tune LLaMA-7b Dettmers et al. (2023). For the knowledge retrieval agent and LLM-based conversational agent, we adopt the same N-shot ICL approach on ChatGPT and LLaMA* Jiang et al. (2023). Detailed experimental setups are discussed in § LABEL:A1.

Model General CRS-specific
Flu Coh Info Pro Avg.
UniMIND 1.87 1.69 1.49 1.32 1.60
ChatGPT 1.98 1.80 1.50 1.30 1.65
LLaMA-13b 1.94 1.68 1.21 1.33 1.49
ChatCRS 1.99 1.85 1.76 1.69 1.82
-w/o K* 2.00 1.87 1.49 \downarrow 1.62 1.75
-w/o G* 1.99 1.85 1.72 1.55 \downarrow 1.78
Table 6: Human evaluation and ChatCRS ablations for language qualities of (Flu)ency, (Coh)erence, (Info)rmativeness and (Pro)activity on DuRecDial (K/GK^{*}/G^{*}: Knowledge retrieval or goal-planning agent).
Model Knowledge
N-shot Acc P R F1
TPNet FullFull NA NA NA 0.402
MGCG-G FullFull NA 0.460 0.478 0.450
ChatGPT 3 0.095 0.031 0.139 0.015
LLaMA-13b 3 0.023 0.001 0.001 0.001
ChatCRS 3 0.560 0.583 0.594 0.553
Table 7: Results for knowledge retrieval on DuRecDial.

5.2 Experimental Results

ChatCRS significantly improves LLM-based conversational systems for CRS tasks, outperforming SOTA baselines in response generation in both datasets, enhancing content preservation and language diversity (Table 4). ChatCRS sets new SOTA benchmarks on both datasets using 3-shot ICL prompts incorporating external inputs. In recommendation tasks (Table 5), LLM-based approaches lag behind full-data trained baselines due to insufficient in-domain knowledge. Remarkably, ChatCRS, by harnessing external knowledge, achieves a tenfold increase in recommendation accuracy over existing LLM baselines on both datasets with ICL, without full-data fine-tuning.

Human evaluation highlights ChatCRS’s enhancement in CRS-specific language quality. Table 6 shows the human evaluation and ablation results. ChatCRS outperforms baseline models in both general and CRS-specific language qualities. While all LLM-based approaches uniformly exceed the general LM baseline (UniMIND) in general language quality, ChatCRS notably enhances coherence through its goal guidance feature, enabling response generation more aligned with the dialogue goal. Significant enhancements in CRS-specific language quality, particularly in informativeness and proactivity, underscore the value of integrating external knowledge and goals. Ablation studies, removing either knowledge retrieval or goal planning agent, demonstrate a decline in scores for informativeness and proactivity respectively, confirming the efficacy of both external inputs for CRS-specific language quality.

Refer to caption
Figure 4: Knowledge ratio for each goal type on DuRecDial. (X-axis: Knowledge Ratio ; Y-axis: Goal type)

5.3 Detailed Discussion

CRS datasets typically contain a huge volume of knowledge. By analyzing dialogues from the DuRecDial datasets, categorized by goal types, we calculated a “Knowledge Ratio” dividing the number of utterances with annotated knowledge NK,GN_{K,G} by total number of utterances NGN_{G} in each goal type (Eq 4) to measure the necessity of relevant knowledge in CRS task completion. Our analysis, depicted in Figure 4, shows that recommendation tasks rank highly in terms of knowledge necessity, with “POI recommendation” dialogues requiring pertinent knowledge in 75% of cases.

Knowledge Ratio (KR)G=NK,GNG\text{Knowledge Ratio (KR)}_{G}=\frac{N_{K,G}}{N_{G}} (4)
Case Study 1: Factual Knowledge QA Case Study 2: Proactive Response Generation
Dialogue history (Dial_ID 39-1):
[user]: Do you know Jiong He’s zodiac sign ?
Dialogue history (Dial_ID 89-14):
[user]: I like Cecilia Cheung very much. Her acting is very good.
. . .
[system]: Yeah, have you seen Cecilia Cheung’s One Night in Mongkok?
[user]: I’ve seen it. I don’t want to see it again.
System Responses:
[✓] UniMIND: Of course, Taurus
[] ChatGPT: Jiong He’s zodiac sign is Cancer.
[] LLaMA: Jiong He’s zodiac sign is Aries.
[✓] ChatCRS: Yes, I do! Jiong He’s zodiac sign is Taurus.
[Retrieved Knowledge:
(‘Jiong He’, ‘zodiac sign ’, ‘Taurus’)]
System Responses:
[✓] UniMIND: How about <Left Right…Destiny>, a very touching movie.
[✓] ChatGPT: How about trying her other film, <Lost in Time>, a touching romance that showcases her acting skills?
[] LLaMA: Maybe you can try it again, you may find a different feeling.
[✓] ChatCRS: If you haven’t seen it, you should watch <Left Right…Destiny>, it’s Cecilia Cheung’s new movie this year. She played the role of a taxi driver, very cool! Have you seen it?
[Predicted Goal: Movie Recommendation]
Table 8: Case study for ChatCRS with baselines.

Internal knowledge of LLMs is insufficient for handling domain-specific CRS tasks, which is reaffirmed by the analysis of knowledge retrieval accuracy in Table 7. Contrasting with traditional RS which relies on user data for collaborative recommendations, CRS mainly depends on context/content-based recommendation He et al. (2023). This shift highlights the limitations of LLMs in harnessing internal knowledge. ChatCRS overcomes these limitations by interfacing LLMs to plan and reason over external KBs through entities and relations. Therefore, it largely improves the recommendation accuracy, outperforming the training-based approach using full data. Given the limitations in LLM-based CRS tasks, Zhang (2023); Di Palma (2023), we anticipate future studies to further explore such approaches in CRS.

Factual knowledge guides the response generation process, mitigating the risks of generating implausible or inconsistent responses. The “Asking questions” goal type which has the highest knowledge ratio, demonstrates the advantage of leveraging external knowledge in answering factual questions like “the zodiac sign of an Asian celebrity” (Table 8). Standard LLMs produce responses with fabricated content, but ChatCRS accurately retrieves and integrates external knowledge, ensuring factual and informative responses.

Goal guidance contributes more to the linguistic quality of CRS by managing the dialogue flow. We examine the goal planning proficiency of ChatCRS by showcasing the results of goal predictions of the top 5 goal types in each dataset (Figure LABEL:GOALS). DuRecDial dataset shows better balances among recommendation and non-recommendation goals, which exactly aligns with the real-world scenarios Hayati et al. (2020). However, the TG-Redial dataset contains more recommendation-related goals and multi-goal utterances, making the goal predictions more challenging. The detailed goal planning accuracy is discussed in § LABEL:A-goal.

Dialogue goals guide LLMs towards a proactive conversational recommender. For a clearer understanding, we present a scenario in Table 8 where a CRS seamlessly transitions between “asking questions” and “movie recommendation”, illustrating how accurate goal direction boosts interaction relevance and efficacy. Specifically, if a recommendation does not succeed, ChatCRS will adeptly pose further questions to refine subsequent recommendation responses while LLMs may keep outputting wrong recommendations, creating unproductive dialogue turns. This further emphasizes the challenges of conversational approaches in CRS, where the system needs to proactively lead the dialogue from non-recommendation goals to approach the users’ interests for certain items or responses Liu et al. (2023b), and underscores the goal guidance in fostering proactive engagement in CRS.

6 Conclusion

This paper conducts an empirical investigation into the LLM-based CRS for domain-specific applications in the Chinese movie domain, emphasizing the insufficiency of LLMs in domain-specific CRS tasks and the necessity of integrating external knowledge and goal guidance. We introduce ChatCRS, a novel framework that employs a unified agent-based approach to more effectively incorporate these external inputs. Our experimental findings highlight improvements over existing benchmarks, corroborated by both automatic and human evaluation. ChatCRS marks a pivotal advancement in CRS research, fostering a paradigm where complex problems are decomposed into subtasks managed by agents, which maximizes the inherent capabilities of LLMs and their domain-specific adaptability in CRS applications.

Limitations

This research explores the application of few-shot learning and parameter-efficient techniques with large language models (LLMs) for generating responses and making recommendations, circumventing the need for the extensive fine-tuning these models usually require. Due to budget and computational constraints, our study is limited to in-context learning with economically viable, smaller-scale closed-source LLMs like ChatGPT, and open-source models such as LLaMA-7b and -13b.

A significant challenge encountered in this study is the scarcity of datasets with adequate annotations for knowledge and goal-oriented guidance for each dialogue turn. This limitation hampers the development of conversational models capable of effectively understanding and navigating dialogue. It is anticipated that future datasets will overcome this shortfall by providing detailed annotations, thereby greatly improving conversational models’ ability to comprehend and steer conversations.

Ethic Concerns

The ethical considerations for our study involving human evaluation (§ 5.1) have been addressed through the attainment of an IRB Exemption for the evaluation components involving human subjects. The datasets utilized in our research are accessible to the public Liu et al. (2021); Zhou et al. (2020), and the methodology employed for annotation adheres to a double-blind procedure (§ 5.1). Additionally, annotators receive compensation at a rate of $15 per hour, which is reflective of the actual hours worked.

References

  • Barnett et al. (2024) Scott Barnett, Stefanus Kurniawan, Srikanth Thudumu, Zach Brannelly, and Mohamed Abdelrazek. 2024. Seven failure points when engineering a retrieval augmented generation system.
  • Dai et al. (2023) Sunhao Dai, Ninglu Shao, Haiyuan Zhao, Weijie Yu, Zihua Si, Chen Xu, Zhongxiang Sun, Xiao Zhang, and Jun Xu. 2023. Uncovering chatgpt’s capabilities in recommender systems. In Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23, page 1126–1132, New York, NY, USA. Association for Computing Machinery.
  • Deng et al. (2023) Yang Deng, Wenxuan Zhang, Weiwen Xu, Wenqiang Lei, Tat-Seng Chua, and Wai Lam. 2023. A unified multi-task learning framework for multi-goal conversational recommender systems. ACM Trans. Inf. Syst., 41(3).
  • Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Di Palma (2023) Dario Di Palma. 2023. Retrieval-augmented recommender system: Enhancing recommender systems with large language models. In Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23, page 1369–1373, New York, NY, USA. Association for Computing Machinery.
  • Dong et al. (2022) Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey for in-context learning. arXiv preprint arXiv:2301.00234.
  • Feng et al. (2023) Yue Feng, Shuchang Liu, Zhenghai Xue, Qingpeng Cai, Lantao Hu, Peng Jiang, Kun Gai, and Fei Sun. 2023. A large language model enhanced conversational recommender system.
  • Gao et al. (2021) Chongming Gao, Wenqiang Lei, Xiangnan He, Maarten de Rijke, and Tat-Seng Chua. 2021. Advances and challenges in conversational recommender systems: A survey. AI Open, 2:100–126.
  • Gao et al. (2023) Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
  • Hayati et al. (2020) Shirley Anugrah Hayati, Dongyeop Kang, Qingxiaoyang Zhu, Weiyan Shi, and Zhou Yu. 2020. Inspired: Toward sociable recommendation dialog systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8142–8152, Online. Association for Computational Linguistics.
  • He et al. (2023) Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, Bodhisattwa Prasad Majumder, Nathan Kallus, and Julian McAuley. 2023. Large language models as zero-shot conversational recommenders. arXiv preprint arXiv:2308.10053.
  • Hu et al. (2021) Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models.
  • Huang et al. (2024) Chen Huang, Peixin Qin, Yang Deng, Wenqiang Lei, Jiancheng Lv, and Tat-Seng Chua. 2024. Concept–an evaluation protocol on conversation recommender systems with system-and user-centric factors. arXiv preprint arXiv:2404.03304.
  • Huang et al. (2023) Xu Huang, Jianxun Lian, Yuxuan Lei, Jing Yao, Defu Lian, and Xing Xie. 2023. Recommender ai agent: Integrating large language models for interactive recommendations. arXiv preprint arXiv:2308.16505.
  • Jannach et al. (2021) Dietmar Jannach, Ahtsham Manzoor, Wanling Cai, and Li Chen. 2021. A survey on conversational recommender systems. ACM Comput. Surv., 54(5).
  • Jiang et al. (2023) Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Xin Zhao, and Ji-Rong Wen. 2023. StructGPT: A general framework for large language model to reason over structured data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9237–9251, Singapore. Association for Computational Linguistics.
  • Kang and McAuley (2018) Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation.
  • Lei et al. (2020) Wenqiang Lei, Xiangnan He, Yisong Miao, Qingyun Wu, Richang Hong, Min-Yen Kan, and Tat-Seng Chua. 2020. Estimation-action-reflection: Towards deep interaction between conversational and recommender systems. In Proceedings of the 13th International Conference on Web Search and Data Mining, WSDM ’20, page 304–312, New York, NY, USA. Association for Computing Machinery.
  • Li et al. (2023) Chuang Li, Hengchang Hu, Yan Zhang, Min-Yen Kan, and Haizhou Li. 2023. A conversation is worth a thousand recommendations: A survey of holistic conversational recommender systems. In KaRS Workshop at ACM RecSys ’23, Singapore.
  • Li et al. (2018) Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. 2018. Towards deep conversational recommendations. In Advances in Neural Information Processing Systems 31 (NIPS 2018).
  • Liu et al. (2023a) Yuanxing Liu, Weinan Zhang, Yifan Chen, Yuchi Zhang, Haopeng Bai, Fan Feng, Hengbin Cui, Yongbin Li, and Wanxiang Che. 2023a. Conversational recommender system and large language model are made for each other in E-commerce pre-sales dialogue. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9587–9605, Singapore. Association for Computational Linguistics.
  • Liu et al. (2021) Zeming Liu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, and Wanxiang Che. 2021. DuRecDial 2.0: A bilingual parallel corpus for conversational recommendation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4335–4347, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Liu et al. (2020) Zeming Liu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanxiang Che, and Ting Liu. 2020. Towards conversational recommendation over multi-type dialogs. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1036–1049. Association for Computational Linguistics.
  • Liu et al. (2023b) Zeming Liu, Ding Zhou, Hao Liu, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanxiang Che, Ting Liu, and Hui Xiong. 2023b. Graph-grounded goal planning for conversational recommendation. IEEE Transactions on Knowledge and Data Engineering, 35(5):4923–4939.
  • Manzoor and Jannach (2021) Ahtsham Manzoor and Dietmar Jannach. 2021. Generation-based vs retrieval-based conversational recommendation: A user-centric comparison. In Proceedings of the 15th ACM Conference on Recommender Systems, RecSys ’21, page 515–520, New York, NY, USA. Association for Computing Machinery.
  • Mao et al. (2020) Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, and Weizhu Chen. 2020. Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553.
  • Moon et al. (2019) Seungwhan Moon, Pararth Shah, Anuj Kumar, and Rajen Subba. 2019. OpenDialKG: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 845–854, Florence, Italy. Association for Computational Linguistics.
  • Palma et al. (2023) Dario Di Palma, Giovanni Maria Biancofiore, Vito Walter Anelli, Fedelucio Narducci, Tommaso Di Noia, and Eugenio Di Sciascio. 2023. Evaluating chatgpt as a recommender system: A rigorous approach.
  • Sanner et al. (2023) Scott Sanner, Krisztian Balog, Filip Radlinski, Ben Wedin, and Lucas Dixon. 2023. Large language models are competitive near cold-start recommenders for language- and item-based preferences. In Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23, page 890–896, New York, NY, USA. Association for Computing Machinery.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models.
  • Wang et al. (2023a) Jian Wang, Dongding Lin, and Wenjie Li. 2023a. A target-driven planning approach for goal-directed dialog systems. IEEE Transactions on Neural Networks and Learning Systems.
  • Wang et al. (2021) Lingzhi Wang, Huang Hu, Lei Sha, Can Xu, Kam-Fai Wong, and Daxin Jiang. 2021. Finetuning large-scale pre-trained language models for conversational recommendation with knowledge graph. CoRR, abs/2110.07477.
  • Wang et al. (2023b) Xiaolei Wang, Xinyu Tang, Xin Zhao, Jingyuan Wang, and Ji-Rong Wen. 2023b. Rethinking the evaluation for conversational recommendation in the era of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10052–10065, Singapore. Association for Computational Linguistics.
  • Wang et al. (2023c) Xiaolei Wang, Xinyu Tang, Xin Zhao, Jingyuan Wang, and Ji-Rong Wen. 2023c. Rethinking the evaluation for conversational recommendation in the era of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10052–10065, Singapore. Association for Computational Linguistics.
  • Wang et al. (2023d) Yancheng Wang, Ziyan Jiang, Zheng Chen, Fan Yang, Yingxue Zhou, Eunah Cho, Xing Fan, Xiaojiang Huang, Yanbin Lu, and Yingzhen Yang. 2023d. Recmind: Large language model powered agent for recommendation.
  • Wei et al. (2023) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-thought prompting elicits reasoning in large language models.
  • Wei et al. (2021) Xiaokai Wei, Shen Wang, Dejiao Zhang, Parminder Bhatia, and Andrew O. Arnold. 2021. Knowledge enhanced pretrained language models: A compreshensive survey. CoRR, abs/2110.08455.
  • Yang et al. (2023) Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, and Ying Shan. 2023. Gpt4tools: Teaching large language model to use tools via self-instruction.
  • Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models.
  • Zhang (2023) Gangyi Zhang. 2023. User-centric conversational recommendation: Adapting the need of user with large language models. In Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23, page 1349–1354, New York, NY, USA. Association for Computing Machinery.
  • Zhang and Balog (2020) Shuo Zhang and Krisztian Balog. 2020. Evaluating conversational recommender systems via user simulation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining, KDD ’20, page 1512–1520, New York, NY, USA. Association for Computing Machinery.
  • Zhang et al. (2023) Xiaoyu Zhang, Xin Xin, Dongdong Li, Wenxuan Liu, Pengjie Ren, Zhumin Chen, Jun Ma, and Zhaochun Ren. 2023. Variational reasoning over incomplete knowledge graphs for conversational recommendation. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 231–239.
  • Zhang et al. (2018) Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and W Bruce Croft. 2018. Towards conversational search and recommendation: System ask, user respond. In Proceedings of the 27th acm international conference on information and knowledge management, pages 177–186.
  • Zhao et al. (2024) Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. 2024. Retrieval-augmented generation for ai-generated content: A survey.
  • Zhou et al. (2020) Kun Zhou, Yuanhang Zhou, Wayne Xin Zhao, Xiaoke Wang, and Ji-Rong Wen. 2020. Towards topic-guided conversational recommender system.
  • Zhou et al. (2022) Yuanhang Zhou, Kun Zhou, Wayne Xin Zhao, Cheng Wang, Peng Jiang, and He Hu. 2022. C2-crs: Coarse-to-fine contrastive learning for conversational recommender system. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, pages 1488–1496.
  • Zou et al. (2022) Jie Zou, Evangelos Kanoulas, Pengjie Ren, Zhaochun Ren, Aixin Sun, and Cheng Long. 2022. Improving conversational recommender systems via transformer-based sequential modelling. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2319–2324. ACM.