Measuring “Why” in Recommender Systems: a Comprehensive Survey on the Evaluation of Explainable Recommendation

Xu Chen¹ Yongfeng Zhang² Ji-Rong Wen¹ ¹Beijing Key Laboratory of Big Data Management and Analysis Methods,
Gaoling School of Artificial Intelligence, Renmin University of China
²Department of Computer Science, Rutgers University [email protected], [email protected], [email protected]

Abstract

Explainable recommendation has shown its great advantages for improving recommendation persuasiveness, user satisfaction, system transparency and among others. A fundamental problem of explainable recommendation is how to evaluate the explanations. In the past few years, various evaluation strategies have been proposed. However, they are scattered in different papers, and there lacks a systematic and detailed comparison between them. To bridge this gap, in this paper, we comprehensively review the previous work, and provide different taxonomies according to the evaluation perspectives and evaluation methods. Beyond summarization, we also analyze the (dis)advantages of existing evaluation methods, and provide a series of guidelines on how to select them. The contents of this survey are concluded from more than 100 papers from top-tier conferences like IJCAI, AAAI, TheWebConf, Recsys, UMAP and IUI, and the complete comparisons are presented at https://shimo.im/sheets/VKrpYTcwVH6KXgdy/MO
DOC/. With this survey, we finally aim to provide a clear and comprehensive review on the evaluation of explainable recommendation.

1 Introduction

Artificial intelligence (AI) has deeply revolutionized people’s daily life and production, bringing effectiveness, efficiency, reliability, automation and etc. Among different AI tasks, many are purely objective, where the ground truths are irrelevant with subjective factors. Computer vision (CV) is a typical domain falling into this category. For example, in the task of image classification, the models are built to accurately predict the labels like a cat or a dog. These labels are objective facts, which do not change with human feelings. In many other domains, the tasks are more subjective, where there is no rigid ground truth, and the target is to improve the utilities of certain stakeholders. Recommender system is a typical subjective AI task, where the stakeholder utilities not only include the recommendation accuracy, but also contain a lot of beyond-accuracy aspects, among which explainability is a significant and widely studied one.

Basically, explainable recommendation solves the problem of “why”, that is, why the items are recommended. By providing explanations, people can make more informed decisions, and the recommendation persuasiveness and algorithm transparency can be improved. In the field of explainable recommendation, evaluation is a fundamental problem, which judges the model availability and helps to make model selections. Comparing with the other evaluation tasks, evaluating recommendation explanations is much more difficult because the ground truth is hard to obtain and human feelings are not easy to approximate. To alleviate these difficulties, the past few decades have witnessed a lot of promising evaluation strategies. However, these strategies are independently proposed in different papers, and has various prerequisites and implementation characters. There lacks a systematic comparison between them, which may hinder the evaluation standardization and be not friendly to the newcomers to this domain.

To bridge the above gap, in this paper, we thoroughly conclude the previous work, and present a clear summarization on existing evaluation strategies. In specific, recommendation explanations may serve for different targets, for example, the users, providers and model designers. For different targets, the evaluation perspectives are also various, and we summarize four major evaluation perspectives from the previous work, that is, the explanation effectiveness, transparency, persuasiveness and scrutability. For evaluating the explanations from the above perspectives, there exists a lot of evaluation methods. We conclude these methods into four categories including the case studies, quantitative metrics, crowdsourcing and online experiments. For each method, we detail its main characters, and analysis its advantages and shortcomings, At last, we propose a series of guidelines on how to select these evaluation methods in real-world problems. The main contributions of this paper are summarized as follows:

$\bullet$ We provide a systematic and comprehensive survey on the evaluation of explainable recommendation, which to the best of our knowledge, is the first survey on this topic.

$\bullet$ We propose different taxonomies for existing papers according to the evaluation perspectives and evaluation methods. The references cover more than 100 papers from top tier conferences like IJCAI, AAAI, TheWebConf, Recsys, UMAP and IUI, and their complete summarization is presented at https://shimo.im/sheets/VKrpYTcwVH6KXgdy/MODOC/.

$\bullet$ We analysis the advantages and shortcomings of existing evaluation methods and propose several guidelines on how to choose them in different scenarios.

2 Related Survey

Recommender system is becoming increasingly important, with dozens of papers published each years on top tier conferences/journals. For better summarizing and comparing these work, people have conducted many promising survey studies. For general recommendation, early surveys mainly focus on shallow architectures like matrix factorization or heuristic methods Adomavicius (2004). Later on, with the ever prospering of deep neural networks, people have designed a large amount of neural recommender models. Correspondingly, many comprehensive surveys on deep recommender algorithms are proposed Zhang (2019); Wu (2021); Fang (2020). Beyond concluding recommender models from the architecture perspective, many surveys focus on how to leverage the side information. For example, Bouraga (2014) summarizes the models on incorporating knowledge graph into recommender systems. Chen (2013) reviews the algorithms on social recommendation. Kanwal (2021) focuses on the combination between the text information and recommender models. Ji (2017) summarizes the models on image based recommendation. In addition, people also conduct surveys according to different recommendation domains. For example, Song (2012) summarizes the algorithms in music recommendation. Goyani (2020) focus more on the movie recommender models. Wei (2007) is tailored for the algorithms in e-commerce. Yu (2015) is a comprehensive survey on point-of-interest (POI) recommendation. We have also noticed that there are some surveys on recommendation evaluations Gunawardana (2009); Silveira (2019). However, they focus more on the metrics like accuracy, novelty, serendipity and coverage, which differs from our target on explanation evaluation. In the past, we have published a survey on explainable recommender models Zhang (2018), but it aims to summarize this field in a general sense. However, in this paper, we focus on the evaluation part, and include many latest studies.

Table 1: Summarization of the evaluation perspectives.

Evaluation perspective	Evaluation problem	Representative papers	Serving target
Effectiveness	Whether the explanations are useful for the users to make more accurate/faster decisions?	Hada (2021); Guesmi (2021); Wang (2020); Gao et al. (2019); Chen et al. (2013)	Users
Transparency	Whether the explanations can reveal the internal working principles of the recommender models?	Chen (2021a); Sonboli (2021); Li (2021c, 2020c); Fu (2020)	Model designers
Persuasiveness	Whether the explanations can increase the click/purchase rate of the users on the items?	Musto et al. (2019); Tsai and Brusilovsky (2019); Balog and Radlinski (2020); Tsukuda (2020); Chen and Wang (2014)	Providers
Scrutability	Whether the explanations can exactly correspond to the recommendation results?	He et al. (2015); Tsai and Brusilovsky (2019); Cheng et al. (2019); Balog et al. (2019); Liu (2020)	Model designers

3 Evaluation Perspectives

Recommendation explanations may serve for different purposes, e.g., providing more recommendation details, so that people can make more informed decisions, or improving recommendation persuasiveness, so that the providers can obtain more profit. For different purposes, the explanations should be evaluated from different perspectives. In the literature, the following evaluation perspectives are usually considered.

3.1 Effectiveness

Effectiveness aims to evaluate “whether the explanations are useful for the users to make more accurate/faster decisions Balog and Radlinski (2020)?” or sometimes “whether the users are more satisfied with the explanations Xian (2021)?” It is targeted at improving the utilities of the users, that is, better serving the users with the explanations. By enhancing the effectiveness, the user experience can be improved, which may consequently increase the user conversion rate and stickiness. Here, we use the term “effectiveness” in a broad sense, which covers the concepts of efficiency, satisfaction and trust in many literature Balog and Radlinski (2020).

3.2 Transparency

Transparency aims to evaluate “whether the explanations can reveal the internal working principles of the recommender models Tai (2021); Sonboli (2021); Li (2021b)?” It encourages the algorithms to produce explanations which can open the “black box” of the recommender models. With better transparency, the model designers can know more about how the recommendations are generated, and better debug the system.

3.3 Persuasiveness

Persuasiveness aims to evaluate “whether the explanations can increase the interaction probability of the users on the items Tsukuda (2020); Balog and Radlinski (2020); Musto et al. (2019)?” In general, persuasiveness aims to enhance the utilities of the providers (e.g., sellers on the e-commerce websites, content creators on the micro video applications). Comparing with transparency, persuasiveness does not care about whether the explanations can honestly reflect the model principles. It focuses more on whether the explanations can persuade users to interact with the recommendations, so that the business goals can be achieved.

3.4 Scrutability

Scrutability aims to evaluate “whether the explanations can exactly correspond to the recommendation results?” Cheng et al. (2019); Tsai and Brusilovsky (2019); He et al. (2015). In other words, if the explanations are different, then whether and how the recommendation list is altered. This perspective is closely related with transparency yet one does not imply the other. Transparency pays more attention to the details inside the model, aiming to whiten the working mechanisms. Scrutability cares more about the relations between the explanations and outputs, which can be model agnostic.

We summarize the above evaluation perspectives in Table 1. We can see, recommendation explanations mainly serve for three types of stakeholders, that is, the users, providers and model designers. For different stakeholders, the evaluation perspectives are different: effectiveness cares more about the utilities of the users, persuasiveness is more tailored for the providers, and transparency and scrutability serve more for the model designers. If one compares the explanations in recommender systems and general machine learning, she may find that the latter mainly focus on the transparency and scrutability, since it is not related with the users and provides. In this sense, explainable recommendation can be more complex, since it needs to trade-off different stakeholders.

Remark.

(1) The above four perspectives are not always independent, they may have both overlaps and contradictions. For example, the explanations for improving transparency may satisfy the curiosities of some users on how the recommender models work, and therefore the explanation effectiveness can be enhanced. In another example, to persuade users to purchase more items (i.e., persuasiveness), the explanations may not accurately reveal the model working mechanism (i.e., transparency), but have to highlight the features which can promote sales. When designing and evaluating explainable recommender models, it is necessary to predetermine which perspectives to focus on, and how much we can sacrifice on the other perspectives. (2) We have noticed there are many other evaluation perspectives, such as the fluency of the explanation Chen (2021b) or whether a user-item pair is explainable Liu et al. (2019). However, these perspectives are not common, only appearing in a small amount of papers. Thus we do not elaborate them in this survey.

4 Evaluation Methods

To evaluate recommendation explanations from the above perspectives, people have designed quite a lot of evaluation methods. By going over the previous papers, we find that there are mainly four categories, that is: case studies, quantitative metrics, crowdsourcing and online experiments. In the following, we elaborate these methods more in detail.

4.1 Evaluation with Case Studies

In this category, people usually present some examples to illustrate how the recommender model works and whether the generated explanations are aligned with human intuitions. Usually, the recommender models are firstly learned based on the training set, and then the examples are generated based on the intermediate or final outputs by feeding the testing samples into the optimized model. More specifically, the examples can be both positive and negative Chen et al. (2019). By the positive examples, the effectiveness of the model can be demonstrated, while by the negative ones, the readers can understand when and why the model fails.

The most popular case studies are conducted by visualizing the attention weights Li (2021b, c); Chen et al. (2019). For example, in fashion recommendation, Li (2021b) presents the matching degree between the tops and bottoms across different attributes. In image recommendation, Chen (2017) visualizes the importances of different user previously interacted images as well as the regions in each image. In graph based recommendation, Li (2021c) illustrates which neighbors, attributes and structure contexts are more important for the current user-item interaction prediction.

In many studies, people present template-based or fully generated natural language explanations as case studies. For example, Chen (2020); Zhang et al. (2014) show the explanations by assembling the model predicted attributes and the pre-defined templates. Li (2020b, 2021a) present the explanations which are completely generated from the models.

When evaluating knowledge-based recommender models, people usually conduct case studies by showing the reasoning paths between the users and items. For example, in Wang et al. (2019), the authors present the importances of different paths for a given user-item pair. In Xian (2020); Fu (2020), the authors present examples to reveal the model routing process from the users to the items.

In many papers, the explanation items/features are presented as intuitive case studies. For example, Liu (2020); Li (2020c) highlights the items which are significant for the current model prediction. Yang et al. (2019) shows important feature combinations to reveal the model working principles.

In aspect-based recommender models, the case studies are usually conducted by presenting the learned aspects for the users/items. For example, in Tan et al. (2016), the authors show the top words in the learned aspects to demonstrate the effectiveness of the model in capturing user preference.

The major advantage of case studies lies in its intuitiveness, which makes the readers easily understand how the explanations look like. However, there are many significant weaknesses. To begin with, it can be biased, since one cannot evaluate and present all the samples. And then, due to the lack of quantitative scores, it is difficult to compare different models and make accurate model selections.

4.2 Evaluation with Quantitative Metrics

In order to more precisely evaluate the explanations, people have designed many quantitative metrics. Basically, this category of methods assumes that the stakeholder utilities can be approximated and evaluated via parameterized functions. In the following, we elaborate the representative metrics.

Intuitively, if the explanations can accurately reflect the user preferences on the items, then the user satisfaction can be improved. As a result, many papers Li (2021a); Hada (2021); Li (2020a) leverage the user reviews, which pool comprehensive user preferences, as the explanation ground truth. In these studies, the explainable recommendation problem is regarded as a natural language generation (NLG) task, where the goal is to accurately predict the user reviews. Under this formulation, the following metrics are usually leveraged to evaluate the explanation qualities.

$\bullet$ BLEU and ROUGE scores. BLEU Papineni (2002) and ROUGE Lin (2004) scores are commonly used metrics for natural language generation, where the key insight is to count the world (or n-gram) level overlaps between the model predicted and real user reviews.

$\bullet$ Unique Sentence Ratio (USR), Feature Coverage Ratio (FCR) and Feature Diversity (FD). These metrics aim to evaluate the diversity of the generated reviews Li (2020a). USR evaluates the diversity on the sentence level. In specific, suppose the set of unique reviews is $\mathcal{S}$ , and the total number of reviews is N, then this metric is computed as $\text{USR}=\frac{|\mathcal{S}|}{N}$ . FCR and FD compute the diversity on the feature level. In the former metric, suppose the number of distinct features in all the generated reviews is $M$ , and the total number of features is $N_{f}$ , than $\text{FCR}=\frac{M}{N_{f}}$ . For the latter one, suppose the feature set in the generated review for user-item pair $(u,i)$ is $F_{u,i}$ , then $\text{FD}=\frac{1}{N(N-1)}\sum_{(u,i)\neq(u^{\prime},i^{\prime})}|F_{u,i}\cap F_{u^{\prime},i^{\prime}}|$ , where $N$ is the total number of different user-item pairs.

$\bullet$ Feature-level Precision (FP), Recall (FR) and $F_{1}$ (FF) In many papers Tan (2021); Tai (2021), the explanation quality is evaluated by comparing the features in the predicted and real user reviews. For each user-item pair $(u,i)$ , suppose the predicted and real feature sets are $S_{u,i}$ and $T_{u,i}$ , respectively. Then the feature-level precision, recall and $F_{1}$ for this user-item pair are computed as $\text{FP}_{u,i}=\frac{|S_{u,i}\cap T_{u,i}|}{|S_{u,i}|}$ , $\text{FR}_{u,i}=\frac{|S_{u,i}\cap T_{u,i}|}{|T_{u,i}|}$ and $\text{FF}_{u,i}=\frac{2\cdot\text{FP}\cdot\text{FR}}{\text{FP}+\text{FR}}$ , respectively. The final results are averaged across all the user-item pairs, that is, $\text{FP}=\frac{1}{N}\sum_{u,i}\text{FP}_{u,i}$ , $\text{FR}=\frac{1}{N}\sum_{u,i}\text{FR}_{u,i}$ and $\text{FF}=\frac{1}{N}\sum_{u,i}\text{FF}_{u,i}$ , where $N$ is the number of user-item pairs.

$\bullet$ Feature Matching Ratio (FMR). In Li (2020a), the reviews are generated under the control of a feature. FMR evaluates whether this feature can successfully appear in the predicted results. It is formally computed as $\text{FMR}=\frac{1}{N}\sum_{u,i}\bm{1}(f_{u,i}\in S_{u,i})$ , where $f_{u,i}$ is the input feature, $S_{u,i}$ is the predicted review, $N$ is the number of user-item pairs, and $\bm{1}$ is the indicator function.

In counterfactual explainable recommendation Tan (2021), the following two metrics are usually leveraged to evaluate how the model prediction changes when the key features/items (which are used as the explanations) are altered.

$\bullet$ Probability of Necessity (PN). This metric aims to evaluate “if the explanation features of an item had been ignored, whether this item will be removed from the recommendation list?” Formally, Suppose $A_{ij}$ is the explanation feature set of item $j$ when being recommended to user $i$ , let $\text{PN}_{ij}=1$ , if item $j$ no longer exists in the recommendation list when one ignores the features in $A_{ij}$ , otherwise $\text{PN}_{ij}=0$ . The final PN score is computed as $\sum_{i,j}\frac{\text{PN}_{ij}}{\bm{1}({|A_{ij}|>0})}$ .

$\bullet$ Probability of Sufficiency (PS). This metric aims to evaluate “if only the explanation features of an item are remained, whether this item will be still in the recommendation list?” Let $\text{PS}_{ij}=1$ , if item $j$ exists in the recommendation list when one only remains the features in $A_{ij}$ , otherwise $\text{PS}_{ij}=0$ . The final PS score is computed as $\sum_{i,j}\frac{\text{PS}_{ij}}{\bm{1}({|A_{ij}|>0})}$ .

Similar evaluation metric is also leveraged in Liu (2020), which measures the explanation quality based on the performance change before and after the key history items/entities are removed from the model input. The key intuition is that if the explanations are the key history items/entities underlying the recommendations, then removing them should greatly change the recommendation performance. The formal definition is as follows:

$\bullet$ Performance Shift (PS). Suppose the original recommendation performance is $r$ , after removing the key history items/entities, the performance is changed to $r^{\prime}$ , then $\text{RS}=\frac{r-r^{\prime}}{r}$ . Here the performance can be evaluated with different metrics, such as precision, recall, NDCG and so on.

In Abdollahi and Nasraoui (2017), the authors firstly define which items can be explained based on the user-item interaction graph, and then the following two metrics are proposed to evaluate the explainability of the recommendations.

$\bullet$ Mean Explainability Precision (MEP) and Mean Explainability Recall (MER). For a user $u$ , suppose the set of items which can be explained is $N_{u}$ , the recommended item set is $M_{u}$ , then $\text{MEP}=\frac{|N_{u}\cap M_{u}|}{|M_{u}|}$ and $\text{MEP}=\frac{|N_{u}\cap M_{u}|}{|N_{u}|}$ .

Based on the above quantitative metrics, different models can be compared, which is important for model selection and benchmarking this field. In addition, quantitative metrics are usually very efficient, since most of them can be computed in a short time. As for the shortcomings, quantitative metrics are sometimes not fully aligned with the evaluation goals. For example, BLUE score only evaluates the word-level overlaps between the generated and real explanations, but it does not directly tell whether the explanations are reasonable or not. What’s more, when the stakeholder utilities are too complicated, existing quantitative metrics, which are heuristically designed by a small amount of people, may have the risk of misleading the evaluation results.

4.3 Evaluation with Crowdsourcing

As mentioned before, recommender system is a subjective AI task, thus involving human feelings into the evaluation process is a direct and natural idea. Such evaluation methods are called crowdsourcing, and there are mainly three strategies, which are elaborated as follows.

$\bullet$ Crowdsourcing with public datasets. In this method, the recommender models are trained based on the public datasets. And then, many annotators are recruited to evaluate the model generated explanations based on a series of questions. There are two key points in the evaluation process.

(i) Annotation quality control. In order to accurately evaluate the explanations, controlling the annotation quality is a necessary step. In the previous work, there are two major strategies: one is based the voting mechanism, for example, in Chen et al. (2019), there are three annotators for each labeling task, and the final result is available only when more than two annotators have the same judgment. The other strategy is based on computing certain statistical quantities. For example, in Chen et al. (2018), Cohen’s Weighted $\kappa$ Cohen (1968) is leveraged to assess the inter-annotator agreements and remove the outlier annotations.

(ii) Annotation question designs. We find that the annotation questions in different papers are quite diverse. In general, they can be classified into three categories. The first category are “model-agnostic” questions, where the problems do not dependent on any specific model. In Chen et al. (2019), the annotators are required to simulate themselves as the real users, and then label out the ideal explanations from all the possible ones. Then the labeled explanations are regarded as the ground truth, and the proposed model and baselines are evaluated by comparing their predicted explanations with the ground truth. The second category are “single-model” questions, where the problems focus on only one explainable model. In Li (2021b), the annotators are asked to up-vote or down-vote the explanations generated from the designed model and the baseline, respectively. In Xian (2021); Li (2020b), the annotators have to evaluate whether the proposed model can help users make better purchase decision. In Chen (2021b), the authors ask the annotators to label whether the explanations produced from the designed model are fluent and useful. In Tao et al. (2019); Wang et al. (2018), the annotators are required to answer whether they are satisfied with the generated explanations, whether the explanations can really match their preferences and whether the explanations can provide more information for users to make a decision. The third category are “pair-model” questions, where the problems focus on the comparison between a pair of explainable models. In Li (2020b), the annotators need to answer which explanations produced from the designed model and baseline are more closer to the ground truth. In Tao et al. (2019), the annotation questions are: between A and B, whose explanations do you think can better help you understand the recommendations? and whose explanations can better help you make a more informed decision? In Chen et al. (2018), the annotators are asked to compare the usefulness of the explanations generated from the proposed model and the baselines.

In the above questions, the first category is the most difficult, since the annotators have to make decisions from a large amount of candidates. However, the labeled datasets are model-agnostic, which can be reused for different models and experiments. In the second and third categories, the annotators only have to answer yes or no, or select from a small amount of candidate answers, but the annotations cannot be reused, which is costly and not scalable.

$\bullet$ Crowdsourcing by injecting annotator data into public dataset. A major problem in the above method is that the annotator preferences may deviate from the real users’, which may introduce noises into the evaluation. As a remedy for this problem, Gao et al. (2019) designs a novel crowdsourcing method by combining the annotator generated data with the public dataset. In specific, the evaluations are conducted based on the Yelp dataset¹¹1https://www.yelp.com/dataset, which is collected from yelp.com. The authors recruit 20 annotators, and require each annotator to write at least 15 reviews on yelp.com. Then, the reviews written by the annotators are infused into the original Yelp dataset, and the explainable models are trained based on the augmented dataset. In the evaluation process, the annotators are required to score from 1 to 5 on the explanations according to their usefulness. Since the data generated by the annotators is also incorporated into the model training process, the feedback from the annotators are exactly the real user feedback.

Table 2: Summarization of the evaluation methods. In the second and third column, we present the most significant strengths and shortcomings of different evaluation methods. In the last column, we present the perspectives that a method is usually leveraged to evaluate.

Evaluation methods	Strengths	Shortcomings	Representative papers	Evaluation perspectives
Case studies	Better intuitiveness	Bias; Cannot make comparisons	Xian (2020, 2020); Fu (2020); Liu (2020); Barkan (2020); Li (2020c)	Effectiveness; Transparency
Quantitative metrics	Quantitative evaluation; Easy to benchmark; High efficiency	Deviating from the explanation goals; Less effective approximation	Li (2021a); Hada (2021); Li (2020a); Tan (2021); Tai (2021)	Effectiveness; Scrutability
Crowdsourcing	Based on real human feelings	High cost	Li (2021b); Xian (2021); Li (2020b); Chen (2021b); Wang et al. (2018)	Effectiveness; Scrutability; Transparency; Persuasiveness
Online experiments	High reliable	High cost	Zhang et al. (2014); Xian (2021)	Effectiveness; Persuasiveness

$\bullet$ Crowdsourcing with fully constructed datasets. In this category, the datasets are fully constructed by the annotators. In general, there are four steps in the evaluation process, that is, (i) recruiting annotators with different background (e.g., age, sex, nationality and etc.), (ii) collecting annotator preferences, (iii) generating recommendations and explanations for the annotators, and (iv) evaluating the explanations by asking questions on different evaluation perspectives. Usually, The number of annotators in this category is much larger than above two methods, and the experiment settings and studied problems are quite diverse. For example, in Musto et al. (2019), there are 286 annotators (with 76.3% males, and 48.4% PhDs), and they are asked to evaluate the explanation transparency and effectiveness. In Naveed (2020), the authors aim to study whether feature-based collaborative filtering (CF) models can lead to better explanations than the conventional CF’s. There are 20 annotators, among which 14 are females with the age ranges from 21 to 40. In Hernandez-Bocanegra (2020), the authors investigate whether the type and justification level of the explanations influence the user perceptions. To achieve this goal, 152 annotators are recruited, where there are 87 females, and the average age is 39.84. In Balog and Radlinski (2020), the authors recruit 240 annotators to study the relations between the explanation effectiveness, transparency, persuasiveness and scrutability. In Tsukuda (2020), the authors recruit 622 annotators to evaluate the influence of the explanation styles on the explanation persuasiveness.

In the above three types of crowdsourcing methods, the first one is fully based on the public datasets, the second one is based on the combination between the annotator generated data and the public dataset, and the last one is completely based on the annotator generated data. Comparing with the case studies and quantitative metrics, crowdsourcing is directly based on real human feelings. However, it can be much more expensive due to the cost of recruiting annotators.

4.4 Evaluation with Online Experiments

In the previous work, we also find a few papers, which evaluate recommendation explanations with online experiments. In specific, in Zhang et al. (2014), the online users are split into three groups. For the users in the first group, the explanations are generated from the proposed model. For the second group users, the explanations are produced from the baseline. For the last group, there is no explanations. After running the system for a short time, the authors compute the click through rate (CTR) in each group to evaluate whether the explanations are useful. Similar method are also leveraged in Xian (2021), where the conversion and revenue are compared before and after presenting the explanations to the users. We believe online experiments can be more reliable. However, its cost is very high, since it may disturb the real production environments, and impact user experiences.

We summarize the above evaluation methods in Table 2, where we can see: for different evaluation methods, they are usually leveraged to evaluate different perspectives. For example, case studies are mainly used to evaluate explanation effectiveness and transparency, while crowdsourcing can be leveraged to evaluate all the perspectives defined in section 3, since it can flexibly design questions for the annotators. From the second and third columns, we can see different evaluation methods have their own strengths and shortcomings, and no one method can take all the advantages.

Guidelines for selecting the evaluation methods. Based on different characters of the evaluation methods, we propose several guidelines on using them in real-world applications. (i) If the recommender models serve for high-stake tasks such as health caring and finance, then more reliable evaluation methods should be selected, since the cost of mis-explaining a recommendation, so that the users fail to make correct decisions can be much larger than that of conducting the evaluations. (ii) For general recommendation tasks, if the evaluation budget is limited, one has to trade-off the strengths and shortcomings of different evaluation methods. (iii) From Table 2, we can see the same perspective can be evaluated using different methods, thus one can simultaneously using these methods to take their different advantages. For example, when the model is not well optimized, we can use case studies to roughly evaluate the explanation effectiveness. Once the model has been tuned better, quantitative metrics can be leveraged to demonstrate whether the explanations are indeed effective.

5 Conclusion and Outlooks

In this paper, we summarize existing explainable recommendation papers with a focus on the explanation evaluation. In specific, we introduce the main evaluation perspectives and methods in the previous work. For each evaluation method, we detail its characters, representative papers, and also highlight its strengths and shortcomings. At last, we propose several guidelines for better selecting the evaluation methods in real-world applications.

By this survey, we would like to provide a clear summarization on the evaluation strategies of explainable recommendation. We believe there is still much room left for improvements. To begin with, since the heuristically designed quantitative metrics can be incompetent for evaluating the complex stakeholder utilities, one can automatically learn the metric functions, where one can incorporate a small amount of human labeled data as the ground truth. And then, for benchmarking this field, one can build a large scale dataset, which incorporates the ground truth on the user effectiveness, recommendation persuasiveness and so on. At last, existing methods mostly evaluate the explanations within a specific task. However, we believe human intrinsic preferences should be stable and robust across different tasks. Thus, in order to evaluate whether the derived explanations can indeed reveal such human intrinsic preferences, developing task-agnostic evaluation methods should be an interesting research direction.

References

Abdollahi and Nasraoui [2017] Behnoush Abdollahi and Olfa Nasraoui. Using explainability for constrained matrix factorization. In Proceedings of the Eleventh ACM Conference on Recommender Systems, pages 79–83, 2017.
Adomavicius [2004] Gediminas et al. Adomavicius. Recommendation technologies: Survey of current methods and possible extensions. 2004.
Balog and Radlinski [2020] Krisztian Balog and Filip Radlinski. Measuring recommendation explanation quality: The conflicting goals of explanations. In SIGIR, 2020.
Balog et al. [2019] Krisztian Balog, Filip Radlinski, and Shushan Arakelyan. Transparent, scrutable and explainable user models for personalized recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 265–274, 2019.
Barkan [2020] Oren et al. Barkan. Explainable recommendations via attentive multi-persona collaborative filtering. In Recsys, 2020.
Bouraga [2014] Sarah et al. Bouraga. Knowledge-based recommendation systems: A survey. IJIIT, 2014.
Chen and Wang [2014] Li Chen and Feng Wang. Sentiment-enhanced explanation of product recommendations. In Proceedings of the 23rd international conference on World Wide Web, pages 239–240, 2014.
Chen et al. [2013] Wei Chen, Wynne Hsu, and Mong Li Lee. Tagcloud-based explanation with feedback for recommender systems. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 945–948, 2013.
Chen et al. [2018] Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Ma. Neural a entional rating regression with review-level explanations.(2018). 2018.
Chen et al. [2019] Xu Chen, Hanxiong Chen, Hongteng Xu, Yongfeng Zhang, Yixin Cao, Zheng Qin, and Hongyuan Zha. Personalized fashion recommendation with visual explanations based on multimodal attention network: Towards visually explainable recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 765–774, 2019.
Chen [2013] Song et al. Chen. Social network based recommendation systems: A short survey. In ICSC. IEEE, 2013.
Chen [2017] Jingyuan et al. Chen. Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In SIGIR, 2017.
Chen [2020] Tong et al. Chen. Try this instead: Personalized and interpretable substitute recommendation. In SIGIR, 2020.
Chen [2021a] Hongxu et al. Chen. Temporal meta-path guided explainable recommendation. In WSDM, 2021.
Chen [2021b] Zhongxia et al. Chen. Towards explainable conversational recommendation. In IJCAI, 2021.
Cheng et al. [2019] Weiyu Cheng, Yanyan Shen, Linpeng Huang, and Yanmin Zhu. Incorporating interpretability into latent factor models via fast influence analysis. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 885–893, 2019.
Cohen [1968] Jacob Cohen. Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin, 1968.
Fang [2020] Hui et al. Fang. Deep learning for sequential recommendation: Algorithms, influential factors, and evaluations. TOIS, 2020.
Fu [2020] Zuohui et al. Fu. Fairness-aware explainable recommendation over knowledge graphs. In SIGIR, 2020.
Gao et al. [2019] Jingyue Gao, Xiting Wang, Yasha Wang, and Xing Xie. Explainable recommendation through attentive multi-view learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3622–3629, 2019.
Goyani [2020] Mahesh et al. Goyani. A review of movie recommendation system. ELCVIA: electronic letters on computer vision and image analysis, 2020.
Guesmi [2021] Mouadh et al. Guesmi. On-demand personalized explanation for transparent recommendation. In UMAP, 2021.
Gunawardana [2009] Asela et al. Gunawardana. A survey of accuracy evaluation metrics of recommendation tasks. JMLR, 2009.
Hada [2021] Deepesh V aet al. Hada. Rexplug: Explainable recommendation using plug-and-play language model. In SIGIR, 2021.
He et al. [2015] Xiangnan He, Tao Chen, Min-Yen Kan, and Xiao Chen. Trirank: Review-aware explainable recommendation by modeling aspects. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 1661–1670, 2015.
Hernandez-Bocanegra [2020] Diana C et al. Hernandez-Bocanegra. Effects of argumentative explanation types on the perception of review-based recommendations. In UMAP, 2020.
Ji [2017] Zhenyan et al. Ji. A survey of personalised image retrieval and recommendation. In National Conference of Theoretical Computer Science, 2017.
Kanwal [2021] Safia et al. Kanwal. A review of text-based recommendation systems. 2021.
Li [2020a] Lei et al. Li. Generate neural template explanations for recommendation. In CIKM, 2020.
Li [2020b] Lei et al. Li. Towards controllable explanation generation for recommender systems via neural template. In TheWebConf, 2020.
Li [2020c] Xueqi et al. Li. Directional and explainable serendipity recommendation. In TheWebConf, 2020.
Li [2021a] Lei et al. Li. Personalized transformer for explainable recommendation. arXiv preprint arXiv:2105.11601, 2021.
Li [2021b] Yang et al. Li. Attribute-aware explainable complementary clothing recommendation. World Wide Web, 2021.
Li [2021c] Zeyu et al. Li. You are what and where you are: Graph enhanced attention network for explainable poi recommendation. In CIKM, 2021.
Lin [2004] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
Liu et al. [2019] Huafeng Liu, Jingxuan Wen, Liping Jing, Jian Yu, Xiangliang Zhang, and Min Zhang. In2rec: Influence-based interpretable recommendation. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 1803–1812, 2019.
Liu [2020] Ninghao et al. Liu. Explainable recommender systems via resolving learning representations. In CIKM, 2020.
Musto et al. [2019] Cataldo Musto, Pasquale Lops, Marco de Gemmis, and Giovanni Semeraro. Justifying recommendations through aspect-based sentiment analysis of users reviews. In Proceedings of the 27th ACM Conference on User Modeling, Adaptation and Personalization, pages 4–12, 2019.
Naveed [2020] Sidra et al. Naveed. On the use of feature-based collaborative explanations: An empirical comparison of explanation styles. In UMAP, 2020.
Papineni [2002] et al. Papineni, Kishore. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318, 2002.
Silveira [2019] Thiago et al. Silveira. How good your recommender system is? a survey on evaluations in recommendation. IJMRC, 2019.
Sonboli [2021] Nasim et al. Sonboli. Fairness and transparency in recommendation: The users’ perspective. In UMAP, 2021.
Song [2012] Yading et al. Song. A survey of music recommendation systems and future perspectives. In CMMR, 2012.
Tai [2021] Chang-You et al. Tai. User-centric path reasoning towards explainable recommendation. In SIGIR, 2021.
Tan et al. [2016] Yunzhi Tan, Min Zhang, Yiqun Liu, and Shaoping Ma. Rating-boosted latent topics: Understanding users and items with ratings and reviews. In IJCAI, volume 16, pages 2640–2646, 2016.
Tan [2021] Juntao et al. Tan. Counterfactual explainable recommendation. In CIKM, 2021.
Tao et al. [2019] Yiyi Tao, Yiling Jia, Nan Wang, and Hongning Wang. The fact: Taming latent factor models for explainability with factorization trees. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 295–304, 2019.
Tsai and Brusilovsky [2019] Chun-Hua Tsai and Peter Brusilovsky. Evaluating visual explanations for similarity-based recommendations: User perception and performance. In Proceedings of the 27th ACM Conference on User Modeling, Adaptation and Personalization, pages 22–30, 2019.
Tsukuda [2020] Kosetsu et al. Tsukuda. Explainable recommendation for repeat consumption. In Recsys, 2020.
Wang et al. [2018] Nan Wang, Hongning Wang, Yiling Jia, and Yue Yin. Explainable recommendation via multi-task learning in opinionated text data. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 165–174, 2018.
Wang et al. [2019] Xiang Wang, Dingxian Wang, Canran Xu, Xiangnan He, Yixin Cao, and Tat-Seng Chua. Explainable reasoning over knowledge graphs for recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5329–5336, 2019.
Wang [2020] Chao et al. Wang. Personalized employee training course recommendation with career development awareness. In TheWebConf, 2020.
Wei [2007] Kangning et al. Wei. A survey of e-commerce recommender systems. In ICSSM, 2007.
Wu [2021] et al. Wu, Le. A survey on neural recommendation: From collaborative filtering to content and context enriched recommendation. arXiv preprint arXiv:2104.13030, 2021.
Xian [2020] Yikun et al. Xian. Cafe: Coarse-to-fine neural symbolic reasoning for explainable recommendation. In CIKM, 2020.
Xian [2021] Yikun et al. Xian. Ex3: Explainable attribute-aware item-set recommendations. In Recsys, 2021.
Yang et al. [2019] Xun Yang, Xiangnan He, Xiang Wang, Yunshan Ma, Fuli Feng, Meng Wang, and Tat-Seng Chua. Interpretable fashion matching with rich attributes. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 775–784, 2019.
Yu [2015] Yonghong et al. Yu. A survey of point-of-interest recommendation in location-based social networks. In AAAIW, 2015.
Zhang et al. [2014] Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu, and Shaoping Ma. Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 83–92, 2014.
Zhang [2018] Yongfeng et al. Zhang. Explainable recommendation: A survey and new perspectives. arXiv preprint arXiv:1804.11192, 2018.
Zhang [2019] Shuai et al. Zhang. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR), 2019.