User Decision Guidance with Selective Explanation Presentation
from Explainable-AI
Abstract
This paper addresses the challenge of selecting explanations for XAI (Explainable AI)-based Intelligent Decision Support Systems (IDSSs). IDSSs have shown promise in improving user decisions through XAI-generated explanations along with AI predictions, and the development of XAI made it possible to generate a variety of such explanations. However, how IDSSs should select explanations to enhance user decision-making remains an open question. This paper proposes X-Selector, a method for selectively presenting XAI explanations. It enables IDSSs to strategically guide users to an AI-suggested decision by predicting the impact of different combinations of explanations on a user’s decision and selecting the combination that is expected to minimize the discrepancy between an AI suggestion and a user decision. We compared the efficacy of X-Selector with two naive strategies (all possible explanations and explanations only for the most likely prediction) and two baselines (no explanation and no AI support). The results suggest the potential of X-Selector to guide users to AI-suggested decisions and improve task performance under the condition of a high AI accuracy.
I Introduction
Intelligent Decision Support Systems (IDSSs) [1], empowered by Artificial Intelligence (AI), have the potential to help users make better decisions by introducing explainability into their support. An increasing number of methods have been proposed for achieving explainable AIs (XAIs) [2], and the development of large language models (LLMs) has also made it possible to generate various post-hoc explanations that justify AI predictions. Previous studies have integrated such XAI methods into IDSSs and shown their effectiveness in presenting explanations along with AI predictions in diverse applications [3, 4].
Now that IDSSs can have a variety of explanation candidates, a new question arises as to which explanations an IDSS should provide in dynamic interaction. Explanation is a complex cognitive process [5]. Although XAI explanations can potentially guide users to make better decisions, there is also a risk of having negative effects on explainees’ decisions. Various causes including explanation uninterpretability [6, 7], information overload [8, 9], and contextual inaccuracy [10] can affect users and thus the performance of decision-making. A subtle difference in the nuance of a linguistic explanation can also have a different impact and sometimes mislead user decisions when influenced by the context, the status of the task, and the cognitive and psychological status of the users. Conversely, we can expect that IDSSs can be greatly enhanced if they can strategically select explanations that are likely to lead users to better decisions while taking the situation into account.
To address the question of how IDSSs can select explanations to improve user decisions, this paper proposes X-Selector, a method for dynamically selecting which explanations to provide along with an AI prediction. The main characteristic of X-Selector is that it predicts how explanations affect user decision-making for each trial and attempts to guide users to an AI-suggested decision referring to the prediction results. The design of guiding users with explanation selection is inspired by libertarian paternalism [11], the idea of influencing one’s choices to better ones while embracing the autonomy of their decision-making.
This paper also reports user experiments that simulated stock trading with an XAI-based IDSS. In a preliminary experiment, we compared two naive but common strategies—ALL (providing all possible explanations) and ARGMAX (providing only explanations for the AI’s most probable prediction)—against baseline scenarios providing no explanations or no decision support. The results suggest that the ARGMAX strategy works better with high AI accuracy, and ALL is more effective when AI accuracy is lower, indicating that the strategy for selecting explanations affects user performance. In the second experiment, we compared the results of explanations selected by X-Selector with ARGMAX and ALL. The results indicate the potential of X-Selector’s selective explanations to more strongly lead users to suggested decisions and to outperform ARGMAX when AI accuracy is high.
II Background
II-A XAIs for deep learning models
While various methods such as Fuzzy Logic and Evolutionary Computing have been introduced to IDSSs, this paper targets IDSSs with Deep Learning (DL) models. IDSSs driven by DL models are capable of dealing with high-dimensional data such as visual images and are actively studied in diverse fields [12, 13, 14]. Due to their blackbox nature, explainability for DL models is also an area of active research, and this can potentially offer benefits for IDSSs.
There are various forms of explanations depending on the nature of the target AI. Common explanations for visual information processing AIs include presenting saliency maps. The class activation map (CAM) is a widely used method for visualizing a saliency map of convolutional neural network (CNN) layers [15]. It identifies the regions of an input image that contribute the most for a model to classify the image into a particular class.
Language is also a common modality of XAIs, and free-text explanation is becoming rapidly available thanks to the advance of LLMs [16]. LLMs can generate post-hoc explanations for AI predictions. Here, post-hoc means that the explanations are generated after the AI’s decision-making process has occurred, as opposed to intrinsic methods that generate explanations in an integral part of that process [2].
II-B Human-XAI interaction
The theme of this study involves how to facilitate users’ appropriate use of AI. Avoiding human over/under-reliance on an AI is a fundamental problem of human-AI interaction [17]. Here, over-reliance is a state in which a human overestimates the capability of an AI and blindly follows its decision, whereas under-reliance is a state in which a human misuses an AI even though it can perform well.
Although explanation is believed to generally help people appropriately use AI by providing transparency in AI predictions, previous studies suggest that XAI explanations do not always work positively [18]. Maehigashi et al. demonstrated that presenting AI saliency maps has different effects on user trust in an AI depending on the task difficulty and the interpretability of the saliency map [6]. Herm revealed that the type of XAI explanation strongly influences users’ cognitive load, task performance, and task time [9]. Panigutti et al. conducted a user study with an ontology-based clinical IDSS, and they found that the users were influenced by the explanations despite the low perceived explanation quality [19]. These results suggest potential risks of triggering under-reliance with explanations or, conversely, leading to users blindly following explanations from an IDSS even if the conclusion drawn from the explanations is incorrect.
This study aims to computationally predict how explanations affect user decisions in order to avoid misleading users and encourage them to make better decisions by selecting explanations. Work by Wiegreffe et al. [16] shares a similar concept with this study. They propose a method of evaluating explanations generated by LLMs by predicting human ratings of their acceptability. This approach is pivotal in understanding how users perceive AI-generated explanations. However, our study diverges by focusing on the behavioral impacts of these explanations on human decision-making. We are particularly interested in how these explanations can alter the decisions made by users getting an IDSS’s support, rather than just their perceptions of the explanations. Another relevant study, Pred-RC [20, 21], aims to predict the effect of explanations of AI performance so that users can avoid over/under-reliance. It dynamically predicts a user’s binary judgment of whether s/he assigns a task to the AI and selects explanations that guide him/her to better assignment. X-Selector aims to take a further step to predict concrete decisions taking the effects of explanations into account and proactively influences them to improve the performance of each decision-making.
III X-Selector
III-A Overview
This paper proposes X-Selector, a method for dynamically selecting explanations for AI predictions. The idea of X-Selector is that it predicts user decisions under possible combinations of explanations and chooses the best one that is predicted to minimize the discrepancy between the decision that the user is likely to make and the AI-suggested one.
III-B Algorithm
The main components of X-Selector are UserModel and . UserModel is a model of a user who makes decision . X-Selector uses it for user decision prediction:
(1) |
The output of UserModel is represented as a probaility distribution of conditioned by and , where is a combination of explanations to be presented to the user, and represents all the other contextual information including AI predictions, task status, and user status. In this paper, we developed a dataset of ( and ) and prepared a machine learning model that was trained with the dataset for implementing this.
In addition, X-Selector has a policy , which considers a decision based on . This inference is done in parallel with user decision-making:
(2) |
X-Selector aims to minimize the discrepancy between and by comparing the effect of each on . The selected combination is calculated as:
(3) |
To calculate this equation, X-Selector simulates how will change using UserModel and aims to choose the best one that guides to the most.
III-C Implementation
III-C1 Task
We implemented X-Selector in a stock trading simulator in which users get support from an XAI-based IDSS. Figure 1 shows screenshots of the simulator. The simulation was conducted on a website. Participants were virtually given three million JPY and traded stocks for 60 days with a stock price chart, AI prediction of the future stock price, and explanations for the prediction.
In the simulation, participants checked the opening price and a price chart for each day and decided whether to buy stocks with the funds they had, sell stocks they had, or hold their position. In accordance with Japan’s general stock trading system, participants could trade stocks in units of 100 shares. Participants were asked to show their decision twice a day to clarify the influence of the explanations. They were first asked to decide an initial order , that is, the amount of trade only with chart information and without the support of the IDSS. Then, the IDSS showed a bar graph that indicated the output of a stock price prediction model and its explanations. We did not explicitly show to enhance the autonomy of users’ decision-making, which is inspired by libertarian paternalism [11], the idea of affecting behavior while respecting freedom of choice as well. However, we can easily extend X-Selector to a setting in which is given to users by including it with (when you always show ) or (when you want to selectively show ). Finally, they input their final order . After this, the simulator immediately transited to the next day. The positions carried over from the final day were converted into cash on the basis of the average stock price over the next five days to calculate the participants’ total performance.


III-C2 StockAI
In the task, an IDSS provides a prediction of a stock price prediction model (StockAI) as user support. StockAI is a machine-learning model that is designed to predict the average stock price in the next five business days, and we used its prediction as the target of the explanation provided to users. StockAI predicts future stock prices on the basis of an image of a candlestick chart. Although using a candlestick chart as an input does not necessarily lead to better performance than modern approaches proposed in the latest studies [22], we chose this because of the better understandability of saliency maps generated with the model as an explanation of AI predictions. Note that the aim of this research is not building a high-performance prediction model but investigating the interaction between a human and an AI whose performance is not necessarily perfect.
For the implementation of StockAI, we used ResNet-18 [23], a deep-learning visual information processing model, using the PyTorch library 111https://pytorch.org/. The StockAI is trained in a supervised manner; it classifies the ratio of the future stock price to the opening price of the day into three classes: BULL (over +2%), NEUTRAL (from -2 to +2%), and BEAR (under -2%). The prediction results are presented as a bar graph of the probability distribution for each class, which hereafter denoted as . For the training, we collected the historical stock data (from 2018/5/18 to 2023/5/16) of companies that are included in the Japanese stock index Nikkei225. We split the data by stock code, with three-quarters of the data as the training dataset and the remainder as the test. The accuracy with which the model was able to predict the correct class among the three classes was 0.474, and the accuracy for binary classification, or the matching rate of the sign of the expected value of the model’s prediction and that of actual fluctuations, was 0.63 for the test dataset.
III-C3 Explanations
We prepared two types of explanations: saliency maps and free-texts. We applied CAM-based methods available in the pytorch-grad-cam package 222https://github.com/frgfm/torch-cam to StockAI and adopted Score-CAM [24] because it most clearly visualizes saliency maps of StockAI. Because CAM-based methods can generate a saliency map for each prediction class, three maps were acquired for each prediction. Let be the set of the acquired maps.
In addition, we created a set of free-text explanations based on the GPT-4V model in the Open-AI API [25], which allows images as input. We input a chart with a prompt that asked GPT-4V to generate two explanation sentences that justify each prediction class (BULL, NEUTRAL, and BEAR). Therefore, we acquired six sentences in total for each chart. Let us denote the set of them by .
As a result, three saliency maps and six free-text explanations were available for each trading day, and X-Selector considered combinations of the selected explanations ().
III-C4 Models

We implemented UserModel with a deep learning model (Figure 2). The input of UserModel is a tuple . includes four variables: date , StockAI’s prediction , total rate , and initial order . is a categorical variable that embeds the context of the day such as the stock price. is a three-dimensional vector that corresponds to the values in the bar graph (Fig. 1). is the percentage increase or decrease of the user’s total asset from the initial amount. and the other variables are encoded in 2048-dimensional vectors with the Embedding and Linear modules implemented in PyTorch, respectively.
Let us denote and as a set , where is the raw data of an explanation, and . if is to be presented, and when hidden. and are also encoded in 2048-dimensional vectors :
where denotes an element-wise product. is a three-layer CNN model. For , we used the E5 (embeddings from bidirectional encoder representations) model [26] with pretrained parameters333https://huggingface.co/intfloat/multilingual-e5-large.
All embedding vectors () are concatenated and input to a three-layer linear model. To extract the influence of explanations, the model was trained to predict not but the difference . In our initial trial, UserModel always predicted to be nearly 0 due to the distributional bias, so we added an auxiliary task of predicting whether and trained the model for predicting only when . The expected value of in Equation 3 is .
was acquired with the deep deterministic policy gradient, a deep reinforcement learning method [27]. We simply trained to decide to maximize assets on the basis of for the training dataset. The reward for the policy is calculated as the difference in total assets between the current day and the previous day.
IV Experiments
IV-A Preliminary experiment
IV-A1 Procedure
We conducted a preliminary experiment to investigate the performance of users who were provided explanations with two naive strategies (ALL and ARGMAX). ALL shows all the nine explanations available for each day, and ARGMAX selects explanations for StockAI’s most probable prediction. To ensure the quality of the explanations, we also prepared two baselines: ONLY_PRED shows but does not provide any explanations. In PLAIN, participants received no support from the IDSS and acted on their own.
For simulation, we chose a Japanese general trading company (code: 2768) from the test dataset on the basis of the common stock price range (1,000 - 3,000 JPY) and its high volatility compared with the other Nikkei225 companies.
Because we had anticipated that the accuracy of StockAI would affect the result, we prepared two scenarios: high-accuracy and low-accuracy. We calculated the moving average of the accuracy of StockAI with a window size of 60 and chose two sections for them. The accuracy of StockAI for high-accuracy was 0.750, which was the highest, and that for low-accuracy was 0.333, the chance level of three-class classification.
We recruited participants to join the simulation with compensation of 220 JPY through Lancers444https://lancers.jp/, a Japanese crowdsourcing platform, and got 336 participants. The participants were first provided pertinent information, and 325 consented to the participation. We gave them instructions on the task and gave basic explanations about stock charts and the price prediction AI. We instructed the participants to increase the given three million JPY as much as possible by trading with the IDSS’s support. To motivate them, we told them that additional rewards would be given to the top performers. We did not notify them of the amount of the additional rewards and the number of participants who got them. We asked six questions to check their comprehension of the task. 34 participants who failed to answer correctly were excluded from the task. After familiarization with the trading simulator, the participants traded for 60 virtual days successively. 242 participants completed the task (152 males, 88 females, and 2 did not answer; aged 14-77, ). Table I gives details on the sample sizes.
ALL | ARGMAX | ONLY_PRED | PLAIN | |
---|---|---|---|---|
High-accuracy | 39 | 40 | 34 | 41 |
Low-accuracy | 31 | 34 | 34 | 38 |
IV-A2 Result
Figure 3 shows the changes in the participants’ performance. A conspicuous result is the underperformance of PLAIN, particularly in the high-accuracy scenario. ONLY_PRED performed well for high-accuracy, but could not outperform PLAIN for low-accuracy. This suggests that presenting alone contributes to improving performance only when it has enough accuracy.
ALL and ARGMAX showed different results between the scenarios. For high-accuracy, ARGMAX outperformed ALL. ALL slightly underperformed the ONLY_PRED baseline as well. This suggests that ARGMAX explanations successfully guided users to follow the prediction of StockAI while ALL toned down the guidance, which worked negatively in this scenario. On the other hand, ALL outperformed ARGMAX and the baselines for low-accuracy. Interestingly, ARGMAX also outperformed the baselines, which suggests that explanations successfully provide users with insights into situations and AI accuracy and can contribute to better decision-making. ALL positively worked for low-accuracy by providing multiple perspectives.


IV-B Experiment with X-Selector
IV-B1 Procedure


To evaluate X-Selector, we conducted a simulation with its selected explanations. To train UserModel, we used the data of the preliminary experiment and additional data acquired in another experiment in which explanations were randomly selected. We added the data to broaden the variety of explanation combinations in the dataset. The numbers of the additional participants were 54 and 45 for high- and low-accuracy, respectively. We conducted a 4-fold cross validation for UserModel, and the correlation coefficient between the model’s predictions and the ground truths was 0.429 on average ().
We obtained a participation of 97 participants. Finally, 39 and 35 participants completed the task for high-accuracy and low-accuracy, respectively (46 males, 26 females and 2 did not answer; aged 23-64, ).
To analyze the results, we compared the correlation coefficient between and for each participant as a measure of whether X-Selector could successfully guided users to , as well as the comparison of user performance that we did in the preliminary experiment.
IV-B2 Result
Figure 4 shows the correlation coefficient distribution between and in the high-accuracy condition. The results for ALL and ARGMAX are also shown for comparison. Notably, while the peaks for ALL and ARGMAX are centered around zero, X-Selector shifted this peak to the right, indicating a stronger correlation between and for a greater number of participants. This means that X-Selector effectively guides users to without coercing but presenting explanations selectively.
Figure 4 illustrates the user performance. X-Selector generally outperformed ALL and ARGMAX, meaning that X-Selector enabled users to trade better with selective explanations. In more detail, X-Selector first underperformed ARGMAX, but the score reversed on day 16. The gap once narrowed near day 39, but it broadened again until the end.
A possible reason for X-Selector’s better performance is that it can predict which combination of explanations guides participants to sell or buy shares more. For example, the stock price around day 16 dropped steeply, so the IDSS needed to guide participants to reverse their position. Here, whereas ARGMAX showed explanations for BEAR, X-Selector showed explanations for NEUTRAL as well as BEAR, which may have helped users sell their shares more. Similarly, X-Selector also attempted to guide users to buy a moderate amount when was positive but not high by, for example, showing only a saliency map for BULL and no text explanations. Another reason is that X-Selector can overcome the ambiguity in the interpretation of . reflects a momentum of stock price in the high-accuracy scenario and must provide some insight for trading, but it was up to the participants how to use this to actually decide their order. sometimes suggested that they buy shares even though NEUTRAL or BEAR was the most likely in . Thus, we can say that was poorly calibrated, but by referring to , X-Selector can avoid misleading participants and instead lead them to more promising decisions.
On the other hand, X-Selector underperformed ARGMAX until day 16. The stock price was in an uptrend until day 14, and ARGMAX continuously presented explanations for BULL for 12 days in a row, which may have strongly guided participants to buy stocks and lead to large benefits. In our implementation, UserModel considers only the explanations of the day and does not consider the history of what explanations were previously provided, which can be a next target for future work.


X-Selector could not improve user performance in the low-accuracy scenario (Figure 5). Overall, the result was similar to ARGMAX and underperformed ALL. We further focused on the correlation coefficient between and . Figure 5 shows that, contrary to the high-accuracy scenario, X-Selector did not increase the score. The different results between the high- and low-accuracy scenarios indicate the possibility that participants actively assessed the reliability of the AI and autonomously decided whether to follow X-Selector’s guidance. This itself highlights a positive aspect of introducing libertarian paternalism to human-AI interaction in that users can potentially avoid AI failure depending on its reliability. However, this did not result in improving their performance in this scenario. The lack of correlation between the score and the final asset amounts in the X-Selector condition () suggests that merely disregarding the AI’s guidance does not guarantee performance improvement. A future direction for this problem can be developing a mechanism to control the strength of AI guidance and provide explanations in more neutral way depending on AI accuracy.
V Conclusion
This paper investigated the question of how IDSSs can select explanations, and we proposed X-Selector, which is a method for dynamically selecting which explanations to provide along with an AI prediction. In X-Selector, UserModel predicts the effect of presenting explanations on a user decision for each possible combination to show. Then, it selects the best combination that minimizes the difference between the predicted user decision and the AI’s suggestion. We applied X-Selector to a stock trading simulation with the support of an XAI-based IDSS. The result indicated that X-Selector can select explanations that guide users to suggested decisions effectively and improve the performance when the accuracy of the AI is high, and in addition, it revealed a new challenge of X-Selector for low-accuracy cases.
References
- [1] G. Phillips-Wren, Intelligent Decision Support Systems, 02 2013, pp. 25–44.
- [2] A. Adadi and M. Berrada, “Peeking inside the black-box: A survey on explainable artificial intelligence (xai),” IEEE Access, vol. 6, pp. 52 138–52 160, 2018.
- [3] M. H. Lee and C. J. Chew, “Understanding the effect of counterfactual explanations on trust and reliance on ai for human-ai collaborative clinical decision making,” Proc. ACM Hum.-Comput. Interact., vol. 7, no. CSCW2, oct 2023.
- [4] D. P. Panagoulias, E. Sarmas, V. Marinakis, M. Virvou, G. A. Tsihrintzis, and H. Doukas, “Intelligent decision support for energy management: A methodology for tailored explainability of artificial intelligence analytics,” Electronics, vol. 12, no. 21, 2023.
- [5] T. Miller, “Explanation in artificial intelligence: Insights from the social sciences,” Artificial Intelligence, vol. 267, pp. 1–38, 2019.
- [6] A. Maehigashi, Y. Fukuchi, and S. Yamada, “Modeling reliance on xai indicating its purpose and attention,” in Proc. Annu. Meet. of CogSci, vol. 45, 2023, pp. 1929–1936.
- [7] ——, “Empirical investigation of how robot’s pointing gesture influences trust in and acceptance of heatmap-based xai,” in 2023 32nd IEEE Intl. Conf. RO-MAN, 2023, pp. 2134–2139.
- [8] A. N. Ferguson, M. Franklin, and D. Lagnado, “Explanations that backfire: Explainable artificial intelligence can cause information overload,” in Proc. Annu. Meet. of CogSci, vol. 44, no. 44, 2022.
- [9] L.-V. Herm, “Impact of explainable ai on cognitive load: Insights from an empirical study,” in 31st Euro. Conf. Info. Syst., 2023, 269.
- [10] U. Ehsan, P. Tambwekar, L. Chan, B. Harrison, and M. O. Riedl, “Automated rationale generation: A technique for explainable ai and its effects on human perceptions,” in Proc. 24th Int. Conf. IUI, 2019, p. 263–274.
- [11] C. R. Sunstein, Why Nudge?: The Politics of Libertarian Paternalism. Yale University Press, 2014.
- [12] M. Kraus and S. Feuerriegel, “Decision support from financial disclosures with deep neural networks and transfer learning,” Decision Support Systems, vol. 104, pp. 38–48, 2017.
- [13] A. Chernov, M. Butakova, and A. Kostyukov, “Intelligent decision support for power grids using deep learning on small datasets,” in 2020 2nd Intl. Conf. SUMMA, 2020, pp. 958–962.
- [14] C.-Y. Hung, C.-H. Lin, T.-H. Lan, G.-S. Peng, and C.-C. Lee, “Development of an intelligent decision support system for ischemic stroke risk assessment in a population-based electronic health record database,” PLOS ONE, vol. 14, no. 3, pp. 1–16, 03 2019.
- [15] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proc. IEEE Conf. CVPR, June 2016.
- [16] S. Wiegreffe, J. Hessel, S. Swayamdipta, M. Riedl, and Y. Choi, “Reframing human-AI collaboration for generating free-text explanations,” in Proc. of the 2022 Conf. of NAACL. ACL, Jul. 2022, pp. 632–658.
- [17] R. Parasuraman and V. Riley, “Humans and automation: Use, misuse, disuse, abuse,” Human Factors, vol. 39, no. 2, pp. 230–253, 1997.
- [18] H. Vasconcelos, M. Jörke, M. Grunde-McLaughlin, T. Gerstenberg, M. S. Bernstein, and R. Krishna, “Explanations can reduce overreliance on ai systems during decision-making,” roc. ACM Hum.-Comput. Interact., vol. 7, no. CSCW1, apr 2023. [Online]. Available: https://doi.org/10.1145/3579605
- [19] C. Panigutti, A. Beretta, F. Giannotti, and D. Pedreschi, “Understanding the impact of explanations on advice-taking: A user study for ai-based clinical decision support systems,” in Proc. of 2022 CHI. ACM, 2022.
- [20] Y. Fukuchi and S. Yamada, “Selectively providing reliance calibration cues with reliance prediction,” in Proc. Annu. Meet. of CogSci, vol. 45, 2023, pp. 1579–1586.
- [21] ——, “Dynamic selection of reliance calibration cues with ai reliance model,” IEEE Access, vol. 11, pp. 138 870–138 881, 2023.
- [22] J.-F. Chen, W.-L. Chen, C.-P. Huang, S.-H. Huang, and A.-P. Chen, “Financial time-series data analysis using deep convolutional neural networks,” in 2016 7th Intl. Conf. on CCBD, 2016, pp. 87–92.
- [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. CVPR, 2016, pp. 770–778.
- [24] H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Score-cam: Score-weighted visual explanations for convolutional neural networks,” in Proc. IEEE/CVF Conf. CVPR Workshops, June 2020.
- [25] OpenAI, “Gpt-4 technical report,” 2023.
- [26] L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F. Wei, “Text embeddings by weakly-supervised contrastive pre-training,” 2022.
- [27] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep q-learning with model-based acceleration,” in Proc. 33rd Intl. Conf. on Machine Learning, vol. 48, 20–22 Jun 2016, pp. 2829–2838.