Journey of Hallucination-minimized Generative AI Solutions for Financial Decision Makers

Sohini Roychowdhury [email protected] Corporate Data and Analytics OfficeAccenture, LLPSan FranciscoUSA94105

Abstract.

Generative AI has significantly reduced the entry barrier to the domain of AI owing to the ease of use and core capabilities of automation, translation, and intelligent actions in our day to day lives. Currently, Large language models (LLMs) that power such chatbots are being utilized primarily for their automation capabilities for software monitoring, report generation etc. and for specific personalized question answering capabilities, on a limited scope and scale. One major limitation of the currently evolving family of LLMs is hallucinations, wherein inaccurate responses are reported as factual. Hallucinations are primarily caused by biased training data, ambiguous prompts and inaccurate LLM parameters, and they majorly occur while combining mathematical facts with language-based context. Thus, monitoring and controlling for hallucinations becomes necessary when designing solutions that are meant for decision makers. In this work we present the three major stages in the journey of designing hallucination-minimized LLM-based solutions that are specialized for the decision makers of the financial domain, namely: prototyping, scaling and LLM evolution using human feedback. These three stages and the novel data to answer generation modules presented in this work are necessary to ensure that the Generative AI chatbots, autonomous reports and alerts are reliable and high-quality to aid key decision-making processes.

LLMs, prompt engineering, hallucinations, LLMOps

^†^†conference: ACM International Conference on Web Search and Data Mining, 2024; ;

1. Introduction

Generative AI-based chatbots have become increasingly popular over the last year with the launch of OpenAI’s ChatGPT in November 2022. Although language-based AI models have been built and researched for decades, now, the genesis of the chatGPT model can be traced back to the formation of OpenAI in 2015 followed by the launch of Generative Pre-trained Transformer (GPT-1) model in 2018 with 117 million parameters (Marr, 2023). The GPT-1 model was the first of its kind in the area of unsupervised learning models that could understand tasks and use books as training data to complete sentences or predict a few follow up sentences while maintaining context (Marr, 2023). The GPT-2 model released in 2019 was an upgrade with 1.5 billion parameters with significantly improved natural language generation (NLG) features with the capability of generating several paragraphs of contextual and sensible text. The launch of GPT3 in 2022 demonstrated significantly superior natural language comprehension and question answering capabilities owing to the 175 billion parameters model size. The faster turbo versions of GPT3.5 and the following evolved version of GPT4 estimated to have 1.76 trillion parameters (Prakash, 2023) showed significant enhancement in language translation and comprehension of large texts and multi modal data (OpenAI, 2023). This evolution in the class of LLMs that perform inferencing only, can be adapted to a variety of multi-modal data processing applications, such as audio, video, document processing and conversions through prompt engineering. In this work we present the challenges around prompt engineering process to ensure reliability and trustworthiness in such Generative AI-based systems and solutions.

The evolution in the current genre of LLMs with ethical and safety considerations (Thompson, 2023) has enabled its widespread usage and early adoption into products. LLMs significantly reduce the entry barrier to AI since that are easier to program using normal language instructions as opposed to the dependence on programming languages (Guo et al., 2023). Some noteworthy products that are completely language based and utilize generative AI are: the explain my answer feature in Duolingo (Duolingo, 2023), automated content creation with AIcontentfy (AIcontentfy, 2023) and job hunting and recruitment support with products like Occupop and SkillGPT (Doyle, 2023). However, LLMs pose a serious issue, also known as hallucinations, wherein inaccurate facts are presented to the user specifically in use cases that involve numerical or tabular data and non-language sources (Oviedo-Trespalacios et al., 2023). In this work, we present a novel framework that minimizes and controls hallucinations for such numerical and data table interactions to generate reliable and accurate answers for decision making tasks. Additionally, we present the development to launch journey for reliable and trustworthy LLM-based products for numerical and analytical domains such as finance and sales.

2. Journey Stages for Finance-based LLM Products

Building and deploying LLM-based systems and products have three major stages, namely, the prototyping stage, the scaling stage and the evolution stage as shown in Fig. 1.

Refer to caption — Figure 1. Stages in the journey of LLM based products for numerical and analytical data sources.

In the first prototyping stage, business case value realization drives the build plan of the minimum viable product. The major development areas include the following: Data Science (prompt engineering, modular builds), UI/UX considerations and LLMOps setup for Infrastructure (Mcmahon, 2023). It is noteworthy that the prototyping stage may incur radical limitations with regards to reliability for certain user-queries. For instance, LLMs are safeguarded and limited against making predictions (Kumoai, 2023). Thus, for the analytical finance domain, queries such as “Which stocks in NYSE should I invest in?” will remain out of scope until a preferred prediction model is combined with the LLM prompts. In this prototyping stage, we build four novel components that monitor and control for hallucinations in the LLM responses and ensure repeatable and reliable answers.

The second stage for the hallucination-minimized solution is scaling the prototype for a variety of user questions, also known as intentions (such as why, what, where, how, trend, anomaly, what-if etc.) while benchmarking for the choice of LLM to ensure accuracy, reliability, repeatability, and optimal response times. The third and final stage of the solution is fine-tuning the LLMs based on an already curated set of user-queries and sample responses to ensure evolution in the question answering capabilities in accordance with reinforcement learning with human feedback (RLHF) criteria (Ouyang et al., 2022).

3. Solution System Design: LLMOps

For a finance-question and answering system we propose a novel Langchain-based framework (Pandya and Holia, 2023) with custom modules to minimize hallucinations as shown in Fig. 2.

The novel components designed for this solution are as follows:

(1)

Question intention classification: This module creates separate customizable prompts for each user query type. Thus, the instructions for Why, What, How, trends, anomalies What-if queries can be separately designed. For every user query, the first step is to categorize the intent to define the generic steps to process the specific user-request. Incorrect intent classifications can lead to hallucinations.
(2)

Data chunk generation and filtering module: Since LLMs are trained on text data, converting tabular data to sentences and paragraphs is the optimal mechanism to pass data to LLMs. Each data table value is converted to sentences and stored as “data chunks”. Data chunks are further hierarchically categorized to support aggregated querying. Lack of granularity in data chunks can cause hallucinations.
(3)

Custom prompt generation module: For each user query, the most pertinent data from the existing data chunks need to be selected to create a customized prompt that is then sent to the LLM. A customized prompt has four major components: the persona, key definitions, relevant data chunks and a sample question and answer. Filtering for the “most similar” data chunks per user query is necessary to minimize hallucinations. A customized data chunk ranking mechanism that is optimized for run-time is crucial to assimilate a customized prompt per user query that best represents the user’s intent and data requirements.
(4)

Response quality scoring module: This module assesses each LLM response for hallucinations using standardized language-based libraries (such as nltk, spacy etc.). This novel component evaluates the question, the prompt sent to the LLM and the returned response together and evaluates the response for contextual, numeric, and uniqueness and grammatical accuracy. These four metric binary quality scoring modules categorize each response into Low/Medium/High confidence, that provides an automated estimate regarding any hallucinations to the user.

4. Conclusions and Discussion

Hallucinations are an unwanted outcome of LLMs that need to be further studied and scored for non-language and multi-modal data use cases. While most hallucinations are caused by biased training data, abstract nature of questions/prompts and LLM parameters (Rawte et al., 2023), there are approaches such as advanced modular prompting that can minimize hallucinations. In this work we present a novel LMOps system design and the three stages of developing LLM-based products for analytical and finance domains, where hallucinations can have extremely detrimental impact for decision making tasks.

5. Company Portrait

Accenture is a leading global professional services company of 738,000 people in 120 countries. They help businesses, governments and organizations build their digital core, optimize operations, accelerate revenue growth, and enhance citizen services. Accenture is one of the global leaders in helping drive change with technology at its core through strong ecosystem relationships, unmatched industry experience, functional expertise, and global delivery capability. In June 2023, Accenture announced that the company would invest $3 billion in its Data and AI practice to help clients across all industries rapidly and responsibly advance and use AI to achieve greater growth, efficiency and resilience.

6. Speaker Biography

Dr. Sohini Roychowdhury is the Global Head of AI/ML at the Corporate Data and Analytics Office, Accenture, USA. Her global team builds Generative AI solutions including a finance-chatbot and automated report generation for decision makers. Prior to this, she formed the Founding team and served as Director of Curriculum and Machine Learning at an Ed-Tech Startup called FourthBrain that provides specialized hands-on courses in the field of Machine Learning and AI. Prior to her entrepreneurial venture she was the Sr. Manager of Autonomous Drive and Head of University Relations at VolvoCars USA, and prior to that a tenure track Assistant Professor in Electrical and Computer Engineering at a University of Washington campus. Dr. Roychowdhury’s latest research directions benchmarking Large Language models for scalable product development. Till date she has over 60 academic research papers and 20 granted patents to her name and a Youtube channel, AI with Sohini.

Acknowledgements.

This work is funded by the Corporate Data and Analytics Office (CDAO) at Accenture. This would would not be possible without the efforts of all the members of the Generative AI team at CDAO Accenture. Many thanks to our leader Priya Raman for all the continued support and encouragement.

References

(1)
AIcontentfy (2023) Team AIcontentfy. 2023. ChatGPT and Language Trabslation. Retrieved 2023 from https://aicontentfy.com/en/blog/chatgpt-and-language-translation
Doyle (2023) O. Doyle. 2023. The ultimate guide to job posting. Retrieved 2023 from https://www.occupop.com/blog/the-ultimate-guide-to-job-posting-chatgpt
Duolingo (2023) Duolingo. 2023. Introducing Duolingo Max, a learning experience powered by GPT-4. Retrieved 2023 from https://blog.duolingo.com/duolingo-max
Guo et al. (2023) Yiduo Guo, Yaobo Liang, Chenfei Wu, Wenshan Wu, Dongyan Zhao, and Nan Duan. 2023. Learning to Program with Natural Language. arXiv:2304.10464 [cs.CL]
Kumoai (2023) Team Kumoai. 2023. LLMs today cannot predict on your enterprise data.
Marr (2023) B. Marr. 2023. A Short History Of ChatGPT: How We Got To Where We Are Today. Retrieved 2023 from https://www.forbes.com/sites/bernardmarr/2023/05/19/a-short-history-of-chatgpt-how-we-got-to-where-we-are-today/?sh=d11e5674f141
Mcmahon (2023) A. Mcmahon. 2023. Building the future with LLMOps the main challenges. Retrieved 2023 from https://mlops.community/building-the-future-with-llmops-the-main-challenges
OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL]
Oviedo-Trespalacios et al. (2023) Oscar Oviedo-Trespalacios, Amy E Peden, Thomas Cole-Hunter, Arianna Costantini, Milad Haghani, JE Rod, Sage Kelly, Helma Torkamaan, Amina Tariq, James David Albert Newton, et al. 2023. The risks of using chatgpt to obtain common safety-related information and advice. Safety science 167 (2023), 106244.
Pandya and Holia (2023) Keivalya Pandya and Mehfuza Holia. 2023. Automating Customer Service using LangChain: Building custom open-source GPT Chatbot for organizations. arXiv:2310.05421 [cs.CL]
Prakash (2023) A. Prakash. 2023. GPT-4 early impressions and how it compares to GPT-3.5. Retrieved 2023 from https://www.thoughtspot.com/data-trends/ai/gpt-4-vs-gpt-3-5
Rawte et al. (2023) Vipula Rawte, Prachi Priya, S. M Towhidul Islam Tonmoy, S M Mehedi Zaman, Amit Sheth, and Amitava Das. 2023. Exploring the Relationship between LLM Hallucinations and Prompt Linguistic Nuances: Readability, Formality, and Concreteness. arXiv:2309.11064 [cs.AI]
Thompson (2023) A. Thompson. 2023. GPT-3.5 + ChatGPT: An illustrated overview. https://lifearchitect.ai/chatgpt