This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ChatNVD: Advancing Cybersecurity Vulnerability Assessment with Large Language Models

Shivansh Chopra, Hussain Ahmad∗†, Diksha Goel, Claudia Szabo University of Adelaide, Australia CSIRO’s Data61, Australia [email protected], {hussain.ahmad, claudia.szabo}@adelaide.edu.au, [email protected] Corresponding author
Abstract

The increasing frequency and sophistication of cybersecurity vulnerabilities in software systems underscore the urgent need for robust and effective methods of vulnerability assessment. However, existing approaches often rely on highly technical and abstract frameworks, which hinders understanding and increases the likelihood of exploitation, resulting in severe cyberattacks. Given the growing adoption of Large Language Models (LLMs) across diverse domains, this paper explores their potential application in cybersecurity, specifically for enhancing the assessment of software vulnerabilities. We propose ChatNVD, an LLM-based cybersecurity vulnerability assessment tool leveraging the National Vulnerability Database (NVD) to provide context-rich insights and streamline vulnerability analysis for cybersecurity professionals, developers, and non-technical users. We develop three variants of ChatNVD, utilizing three prominent LLMs: GPT-4o mini by OpenAI, Llama 3 by Meta, and Gemini 1.5 Pro by Google. To evaluate their efficacy, we conduct a comparative analysis of these models using a comprehensive questionnaire comprising common security vulnerability questions, assessing their accuracy in identifying and analyzing software vulnerabilities. This study provides valuable insights into the potential of LLMs to address critical challenges in understanding and mitigation of software vulnerabilities.

Index Terms:
Large Language Model, Cybersecurity, Software Security, Chatbot, Vulnerability Assessment

I Introduction

Large Language Models (LLMs), a subset of generative artificial intelligence, have revolutionized numerous fields by enabling advanced content generation and process automation [1]. In education, LLMs enhance efficiency in tasks such as generating questions [2], grading essays [3], providing personalized feedback [4], and evaluating assignments [5]. Within the entertainment industry, they exhibit remarkable capabilities, crafting video game narratives [6] and generating music captions [7], thereby enriching creative workflows. Business operations have also undergone significant transformation through the integration of LLMs, which streamline marketing campaigns [8], improve customer service interactions [9], and optimize supply chain processes [10]. In healthcare, LLMs empower professionals by offering real-time clinical decision support [11], advancing medical education [12] and enabling predictive analytics for disease progression [13]. These applications highlight the profound and multifaceted impact of LLMs, positioning them as key drivers of innovation across diverse disciplines [14].

The dynamic and ever-evolving nature of the cybersecurity domain demands continuous adaptation to cutting-edge advancements, as each technological innovation introduces potential new attack vectors and expands the surface available for exploitation [15]. Software vulnerabilities, whether accidental oversights or deliberate flaws serve as entry points for threat actors to manipulate systems in unintended and potentially harmful ways [16, 17, 18]. These vulnerabilities can result in severe consequences, including unauthorized data access or alteration, system disruptions through denial-of-service attacks, or the execution of malicious code, potentially enabling full system compromise [19]. To address these risks, organizations frequently conduct proactive evaluations of system resilience through vulnerability assessments [20]. Such assessments play a vital role in fortifying system defenses and safeguarding the integrity and security of digital infrastructures [21].

Refer to caption
Figure 1: Number of Common Vulnerabilities and Exposures (CVEs) by year [22]

.

The rapid growth in the volume and sophistication of cyber threats has driven a substantial increase in the number of reported software vulnerabilities within the National Vulnerability Database (NVD) [23]. As shown in Fig. 1, the count of identified vulnerabilities has quadrupled over the past decade [22], reflecting the growing complexity of modern software systems and the expanding attack surface. This surge poses challenges for cybersecurity practitioners in keeping up with evolving threats [24]. According to a recent report by the World Economic Forum, cyberattacks are among the top five global risks in terms of likelihood and impact [25]. Despite this alarming trend, existing methods for vulnerability assessment often depend on overly technical and abstract frameworks, such as manual analysis [26, 27] and rule-based approaches [28]. These methods are not only time-intensive but also susceptible to human error [27], limiting their effectiveness in addressing the rapidly growing number of vulnerabilities. Furthermore, their complexity and technical nature can create barriers to broader understanding among diverse stakeholders, complicating efforts to effectively prioritize and mitigate vulnerabilities. This gap in accessibility and clarity increases the likelihood of exploitation, escalating the risk of severe cyberattacks. To combat this, there is a critical need for more comprehensible and actionable solutions that enable both technical and non-technical stakeholders to effectively assess and mitigate software vulnerabilities.

The NVD serves as a comprehensive repository of software vulnerabilities, offering detailed descriptions that are invaluable for reference. However, its utility is limited when it comes to providing contextual or detailed answers to queries about these vulnerabilities. Designed primarily for lookup and reference purposes, the NVD lacks interactive querying capabilities and deeper contextual understanding [29]. This limitation underscores the need for a more advanced tool capable of not only locating specific vulnerabilities but also delivering detailed, customized, and contextualized responses to related questions. To address these challenges, there is a clear need for an advanced tool that not only helps users locate specific vulnerabilities but also offers comprehensive, context-aware responses to related questions. By leveraging LLMs, such a tool could analyze and interpret NVD data, enabling dynamic interaction with the information and transforming it into a more practical and accessible resource. This approach is especially beneficial for cybersecurity professionals who require a quick, accurate, and in-depth understanding of vulnerabilities to make informed decisions. The NVD, with its extensive dataset of software vulnerabilities, serves as an ideal foundation for training LLMs, enabling them to process large volumes of data, summarize complex technical information, and provide actionable insights. An LLM-based tool would facilitate improved vulnerability comprehension, support more effective mitigation strategies, and reduce the risk of exploitation, benefiting a diverse range of stakeholders, from technical experts to non-technical users.

In this paper, we introduce ChatNVD, an LLM-based tool aimed at assessing software security vulnerabilities using data from NVD. Given the availability of multiple LLMs and the lack of prior investigation into their applicability to NVD data, we develop three variants of ChatNVD with three different widely adopted models: GPT-4o mini by OpenAI [30], Gemini 1.5 Pro by Google [31], and Llama 3 by Meta [32]. The models are trained using the term frequency-inverse document frequency (TF-IDF) embedding technique, chosen for its computational efficiency, lower cost, and faster processing time [33]. High-quality embeddings for the entire NVD dataset (720.7 MB) were deemed too costly and time-intensive, making TF-IDF a suitable alternative for tasks focused on CVE IDs and term significance. Our experimental results provide a comprehensive comparison of the three LLMs, GPT-4o mini, Gemini 1.5 Pro, and Llama 3, based on their performance in answering a carefully designed questionnaire. This questionnaire comprises widely recognized software vulnerability analysis questions, reflecting real-world scenarios and common queries encountered by cybersecurity professionals. By evaluating the models on a diverse set of questions, we assess their ability to understand, process, and provide accurate and contextually relevant responses regarding software vulnerabilities. The comparative analysis highlights the strengths and limitations of each model, shedding light on their respective capabilities in addressing the complexities of software vulnerability assessment. This evaluation offers valuable insights into their practical applicability for enhancing cybersecurity workflows and decision-making processes. To summarize, our key contributions are two-fold:

  • We develop ChatNVD, an LLM-based cybersecurity vulnerability assessment tool built on the dataset of NVD to facilitate a more interactive and efficient approach to provide detailed, context-rich insights and streamline vulnerability analysis for diverse users, including cybersecurity professionals, developers, and non-technical stakeholders.

  • We perform a comparative analysis of three popular LLMs, GPT-4o mini, Gemini 1.5 Pro, and Llama 3, to assess their capabilities in understanding, processing, and providing accurate and contextually relevant responses to queries about software vulnerabilities.

The structure of the paper is organized as follows: Section II provides a review of related work, covering existing LLMs, embedding techniques, and key cybersecurity insights. Section III details the research methodology employed in the development of ChatNVD. Section IV presents the experimental evaluation, including results and a discussion comparing the three different variants of ChatNVD. Section V addresses potential threats to validity. Finally, Section VI concludes the paper and highlights future research directions.

TABLE I: Comparison of Existing LLMs.
LLM Characteristics Strengths Weaknesses Training Data References Open Source Cost
GPT 4o mini Fast, Scalable Fast, Scalable, Efficient Limited Control and Expertise Not revealed [30],[34, 35, 36, 37] No $0.150 / 1M input tokens, $0.600 / 1M output tokens [38]
Llama 3 Economical Open Source, Efficient Lack of Domain Expertise 7 Billion Parameters [32],[39, 40, 41, 42, 43]. Yes Free
Gemini High Performance, Multi-Modal Accurate, Multi-Modal Opaque, and Resource-Intensive Over Billion Parameters [44, 45, 46] No $1.25 / 1 million input tokens, $5.00 / 1 million output tokens [47]
TABLE II: Comparison of Existing Embedding Techniques.
Embedding Technique Characteristics Strengths Weaknesses Computational Requirements Cost
TF-IDF Fast, Simple High Speed, Scalable Lack of Deep Understanding Less Free
Openai Embeddings (text-embedding-3-small) Versatile, Expensive Semantic Understanding Expensive, Overfitting Risk High $0.010 per 1 Million tokens [38]
Llama3 Embeddings Adaptability Efficient, Cross platform and Multilingual Capabilities Complex to Fine Tune High Free
Gemini Embeddings Context Awareness Multimodal Capabilities Expensive High $0.000025 per 1000 Characters [47]

II Related Work

In this section, the discussion has been made on the existing research on text embeddings and LLMs that are used in cybersecurity, especially for checking software vulnerabilities.

An LLM is a kind of artificial intelligence (AI) system that has been trained on extensive text datasets to comprehend and produce human-like language [48]. LLMs use deep learning methods, especially transformers, to process and create text that makes sense and is relevant to the situation [48]. Based on the patterns and structures they learned from the data they were taught on, these models can do numerous tasks, such as translating, summarising, creating text, and more [48]. LLMs have changed the field of NLP in recent years. They are excellent at various tasks, like question answering, text generation, and sentiment analysis [49]. Vaswani et al. [50] created the Transformer architecture, which is illustrated in Fig. 2 and is now an important part of many LLMs present today.

Refer to caption
Figure 2: Transformer Architecture [51]

.

This key innovation in the architecture is the self-attention mechanism is the ability of the model to weigh the importance of different words inside a sentence, thereby being able to capture long-range dependencies more effectively than the previous models [50]. The multi-head attention mechanism is really the key. It allows the model to consider different aspects of the input all at once, picking up on nuanced linguistic patterns that might be missed otherwise [50].

Positional encodings are a clever solution to the problem of order awareness. By explicitly encoding the position of each word, the model can understand the sequence and context, even though the transformer architecture itself does not have a built-in sense of order [50]. Together, these innovations enable Transformer language models to generate remarkably coherent and relevant text, which makes them powerful tools for tasks such as text generation, translation, and summarization.

Knowledge Base: The data used for the knowledgebase of ChatNVD has been taken from the official NVD website https://nvd.nist.gov/vuln/data-feeds. The data is stored in JSON format and we will use the data from year 2002 - 2024. The whole data is around 404 MB, with a total of around 252,000 vulnerabilities.

Cybersecurity Applications: One significant application of LLMs in cybersecurity is in threat detection and response. The integration of AI and NLP technologies has significantly

improved the detection of unusual patterns and behaviors in large datasets[52]; LLMs are adept at processing and analyzing vast amounts of textual data, such as logs and network traffic, to detect subtle signs of security threats [52, 53].

LLMs are also utilized in vulnerability management. For instance, research [54] has shown that LLMs can effectively enhance the process of vulnerability detection by analyzing the descriptions and technical details of vulnerabilities as recorded in databases like the NVD.

This feature makes it easier to quickly and correctly classify and prioritize vulnerabilities, which speeds up the process of mitigation[54, 55].

LLMs are being actively used in threat identification and penetration testing. For example, Giovanni Cimmino’s paper “Large Language Models in Cybersecurity: Digital Defence and Ethical Challenges” talks about how LLMs like GPT-4 can be used in penetration testing. LLMs can talk to system terminals using tools like Open Interpreter. This lets them run code and maintain security standards. By adding this feature, finding bugs and doing security checks becomes faster and more accurate [56].

LLMs are also being added to smart defense systems and intelligent firewalls [56, 57]. Thanks to LLMs, these systems are able to examine and understand network data in real time, which lets them find sophisticated threats like advanced persistent threats (APTs). When these systems are connected to databases like MITRE’s ATT&CK and NIST frameworks, they can scan for vulnerabilities, prioritize threats , and automatically start looking for fixes [56].

Guo et al. [22] talk about how LLMs can be used to make complex security protocols easier to use. LLMs can, for instance, automate the process of looking for vulnerabilities in networks, analyzing the data, and suggesting ways to address the problems. It not only makes cybersecurity activities more effective, but it also makes mistakes less likely to be prone to human error. The paper emphasizes the critical role of AI-driven security measures in modern cybersecurity, showing the proficiency of LLMs in vulnerability assessment.[22, 58].

II-A LLM Models

II-A1 GPT 4o mini

GPT 4o mini is a compact and cheaper variant of the larger GPT-4 series, developed by OpenAI to deliver similar language model functionalities at a lower price [59]. Developed via a model distillation approach, GPT 4o Mini preserves significant fundamental functionality and accuracy of its larger version ie GPT-4, providing a performance-cost equilibrium that renders it suitable for diverse applications, particularly those with financial constraints [60]. Boasting a context window of up to 128K tokens [59], it is especially adept at jobs that require prolonged context or multi-step reasoning. Furthermore, GPT 4o mini accommodates multi-modal input, enabling it to handle both text and images, hence enhancing its applicability in fields that require integrated text and visual analysis [59, 61].

Strengths:

  • Economic Viability and Accessibility. With a cost of 15 cents per million input tokens and 60 cents per million output tokens, GPT-4o mini is considerably more economical than earlier models like GPT-3.5 Turbo, hence enhancing its accessibility for a wider array of applications without substantial operational expenses [59].

  • Exceptional Performance on Benchmarks: GPT-4o mini demonstrates superior results, attaining 82% on MMLU (textual reasoning), 87% on MGSM (mathematical reasoning), and 87.2% on HumanEval (coding proficiency). This renders it exceptionally appropriate for situations necessitating robust reasoning, coding, and comprehension skills [59].

  • Improved Safety and Resilience: Featuring integrated safety mechanisms such as content filtering, reinforcement learning with human feedback (RLHF), and an instruction hierarchy approach to mitigate prompt injections, GPT-4o mini guarantees secure and dependable responses, making it appropriate for extensive and publicly accessible applications [59].

Weaknesses:

  • High Costs for Large-Scale Applications Despite being more economical than earlier models, token-based pricing may nevertheless incur substantial expenses for applications with significant utilization, especially for high-traffic services or those necessitating intensive input-output processing [62].

  • Output Token Limitation: Although GPT-4o mini has a 128K token context window, it permits a maximum of 16K output tokens per request, which may constrain applications necessitating extensive outputs, perhaps leading to several API requests [59].

  • Absence of Immediate Fine-Tuning Capability: Presently, GPT-4o mini lacks fine-tuning assistance, constraining its adaptability for specialized or domain-specific applications where customized answers could enhance performance [59].

Key Features:

  • Large Contextual Range: The 128K token context window of GPT-4o mini allows it to manage extended input sequences, rendering it suitable for jobs that necessitate significant context retention, such as processing extensive conversations or analyzing substantial datasets [60].

  • Multimodal Input Support: The model is able to accommodate both textual and visual inputs. It can also handle different types of data, which makes it a good option for tasks like multimedia research and customer service [59].

II-A2 Llama 3

LLaMA 3 is a LLM developed by Facebook AI Research (FAIR). It is based on the Roberta architecture and has around 8 billion parameters [32]. LLaMA 3 is trained on a combination of web pages, books, and Wikipedia articles, which makes it an ideal choice for a wide range of NLP tasks. LLaMA 3’s ability to work well with low-resource languages is one of its most important traits. This is possible because it was trained on a wide range of multilingual data [39].

Strengths

  • State-of-the-Art Performance: Llama 3 has shown great results in many NLP-related tasks, often beating older models in tests for tasks like new text generation, text summarization, and translation. [40]

  • Community Collaboration and Innovation: As Llama 3 is open source, writers, researchers, and enthusiasts from all around the world can work together to contribute and innovate [41].

Weaknesses

  • Resource Intensity: While more efficient than some predecessors, Llama 3 still requires significant computational resources, especially for large-scale deployments [42]. It can be a barrier for smaller organizations or those needing access to high-end hardware.

  • Generalization Limitations: Although Llama 3 is highly versatile, it may need help with tasks that require deep domain-specific knowledge, specifically in highly specialized fields where the training data is limited or not readily available [43].

Key Features:

  • Transformer-Based Architecture: Llama 3 utilizes a transformer architecture with optimizations that improve both training speed and inference efficiency. The model’s architecture allows it to process vast amounts of data in parallel, leading to faster and more accurate predictions [40].

  • Multi-Task Learning Capabilities: The model is designed to handle multiple NLP tasks simultaneously, thanks to its advanced multi-task learning capabilities. This feature allows Llama 3 to excel in scenarios where diverse types of language processing are required within a single application.

II-A3 Gemini 1.5 pro

The Google DeepMind Gemini language model has been improved in Gemini Pro, which is made for high-performance NLP tasks [31]. It’s a big step forward in terms of what it can accomplish, especially when it comes to handling complex, large-scale tasks in numerous fields. There are more than a billion parameters in Gemini Pro, but the exact number has been kept confidential for now.

Strengths:

  • High Accuracy and Language Understanding: Gemini Pro is made to do extremely well with jobs that involve language, like summarising, translating, and creating complex texts. Its huge number of parameters lets the model pick up on small differences in language, which results in outputs that are very accurate and relevant to the situation [46].

  • Multimodal Integration: One great thing about Gemini Pro is that it can easily combine and process different types of data, like text, images, and even voice. Because of this, it works really well in situations where you need to understand information in different ways [45].

Weaknesses:

  • High Requirements for Resources: Gemini Pro is a large and intricate model, which needs a lot of computing power for both training and real-time reasoning. This can be a big problem for adoption, especially for smaller businesses that need to use high-end technology at a modest infrastructure budget [44].

  • Problems with Interpretability: Gemini Pro’s internal decision-making processes are complicated and hard to understand, just like those of other big language models. This lack of transparency can be a problem in areas where it’s important to know why choices are made, like in the medical or legal fields [63].

Key Features:

  • Multimodal capabilities: Gemini 1.5 pro has advanced multi-modal capabilities, which it uses to process numerous data formats like text and images making it highly versatile for the users [64]

  • In-Context-learning: It also boasts remarkable in-context learning capabilities which enable it to acquire new skills from prompts without much fine-tuning [64].

II-B Text Embeddings

Text embeddings are representations of text that use vectors to map words, sentences, or even whole documents into continuous numerical vectors in a multi-dimensional space. Text embeddings are mostly used to record the semantic meaning and connections between different pieces of text. This method solves the problem of putting textual data into a style that computers can understand while still keeping all the important data [65].

The domain of  NLP has significantly advanced in text representation strategies. Historically, basic techniques such as TF-IDF (Term Frequency-Inverse text Frequency) were often used to assess the significance of a term within a text by evaluating its frequency relative to other documents. This methodology was initially proposed by Ramos et al. (2003) [66] and further developed by Qaiser and Ali (2018) [67], was effective for identifying relevant context but lacked the capability to comprehend the meaning or context of those terms. [67]

Later, more advanced techniques like Latent Semantic Later, complex methods like Latent Semantic Indexing (LSI) were created by combining words that are related into hidden meanings. Even Zhang et al. [68] thought that LSI didn’t fully show how words and their meanings are connected on a deeper level. The older means of doing things helped us understand writing better, but they couldn’t understand language like humans do.

The big step forward was made possible by word embeddings, which turned words into vectors (numbers) that show what they mean based on the situation they are in. This changed everything in NLP, especially when models like Word2Vec and GloVe came out, as Pilehvar and Camacho-Collados (2020) [65] explain. These techniques helped us record not only what words mean, but also how they connect to each other [67].

Moving on to the next big step, Brown et al. (2020) [40] presented transformer-based models like OpenAI’s GPT-3. It is much more powerful with GPT-3’s embeddings because they understand the context better. GPT-3 can look at whole sentences and paragraphs, not just words. This makes it great for translating, asking questions, and summarising. A feature that sets GPT-3 apart is its ability to learn from just a few examples. This is called ”few-shot learning,” and it lets it generalize and do well on different tasks without needing a lot of training,

Similar to GPT 3 another great embedding model is Llama 3 which was introduced by, Touvron et al. (2023) [69] It mostly focuses on being more efficient and cost-effective. LLaMA makes high-quality embeddings without using as much computing power. This makes it perfect for apps that need to work well but don’t have a lot of resources [69].

Another big development in the realm of embeddings is multi-modal embeddings which allows models to process both text and various data kinds, such as images. The Gemini model, inspired by Radford et al. (2021) [70] and CLIP, is capable of generating embeddings that integrate text and images, facilitating tasks such as image captioning and video analysis. This renders Gemini ideal for applications requiring comprehension of both text and graphics.

Even though these have significantly improved how we process text, simpler methods like TF-IDF are still useful in certain situations. For example, Ahuja et al. (2019) [71] used TF-IDF alongside feature extraction techniques to perform sentiment analysis, showing that traditional methods can still be relevant when simplicity and speed are priorities.

In summary, text representation evolved from basic methods like TF-IDF and LSI to powerful models like GPT 4o mini, LLama 3, and Gemini 1.5 pro. These advancements allow us to capture the meaning of words in context, making NLP more accurate and flexible for various tasks. The newer embedding models are better at understanding not just the meaning of individual words but also their relationships within larger bodies of text, and even across different data types like images.

II-B1 TF-IDF Embeddings

Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic that is used to evaluate the importance of a word inside a document relative to a collection of documents. It is a relatively simple embedding technique that converts simple words into vectors based on their frequency and uniqueness within a corpus of documents [72]. TF-IDF works by assigning weight to each word inside a document, which starts increasing with the word’s frequency within that document, but it decreases with its frequency across the entire corpus. This technique highlights words that are significant within a document while downplaying the standard terms that are less informative [67]. TF-IDF is widely used in information retrieval and text classification due to its simplicity and interpretability [73].

Key Features:

  • Simplicity: TF-IDF is easy to compute and implement, making it accessible for a wide range of applications [74].

  • Interpretability: The resulting vectors are easily interpretable, with higher values indicating more essential words within the context of a specific document [73].

  • Scalability: TF-IDF can be efficiently computed even for large document corpora, making it suitable for large-scale applications.

  • Fast: TF-IDF is computationally efficient, allowing for quick computation even on large datasets [74].

  • Cost Effective: The computation of TF-IDF does not require significant resources, making it cost-effective for many applications [74].

Strengths: TF-IDF is computationally efficient and provides a straightforward method for identifying essential words in a document. It is particularly effective when interpretability is important, as the resulting weights directly correlate with word significance [33].

Weaknesses: TF-IDF does not capture the semantic meaning or context of words. It treats each word independently, ignoring the relationships between words, and can be less effective for tasks that require deep understanding or contextual analysis [33].

Use Case:

  • Text Classification: TF-IDF embeddings are widely used in text classification tasks. This method transforms textual data into numerical vectors, allowing machine learning algorithms to process and classify text[68]. TF-IDF is particularly effective in scenarios where the goal is to distinguish between different categories of documents based on the presence and importance of specific terms [68, 75].

  • Sentiment Analysis: TF-IDF embeddings are commonly employed in sentiment analysis, where the objective is to classify text (such as product reviews, social media posts, or customer feedback) into sentiment categories like positive, negative, or neutral. The TF-IDF technique is effective in highlighting the importance of specific words that contribute to the sentiment, enabling more accurate classification [71].

II-B2 Embedding Models

Embedding models are machine learning models that map text (such as words, phrases, or entire documents) into continuous numerical vectors within a multi-dimensional space. These vectors are designed to capture the semantic relationships between different pieces of text, allowing similar texts to be closer in this space [50]. Transformer-based architectures, like BERT, GPT, and newer models such as OpenAI’s GPT-3, LLaMA 3, and Gemini, have revolutionized text embeddings by introducing attention mechanisms that enable models to capture context-dependent meaning more effectively [50].

Key Features:

  • Contextual Understanding: Transformer-based embeddings capture the meaning of words within their specific context, enabling a more accurate understanding of language.

  • Attention Mechanism: These models use self-attention to determine the importance of different words relative to each other, improving the quality of the embeddings.

  • Scalability: Transformer models can be scaled to handle large datasets and generate embeddings for complex, multilingual, and multi-domain tasks.

  • Pre-trained Models: Many transformer-based embedding models, such as GPT-3, LLaMA 3, and Gemini, are pre-trained on massive datasets and can be fine-tuned for specific applications.

OpenAI Embeddings OpenAI embeddings are sophisticated text representations generated by LLMs like GPT-3. These embeddings are designed to capture deep semantic relationships and contextual information from large-scale text data [76].

Strengths:

  • Deep Semantic Understanding: OpenAI embeddings, particularly from models like GPT-3, excel in capturing deep semantic meanings, making them highly effective for tasks that require nuanced language understanding.

  • Versatility: These embeddings can be applied to a wide range of tasks, including text generation, question answering, and summarization, due to their comprehensive understanding of language.

  • Few-Shot Learning: OpenAI embeddings support few-shot learning, enabling the model to adapt to new tasks with minimal additional training data [76].

Weaknesses:

  • Computational Cost: Training and using OpenAI models like GPT-3 is computationally expensive, requiring significant resources [77].

  • Data Requirements: These models demand large datasets for pretraining, which can be a limitation in scenarios with limited data availability.

  • Overfitting Risk: Without proper fine-tuning, there is a risk of overfitting specific tasks, leading to reduced generalizability.

Use Case

  • Best suited for tasks requiring general-purpose NLP capabilities, such as text generation, summarization, and complex question answering [40].

  • Ideal for applications where minimal fine-tuning is required, leveraging few-shot learning capabilities [40].

LLaMA Embeddings LLaMA (LLM Pretraining) is an embedding model developed by Facebook AI Research. LLaMA embeddings are obtained by pre-training a transformer-based language model on a massive multilingual corpus [78]. The model is trained to predict masked tokens in the input sequence, enabling it to learn meaningful representations of words across different languages [79]. The multilingual nature of LLaMA embeddings makes them particularly valuable for applications that involve processing text in multiple languages.

Strengths:

  • Efficiency: LLaMA 3 embeddings are designed to be more efficient in terms of computation, offering a balance between performance and resource usage [69].

  • Adaptability: These embeddings are adaptable to a variety of tasks, including multilingual and cross-domain applications, making them versatile in different NLP contexts [69].

    Fine-Tuning Capabilities: LLaMA 3 is optimized for fine-tuning, enabling the creation of task-specific embeddings with relatively small datasets.

Weaknesses:

  • Limited Pretraining Data: Compared to larger models like GPT-3, LLaMA 3 may be trained on less extensive datasets, potentially limiting its performance in specific applications.

  • Complexity in Fine-Tuning: While LLaMA 3 is optimized for fine-tuning, achieving optimal results can still require careful tuning and substantial domain knowledge.

Use Case

  • Quite effective in environments where computational efficiency is needed, such as mobile or edge applications [80].

  • Well-suited for tasks that benefit from fine-tuning, such as domain-specific text analysis or multilingual translation [80].

Gemini Embeddings

Gemini embeddings are a type of vector representation generated by the Gemini model, which is a LLM developed by google deepmind. Gemini is part of Google’s family of advanced AI models, similar to OpenAI’s GPT series. these embeddings are designed to encode the semantic meaning of the text into numerical vectors, making them useful for a wide range of natural language processing (NLP) tasks [70].

Refer to caption
Figure 3: Proposed Architecture of ChatNVD.

Strengths:

  • Context-Awareness: Gemini embeddings are particularly strong in capturing context-dependent meanings, making them ideal for tasks where word meaning varies based on the surrounding text.

  • Dynamic Adaptation: These embeddings dynamically adjust based on the context, providing a more nuanced and accurate representation of language [70].

  • Advanced Multimodal Capabilities: Gemini models are often designed to handle multimodal data, integrating text with other forms of data, such as images or audio, for richer embeddings [45].

Weaknesses:

  • High Computational Requirements: Similar to other advanced models, Gemini embeddings require substantial computational resources for training and deployment [77].

  • Specialized Data Needs: Achieving the best performance with Gemini embeddings may require specialized, high-quality datasets that are not always readily available.

Use Case

  • Particularly strong in tasks that require deep context awareness, such as dialogue systems, sentiment analysis, and emotion detection.

  • Best used in multimodal applications that integrate text with other data types, like image captioning or video analysis [45].

III ChatNVD Development Process

The research focuses on building a powerful tool that brings together data collection, analysis, and document comparison, which are all centered around identifying similarities. We’re also developing question-answering systems, using advanced LLMs to make it easy for users to get clear, helpful answers. A key part of this is tapping into data from the NVD JSON feeds, which allows us to analyze and compare software vulnerabilities in detail. The tool, which works like a chatbot, pulls together a rich database of Common Vulnerabilities and uses (CVE) information, giving real-time, reliable insights into both new and historical security issues.

The architecture of ChatNVD has been illustrated in the Fig. 3. The process starts with getting data from the NVD. This data is then preprocessed to get rid of redundant information and broken up into smaller chunks so that it can be prepared for the embedding process. These chunks are sent through an embedding model, which turns each chunk into a vector representation that captures the sense of the text. When a user sends in a query, the embedding model finds the most important chunks and gets the context it needs to answer the query. It then takes this information and the user’s question and runs them through an LLM. Several LLMs are used in this setting, including the GPT-4o mini, the LLaMA 3, and the Gemini 1.5 Pro. The chosen LLM uses the given information to make a response, which is then sent back to the user. The end result gives cybersecurity practitioners a clear, readable explanation of complicated vulnerability data, which helps them make well-informed decisions based on reliable information.

Refer to caption

Figure 4: ChatNVD Development Process.

Fig. 4 depicts the phases involved in the development process of ChatNVD. The process is systematically divided into five distinct phases, each addressing a critical component of the overall process, starting with the selection of Large Language Models (LLMs) and culminating in the development of a user-friendly interface.

III-A Phase 1: Selection of LLMs and Embedding Techniques

The LLMs GPT 4o mini, Llama 3, and Gemini 1.5 pro were selected for our study for several important reasons. We chose GPT 4o-mini because it is an improved and cheaper version of GPT 3.5 [81] which is already quite popular due to its widespread adoption in the natural language processing (NLP) world; it has proven versatile and capable across many kinds of NLP tasks [82]. Llama 3 was picked because of its open-source nature and its better ability to be fine-tuned, providing more flexible controls over the model’s behavior in any kind of domain-specific task, such as those found in the cybersecurity world. Finally, we chose Gemini for its obvious strong focus on security-related tasks, making it a good bet for a specialized model in the LLM space for any kind of applications relevant to programming or indeed, cybersecurity.

TF IDF was selected as a baseline method for the text embedding techniques used in this project. TF IDF’s simplicity and effectiveness make it suitable for capturing term frequency, which is a vital characteristic of any large dataset like the NVD. Advanced embedding models, such as those provided by OpenAI, Llama, and Gemini, were chosen for their ability to extract and represent the deeper semantic meanings and relationships of data within our LLM. The use of these model embeddings allows our LLM to work with far richer (and hence more accurate) representations of the data it processes, which in turn allows for the much more accurate identification and prioritization of virtually any cybersecurity threat derived from the set of 200,000 plus vulnerabilities currently in the NVD.

III-B Phase 2: Extraction and Preprocessing of NVD Data

In this phase, we acquire data from the NVD and then preprocess and clean the data to set up a streamlined knowledge base for the LLMs and embedding techniques that we chose previously. The cleaning step removes redundancies and reorders some JSON keys and values to reduce unnecessary complexity. We also trim the data to achieve the reduction in size without loss of any essential information that would degrade model performance and impair the quality of the knowledge base without impacting valuable information.

As a result, we have reduced the data from 1.29 GB to 720.7 MB, a reduction that is beneficial for both model performance and computational cost.

III-C Phase 3: Integration of LLMs and Embedding Techniques

The initial plan was to integrate LLMs like GPT 4o mini, Llama 3, and Gemini 1.5 Pro with TF IDF and their respective embedding models (OpenAI embeddings, Llama embeddings, and Gemini embeddings). However, several challenges came along the way. The main challenge was the size of our data which was 720.7 MB. We had some success generating embeddings for a few files but found that attempting to generate embeddings for the entire dataset was quite expensive especially for OpenAI and Gemini due to the cost of processing such large amounts of data. For Llama 3, the cost was not an issue since it used local resources, but the process was extremely time-consuming [83].

Creating embeddings for a small subset of our data a mere five files covering five years took approximately 48 hours. In total, we had 22 files to process. This local resource bottleneck posed a serious time management challenge. Ultimately, we fell back on using TF IDF to generate the embeddings. Although TF IDF is a basic technique, it was lightweight and sufficient for our use case, as we did not need the heavy-duty vector embedding complexity. The trade-off between cost, time, and performance made TF IDF the optimal solution for efficiently dealing with our large dataset.

\parprompt_template = ”””
Context: \n{context}.\nTask: - You are an assistant that helps with CVE data. \n- Only use the context provided. Respond with CVE details. \n- Recommend this website and attach the CVE ID in front of it https://nvd.nist.gov/vuln/detail/ \n\parQuestion: \n{question}?\n\parAnswer:\n”””
Figure 5: Prompt template for generating responses in Figure  6.
Refer to caption
Figure 6: Gemini 1.5 Pro yielding unsatisfactory response

When integrating LLMs like GPT 4o mini, LLaMA 3, and Gemini 1.5 Pro with TF-IDF, we faced several obstacles. Our primary aim was to use a uniform prompt template across all models to avoid any evaluation bias. Yet, the prompt structure that worked for one LLM did not work for others, causing them to underperform relative to their potential. For instance, when combining Gemini 1.5 Pro with TF-IDF, we initially used the prompt in Fig. 5.

Unfortunately, this template yielded unsatisfactory results, as indicated in Fig. 6. The answer lacked details and relevance to the given context. Ideally, it should have given all the details related to that CVE but it gave the URL to that CVE. Thus some further tweaking in the prompt was required.

After tweaking the prompt slightly as shown in Fig. 7, we were able to get satisfactory results as shown in Fig. 8.

prompt_template = ”””
You are an assistant that helps with CVE data. Only use the context provided. Respond with CVE details, recommend this website, and attach the CVE ID in front of it https://nvd.nist.gov/vuln/detail/ \n\parContext:\n{context}\nQuestion: \n{question}?\n\parAnswer:
”””
Figure 7: Prompt template for generating responses in Figure 8
Refer to caption
Figure 8: Gemini 1.5 Pro yielding satisfactory response

Following the updated prompt, the Gemini responses far more closely matched our expectations. The improved result, as seen in Fig. 8, is a direct reflection of how subtle changes in the wording of a prompt can influence an LLM’s output. This shows how small changes in prompt structure can significantly impact the output quality of LLMs. This challenge highlighted the need for model-specific prompt tuning to achieve accurate results.

III-D Phase 4: Evaluation Process

In this phase, we worked on an evaluation of the performance of three LLMs (GPT 4o mini, LLaMA 3, and Gemini 1.5 Pro) coupled with TF IDF. We devised several Python scripts to evaluate them on a sizable set of 125 questions. To maintain the integrity of our evaluation, we drew from an unbiased pool of CVE identifiers. Using a Python script, we generated questions as well as answers that we expected the combined models to provide. For example, one of the questions we generated asked what the publication date of a particular CVE was. To answer this, we simply pulled the information from the corresponding JSON object. A sample JSON object is mentioned below.

\par{
cve : {
data_type : CVE”,
data_format : MITRE”,
data_version : ”4.0”,
CVE_data_meta : {
ID : CVE-2023-0017”,
ASSIGNER : cna@sap.com
},
problemtype : {
problemtype_data : [ {
description : [ {
lang : en”,
value : CWE-284”
} ]
} ]
},
references : {
reference_data : [ {
url : https://launchpad.support.
sap.com/#/notes/3268093”,
name : https://launchpad.support.sap.
com/#/notes/3268093”,
refsource : MISC”,
tags : [ Permissions Required”, Vendor
Advisory ]
}, {
url : https://www.sap.com/documents
/2022/02/fa865ea4-167e-0010-bca6
-c68f7e60039b.html”,
name : https://www.sap.com/documents
/2022/02/fa865ea4-167e-0010-bca6-
c68f7e60039b.html”,
refsource : MISC”,
tags : [ Vendor Advisory ]
} ]
},
description : {
description_data : [ {
lang : en”,
value : An unauthenticated attacker in SAP NetWeaver AS for Java - version 7.50, due to improper access control, can attach to an open interface and make use of an open naming and directory API to access services which can be used to perform unauthorized operations affecting users and data on the current system. This could allow the attacker to have full read access to user data, make modifications to user data, and make services within the system unavailable.”
} ]
}
},
configurations : {
CVE_data_version : ”4.0”,
nodes : [ {
operator : OR”,
children : [ ],
cpe_match : [ {
vulnerable : true,
cpe23Uri : cpe:2.3:a:sap:netweaver_
application_server_for_java:7
.50:*:*:*:*:*:*:*”,
cpe_name : [ ]
} ]
} ]
},
impact : {
baseMetricV3 : {
cvssV3 : {
version : ”3.1”,
vectorString : CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:
U/C:H/I:H/A:H”,
attackVector : NETWORK”,
attackComplexity : LOW”,
privilegesRequired : NONE”,
userInteraction : NONE”,
scope : UNCHANGED”,
confidentialityImpact : HIGH”,
integrityImpact : HIGH”,
availabilityImpact : HIGH”,
baseScore : 9.8,
baseSeverity : CRITICAL
},
exploitabilityScore : 3.9,
impactScore : 5.9
}
},
publishedDate : ”2023-01-10T04:15Z”,
lastModifiedDate : ”2023-01-13T18:18Z
}
\par\par

So the expected answer in this case was ”2023-01-10T04:15Z”.

III-E Phase 5: API Development and User Interface Integration

The main emphasis in Phase 5 was on API development. The best-performing LLM from the evaluation phase was chosen and integrated with TF IDF to create an API for ChatNVD. FastAPI was used because of its high performance and ease of use [84]. To make the API accessible, we hosted it on an AWS EC2 instance. Our first attempt at hosting it was on a t2.micro server, part of AWS’s free tier. Unfortunately, our API kept crashing, largely because the server does not have enough RAM to run everything that needs to be run in order to serve requests. The t2.micro server has 1 GB of RAM, and that’s just not enough. The problem was fixed by moving to a t2.medium instance, which gave us a pretty decent upgrade to 4 GB of RAM. The front-end of ChatNVD was built using React which is a Javascript library used for making user interfaces [85]. React was chosen because of its flexibility and component-based architecture [86]. By using FastAPI, AWS, and React we were able to create and deploy a responsive interface that is also user-friendly.ChatNVD is capable of answering questions of users related to the vulnerabilities and utilizes the vast data from NVD. The application has been deployed on .

IV Experimental Evaluation and Results

IV-1 Software/Tools and Machine Configuration

This experiment was primarily carried out using Python 3.12.2 as the main programming language. The development and execution of the testing and evaluation scripts took place on a MacBook Pro powered by the M3 Pro chip, featuring 18 GB of RAM. This configuration provided the necessary computational power for running the various Python scripts used in the evaluation process, including those for generating questions, interacting with the LLMs, and processing their respective results.

IV-2 Dataset

To assess the LLMs, we utilized data from the NVD feeds. These feeds offer detailed information on software vulnerabilities. We amassed a dataset covering from years 2002 through 2024, with each year represented by a separate JSON file containing that year’s vulnerabilities. Along with the annual files, we included the most recent and revised files from the NVD to ensure our dataset contained all newly added or updated vulnerabilities. Since JSON files were retrieved on September 18, 2024, they have the context of vulnerabilities till then. Initially, our full dataset totaled 1.29 GB in size, containing a total of 266487 vulnerabilities. After employing some pre-processing and cleaning techniques like eliminating redundant fields, restructuring JSON objects, and trimming whitespace the dataset was slimmed down to a final size of 720.7 MB.

IV-3 Evaluation Technique

We created an experimental framework to assess how well three LLMs perform in answering questions in the context of software vulnerabilities. The models we chose for this study were GPT 4o mini, LLaMA 3, and Gemini 1.5 Pro. To make the evaluation unbiased all the models were asked the same set of questions.

Step 1: Defining the Questions: In the first step, we defined the 5 types of questions that we wanted to ask about each chosen vulnerability. These questions were designed to capture key information about the vulnerability of the model. The 5 kinds of questions that we finalized for a sample vulnerability were as follows:

1. What is the published date of CVE-2016-9733?

2. What is the description of CVE-2016-9733?

3. What is the exploitability score of CVE-2016-9733?

4. What is the impact score of CVE-2016-9733?

5. What is the base score of CVE-2016-9733?

These questions were able to cover most of the key attributes of a vulnerability, ranging from metadata like (published date and description) to technical details like (exploitability, impact, and base scores).

Step 2: Generating Testing Batches: To create evaluation batches, we randomly selected five CVEs from the NVD using a Python script. We extracted the required data for each CVE from the JSON dataset provided by NVD. Each batch of questions consisted of five CVEs, and for each CVE, the predefined set of five questions was asked that we defined in the first step. A Python script was used to automatically generate a file containing:

  • The questions (e.g., ”What is the published date of CVE-2016-9733?”)

  • The expected answers, which were directly fetched from the JSON data for each CVE.

This process was repeated five times to generate five distinct files, each with the questions and corresponding expected answers for a set of randomly generated CVEs. Below is an example of 5 questions asked and their expected answers for 1 CVE.

{
question”: What is the published date of CVE-2016-9733”,
expected_answer”: ”2017-07-05T17:29Z
},
{
question”: What is the description of CVE-2016-9733”,
expected_answer”: IBM Team Concert (RTC) 4.0, 5.0, and 6.0 is vulnerable to cross-site scripting. This vulnerability allows users to embed arbitrary JavaScript code in the Web UI thus altering the intended functionality potentially leading to credentials disclosure within a trusted session. IBM X-Force ID: 119762.”
},
{
question”: What is the exploitability score of CVE-2016-9733”,
expected_answer”: 2.3
},
{
question”: What is the impact score of CVE-2016-9733”,
expected_answer”: 2.7
},
{
question”: What is the base score of CVE-2016-9733”,
expected_answer”: 5.4
}
\par

In each evaluation batch, five CVEs are targeted with five questions asked per CVE, this results in a total of 25 Questions per evaluation batch. Since this process is repeated five times this results in a combined total of 125 questions in which a total of 25 CVEs are being targeted.

Step 3: Asking the Questions to the LLMs: Once 5 question files were prepared, we used a second Python script to interact with each LLM. This script took the questions from the generated files and fed them to GPT-4o mini, LLaMA 3, and Gemini 1.5 Pro to generate actual answers. These actual answers were stored in a separate file along with questions and expected answers. The Output file had the following structure:

  • Question

  • Expected answer (from the NVD JSON data)

  • Actual answer (generated by the LLM being evaluated)

A sample of the prepared object with an actual answer is shown below.

{
question”: What is the published date of CVE-2016-9733”,
expected_answer”: ”2017-07-05T17:29Z”,
actual_answer”: The published date of CVE-2016-9733 is 2017-07-05T17:29Z. You can find more details at https://nvd.nist.gov/vuln/detail/
CVE-2016-9733.”
},
{
question”: What is the description of CVE-2016-9733”,
expected_answer”: IBM Team Concert (RTC) 4.0, 5.0, and 6.0 is vulnerable to cross-site scripting. This vulnerability allows users to embed arbitrary JavaScript code in the Web UI thus altering the intended functionality potentially leading to credentials disclosure within a trusted session. IBM X-Force ID: 119762.”,
actual_answer”: The description of CVE-2016-9733 is: \”IBM Team Concert (RTC) 4.0, 5.0 and 6.0 is vulnerable to cross-site scripting. This vulnerability allows users to embed arbitrary JavaScript code in the Web UI thus altering the intended functionality potentially leading to credentials disclosure within a trusted session. IBM X-Force ID: 119762.\” \n\nFormore information, visit https://nvd.nist.gov/vuln/detail/
CVE-2016-9733.”
},
{
question”: What is the exploitability score of CVE-2016-9733”,
expected_answer”: 2.3,
actual_answer”: The exploitability score of CVE-2016-9733 is 2.3. For more details, visit https://nvd.nist.gov/vuln/detail/
CVE-2016-9733.”
},
{
question”: What is the impact score of CVE-2016-9733”,
expected_answer”: 2.7,
actual_answer”: The impact score of CVE-2016-9733 is 2.7. You can find more details at https://nvd.nist.gov/vuln/detail/
CVE-2016-9733”
},
{
question”: What is the base score of CVE-2016-9733”,
expected_answer”: 5.4,
actual_answer”: The base score of CVE-2016-9733 is 5.4. You can find more details at https://nvd.nist.gov/vuln/detail/
CVE-2016-9733.”
}

This process was repeated for 5 question files for 3 LLMs thus creating a total of 15 Output files.

Step 4: Evaluation and Scoring: In the final stage, we used a third Python script to evaluate the performance of each LLM. This script compared the expected answers (from the NVD) with the actual answers generated by each model. The comparison was done for all five batches of questions and answers for each model.

The script scored the LLMs based on:

  • Accuracy: How many answers matched the expected answers?

  • Error Rate: How often the actual answers deviated from the expected ones.

The final output was saved into a .xlsx file, which summarized the performance of each LLM. A total of three .xlsx files were generated, one for each model. The results have been discussed in the Experimental Results section.

IV-A Experimental Results

During our assessment, we used three LLMs (GPT 4O Mini, LLaMA 3, and Gemini 1.5 Pro). We evaluated their performance on 125 questions, which were split up into five batches. Each batch contained 25 questions targeting 5 unique and random CVEs in every batch. The combined accuracy and error rate for all the batches can be seen in Fig. 9.

We found that the GPT 4o mini model achieved perfect accuracy. It correctly answered all 125 questions over 5 batches, thus achieving a model accuracy of 1.0 (perfect accuracy) for each batch as shown in Fig. 10.

Batch 1: 1

Batch 2: 1

Batch 3: 1

Batch 4: 1

Batch 5: 1

In comparison, Gemini 1.5 Pro got 0.84 accuracies and had more variability in its performance across the batches as shown in Fig. 11, with accuracy scores across the five batches being:

Batch 1: 0.92

Batch 2: 0.88

Batch 3: 0.84

Batch 4: 0.88

Batch 5: 0.48

Refer to caption
Figure 9: Accuracy and Error Rate Comparison among 3 LLMs.
Refer to caption
Figure 10: GPT 4o mini Accuracy and Error Rate
Refer to caption
Figure 11: Gemini 1.5 Pro Accuracy and Error Rate
Refer to caption
Figure 12: Llama 3 Accuracy and Error Rate

Similarly as shown in Fig.  12, LLaMA 3 performed well in most batches but it also showed some variations, especially in the fifth batch. Its accuracy across the five batches was:

Batch 1: 0.80

Batch 2: 1.00

Batch 3: 1.00

Batch 4: 0.96

Batch 5: 0.52

IV-B Discussion

IV-B1 Insights from Results

One interesting finding in Batch 5 was the difficulty LLaMA 3 and Gemini 1.5 Pro had to go through while responding to questions related to vulnerability with CVE ID: ”CVE-2022-33248”. A total of 5 questions were asked regarding this vulnerability, and both LLMs failed to answer even a single question.

Five questions concerning this vulnerability were:

  1. 1.

    What is the published date of CVE-2022-33248?

  2. 2.

    What is the description of CVE-2022-33248?

  3. 3.

    What is the exploitability score of CVE-2022-33248?

  4. 4.

    What is the impact score of CVE-2022-33248?

  5. 5.

    What is the base score of CVE-2022-33248?

The main reason behind such poor performance could be attributed to the large size of the context provided. As it turns out CVE-2022-33248 has an exceptionally large JSON object, with a size of approximately 97,894 characters, which roughly translates to 31,415 tokens.

Interestingly, GPT 4o mini was able to answer all the questions regarding that vulnerability correctly. This can be attributed to the fact that GPT 4o mini has a large context window of 128K tokens [87], which allows it to process and manage large amounts of contextual data effectively

On the contrary, LLaMA 3 is insufficient for managing such big inputs since its context window is far smaller just 8,000 tokens [88]. This also explains why it wasn’t able to answer the questions properly.

Similarly, Gemini 1.5 Pro also struggled to answer the questions related to CVE-2022-33248 even though it has a comparable large context window size of 128K tokens [89] similar to GPT 4o Mini.

This emphasizes that although context window size is important for processing large inputs, other model-specific elements like architecture, training strategies, and token management also play a vital role in managing complicated vulnerabilities.

Apart from the previously mentioned problems, both LLaMA 3 and Gemini 1.5 Pro also suffered from other vulnerabilities, one such example being CVE-2016-9733. Five questions concerning this vulnerability were posed.

1. What is the published date of CVE-2016-9733?

2. What is the description of CVE-2016-9733?

3. What is the exploitability score of CVE-2016-9733?

4. What is the impact score of CVE-2016-9733?

5. What is the base score of CVE-2016-9733?

Refer to caption
Figure 13: Llama 3 yielding unsatisfactory response to questions related to CVE-2016-9733

In this case, LLaMA 3 failed to answer any of the questions. Our initial hypothesis was that this could be due to the context of vulnerability being too large. But when we looked at the JSON object, we saw that it only had 9,418 characters, or 3,235 tokens, which is well within LLaMA 3’s 8K token context window threshold, Which ruled out the problem being the size of the context window.

Our second assumption was that LLaMA 3 might not have gotten the correct context of the vulnerability properly. When we again asked LLaMA 3 with the same vulnerability and questions as shown in Fig. 13, it provided a similar response compared to the first attempt. It showed some level of context awareness by mentioning IBM in its answers, which is a keyword in the vulnerability description, but the responses still lacked accurate and useful details that could be used by a cybersecurity practitioner.

The lack of consistency indicates that LLaMA 3 can cover some contextual keywords but cannot be relied on to extract and understand the structured data well enough to provide accurate and coherent answers.

Gemini 1.5 Pro had difficulty with one type of question during the test: questions that asked for a description of a CVE. Across all 5 batches, the question about the description was asked 25 times, and Gemini 1.5 Pro failed to provide a satisfactory answer 18 times out of 25, it instead only returned a blank output.

At first, we hypothesized that this problem might be caused by limited output tokens, which might stop the model from creating long replies. However, more tests disproved this theory: when Gemini was asked about a specific vulnerability, they were able to give a long, detailed answer that included a description. Fig. 14 shows that when the question about the description was rephrased or asked again for the same CVE, it gave the same blank answer.

Refer to caption
Figure 14: Gemini 1.5 Pro yielding unsatisfactory response to questions related to the description of CVE-2023-40476

This unpredictability in Gemini’s responses, especially with description questions was one of the big reasons why it wasn’t chosen as the preferred LLM.

Llama 3 also faced difficulties while answering specific types of questions. Specifically while answering questions related to the impact score of a CVE. Out of 25 chances, it failed to give satisfactory answers on 4 occasions. In the other two cases, LLaMA 3 gave answers that were hallucinations, with wrong scores that weren’t in the original data. Overall, LLaMA 3 did slightly better than Gemini 1.5 Pro, but the limits of the context window and the occurrence of hallucinations made us decide not to choose it as the preferred LLM for our needs.

IV-B2 Takeaways

Here are the key takeaways:

LLM reliability and Consistency: The GPT-4o mini did better than other models in terms of reliability, with perfect accuracy in all batches. This level of consistency is very important for areas like cybersecurity that need outputs that can be relied on and can stay consistent.

Challenges with specific questions: Gemini 1.5 Pro and LLaMA 3 both had trouble with questions that had CVEs with large context. Gemini had trouble with questions that asked for descriptions and often gave back blank outputs. LLaMA 3 had trouble with questions about impact scores and sometimes encountered hallucinations. These results show how important it is to test models thoroughly with complicated, structured data to make sure they can correctly answer different types of questions.

Cost and Scalability: The review showed that running LLMs on a large scale is very expensive because of the way they are priced: by tokens. This makes it clear how important it is to look into more sustainable and cheaper alternatives.in order to make the tool last longer for public and professional use.

IV-B3 Implications for Researchers and Practitioners

Model Selection: In high-stakes situations like hacking, it’s important for practitioners to choose LLMs that give consistent, correct answers. According to this study, reliability should be a top concern. The GPT-4o mini’s performance is a great example of a model that can reliably retrieve information.

Need for Thorough Testing on Domain-Specific Data: The results we got in the evaluation suggest that normal testing might not be enough for researchers. A great deal of testing on cybersecurity-specific questions would help models like Gemini 1.5 Pro and LLaMA 3 find issues with how they handle context and assess their response consistency.

Development of different Evaluation Methods for new Use Cases: Our approach gave us indications that other areas, like threat detection or malware analysis, could benefit from similar, custom evaluation metrics. Researchers are urged to create targeted evaluation methods because domain-specific performance measures can provide more useful information and help choose the best model for specific uses.

V Threats to Validity

Despite the effectiveness of ChatNVD, we identify some validity threats that are discussed as follows:

V-1 Hallucination:

Ensuring the accuracy and reliability of generated responses is crucial for maintaining ChatNVD’s consistency and fostering user trust. A key challenge faced by LLMs is hallucination, where the model generates responses that deviate from the provided context or knowledge base [90]. This issue can result in incorrect or misleading information being presented to users, potentially undermining the system’s credibility and usability.

To mitigate this, it is essential to implement robust mechanisms, including improving training datasets, enhancing context awareness through fine-tuning, and incorporating post-processing validation steps to verify generated responses against trusted knowledge sources.

V-2 Token Limitation:

ChatNVD heavily relies on maintaining context from previous conversations to provide accurate and relevant responses. However, as the conversation progresses and becomes longer, the model may struggle to keep track of the entire context due to token limitations [91]. This can result in ChatNVD losing important information from earlier parts of the conversation, leading to inconsistent or irrelevant responses.

V-3 Lost in the Middle Problem:

When LLMs are dealing with large chunks of information they often struggle to extract key details from the middle of the documents. This is a very common problem in LLMs and is referred to as Lost in the Middle Problem.

In Fig. 15, we can see that for all the models tested, when the document with the answer is in the 1st or 5th position (at the beginning of the input), accuracy is highest at around 70-80%. However, when the relevant document is in the 10th or 15th position (in the middle of the input sequence), the accuracy drops dramatically to around 50-60% for most models. This indicates that the models struggle to maintain and utilize relevant context from the middle of long inputs. Information that appears in the middle gets ”lost” or diluted as the model also has to process irrelevant documents at the beginning and end of the sequence. The limited attention span and context window of the models makes it harder for them to identify and extract the key relevant details when they are buried in the middle.

Refer to caption
Figure 15: Performance on Lost in the middle problem [91]

V-4 Expensive

One thing that could hurt ChatNVD’s long-term usefulness is the cost of operating LLM, which is usually pricey because it uses a token-based pricing model. Since each question costs tokens, keeping ChatNVD running for free for everyone could get expensive, especially as it scales up. This cost might make it harder to keep the tool available for free. Changing to a paid model could help cover costs, but it could also discourage users, which would lessen its effect among the masses. To keep the tool open to many people, it will be important to come up with long-term ways to pay for it or look into ways to develop sustainable funding strategies, or explore cost-effective model optimizations.

VI Conclusion and future work

We propose ChatNVD, an LLM-based cybersecurity vulnerability assessment tool designed using the National Vulnerability Database (NVD) dataset to offer an interactive and efficient solution for vulnerability analysis. ChatNVD provides detailed, context-rich insights to streamline the assessment process, catering to diverse users, including cybersecurity professionals, developers, and non-technical stakeholders. To achieve this, we develop three variants of ChatNVD by employing three widely used LLMs—GPT-4o mini, Gemini 1.5 Pro, and LLaMA 3—integrated with the TF-IDF embedding technique, leveraging NVD JSON data feeds (2002–2024) as the foundational knowledge base. We develop an evaluation framework tailored to our use case, enabling us to assess the accuracy and error rates of each model. The results show that GPT-4o mini achieves an accuracy of 1.0, Gemini 1.5 Pro achieves 0.84, and LLaMA 3 achieves 0.86, with corresponding error rates of 0, 0.16, and 0.14, respectively. Based on these findings, we select GPT-4o mini as the LLM for our web application due to its superior accuracy. Additionally, we implement the application with accessible APIs and a user-friendly interface, successfully deploying it for user interaction.

For future work, exploring various embedding models that capture deeper semantic meaning in text remains a priority. Incorporating and evaluating additional LLMs, particularly emerging models with larger context windows or domain-specific training, could provide insights into their potential to outperform GPT-4o mini. Expanding the application’s use cases within the cybersecurity domain could also yield promising results, broadening its applicability and impact. Further research could compare different embedding techniques, evaluating the relevance of their generated context to specific queries. Additionally, enhancing ChatNVD’s interface by introducing features such as multi-conversation tracking within accounts or collaborative team sessions could transform it into an interactive workspace. These improvements would enable seamless sharing of vulnerability data among team members, supporting more robust cybersecurity workflows and decision-making processes.

Artifact Availability. We provide the replication package [92] for ChatNVD, encompassing all scripts and data essential for reproducing, validating, and extending the results outlined in the paper.

References

  • [1] S. Neupane, I. A. Fernandez, S. Mittal, and S. Rahimi, “Impacts and risk of generative ai technology on cyber defense,” arXiv preprint arXiv:2306.13033, 2023.
  • [2] S. Elkins, E. Kochmar, I. Serban, and J. C. Cheung, “How useful are educational questions generated by large language models?,” in International Conference on Artificial Intelligence in Education, pp. 536–542, Springer, 2023.
  • [3] L. Yan, L. Sha, L. Zhao, Y. Li, R. Martinez-Maldonado, G. Chen, X. Li, Y. Jin, and D. Gašević, “Practical and ethical challenges of large language models in education: A systematic scoping review,” British Journal of Educational Technology, vol. 55, no. 1, pp. 90–112, 2024.
  • [4] K. Guo and D. Wang, “To resist it or to embrace it? examining chatgpt’s potential to support teacher feedback in efl writing,” Education and Information Technologies, vol. 29, no. 7, pp. 8435–8463, 2024.
  • [5] Y.-P. Hsiao, N. Klijn, and M.-S. Chiu, “Developing a framework to re-design writing assignment assessment for the era of large language models,” Learning: Research and Practice, vol. 9, no. 2, pp. 148–158, 2023.
  • [6] G. L. Latouche, L. Marcotte, and B. Swanson, “Generating video game scripts with style,” in Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023), pp. 129–139, 2023.
  • [7] Z. Deng, Y. Ma, Y. Liu, R. Guo, G. Zhang, W. Chen, W. Huang, and E. Benetos, “Musilingo: Bridging music and text with pre-trained language models for music captioning and query response,” arXiv preprint arXiv:2309.08730, 2023.
  • [8] Q. Yang, M. Ongpin, S. Nikolenko, A. Huang, and A. Farseev, “Against opacity: Explainable ai and large language models for effective digital advertising,” in Proceedings of the 31st ACM International Conference on Multimedia, pp. 9299–9305, 2023.
  • [9] K. Pandya and M. Holia, “Automating customer service using langchain: Building custom open-source gpt chatbot for organizations,” arXiv preprint arXiv:2310.05421, 2023.
  • [10] E. E. Kosasih, E. Papadakis, G. Baryannis, and A. Brintrup, “A review of explainable artificial intelligence in supply chain management using neurosymbolic approaches,” International Journal of Production Research, vol. 62, no. 4, pp. 1510–1540, 2024.
  • [11] A. Rao, J. Kim, M. Kamineni, M. Pang, W. Lie, and M. D. Succi, “Evaluating chatgpt as an adjunct for radiologic decision-making,” MedRxiv, pp. 2023–02, 2023.
  • [12] I. J. Kuckelman, H. Y. Paul, M. Bui, I. Onuh, J. A. Anderson, and A. B. Ross, “Assessing ai-powered patient education: a case study in radiology,” Academic Radiology, vol. 31, no. 1, pp. 338–342, 2024.
  • [13] O. B. Shoham and N. Rappoport, “Cpllm: Clinical prediction with large language models,” arXiv preprint arXiv:2309.11295, 2023.
  • [14] M. U. Haque, I. Dharmadasa, Z. T. Sworna, R. N. Rajapakse, and H. Ahmad, “” i think this is the most disruptive technology”: Exploring sentiments of chatgpt early adopters using twitter data,” arXiv preprint arXiv:2212.05856, 2022.
  • [15] G. d. J. C. da Silva and C. B. Westphall, “A survey of large language models in cybersecurity,” arXiv preprint arXiv:2402.16968, 2024.
  • [16] R. K. Jayalath, H. Ahmad, D. Goel, M. S. Syed, and F. Ullah, “Microservice vulnerability analysis: A literature review with empirical insights,” IEEE Access, 2024.
  • [17] M. Thakur, “Cyber security threats and countermeasures in digital age,” Journal of Applied Science and Education (JASE), pp. 1–20, 2024.
  • [18] H. Ahmad, C. Treude, M. Wagner, and C. Szabo, “Smart hpa: A resource-efficient horizontal pod auto-scaler for microservice architectures,” arXiv preprint arXiv:2403.07909, 2024.
  • [19] H. Ahmad, I. Dharmadasa, F. Ullah, and M. A. Babar, “A review on c3i systems’ security: Vulnerabilities, attacks, and countermeasures,” ACM Computing Surveys, vol. 55, no. 9, pp. 1–38, 2023.
  • [20] M. Abdulsatar, H. Ahmad, D. Goel, and F. Ullah, “Towards deep learning enabled cybersecurity risk assessment for microservice architectures,” arXiv preprint arXiv:2403.15169, 2024.
  • [21] F. M. Teichmann and S. R. Boticiu, “An overview of the benefits, challenges, and legal aspects of penetration testing and red teaming,” International Cybersecurity Law Review, vol. 4, no. 4, pp. 387–397, 2023.
  • [22] Y. Guo, C. Patsakis, Q. Hu, Q. Tang, and F. Casino, “Outside the comfort zone: Analysing llm capabilities in software vulnerability detection,” in European symposium on research in computer security, pp. 271–289, Springer, 2024.
  • [23] F. Jimmy, “Cyber security vulnerabilities and remediation through cloud security tools,” Journal of Artificial Intelligence General science (JAIGS) ISSN: 3006-4023, vol. 2, no. 1, pp. 196–233, 2024.
  • [24] Y. Shoshitaishvili, R. Wang, C. Salls, N. Stephens, M. Polino, A. Dutcher, J. Grosen, S. Feng, C. Hauser, C. Kruegel, et al., “Sok:(state of) the art of war: Offensive techniques in binary analysis,” in 2016 IEEE symposium on security and privacy (SP), pp. 138–157, IEEE, 2016.
  • [25] M. McLennan et al., “The global risks report 2021 16th edition,” World Economic Forum Cologny, Switzerland, 2021.
  • [26] A. Austin, C. Holmgreen, and L. Williams, “A comparison of the efficiency and effectiveness of vulnerability discovery techniques,” Information and Software Technology, vol. 55, no. 7, pp. 1279–1288, 2013.
  • [27] J. A. Harer, L. Y. Kim, R. L. Russell, O. Ozdemir, L. R. Kosta, A. Rangamani, L. H. Hamilton, G. I. Centeno, J. R. Key, P. M. Ellingwood, et al., “Automated software vulnerability detection with machine learning,” arXiv preprint arXiv:1803.04497, 2018.
  • [28] H. Eakin and A. L. Luers, “Assessing the vulnerability of social-environmental systems,” Annu. Rev. Environ. Resour., vol. 31, no. 1, pp. 365–394, 2006.
  • [29] A. Anwar, A. Abusnaina, S. Chen, F. Li, and D. Mohaisen, “Cleaning the nvd: Comprehensive quality assessment, improvements, and analyses,” IEEE Transactions on Dependable and Secure Computing, vol. 19, no. 6, pp. 4255–4269, 2021.
  • [30] A. Koubaa, “Gpt-4 vs. gpt-3.5: A concise showdown,” 2023.
  • [31] S. N. Akter, Z. Yu, A. Muhamed, T. Ou, A. Bäuerle, Á. A. Cabrera, K. Dholakia, C. Xiong, and G. Neubig, “An in-depth look at gemini’s language abilities,” arXiv preprint arXiv:2312.11444, 2023.
  • [32] M. A. K. Raiaan, M. S. H. Mukta, K. Fatema, N. M. Fahad, S. Sakib, M. M. J. Mim, J. Ahmad, M. E. Ali, and S. Azam, “A review on large language models: Architectures, applications, taxonomies, open issues and challenges,” IEEE Access, 2024.
  • [33] C. D. Manning, “Introduction to information retrieval,” 2008.
  • [34] G. Yenduri, M. Ramalingam, G. C. Selvi, Y. Supriya, G. Srivastava, P. K. R. Maddikunta, G. D. Raj, R. H. Jhaveri, B. Prabadevi, W. Wang, et al., “Gpt (generative pre-trained transformer)–a comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions,” IEEE Access, 2024.
  • [35] S. S. Biswas, “Chatgpt for research and publication: a step-by-step guide,” The Journal of Pediatric Pharmacology and Therapeutics, vol. 28, no. 6, pp. 576–584, 2023.
  • [36] J. Liljedahl, “Development and evaluation of an ai-powered virtual coach for an e-health tool for fall prevention: Open-source vs closed-source llms in e-health applications,” 2024.
  • [37] S. Saha Roy and S. Nilizadeh, “Utilizing large language models to optimize the detection and explainability of phishing websites,” arXiv e-prints, pp. arXiv–2408, 2024.
  • [38] OpenAI, “Api pricing information,” 2024. Accessed: October 17, 2024.
  • [39] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale. in¡ i¿ proceedings of the 58th annual meeting of the association for computational linguistics¡/i¿, 8440-8451,” Online: Association for Computational Linguistics, 2020.
  • [40] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  • [41] C. Wu, P. Qiu, J. Liu, H. Gu, N. Li, Y. Zhang, Y. Wang, and W. Xie, “Towards evaluating and building versatile large language models for medicine,” arXiv preprint arXiv:2408.12547, 2024.
  • [42] C.-H. Lin, S. Gao, J. S. Smith, A. Patel, S. Tuli, Y. Shen, H. Jin, and Y.-C. Hsu, “Modegpt: Modular decomposition for large language model compression,” arXiv preprint arXiv:2408.09632, 2024.
  • [43] G. Zhang, Q. Jin, Y. Zhou, S. Wang, B. R. Idnay, Y. Luo, E. Park, J. G. Nestor, M. E. Spotnitz, A. Soroush, et al., “Closing the gap between open-source and commercial large language models for medical evidence summarization,” arXiv preprint arXiv:2408.00588, 2024.
  • [44] C. Li, M. Chen, J. Wang, S. Sitaram, and X. Xie, “Culturellm: Incorporating cultural differences into large language models,” arXiv preprint arXiv:2402.10946, 2024.
  • [45] M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024.
  • [46] Y. Han, Z. Li, Z. Zhao, and T. Braun, “Crowdbert: Crowdsourcing indoor positioning via semi-supervised bert with masking,” IEEE Internet of Things Journal, 2024.
  • [47] G. AI, “Ai pricing information,” 2024. Accessed: October 17, 2024.
  • [48] M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. B. Shaikh, N. Akhtar, J. Wu, S. Mirjalili, et al., “Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects,” Authorea Preprints, 2023.
  • [49] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, et al., “A survey on evaluation of large language models,” ACM Transactions on Intelligent Systems and Technology, vol. 15, no. 3, pp. 1–45, 2024.
  • [50] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [51] I. Ahmed, A. Roy, M. Kajol, U. Hasan, P. P. Datta, and M. R. Reza, “Chatgpt vs. bard: a comparative study,” Authorea Preprints, 2023.
  • [52] M. A. Haq, M. A. R. Khan, and M. Alshehri, “Insider threat detection based on nlp word embedding and machine learning,” Intell. Autom. Soft Comput, vol. 33, no. 1, pp. 619–635, 2022.
  • [53] H. Ahmad, F. Ullah, and R. Jafri, “A survey on immersive cyber situational awareness systems,” arXiv preprint arXiv:2408.07456, 2024.
  • [54] B. K. Verma and A. K. Yadav, “Software security with natural language processing and vulnerability scoring using machine learning approach,” Journal of Ambient Intelligence and Humanized Computing, pp. 1–11, 2024.
  • [55] F. Abbas and H. Ahmad, “Robust partial least squares using low rank and sparse decomposition,” arXiv preprint arXiv:2407.06936, 2024.
  • [56] G. Cimmino, “Large language models in cybersecurity: Digital defense and ethical challenge,” Authorea Preprints, 2024.
  • [57] F. Abbas, V. McClelland, Z. Cvetkovic, and W. Dai, “Ss-admm: Stationary and sparse granger causal discovery for cortico-muscular coupling,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, IEEE, 2023.
  • [58] F. Abbas, V. McClelland, Z. Cvetkovic, and W. Dai, “Dlgc: Dictionary learning based granger causal discovery for cortico-muscular coupling,” in 2024 32nd European Signal Processing Conference (EUSIPCO), pp. 1746–1750, IEEE, 2024.
  • [59] OpenAI, “Gpt-4 turbo: Advancing cost-efficient intelligence,” 2024. Accessed: 2024-11-01.
  • [60] DataCamp, “Exploring gpt-4 turbo: What’s new and how it works,” 2024. Accessed: 2024-11-01.
  • [61] F. Abbas, V. McClelland, Z. Cvetkovic, and W. Dai, “Stationary and sparse denoising approach for corticomuscular causality estimation,” arXiv preprint arXiv:2406.16692, 2024.
  • [62]
  • [63] B. D. Menz, N. M. Kuderer, S. Bacchi, N. D. Modi, B. Chin-Yee, T. Hu, C. Rickard, M. Haseloff, A. Vitry, R. A. McKinnon, et al., “Current safeguards, risk mitigation, and transparency measures of large language models against the generation of health disinformation: repeated cross sectional analysis,” bmj, vol. 384, 2024.
  • [64] G. AI, “Google gemini: Next-generation model,” 2024. Accessed: October 17, 2024.
  • [65] M. T. Pilehvar and J. Camacho-Collados, Embeddings in natural language processing: Theory and advances in vector representations of meaning. Morgan & Claypool Publishers, 2020.
  • [66] J. Ramos et al., “Using tf-idf to determine word relevance in document queries,” in Proceedings of the first instructional conference on machine learning, vol. 242, pp. 29–48, Citeseer, 2003.
  • [67] S. Qaiser and R. Ali, “Text mining: use of tf-idf to examine the relevance of words to documents,” International Journal of Computer Applications, vol. 181, no. 1, pp. 25–29, 2018.
  • [68] W. Zhang, T. Yoshida, and X. Tang, “A comparative study of tf* idf, lsi and multi-words for text classification,” Expert systems with applications, vol. 38, no. 3, pp. 2758–2765, 2011.
  • [69] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  • [70] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PMLR, 2021.
  • [71] R. Ahuja, A. Chug, S. Kohli, S. Gupta, and P. Ahuja, “The impact of features extraction on the sentiment analysis,” Procedia Computer Science, vol. 152, pp. 341–348, 2019.
  • [72] J. Ramos et al., “Using tf-idf to determine word relevance in document queries,” in Proceedings of the first instructional conference on machine learning, vol. 242, pp. 29–48, Citeseer, 2003.
  • [73] N. Van Linh, N. K. Anh, K. Than, and C. N. Dang, “An effective and interpretable method for document classification,” Knowledge and Information Systems, vol. 50, pp. 763–793, 2017.
  • [74] S. Zaware, D. Patadiya, A. Gaikwad, S. Gulhane, and A. Thakare, “Text summarization using tf-idf and textrank algorithm,” in 2021 5th International conference on trends in electronics and informatics (ICOEI), pp. 1399–1407, IEEE, 2021.
  • [75] H. Ahmad, C. Treude, M. Wagner, and C. Szabo, “Towards resource-efficient reactive and proactive auto-scaling for microservice architectures,” Available at SSRN 4918202, 2024.
  • [76] T. B. Brown, “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020.
  • [77] D. Bahdanau, T. Bosc, S. Jastrzebski, E. Grefenstette, P. Vincent, and Y. Bengio, “Learning to compute word embeddings on the fly,” arXiv preprint arXiv:1706.00286, 2017.
  • [78] J. D’Souza, “A review of transformer models,”
  • [79] P. Přibáň, J. Šmíd, J. Steinberger, and A. Mištera, “A comparative study of cross-lingual sentiment analysis,” Expert Systems with Applications, vol. 247, p. 123247, 2024.
  • [80] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  • [81] OpenAI, “Gpt-4o mini: Advancing cost-efficient intelligence,” 2024. Accessed: October 17, 2024.
  • [82] H. Feng, K. Rough, P. B. Milligan, F. Tombini, T. Kwon, K. Zine El Abidine, C. D. Mack, and B. Hughes, “How well it works: Benchmarking performance of gpt models on medical natural language processing tasks,” medRxiv, pp. 2024–06, 2024.
  • [83] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024.
  • [84] M. Lathkar, High-Performance Web Apps with FastAPI. Springer, 2023.
  • [85] C. Gackenheimer, Introduction to React. Apress, 2015.
  • [86] A. Fedosejev, React. js essentials. Packt Publishing Ltd, 2015.
  • [87] OpenAI, “Gpt-4o mini: Advancing cost-efficient intelligence,” 2024. Accessed: October 17, 2024.
  • [88] B. Liu, “Comparing pruned llama 3 and llama 2 models for on-device ai assistants,” 2024.
  • [89] G. AI, “Google gemini: Next-generation model,” 2024. Accessed: October 17, 2024.
  • [90] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023.
  • [91] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024.
  • [92] Replication Package of ChatNVD. https://github.com/Shivansh1313/ChatNVD-Research.