Zero is Not Hero Yet: Benchmarking Zero-Shot Performance of LLMs for Financial Tasks¹¹1This working paper is an ongoing research project, and feedback is greatly appreciated.

Agam Shah²²2Center for Machine Learning, School of Computational Science & Engineering, Georgia Institute of Technology; Email: [email protected] Sudheer Chava³³3Scheller College of Business at Georgia Institute of Technology; Email: [email protected]

(This Version: )

Abstract

Recently large language models (LLMs) like ChatGPT have shown impressive performance on many natural language processing tasks with zero-shot. In this paper, we investigate the effectiveness of zero-shot LLMs in the financial domain. We compare the performance of ChatGPT along with some open-source generative LLMs in zero-shot mode with RoBERTa fine-tuned on annotated data. We address three inter-related research questions on data annotation, performance gaps, and the feasibility of employing generative models in the finance domain. Our findings demonstrate that ChatGPT performs well even without labeled data but fine-tuned models generally outperform it. Our research also highlights how annotating with generative models can be time-intensive. Our codebase is publicly available on GitHub under CC BY-NC 4.0 license⁴⁴4The code is available on FinTech Lab GitHub..

1 Introduction

On November 30th, 2022, OpenAI released ChatGPT⁵⁵5https://chat.openai.com/ and intrigued the world with its capabilities. Within a few months after its release, researchers have started testing its zero-shot capabilities for financial domain tasks. Shah et al. (2023a) and Hansen and Kazinnik (2023) have demonstrated the use of ChatGPT to decode the communications from the Federal Open Market Committee (FOMC) and how it can be used in understanding financial markets. Saggu and Ante (2023) present evidence that ChatGPT is already influencing AI-related crypto assets. Despite the impressive demonstrations using prompts, it is still important to further understand ChatGPT capabilities across various NLP tasks in the financial domain.

Pikuliak (2023) conducted a survey to understand the linguistic capabilities of ChatGPT in solving natural language processing (NLP) tasks and found that ChatGPT outperforms fine-tuned models only on 22.5% (34 out of 151) tasks. Qin et al. (2023) empirically analyzes the zero-shot learning ability of ChatGPT across 20 NLP datasets, highlighting its effectiveness (e.g., arithmetic reasoning) and limitations (e.g., sequence tagging), and provides qualitative case studies for further analysis. Even though researchers have done work to understand the capabilities of ChatGPT for general domain NLP tasks, there is no work done for financial NLP. To fill in this gap, we benchmark the zero-shot performance of ChatGPT and compare it with fine-tuned RoBERTa for various financial NLP tasks. Compared to previous work where they report fine-tuning numbers from existing work, we run all experiments ourselves.

The recent op-ed article by Rogers et al. (2023) argues that closed AI models like ChatGPT make bad baselines and make a case that these models shouldn’t be a requisite baseline in scientific work. To provide an open-source alternative to ChatGPT, some organizations have released competing open-source LLMs. Databricks fine-tuned pythia-12b (EleutherAI, 2023) model on approximately 15k instruction fine-tuning records generated by Databricks employees and released dolly-v2-12b (Databricks, 2023). Following the release of dolly-v2-12b, H2O.ai released h2ogpt-oasst1-512-12b (H2O.ai, 2023a) which is developed by fine-tuning pythia-12b on open-source instruct-type dataset (H2O.ai, 2023b). In order to understand where these open-source models stand in comparison with ChatGPT, we also benchmark these models for all the tasks in our study. To also understand how good these instruction following models are in following instructions, we also report % of test samples where they don’t follow instructions.

In the last few years, language models have scaled exponentially. BERT-large has (Devlin et al., 2018) 345 million parameters vs LLaMA (Touvron et al., 2023) has 65 billion parameters, GPT-3 (Brown et al., 2020) has 175 billion parameters and GPT-4 is estimated/rumored to have a few trillion parameters. For a comprehensive review of LLMs, we direct the reader’s attention to the extensive surveys conducted by Mialon et al. (2023) and Zhao et al. (2023). Given that latency is of high importance to many financial applications, we not only compare the performance of models but also compare the time it takes to label one sentence. It also helps us understand how feasible it is to use for a research project which requires labeling large datasets.

In summary through this work, we try to answer the following research questions:

•

RQ1: For financial domain tasks, is it better to annotate the data and fine-tune a medium-sized model instead of using a generative LLM with zero-shot prompt?
•

RQ2: What is the performance gap between ChatGPT-like closed model and open-sources LLMs for financial domain tasks?
•

RQ3: Is it feasible to employ generative LLMs for research when the text data is large?

In order to answer these questions, we use four financial NLP tasks in our study and benchmark various models on all of them. We employ RoBERTa-base and RoBERTa-large models for fine-tuning benchmarks while using ChatGPT-3.5-Turbo, Dolly-V2-12B, and H2O-12B as zero-shot models.

Key insights

To the best of our knowledge this is the first study that not only attempts to understand how well ChatGPT can perform with zero-shot on many nlp tasks but compares it with other open-source generative LLMs and fine-tuned PLMs⁶⁶6Throughout the paper we refer to the earlier models built based on BERT architecture as PLMs and latest generative models similar to GPT models as LLMs. for the financial domain. The key takeaways are summarized as follows:

•

Even though zero-shot ChatGPT fails to outperform fine-tuned PLMs, it provides impressive performance across all the tasks without having access to any labeled data.
•

The performance gap between fine-tuned model PLMs and ChatGPT is larger where the dataset is not publicly available yet. It will be an interesting future study to understand plausible contamination issues as a potential explanation.
•

The performance of fully open-source LLMs for all financial tasks is significantly lower as compared to ChatGPT.
•

In certain scenarios, even if the user is willing to accept the performance difference between zero-shot LLMs and fine-tuned PLMs, the amount of time required to assign labels to data is 1000 times greater when using generative LLMs.

2 Datasets and Tasks

We include hawkish-dovish sequence classification from Shah et al. (2023a), financial sentiment analysis task from Malo et al. (2014), financial numerical claim detection from Shah et al. (2022a), and named entity recognition dataset from Shah et al. (2023b). A summary of datasets used with the train-validation-test split is provided in table 1.

Task	Source	Dataset Size
		Train	Valid	Test
Hawkish-Dovish-Neutral Classification	Shah et al. (2023a)	1587	397	496
Sentiment Classification	Malo et al. (2014)	1449	362	453
Claim Detection	Shah et al. (2022a)	1715	429	537
Named Entity Recognition	Shah et al. (2023b)	80.5k	10.2k	26k

Table 1: Summary of benchmarks used. Dataset size denotes the number of samples in the benchmark. For the NER task each sample consists of a word token while for other tasks it represents a sentence token.

2.1 FOMC Communication

Understanding or deciphering the monetary policy stance of the central banks is an important NLP task to understand financial markets better. Several studies (Rozkrut et al., 2007; Tobback et al., 2017; Hansen et al., 2018; Cieslak et al., 2019; Tsukioka and Yamasaki, 2020; Bennani et al., 2020; Shah et al., 2023a) have found that the words used by central banks have an impact on the market, although the extent of this impact varies depending on the communication style and chair. Recent work by Hansen and Kazinnik (2023) uses ChatGPT to decode the FOMC post-meeting statements. In order to understand the capabilities of LLMs in decoding central bank communications, we use open-source labeled data that combines meeting minutes, press conference transcripts, and speeches (Shah et al., 2023a). In this dataset, each sentence is labeled one of the three (hawkish, dovish, and neutral) labels which we use as a sequence classification task.

2.2 Sentiment Analysis

Sentiment analysis is a popular task in the financial domain as it correlates with the market sentiment which influences price movements. In the finance literature, the sentiment dictionary developed by Loughran and McDonald (2011) is used as bag-of-words extensively for sentiment analysis. Garcia et al. (2023) refine the dictionary for bag-of-words and claim it to be SOTA in the finance literature. On the other side computational linguistics literature has grown exponentially over the last decade. After the introduction of sentence-level financial sentiment data by Malo et al. (2014), many models have been trained for sentiment analysis tasks. Soon after the release of BERT (Devlin et al., 2018), FinBERT (Araci, 2019) was developed for financial sentiment analysis. To understand how well generative LLMs can do on this important task, we use Financial Phrasebank sentiment analysis data⁷⁷7https://huggingface.co/datasets/financial_phrasebank developed by Malo et al. (2014) for the financial sentiment classification task. The dataset contains multiple versions. Here we use the data where there is 100% annotation agreement. We perform three class (positive, negative, and neutral) sequence classification task.

2.3 Numerical Claim Detection

Extraction of numerical claims from the financial text like analysts’ reports, earnings calls, news, etc is helpful in forecasting volatility of the stock prices. For the task of claim detection, we use a dataset developed by Shah et al. (2022a). The dataset contains binary labels (“in-claim”, and “out-of-claim”) for sentences extracted from a heterogenous set of analysts’ reports. In this case, the term “in-claim” text is used to refer to sentences in the financial domain that contain specific and measurable financial claims, which are not factual statements. For instance, the sentence “Operating income is expected to be between $2.1 billion and $3.6 billion” is considered an ”in-claim” sentence because it predicts a future outcome. On the other hand, the sentence “Revenues increased by 48.6% compared to the previous year, reaching $5.44 billion, primarily due to the expansion of the customer base” is categorized as “out-of-claim” since it presents factual information from the past.

2.4 Named Entity Recognition

Named Entity Recognition (NER) involves the identification and classification of named entities in text, such as person names, organization names, and locations, among others. Given a sentence with N tokens and a set of entities denoted as S with a size of #S, the NER model assigns an entity label to each token. Following the BIO convention, the label set L has a size of $2\#S+1$ and includes ”O” to indicate categories not listed in the entity list. NER plays a crucial role in financial NLP due to its significant implications for various applications in this domain. NER enables the extraction and categorization of key financial entities such as company names, person names, locations, etc. This facilitates the identification of important entities involved in financial news and reports, helping analysts and investors gain insights into market trends, company performance, and potential investment opportunities. For instance, NER can aid in tracking the mentions of specific companies in news articles, social media posts, and financial reports, enabling the monitoring of market sentiment and the detection of emerging patterns that may impact stock prices. Therefore, the accurate identification and classification of named entities through NER are crucial for extracting meaningful information and enhancing decision-making processes in financial NLP. For financial NER we use a dataset developed in Shah et al. (2023b).

3 Experiments

We run all the experiments for this paper and do not report numbers from any prior work. We use 3 different seeds to split datasets into train and test parts except for FiNER-ORD as it has a train-validation-test split already given.

3.1 Fine-Tuning PLM

In order to set the benchmark, we use base (“roberta-base”) and large (“roberta-large”) versions of RoBERTa (Liu et al., 2019) model. No pre-training is conducted on the models before proceeding with the fine-tuning process. To determine the optimal hyper-parameters for each model, a grid search is performed using four different learning rates (1e-4, 1e-5, 1e-6, 1e-7) and four different batch sizes (32, 16, 8, 4). We use a maximum of 100 epochs for training with early stopping criteria. If the validation F1 score doesn’t improve by more than or equal to 1e-2 in the next 7 epochs then we use the best model stored earlier as the final fine-tuned model. The experiments are carried out using PyTorch (Paszke et al., 2019) on an NVIDIA RTX A6000 GPU. The initialization of each model is based on the pre-trained version available in the Huggingface Transformers library (Wolf et al., 2020).

3.2 Zero-Shot with Generative LLMs

In the generative LLM category, we utilize ChatGPT (”gpt-3.5-turbo”) with specific settings including a maximum token limit of 1000 and a temperature value of 0.0. As for the open-source LLM category, we employ ”dolly-v2-12b” and ”h2ogpt-oasst1-512-12b” along with their dedicated text generation pipelines and models, which can be found on their respective Huggingface pages.

For each task, we design separate prompts and write functions that can be used to get labels from the output of the prompt. As zero-shot learning doesn’t require any labeled data for training, we only use test split for prompting. The prompt template for each task is discussed below.

FOMC Communication

We use the following zero-shot prompt for hawkish-dovish-neutral classification:

“Discard all the previous instructions. Behave like you are an expert sentence classifier. Classify the following sentence from FOMC into ‘HAWKISH’, ‘DOVISH’, or ‘NEUTRAL’ class. Label ‘HAWKISH’ if it is corresponding to tightening of the monetary policy, ‘DOVISH’ if it is corresponding to easing of the monetary policy, or ‘NEUTRAL’ if the stance is neutral. Provide the label in the first line and provide a short explanation in the second line. The sentence: {sentence}”

Sentiment Analysis

We use the following zero-shot prompt for sentiment classification:

“Discard all the previous instructions. Behave like you are an expert sentence sentiment classifier. Classify the following sentence into ‘NEGATIVE’, ‘POSITIVE’, or ‘NEUTRAL’ class. Label ‘NEGATIVE’ if it is corresponding to negative sentiment, ‘POSITIVE’ if it is corresponding to positive sentiment, or ‘NEUTRAL’ if the sentiment is neutral. Provide the label in the first line and provide a short explanation in the second line. The sentence: {sentence}”

Numerical Claim Detection

We use the following zero-shot prompt for numerical claim detection:

“Discard all the previous instructions. Behave like you are an expert sentence sentiment classifier. Classify the following sentence into ‘INCLAIM’, or ‘OUTOFCLAIM’ class. Label ‘INCLAIM’ if consist of a claim and not just factual past or present information, or ‘OUTOFCLAIM’ if it has just factual past or present information. Provide the label in the first line and provide a short explanation in the second line. The sentence: {sentence}”

Named Entity Recognition

We use the following zero-shot prompt for named entity recognition:

“Discard all the previous instructions. Behave like you are an expert named entity identifier. Below a sentence is tokenized and each line contains a word token from the sentence. Identify ‘Person’, ‘Location’, and ‘Organisation’ from them and label them. If the entity is multi token use post-fix _B for the first label and _I for the remaining token labels for that particular entity. The start of the separate entity should always use _B post-fix for the label. If the token doesn’t fit in any of those three categories or is not a named entity label it ‘Other’. Do not combine words yourself. Use a colon to separate token and label. So the format should be token:label. \n\n {word tokens separated by \n}”

4 Results

In this section, we benchmark and evaluate all the models and tasks discussed in previous sections. For performance analysis, we report the mean and standard deviation of the weighted F1 score. Along with performance for each model, we also report the time it takes on average for each model to label a sentence in the test dataset. For fine-tuned PLMs, we additionally report fine-tuning time for the best hyperparameter setting. For generative LLMs, in many cases, the model doesn’t follow the instruction. We assign the label “-1” (which makes their weight 0 while calculating the weighted F1 score) to those instances and report percentage of those instances as “Missing %”.

4.1 FOMC Communication

Performance and other metrics for the FOMC tone classification task are reported in table 2. Fine-tuned RoBERTa-base and RoBERTa-large have similar performance and they outperform zero-shot LLMs. In the zero-shot LLM category, ChatGPT outperforms other open-source LLMs by a good margin. ChatGPT also follows the instruction 100% of the time, while Dolly fails to follow instructions for 6.05% of the time and H2O for 42.14% of the time. We run all the codes before we made the data for this task public as part of our other work (Shah et al., 2023a) to ensure that there is no contamination issue. The results for fine-tuning models are a little different than those reported in Shah et al. (2023a) as we employ different early stopping mechanism compared to them. Even if for a particular use case performance is not an issue, the time it takes for generative LLM to annotate the data is 1000 times higher compared to fine-tuned PLMs. The latency of a few seconds is not useful when markets might adjust the price in a few milliseconds.

Model	Fine Tuning Time	Test Labeling Time	F1 Score mean (std)	Missing %
Panel A: Fine-Tuning with PLM
RoBERTa-base	5.85 minutes	3.85 milliseconds	0.6990 (0.0182)	-
RoBERTa-large	18.11 minutes	9.87 milliseconds	0.6977 (0.0110)	-
Panel B: Zero-Shot with Generative LLM
ChatGPT-3.5-Turbo	-	**3.91 seconds	0.5837 (0.0155)	0.00%
Dolly-V2-12B	-	5.89 seconds	0.1195 (0.0100)	6.05%
H2O-12B	-	29.64 seconds	0.0915 (0.0105)	42.14%

Table 2: Experiment results on hawkish-dovish-neutral classification task. An average of 3 seeds was used for all models. **It is based on an API call, for more details check the “Limitations and Future Work” section.

4.2 Sentiment Analysis

For the sentiment analysis task, performance and other metrics are reported in table 3. The results follow a similar trend but for this task, the H2O model doesn’t follow the instruction at all which is surprising. It will be an interesting future study to understand why this is the case and how open-source LLMs can be improved in this dimension. The ChatGPT achieves impressive performance close to 0.9 F1 score for sentiment analysis which is not far from the performance of fine-tuned RoBERTa.

Model	Fine Tuning Time	Test Labeling Time	F1 Score mean (std)	Missing %
Panel A: Fine-Tuning with PLM
RoBERTa-base	4.89 minutes	2.62 milliseconds	0.9735 (0.0041)	-
RoBERTa-large	6.50 minutes	4.50 milliseconds	0.9757 (0.0077)	-
Panel B: Zero-Shot with Generative LLM
ChatGPT-3.5-Turbo	-	**4.11 seconds	0.8929 (0.0078)	0.00%
Dolly-V2-12B	-	6.93 seconds	0.1070 (0.0132)	3.16%
H2O-12B	-	54.66 seconds	0.0000 (0.0000)	100.00%

Table 3: Experiment results on the sentiment classification task. An average of 3 seeds was used for all models. **It is based on an API call, for more details check the “Limitations and Future Work” section.

4.3 Numerical Claim Detection

Performance and other metrics for the numerical claim detection task are reported in table 4. The gap between fine-tuned model PLMs and ChatGPT is larger here compared to the sentiment analysis task while the gap between Dolly and ChatGPT is smaller. The exact reason for this is uncertain, but one possibility could be contamination, considering that the sentiment analysis dataset is publicly available while the claim dataset is not. Exploring the potential contamination issues as an explanation for this discrepancy would be an intriguing avenue for future research. Other results follow a similar trend.

Model	Fine Tuning Time	Test Labeling Time	F1 Score mean (std)	Missing %
Panel A: Fine-Tuning with PLM
RoBERTa-base	2.44 minutes	1.60 milliseconds	0.9615 (0.0091)	-
RoBERTa-large	10.41 minutes	6.31 milliseconds	0.9642 (0.0069)	-
Panel B: Zero-Shot with Generative LLM
ChatGPT-3.5-Turbo	-	**5.51 seconds	0.8136 (0.0079)	0.00%
Dolly-V2-12B	-	5.66 seconds	0.5250 (0.0203)	5.59%
H2O-12B	-	54.97 seconds	0.0536 (0.0054)	42.89%

Table 4: Experiment results on the numerical claim detection task. An average of 3 seeds was used for all models. **It is based on an API call, for more details check the “Limitations and Future Work” section.

4.4 Named Entity Recognition

Table 5 presents the performance metrics for the Named Entity Recognition (NER) task. Similar to the previous tasks, the results exhibit a consistent pattern. However, the NER prompt proves to be more challenging compared to other tasks, leading both open-source models to struggle in adhering to the given instruction. Moreover, decoding labels from the prompt output is relatively more difficult in token classification compared to sequence classification tasks, resulting in a higher percentage of samples with missing labels for generative LLMs in the NER task.

Model	Fine Tuning Time	Test Labeling Time	F1 Score mean (std)	Missing %
Panel A: Fine-Tuning with PLM
RoBERTa-base	14.22 minutes	3.37 milliseconds	0.9696 (0.0060)	-
RoBERTa-large	29.04 minutes	10.57 milliseconds	0.9754 (0.0047)	-
Panel B: Zero-Shot with Generative LLM
ChatGPT-3.5-Turbo	-	**9.49 seconds	0.8509	17.59%
Dolly-V2-12B	-	6.59 seconds	0.0023	99.88%
H2O-12B	-	44.82 seconds	0.0056	99.67%

Table 5: Experiment results on the named entity recognition task. An average of 3 seeds was used for fine-tuned PLMs. For generative LLMs, we only have one test split so the standard deviation is not reported. **It is based on an API call, for more details check the “Limitations and Future Work” section.

5 Conclusion

In conclusion, we investigate the effectiveness of using ChatGPT, in zero-shot mode as compared to other open-source generative LLMs and fine-tuned PLMs in the financial domain. The key findings and implications derived from the research are as follows:

Firstly, regarding RQ1, it was observed that while fine-tuned PLMs generally outperformed zero-shot ChatGPT, the latter still demonstrated impressive performance across various tasks without the need for labeled data. This indicates that ChatGPT has the potential to be a “hero” for financial domain tasks even without explicit fine-tuning.

Secondly, in relation to RQ2, a notable performance gap was identified between fine-tuned PLMs and ChatGPT, particularly in cases where the dataset was not publicly available. This discrepancy could be attributed to potential contamination issues within the data.

Additionally, concerning RQ3, we examine the feasibility of employing generative LLMs for research purposes when dealing with large volumes of textual data requiring annotation. Our findings indicate that despite potential performance gaps between zero-shot LLMs and fine-tuned PLMs, the time required to label a single sample using generative LLMs was significantly higher, potentially by a factor of 1000.

Overall, to our knowledge, this is the first paper to evaluate the performance of ChatGPT in zero-shot mode across multiple NLP tasks and compare it with other open-source LLMs and fine-tuned PLMs in the financial domain. Our results emphasize the potential and limitations of ChatGPT, highlighting the trade-offs between performance, availability of labeled data, and labeling efficiency. Future research in this area could explore strategies to mitigate contamination issues in closed models and address the labeling time challenges posed by generative LLMs.

Limitations and Future Work

The range of tasks used here is not exhaustive and we plan to keep adding more tasks in the future from Financial Language Understanding Evaluation (FLUE) (Shah et al., 2022b) and other sources. In future drafts, we also plan to add GPT-4, Alpaca, MPT, and other models. In this study, we do not benchmark few-shot as the focus is to understand how far zero-shot is from fine-tuned models. We also do not include and discuss finance domain-specific LLMs like BloombergGPT (Wu et al., 2023) in our work as we have no way of accessing it.

For all the experiments regarding ChatGPT, we use API calls which limits our ability to accurately compare the time it takes for it takes to label a sentence. For all other models, we use the same hardware with NVIDIA RTX A6000 48GB GPU and we don’t know what hardware or GPU OpenAI uses. We also note that the study neither tries to understand biases nor looks at the contamination issues with LLMs. It attempts to ask questions that “if there is no contamination issue, where do current open and closed LLMs stand compared to fine-tuned models when it comes to solving a task in the financial domain?”.

Acknowledgements

We appreciate the generous infrastructure support provided by Georgia Tech’s Office of Information Technology, especially Robert Griffin. We would also like to thank Agoston Reguly, and Arnav Hiray for their comments and suggestions.

References

Araci (2019) Dogu Araci. 2019. Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063.
Bennani et al. (2020) Hamza Bennani, Nicolas Fanta, Pavel Gertler, and Roman Horvath. 2020. Does central bank communication signal future monetary policy in a (post)-crisis era? the case of the ecb. Journal of International Money and Finance, 104:102167.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Cieslak et al. (2019) Anna Cieslak, Adair Morse, and Annette Vissing-Jorgensen. 2019. Stock returns over the fomc cycle. The Journal of Finance, 74(5):2201–2248.
Databricks (2023) Databricks. 2023. Databricks’ dolly-v2-12b, an instruction-following large language model. https://huggingface.co/databricks/dolly-v2-12b.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
EleutherAI (2023) EleutherAI. 2023. Pythia scaling suite. https://huggingface.co/EleutherAI/pythia-12b.
Garcia et al. (2023) Diego Garcia, Xiaowen Hu, and Maximilian Rohrer. 2023. The colour of finance words. Journal of Financial Economics, 147(3):525–549.
H2O.ai (2023a) H2O.ai. 2023a. H2o.ai’s h2ogpt-oasst1-512-12b, a 12 billion parameter instruction-following large language model. https://huggingface.co/h2oai/h2ogpt-oasst1-512-12b.
H2O.ai (2023b) H2O.ai. 2023b. H2o.ai’s openassistant_oasst1_h2ogpt_graded, an open-source instruct-type dataset for fine-tuning of large language models. https://huggingface.co/datasets/h2oai/openassistant_oasst1_h2ogpt_graded.
Hansen and Kazinnik (2023) Anne Lundgaard Hansen and Sophia Kazinnik. 2023. Can chatgpt decipher fedspeak? Available at SSRN.
Hansen et al. (2018) Stephen Hansen, Michael McMahon, and Andrea Prat. 2018. Transparency and deliberation within the fomc: a computational linguistics approach. The Quarterly Journal of Economics, 133(2):801–870.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Loughran and McDonald (2011) Tim Loughran and Bill McDonald. 2011. When is a liability not a liability? textual analysis, dictionaries, and 10-ks. The Journal of finance, 66(1):35–65.
Malo et al. (2014) Pekka Malo, Ankur Sinha, Pekka Korhonen, Jyrki Wallenius, and Pyry Takala. 2014. Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology, 65(4):782–796.
Mialon et al. (2023) Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. 2023. Augmented language models: a survey. arXiv preprint arXiv:2302.07842.
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
Pikuliak (2023) Matúš Pikuliak. 2023. Chatgpt survey: Performance on nlp datasets. https://www.opensamizdat.com/posts/chatgpt_survey.
Qin et al. (2023) Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
Rogers et al. (2023) Anna Rogers, Niranjan Balasubramanian, Leon Derczynski, Jesse Dodge, Alexander Koller, Sasha Luccioni, Maarten Sap, Roy Schwartz, Noah A. Smith, and Emma Strubell. 2023. Closed ai models make bad baselines.
Rozkrut et al. (2007) Marek Rozkrut, Krzysztof Rybiński, Lucyna Sztaba, and Radosław Szwaja. 2007. Quest for central bank communication: Does it pay to be “talkative”? European Journal of Political Economy, 23(1):176–206.
Saggu and Ante (2023) Aman Saggu and Lennart Ante. 2023. The influence of chatgpt on artificial intelligence related crypto assets: Evidence from a synthetic control analysis. Finance Research Letters, page 103993.
Shah et al. (2023a) Agam Shah, Suvan Paturi, and Sudheer Chava. 2023a. Trillion dollar words: A new financial dataset, task & market analysis. arXiv preprint arXiv:2305.07972.
Shah et al. (2023b) Agam Shah, Ruchit Vithani, Abhinav Gullapalli, and Sudheer Chava. 2023b. Finer: Financial named entity recognition dataset and weak-supervision model. arXiv preprint arXiv:2302.11157.
Shah et al. (2022a) Pratvi Shah, Arkaprabha Banerjee, Agam Shah, Bhaskar Chaudhury, and Sudheer Chava. 2022a. Numerical claim detection in finance: A weak-supervision approach. TechRxiv preprint 21288087.
Shah et al. (2022b) Raj Sanjay Shah, Kunal Chawla, Dheeraj Eidnani, Agam Shah, Wendi Du, Sudheer Chava, Natraj Raman, Charese Smiley, Jiaao Chen, and Diyi Yang. 2022b. When flue meets flang: Benchmarks and large pre-trained language model for financial domain. arXiv preprint arXiv:2211.00083.
Tobback et al. (2017) Ellen Tobback, Stefano Nardelli, and David Martens. 2017. Between hawks and doves: measuring central bank communication. SSRN.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Tsukioka and Yamasaki (2020) Yasutomo Tsukioka and Takahiro Yamasaki. 2020. The tone of the beige book and the pre-fomc announcement drift. Available at SSRN 3306011.
Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
Wu et al. (2023) Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564.
Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223.

Zero is Not Hero Yet: Benchmarking Zero-Shot Performance of LLMs for Financial Tasks111This working paper is an ongoing research project, and feedback is greatly appreciated.