To ChatGPT, or not to ChatGPT:
That is the question!

Alessandro Pegoraro System Security Lab, Technical University of Darmstadt, Germany Kavita Kumari System Security Lab, Technical University of Darmstadt, Germany Hossein Fereidooni System Security Lab, Technical University of Darmstadt, Germany Ahmad-Reza Sadeghi System Security Lab, Technical University of Darmstadt, Germany

Abstract

ChatGPT has become a global sensation. As ChatGPT and other Large Language Models (LLMs) emerge, concerns of misusing them in various ways increase, such as disseminating fake news, plagiarism, manipulating public opinion, cheating, and fraud. Hence, distinguishing AI-generated from human-generated becomes increasingly essential. Researchers have proposed various detection methodologies, ranging from basic binary classifiers to more complex deep-learning models. Some detection techniques rely on statistical characteristics or syntactic patterns, while others incorporate semantic or contextual information to improve accuracy.
The primary objective of this study is to provide a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection. Additionally, we evaluated other AI-generated text detection tools that do not specifically claim to detect ChatGPT-generated content to assess their performance in detecting ChatGPT-generated content. For our evaluation we have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains and user-generated responses from popular social networking platforms. The dataset serves as a reference to assess the performance of various techniques in detecting ChatGPT-generated content. Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.

I Introduction

ChatGPT developed by OpenAI has garnered significant attention and sparked extensive discourse in the Natural Language Processing (NLP) community and several other fields. ChatGPT is an AI chatbot introduced by OpenAI in November 2022. It utilizes the power of OpenAI’s LLMs belonging to the GPT-3.5 and GPT-4 families. However, ChatGPT is not a simple extension of these models. Instead, it has undergone a fine-tuning process utilizing supervised and reinforcement learning techniques based on Human Feedback [7, 23]. This approach to transfer learning has allowed ChatGPT to learn from existing data and optimize its performance for conversational applications. It has also facilitated ChatGPT’s exceptional performance in various challenging NLP tasks [5, 11, 27, 34, 38]. The media’s promotion of ChatGPT has resulted in a chain of reactions, with news and media companies utilizing it for optimal content creation, teachers and academia using it to prepare course objectives and goals, and individuals using it to translate content between languages. Unfortunately, as is often the case with such technologies, misuse has also ensued. Students are employing it to generate their projects and coding assignments [6, 8], while scholars are utilizing it to produce papers [15]. Malicious actors use it to propagate fake news on social media platforms [19, 9], and educational institutions are employing it to provide mental health education to students without their consent [2], among other uses. Furthermore, ChatGPT has the potential to generate seemingly realistic stories that could deceive unsuspecting readers [4, 32]. Hence, developing an efficient detection algorithm capable of distinguishing AI-generated text, particularly ChatGPT, from human-generated text has attracted many researchers.

In general, detecting AI-generated text using machine learning concerns two types: black-box and white-box detection. Black-box detection relies on API-level access to language models, limiting its capability to detect synthetic texts [35]. This type involves data collection, feature extraction, and building a classifier for detection. It includes simple classifiers such as binary logistic regression [33]. In contrast, white-box detection has full access to language models, enabling control of the model’s behavior and traceable results [35]. It includes zero-shot detection [25, 33, 39] that leverages pre-trained generative models like GPT-2 [33] or Grover [39] and pre-trained language models fine-tuned for the task.

A large body of research is attributed to building detectors for the text generated by AI bots [8, 15, 16, 17, 20, 21, 22, 25, 29, 33, 39, 40]. Furthermore, some claim that their AI-text detector can distinguish the ChatGPT-generated text from the human-generated text [1, 3, 6, 10, 12, 13, 14, 18, 26, 28, 36, 37]. On that account, our motivation is to test all the tools (generalized AI-text detectors plus detectors targeting ChatGPT-generated text) against a benchmark dataset (Section III-A), comprising of ChatGPT prompts and human responses, spanning different domains. We will elaborate on each tool and its functionality in the following section.

Our goals in this paper are as follows:

•

We explore the research conducted on AI-generated text detection focusing on ChatGPT detection. We outline different white-box and black-box detection schemes proposed in the literature. We also explore detection schemes in education, academic/scientific writing, and the detection tools available online.
•

We evaluate the effectiveness of various tools that aim to distinguish ChatGPT-generated from human-generated responses and compare their accuracy and reliability. Our assessment includes tools that claim to detect ChatGPT prompts and other AI-generated text detection tools that do not target ChatGPT-generated content. This evaluation’s primary objective is to gauge these tools’ effectiveness in detecting ChatGPT-generated content. Our analysis reveals that the most effective online tool for detecting generated text can only achieve a success rate of less than 50%, as depicted in Table I.
•

Our research aims to inspire further inquiry into this critical area of study and promote the development of more effective and accurate detection methods for AI-generated text. Further, our findings underscore the importance of thorough testing and verification when assessing AI detection tools.

II Related works

This section provides an overview of current research on distinguishing AI-generated text from human-generated text. To categorize most automated machine learning-based detection methods for synthetic texts, we followed OpenAI’s classification [33], which divides these methods into three main categories, which are: i) Simple Classifiers [18, 33], ii) Zero-shot detection techniques [22, 25, 39], and iii) Fine-tuning based detection [26]. Simple Classifiers fall under the category of black-box detection techniques, whereas zero-shot and fine-tuning-based detection techniques come under the umbrella of white-box detection techniques. There exists other approaches that do not fit into these three categories; however, they are still significant and merit consideration. These alternative methods include testing ChatGPT-generated text against various plagiarism tools [20], designing a Deep Neural Network-based AI detection tool [6], a sampling-based approach [16], and online detection tools [1, 3, 10, 13, 17, 28, 29, 36, 40]. In the following sections, we will analyze the existing approaches belonging to the aforementioned categories and alternative methods, focusing on their effectiveness in detecting AI-generated text.

II-A Simple Classifiers

OpenAI [33], an artificial intelligence research company, analyzed the human detection and automated ML-based detection of synthetic texts. For the human detection of the synthetic datasets, authors showed that the models trained on the GPT-2 datasets tend to increase the perceived humanness of GPT-2 generated text. Hence, they tested a simple logistic regression model, zero-shot detection model (explained in Section II-B), and fine-tune-based detection model (described Section in II-C). The simple logistic regression model was trained on TF-IDF (Term Frequency-Inverse Document Frequency), unigram, and bigram features and later analyzed at different generation strategies and model parameters. It was found that the simple classifiers can work correctly up to an accuracy of 97%. However, detecting shorter outputs is more complicated than detecting more extended outputs for these models. Guo et al. [18] conducted human evaluations and compared datasets generated by ChatGPT and human experts, analyzing linguistic and stylistic characteristics of their responses and highlighting differences between them. The authors then attempted to detect whether the text was ChatGPT-generated or human-generated by deploying: first, a simple logistic regression model on GLTR Test-2 features and, second, a pre-trained deep classifier model based on RoBERTa [24] for single-text and Q&A detection (as explained in Section II-C). However, the proposed detection models are ineffective in detecting ChatGPT-generated text from human-generated text due to the highly unbalanced corpus of datasets, which did not capture all the text-generating styles of ChatGPT. Another study by Kushnareva et al. [22] utilized Topological Data Analysis (TDA) to extract three types of interpretable topological features, such as the number of connected components, the number of edges, and the number of cycles present in the graph, for artificial text recognition. The authors then trained a logistic regression classifier with these features and tested the approach on datasets from WebText & GPT-2, Amazon Reviews & GPT-2, and RealNews & GROVER [39]. However, this approach is unlikely to be effective for ChatGPT, as it was not tested on that specific model.

II-B Zero-shot detection techniques

OpenAI [33] has also developed a GPT-2 detector using a 1.5 billion parameter GPT-2 model that can identify the top 40 generated outputs with an accuracy of 83% to 85%. However, when the model was fine-tuned to the Amazon reviews dataset, the accuracy dropped to 76%. In a different study [25], the authors explored the Zero-shot detection of AI-generated text and deployed an online detection tool (DetectGPT) to distinguish GPT-2 generated text from the human-generated text. They used the generative model’s log probabilities to achieve this. The authors experimented and demonstrated that AI-generated text occupies the negative curvature regions of the model’s log probability function. However, it should be noted that the authors assumed that one could evaluate the log probabilities of the model(s) under consideration, which may not always be possible. Moreover, as mentioned by the authors, this approach is only practical for GPT-2 prompts.

Zellers et al. [39] utilized a transformer identical to the one used for GPT-2, except that they used nucleus sampling instead of top-k sampling to select the next word during text generation. The model they developed, known as Grover, can generate text such as fake news and detect its own generated text. It is also available online. Authors used Grover, GPT-2 (124M or 355M ), BERT (BERT-Base or BERT-Large), and FastText verification tools to classify the news articles generated by Grover. They proved that Grover is the best among the previously mentioned detector to verify its self-generated fake news. However, it’s not visible that it will work for the text generated by the GPT models. Also, it has been shown that the bi-directional transformer model RoBERTa outperforms Grover models with equivalent parameter size in detecting GPT-2 texts [33].

II-C Fine-tuning based detection

In [33], the authors conducted experiments to fine-tune pre-trained language models for detecting AI-generated texts by basing the classifiers on $RoBERTa_{BASE}$ and $RoBERTa_{LARGE}$ . They found that fine-tuning RoBERTa consistently outperformed fine-tuning an equivalent capacity GPT-2 model. However, the approach could not detect text generated by ChatGPT, as demonstrated in [31]. In a separate study, Mitrovic et al. [26] investigated the feasibility of training an ML model to distinguish between queries generated by ChatGPT and those generated by humans. The authors attempted to detect ChatGPT-generated two-line restaurant reviews using a framework based on DistilBERT, a lightweight model trained using BERT, which was fine-tuned using a Transformer-based model. Additionally, the predictions made by the model were explained using the SHAP method. The authors concluded that an ML model could not successfully identify texts generated by ChatGPT. Guo at al. [18] also developed a pre-trained deep classifier model based on RoBERTa for single-text and Q&A detection. The limitations are the same as described in Section II-A.

TABLE I: Summary of analyzed papers

Approach	Published in	Target Model				Publicly Available	Free/Paid	ChatGPT detc. Capability (TPR%)	Human-text detc. Capability (TNR%)
Approach	Published in	Grover	GPT-2	GPT-3	ChatGPT^^*GPT 3.5 and above.	Publicly Available	Free/Paid	ChatGPT detc. Capability (TPR%)	Human-text detc. Capability (TNR%)
Kumarage et al. [21]	2023		✓			✓	Free	23.3	94.7
Bleumink et al. [6]	2023			✓	✓	✓	Paid	13.4	95.4
ZeroGPT [40]	2023				✓	✓	Paid	45.7	92.2
OpenAI Classifier [28]	2023				✓	✓	Free	31.9	91.8
Mitchell et al. [25]	2023		✓			✓	Free	18.1	80.0
GPTZero [29]	2023		✓	✓	✓	✓	Paid	27.3	93.5
Hugging Face [13]	2023				✓	✓	Free	10.7	62.9
Guo et al. [18]	2023				✓	✓	Free	47.3	98.0
Perplexity (PPL) [17]	2023				✓	✓	Free	44.4	98.3
Writefull GPT [36]	2023			✓	✓	✓	Paid	21.6	99.3
Copyleaks [10]	2023			✓	✓	✓	Paid	22.9	92.1
Cotton et al. [8]	2023			✓	✓	$\times$	-	-	-
Khalil et al. [20]	2023				✓	$\times$	-	-	-
Mitrovic et al. [26]	2023		✓		✓	$\times$	-	-	-
Content at Scale [3]	2022		✓	✓	✓	✓	Paid	38.4	79.8
Orignality.ai [1]	2022			✓	✓	$\times$	Paid	7.6	95.0
Writer AI Detector [37]	2022			✓	✓	✓	Paid	6.9	94.5
Draft and Goal [12]	2022			✓	✓	✓	Free	23.7	91.1
Gao et al. [15]	2022				✓	$\times$	-	-	-
Fröhling et al. [14]	2021	✓	✓	✓		✓	Free	27.8	89.2
Kushnareva et al. [22]	2021	✓	✓			✓	Free	25.1	96.3
Solaiman et al. [33]	2019		✓			✓	Free	7.2	96.4
Gehrmann et al. [16]	2019		✓			✓	Free	32.0	98.4
Zellers et al. [39]	2019	✓				✓	Free	43.1	91.3

II-D Other approaches proposed in literature

Extensive academic research has investigated the adverse impact of ChatGPT on education, which has shown that students and scholars can use ChatGPT to engage in plagiarism. To address this issue, several existing tools, such as RoBERTa, Grover, or GPT-2, have been utilized to check the uniqueness of educational content against ChatGPT-generated text. In [6], the authors proposed a transformer-based model named AICheatCheck, a web-based AI detection tool designed to identify whether a human or ChatGPT generated a given text. AICheatCheck examines a sentence or group of sentences for patterns to determine their origin. The authors used the data collected by Guo et al. [18] (with its limitations) and from the education field. Also, it is not specified in the paper on what basis or on what features AICheatCheck can achieve high accuracy.

The study in [20], evaluates the effectiveness of two popular plagiarism-detection tools, iThenticate and Turnitin, in detecting plagiarism concerning 50 essays generated by ChatGPT. The authors also compared ChatGPT’s performance against itself and found it more effective than traditional plagiarism detection tools. Another study by Gao et al. [15] aimed to compare ChatGPT-generated academic paper abstracts using a GPT-2 Output Detector RoBERTa, a plagiarism checker, and human review. The authors collected ten research abstracts from five high-impact medical journals and then used ChatGPT to output research abstracts based on their titles and journals. However, the tool used in the study is not available online for verification. In recent work, Cotton et al. [8] investigate the pros and cons of using ChatGPT in the academic field, particularly concerning plagiarism. In a different work, authors in [16] utilized the distributional properties of the underlying text used for the model. They deployed a tool called GLTR that highlights the input text in different colors to determine its authenticity. GLTR was tested on the prompts from GPT-2 1.5B parameter model [30] and human-generated articles present on social media platforms. The authors also conducted a human study, asking the students to identify fake news from real news.

II-E Online tools

Below, we examine multiple online tools and delineate their brief functionality.

1.

Stylometric Detection of AI-generated Text [21]: This tool utilizes stylometric signals to examine the writing style of a text by identifying patterns and features that are unique to them. These signals are then extracted from the input text to enable sequence-based detection of AI-generated tweets.
2.

ZeroGPT [40]: This tool is specifically developed to detect OpenAI text but has limited capabilities with shorter text.
3.

OpenAI Text Classifier [28]: This fine-tuned GPT model developed by OpenAI predicts the likelihood of a text being AI-generated from various sources, including ChatGPT. However, the tool only works with 1000 characters or more and is less reliable in determining if a text was artificially generated.
4.

GPTZero [29]: This classification model determines whether a document was written by an LLM, providing predictions at a sentence, paragraph, and document level. However, it mainly works for content in the English language and only allows text with a character count between 250 and 5000 characters.
5.

Hugging Face [13]: This tool was released by Hugging Face for detecting text generated by ChatGPT. However, it tends to over-classify text as being ChatGPT-written.
6.

Perplexity (PPL) [17]: The perplexity (PPL) is a widely employed metric for assessing the efficacy of Large language models (LLM). It is calculated as the exponential of the negative average log-likelihood of text under the LLM. A lower PPL value implies that the language model is more confident in its predictions. LLMs are trained on vast text corpora, enabling them to learn common language patterns and text structures. Consequently, PPL can be utilized to gauge how effectively a given text conforms to such typical characteristics.
7.

Writefull GPT Detector [36]: It is primarily used for detecting plagiarism, this tool can identify if a piece of text is generated by GPT-3 or ChatGPT. However, the tool’s percentage-based system for determining whether the text was created by AI has a degree of uncertainty for both samples generated by humans and those generated by ChatGPT.
8.

Copyleaks [10]: This tool claims to detect if a text is generated by GPT-3, ChatGPT, humans, or a combination of humans and AI. The tool accepts only text with 150 or more characters.
9.

Content at Scale [3]: This is an online tool available to detect text generated by ChatGPT. However, the tool can only analyze samples with 25000 characters or less.
10.

Originality.ai [1]: This paid tool is designed to work with GPT-3, GPT 3.5 (DaVinci-003), and ChatGPT models. However, the tool only works with 100 words or more and is prone to classify ChatGPT-generated content as real.
11.

Writer AI Content Detector [37]: This tool is designed to work with GPT-3 and ChatGPT models. However, its limitation restricts the amount of text that can be checked on each experiment to a maximum of 1500 characters.
12.

Draft and Goal [12]: This tool is intended for detecting content generated by GPT-3 and ChatGPT models, and it is equipped to perform detection in both English and French. However, it has a requirement that the input text should be at least 600 characters or longer to work effectively.

III Evaluation

This section evaluates publicly available literature, tools, or codes that can differentiate between AI-generated and human-generated responses. Our primary focus is on the tools claiming to detect ChatGPT-generated content. However, we also evaluate (to the best of our abilities) the performance of other AI-generated text detection tools that do not make explicit claims about detecting ChatGPT-generated content on ChatGPT prompts. To assess the effectiveness of these tools, we employ a benchmark dataset (Section III-A) that comprises prompts from ChatGPT and humans. Then, we measure the detection capabilities of these tools on both ChatGPT-generated and human-generated content and present the results in Table I.

III-A Benchmark Dataset

We utilized the inquiry prompts proposed by Guo et al. [18] through the OpenAI API¹¹1https://openai.com/blog/introducing-chatgpt-and-whisper-apis to generate a benchmark dataset. This dataset comprises 58,546 responses generated by humans and 72,966 responses generated by the ChatGPT model, resulting in 131,512 unique samples that address 24,322 distinct questions from various fields, including medicine, open-domain, and finance. Furthermore, the dataset incorporates responses from popular social networking platforms, which provide a wide range of user-generated perspectives. To assess the similarity between human-generated and ChatGPT-generated responses, we employed the sentence transformer all-MiniLM-L6-v2²²2https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2. Then, we selected responses with the highest and lowest levels of similarity to assemble a benchmark dataset, which was reduced to approximately 10% of the primary dataset that we generated. This benchmark dataset serves as a standardized reference for evaluating the ability of different techniques to detect ChatGPT-generated content.

III-B Evaluation Metrics

To measure and compare the effectiveness of each approach, we utilized the following metrics:

•

True Positive Rate (TPR): This metric represents the tool’s sensitivity in detecting text that ChatGPT generates. True Positive ( $TP$ ) is the total number of correctly identified samples, while we consider False Negative ( $FN$ ) the number of samples not classified as generated text or incorrectly identified as human text. Therefore, $TPR=\frac{TP}{TP+FN}$ .
•

True Negative Rate (TNR): This metric indicates the tool’s specificity in detecting human-generated texts. True Negatives ( $TN$ ) is the total number of correctly identified samples, while False Positives ( $FP$ ) is the number of samples incorrectly classified as being produced by ChatGPT. Therefore, $TNR=\frac{TN}{TN+FP}$ .

III-C Evaluated Tools and Algorithms

We evaluate several tools and algorithms summarized in Section II. Table I outlines the detection capability of these tools for the ChatGPT-generated content in terms of TPR and TNR. We can observe that none of the evaluated approaches can consistently detect the ChatGPT-generated text. Analysis reveals that the most effective online tool for detecting generated text can only achieve a success rate of less than 50%, as depicted in Table I.

IV Conclusion

This study delved into the various methods employed for detecting ChatGPT-generated text. Through a comprehensive review of the literature and an examination of existing approaches, we assess the ability of these techniques to differentiate between responses generated by ChatGPT and those produced by humans. Furthermore, our study includes testing and validating online detection tools and algorithms utilizing a benchmark dataset that covers various topics, such as finance and medicine, and user-generated responses from popular social networking platforms. Our experiments highlight ChatGPT’s exceptional ability to deceive detectors and further indicate that most of the analyzed detectors are prone to classifying any text as human-written, with a general high $TNR$ of 90% and low $TPR$ . These findings have significant implications for enhancing the quality and credibility of online discussions. Ultimately, our results underscore the need for continued efforts to improve the accuracy and robustness of text detection techniques in the face of increasingly sophisticated AI-generated content.

References

[1] Originality.ai. https://originality.ai/, 2022. [Online; accessed 28-Mar-2023].
[2] Hidden use of chatgpt in online mental health counseling raises ethical concerns. https://www.psychiatrist.com/news/hidden-use-of-chatgpt-in-online-mental-health-counseling-raises-ethical-concerns/, 2023. [Online; accessed 24-Mar-2023].
[3] Content at Scale. AI DETECTOR. https://contentatscale.ai/ai-content-detector/, 2023. [Online; accessed 23-Mar-2023].
[4] Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
[5] Som Biswas. Chatgpt and the future of medical writing, 2023.
[6] Arend Groot Bleumink and Aaron Shikhule. Keeping ai honest in education: Identifying gpt-generated text, 2023. https://www.aicheatcheck.com/.
[7] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
[8] Debby RE Cotton, Peter A Cotton, and J Reuben Shipway. Chatting and cheating: Ensuring academic integrity in the era of chatgpt. Innovations in Education and Teaching International, pages 1–12, 2023.
[9] Luigi De Angelis, Francesco Baglivo, Guglielmo Arzilli, Gaetano Pierpaolo Privitera, Paolo Ferragina, Alberto Eugenio Tozzi, and Caterina Rizzo. Chatgpt and the rise of large language models: The new ai-driven infodemic threat in public health. Available at SSRN 4352931, 2023.
[10] Copyleaks AI Content Detector. Copyleaks. https://copyleaks.com/ai-content-detector, 2023. [Online; accessed 23-Mar-2023].
[11] Michael Dowling and Brian Lucey. Chatgpt for (finance) research: The bananarama conjecture. Finance Research Letters, page 103662, 2023.
[12] Draft and Goal. ChatGPT - GPT3 Content Detector. https://detector.dng.ai/, 2023. [Online; accessed 23-Mar-2023].
[13] Hugging Face. Hugging Face ChatGPT-Detection . https://huggingface.co/spaces/imseldrith/ChatGPT-Detection, 2023. [Online; accessed 23-Mar-2023].
[14] Leon Fröhling and Arkaitz Zubiaga. Feature-based detection of automated language models: tackling gpt-2, gpt-3 and grover. PeerJ Computer Science, 7:e443, April 2021. https://peerj.com/articles/cs-443/#supplementary-material.
[15] Catherine A Gao, Frederick M Howard, Nikolay S Markov, Emma C Dyer, Siddhi Ramesh, Yuan Luo, and Alexander T Pearson. Comparing scientific abstracts generated by chatgpt to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers. bioRxiv, pages 2022–12, 2022.
[16] Sebastian Gehrmann, Hendrik Strobelt, and Alexander M Rush. Gltr: Statistical detection and visualization of generated text. arXiv preprint arXiv:1906.04043, 2019. http://gltr.io/.
[17] Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Wu. Chatgpt Detector Using Linguistic Features. https://huggingface.co/spaces/Hello-SimpleAI/chatgpt-detector-ling, 2023. [Online; accessed 23-Mar-2023].
[18] Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Wu. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597, 2023. https://github.com/Hello-SimpleAI/chatgpt-comparison-detection.
[19] Philipp Hacker, Andreas Engel, and Marco Mauer. Regulating chatgpt and other large generative ai models. arXiv preprint arXiv:2302.02337, 2023.
[20] Mohammad Khalil and Erkan Er. Will chatgpt get you caught? rethinking of plagiarism detection. arXiv preprint arXiv:2302.04335, 2023.
[21] Tharindu Kumarage, Joshua Garland, Amrita Bhattacharjee, Kirill Trapeznikov, Scott Ruston, and Huan Liu. Stylometric detection of ai-generated text in twitter timelines. arXiv preprint arXiv:2303.03697, 2023. https://github.com/TSKumarage/Stylo-Det-AI-Gen-Twitter-Timelines.
[22] Laida Kushnareva, Daniil Cherniavskii, Vladislav Mikhailov, Ekaterina Artemova, Serguei Barannikov, Alexander Bernstein, Irina Piontkovskaya, Dmitri Piontkovski, and Evgeny Burnaev. Artificial text detection via examining the topology of attention maps. arXiv preprint arXiv:2109.04825, 2021. https://github.com/danchern97/tda4atd.
[23] Nathan Lambert, Louis Castricato, Leandro von Werra, and Alex Havrilla. Illustrating reinforcement learning from human feedback (rlhf). Hugging Face Blog, 2022. https://huggingface.co/blog/rlhf.
[24] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[25] Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305, 2023. https://github.com/eric-mitchell/detect-gpt https://detectgpt.ericmitchell.ai/.
[26] Sandra Mitrović, Davide Andreoletti, and Omran Ayoub. Chatgpt or human? detect and explain. explaining decisions of machine learning model for detecting short chatgpt-generated text. arXiv preprint arXiv:2301.13852, 2023.
[27] Reham Omar, Omij Mangukiya, Panos Kalnis, and Essam Mansour. Chatgpt versus traditional question answering for knowledge graphs: Current status and future directions towards knowledge graph chatbots. arXiv preprint arXiv:2302.06466, 2023.
[28] OpenAI. https://beta.openai.com/ai-text-classifier, January 2023.
[29] Edward Tian (Princeton). GPTZero. https://gptzero.me/, 2023. [Online; accessed 23-Mar-2023].
[30] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[31] Jürgen Rudolph, Samson Tan, and Shannon Tan. Chatgpt: Bullshit spewer or the end of traditional assessments in higher education? Journal of Applied Learning and Teaching, 6(1), 2023.
[32] Yiqiu Shen, Laura Heacock, Jonathan Elias, Keith D Hentel, Beatriu Reig, George Shih, and Linda Moy. Chatgpt and other large language models are double-edged swords, 2023.
[33] Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, et al. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203, 2019. https://openai-openai-detector.hf.space/ https://github.com/openai/gpt-2-output-dataset/tree/master/detector.
[34] Teo Susnjak. Applying bert and chatgpt for sentiment analysis of lyme disease in scientific literature. arXiv preprint arXiv:2302.06474, 2023.
[35] Ruixiang Tang, Yu-Neng Chuang, and Xia Hu. The science of detecting llm-generated texts. arXiv preprint arXiv:2303.07205, 2023.
[36] writefull. GPT Detector. https://x.writefull.com/gpt-detector, 2023. [Online; accessed 23-Mar-2023].
[37] Writer.com. AI Content Detector. https://writer.com/ai-content-detector/, 2023. [Online; accessed 23-Mar-2023].
[38] Yee Hui Yeo, Jamil S Samaan, Wee Han Ng, Peng-Sheng Ting, Hirsh Trivedi, Aarshi Vipani, Walid Ayoub, Ju Dong Yang, Omer Liran, Brennan Spiegel, et al. Assessing the performance of chatgpt in answering questions regarding cirrhosis and hepatocellular carcinoma. medRxiv, pages 2023–02, 2023.
[39] Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. Advances in neural information processing systems, 32, 2019. https://rowanzellers.com/grover/.
[40] ZeroGPT. https://www.zerogpt.com, January 2023.

To ChatGPT, or not to ChatGPT: That is the question!