The Detection and Understanding of Fictional Discourse

Andrew Piper
McGill University
[email protected]
\AndHaiqi Zhou
McGill University
[email protected]

Abstract

In this paper, we present a variety of classification experiments related to the task of fictional discourse detection. We utilize a diverse array of datasets, including contemporary professionally published fiction, historical fiction from the Hathi Trust, fanfiction, stories from Reddit, folk tales, GPT-generated stories, and anglophone world literature. Additionally, we introduce a new feature set of word “supersenses” that facilitate the goal of semantic generalization. The detection of fictional discourse can help enrich our knowledge of large cultural heritage archives and assist with the process of understanding the distinctive qualities of fictional storytelling more broadly.

1 Introduction

Written fiction continues to play an important role in the lives of readers. Despite predictions about the end of the book Phillips (2019) or the death of the novel Aravamudan (2011), fiction remains a central medium of communication that contributes to a sense of meaning, joy and imagination on the part of readers over their life-spans, from childhood to old age.

Written fiction can take many forms. It can be found within a multi-billion dollar industry of major publishing houses (“big fiction,” Sinykin (2023)), self-published books by amateur writers, vibrant online communities devoted to fan fiction, independent presses seeking to promote more diverse styles and voices, or major digital cultural heritage archives.

The computational detection of fictional discourse – identifying whether a text is telling an imaginary story versus a true one – can be a useful task for two reasons. First, it can help locate fiction within large, unmarked digitized cultural heritage collections, thus enriching our knowledge about the past Underwood et al. (2020); Bagga and Piper (2022); Hamilton and Piper (2023).

Second, the use of predictive modeling can also facilitate the identification of distinctive qualities of fictional discourse thereby highlighting its potential value to readers and society. While prior theoretical work has made strong claims about the absence of distinctiveness surrounding fictional discourse Searle (1975); Currie (1990), computational research in this area has shown that fictional discourse represents a very coherent and historically consistent set of linguistic practices Underwood (2014, 2019); Piper (2016).

In this paper, we seek to further contribute to knowledge about the detectability and understanding of fictional discourse. Prior work has focused on using lexical features (i.e. bag-of-words) Underwood (2014, 2019) and LIWC features Piper (2016) to detect fictional documents within large and small collections of historical texts respectively. In this paper, we expand the number of data sets used to test the accuracy of our models beyond prior work. In addition to using historical collections derived from the Hathi Trust Digital Library Bagga and Piper (2022), we also include data from contemporary professional publishing Piper (2022), fan fiction, social media Ouyang and McKeown (2015), historical folk tales from around the world from Project Gutenberg, along with a collection of contemporary anglophone fiction from non-Western sources.

Second, where prior work has used either lexical features or LIWC’s psycho-linguistic categories to represent model features, here we rely on word “supersenses” to capture more general semantic categories of words (see Table 1). Supersenses are derived from Wordnet’s taxonomies and generated through the latest model of bookNLP Bamman (2021). We find that word supersenses serve two important functions for our analysis.

First, they not only help generalize our understanding of the semantic behavior of fictional discourse beyond individual keywords. Additionally, they help operationalize our understanding of fictional writing as a distinctive form of “discourse,” which we define as a socially constructed form of communication Halliday and Matthiessen (2013); Taylor (2013); Berger and Luckmann (2023). We assume that fictional discourse is strongly shaped by the distinctiveness of its agents, actions, and worlds, all of which have a strong semantic aspect (though there may be other kinds of distinguishing features that could be the subject of future work).

Finally, we also explore prediction at the sentence-level using a fine-tuned BERT model to better understand the minimal tokens necessary to distinguish fictional discourse.

In the sections that follow we describe the principal data sets used along with the experimental set-ups employed to better understand both the predictability of fictional discourse and its distinctive qualities as seen from a general semantic viewpoint. We imagine future work could continue to compare these results across more diverse data sets as well as test higher-level formal features.

2 Data and Methodology

2.1 Data

In this paper we use the following data sets to run our classification exercises.

CONLIT: 2,754 books belonging to 12 different categories split into fiction (1,934) and non-fiction (820) narratives published since 2001 Piper (2022).

CONLIT_Page: The above dataset using 350 sequential tokens sampled from each document.

CONLIT_1P/3P: Subsetted by first-person / third-person fiction and “Memoir” /“Biography” for non-fiction.

HATHI1M: 1,671,370 randomly sampled pages of English-language prose drawn from the Hathi Trust Digital Library divided between fiction (765,920) and non-fiction (905,450) Bagga and Piper (2022).

HATHI1M_19C/20C: Pages sampled with an original publication date between 1825-1875 /1875-2000.

FANFIC: 9,948 stories sampled from the top 15 most popular fandoms on Archive of Our Own (A03) as of 2020. All texts are between 2,000 and 10,000 words in length.

REDDIT: 2,643 stories drawn from sub-reddits that focus on non-fictional storytelling (such as “What is your scariest real life story?”) Ouyang and McKeown (2015).

FOLK: 3,136 world folktale collections downloaded from Project Gutenberg.

WORLDLIT_EN: 243 works of Anglophone fiction told in the third-person from three countries: South Africa, Nigeria and India.

GPT_Neutral/Confounding: 100 fictional stories generated by GPT-4 using the prompt, "Can you tell me a short story?" (Neutral) or prompting for the use of features indicative of non-fictional narratives to tell a fictional story: “Can you tell me a made-up short story (i.e. not a true story) where you: 1. use a lot of historical dates, 2. talk about social groups rather than individual characters (such as Americans or Germans but use invented groups), and 3. do not describe anyone’s personal appearance or their bodies?” (Confounding).

2.2 Methodology

All data was processed using the large model of bookNLP Bamman (2021). We condition on the frequency of supersense types per document normalized by the token count of the document. Table 1 illustrates a sample of supersense types and the top associated tokens.

Nouns	Top Tokens	Verbs	Top Tokens
noun.act	way, work, job	verb.body	smiled, wearing, laughed
noun.animal	dog, horse, animal	verb.change	began, get, make
noun.artifact	door, room, house	verb.cognition	know, think, knew
noun.attribute	way, voice, power	verb.communication	said, say, asked
noun.body	eyes, head, hand	verb.competition	fight, play, protect

Table 1: Sample of the 40 total supersense types with their most frequent tokens for the fiction subset of CONLIT.

We then engage in pairwise classification of various combinations of fictional and non-fictional documents using the random forests algorithm, which has been shown to perform well for text-classification purposes Xu et al. (2012) and exhibits little difference from other textual classifiers Piper and Bagga (2022). For each experiment, we run five-fold cross-validation and report the mean F1. We then extract the top-weighted features for each classifier.

To find out the minimum amount of text required to predict fiction, we train BERT using five datasets randomly sampled from CONLIT consisting of 5,000 passages with lengths of 1-5 sentences. Note that the longer passages are not newly sampled but contain sentences from the shorter sets (i.e. the five-sentence set contains all the sentences in the four-sentence set plus the next sentence). We reference the original BERT paper Devlin et al. (2019) to select a set of hyperparameters for grid search. The result shows that the best-performing hyperparameters are a batch size of 16, a learning rate of 4e-5, and an epoch of 5. Each dataset has balanced classes and is divided into a training set of size 3200, a validation set of size 800, and a test set of size 1000 to conduct fine-tuning and evaluation.

3 Results

Table 2 provides a snapshot of our document-level classification experiments with the full results reported in the Appendix. In Table 3 we provide an overview of our sentence-level classification results.

Dataset	F1	Top Features
CONLIT	0.973	verb.perception
		noun.body
FANFIC	0.996	noun.body
		verb.contact
HATHI1M	0.886	verb.motion
		verb.perception
FOLK	0.991	noun.plant
		noun.animal
WORLDLIT	0.893	noun.body
		verb.contact

Table 2: Overview of comparisons with the two most strongly weighted positive predictive features. See Appendix for full list.

Sentence Length	Precision	Recall	F1
One sentence	0.78	0.81	0.80
Two sentences	0.81	0.87	0.84
Three sentences	0.85	0.89	0.87
Four sentences	0.87	0.88	0.87
Five sentences	0.89	0.91	0.90

Table 3: BERT Sentence-Level Classification Report

4 Discussion

4.1 Predictibility of Fictional Discourse

Overall, our work shows that fictional discourse exhibits extremely strong distinguishing features, ranging from a high of an F1 of .99 for fan-fiction and folk-tales to a low of .89 for historical fiction. Even at the sentence level, our BERT-based classifier was able to accurately identify fictional discourse at the single sentence level with at least .80 accuracy that rises to .90 by a five sentence context (though interestingly never achieves the accuracy of our Random Forests model). One of the core insights provided here is the way fictional discourse signals its semantic distinctiveness in very clear and overt ways, such that our classifier has little trouble detecting the difference.

This consistency also exhibits interesting historical continuity at least since the early-nineteenth-century (i.e. the onset of literary Romanticism). As we can see in the Appendix, models trained on nineteenth-century fiction can predict late-twentieth-century fiction as well as the global average for the entire Hathi1M collection. The distinctiveness of fictional writing appears from a semantic point of view to have remained reliably consistent over time.

When we test non-North American fiction, we find that our models lose accuracy but this is due to the smaller training size. If we use a similarly-sized subset of the CONLIT data, we find that the accuracy is comparable. If we test countries individually through a hold-out model (train on WORLDLIT+CONLIT minus one country at a time) we find that the three non-Western collections do exhibit different levels of accuracy, with a low of 0.894 for India and a high of 1.0 for South Africa.

We also note two interesting results of our GPT-generated stories experiments (see Appendix). When training on contemporary fiction, our models struggle to adequately classify GPT-generated stories, even the simple, straightforward kind (GPT_Neutral). However, when trained on folk tales, all stories were accurately classified as fiction. This was true even for the “confounding” stories which were prompted requesting stories that exhibited high levels of features that were typically indicative of non-fiction.

These results suggest two interesting things: a.) the default GPT theory of “story” is highly dependent on the folk-tale genre and b.) stories that exhibit confounding features, i.e. try to sound non-fictional, are still easily detectable as fiction due to other semantic qualities, given the right training data.

4.2 Distinctive Qualities of Fictional Discourse

In terms of understanding the distinctive features that our models suggest, we note there is a high-level of consistency with some interesting deviations between datasets.

In CONLIT, the most strongly weighted features for predicting fictional discourse belonged to sensorimotor categories (verb.perception, noun.body, verb.contact), which was the same for historical fiction except verb.motion replaced verb.contact. Fanfiction similarly replicated the same features as CONLIT, but with a lower emphasis on perception. Folktales not surprisingly looked the most anamolous with a far stronger emphasis on noun entities like plants, animals, and food but body language was still important.

Finally, the WORLDLIT anglophone collection exhibits very similar leading features to CONLIT, suggesting cross-cultural semantic norms in fictional storytelling.

These findings strongly support prior research that has shown a strong emphasis on embodied behavior on the part of fictional characters as an increasingly core aspect of fictional storytelling over time (Piper, 2023). Across numerous kinds of datasets and fictional storytelling scenarios, embodiment remains a core dimension of constructing and communicating imaginary stories.

5 Conclusion

This paper has shown that much of the earlier findings around fictional discourse’s predictability and its emphasis on embodied behavior hold true across a diverse array of corpora. As we discuss in our limitations, future work will want to continue to expand the diversity of genres, languages, and cultures as well move to more non-semantic features to further deepen our understanding of the role that fictional storytelling plays around the world.

Limitations

One of the core limitations of our work is the application to English-language texts. While we have expanded the range of text-types, historical periods, and regional cultures used in fictional discourse analysis compared to prior work, future work on multilingual analysis awaits.

Our insights also only pertain to the semantic dimension of fictional discourse. While we find the use of “supersenses” a valuable tool for generalizing about semantic behavior, future work will want to focus on the unique formal or structural qualities of fictional discourse when compared to non-fictional narratives.

Acknowledgmenets

References

Aravamudan (2011) Srinivas Aravamudan. 2011. Refusing the death of the novel. In Novel: A Forum on Fiction, volume 44, pages 20–22. JSTOR.
Bagga and Piper (2022) Sunyam Bagga and Andrew Piper. 2022. Hathi 1m: Introducing a million page historical prose dataset in english from the hathi trust. Journal of Open Humanities Data, 8(7).
Bamman (2021) David Bamman. 2021. Booknlp. a natural language processing pipeline for books. https://github.com/booknlp/booknlp. Accessed: 2022-01-30.
Berger and Luckmann (2023) Peter Berger and Thomas Luckmann. 2023. The social construction of reality. In Social theory re-wired, pages 92–101. Routledge.
Currie (1990) Gregory Currie. 1990. The nature of fiction. Cambridge University Press.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding.
Halliday and Matthiessen (2013) Michael Alexander Kirkwood Halliday and Christian MIM Matthiessen. 2013. Halliday’s introduction to functional grammar. Routledge.
Hamilton and Piper (2023) Sil Hamilton and Andrew Piper. 2023. Multihathi: A complete collection of multilingual prose fiction in the hathitrust digital library. Journal of Open Humanities Data, 9.
Ouyang and McKeown (2015) Jessica Ouyang and Kathleen McKeown. 2015. Modeling reportable events as turning points in narrative. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2149–2158.
Phillips (2019) Angus Phillips. 2019. Does the book have a future? A Companion to the History of the Book, pages 841–855.
Piper (2016) Andrew Piper. 2016. Fictionality. Journal of Cultural Analytics, 2(2).
Piper (2022) Andrew Piper. 2022. The conlit dataset of contemporary literature. Journal of Open Humanities Data, 8.
Piper (2023) Andrew Piper. 2023. What do characters do? the embodied agency of fictional characters. Journal of Computational Literary Studies, 2(1).
Piper and Bagga (2022) Andrew Piper and Sunyam Bagga. 2022. Toward a data-driven theory of narrativity. New Literary History, 54(1):879–901.
Searle (1975) John R Searle. 1975. The logical status of fictional discourse. New literary history, 6(2):319–332.
Sinykin (2023) Dan Sinykin. 2023. Big Fiction: How Conglomeration Changed the Publishing Industry and American Literature. Columbia University Press.
Taylor (2013) Stephanie Taylor. 2013. What is discourse analysis? Bloomsbury Academic.
Underwood (2014) Ted Underwood. 2014. Understanding genre in a collection of a million volumes.
Underwood (2019) Ted Underwood. 2019. Distant horizons: digital evidence and literary change. University of Chicago Press.
Underwood et al. (2020) Ted Underwood, Patrick Kimutis, and Jessica Witte. 2020. Noveltm datasets for english-language fiction, 1700-2009. Journal of Cultural Analytics, 5(2).
Xu et al. (2012) Baoxun Xu, Xiufeng Guo, Yunming Ye, and Jiefeng Cheng. 2012. An improved random forest classifier for text categorization. J. Comput., 7(12):2913–2920.

Appendix A Appendix

We include figures of the distribution of the top 15 feature weights for three of our primary experiments. Table 4 on next page provides a list of all experiments, F1 scores, and top 5 features.

Table 4: Full list of experiments undertaken for fiction detection. Note for Train/Test experiments we report “Accuracy” not F1 because the test sets are a single class. Bold features are positive predictors of fiction and underline features are positive predictors of non-fiction.

Dataset	F1	Top 5 Features	Support
CONLIT_FIC	0.973	verb.perception	1,934
CONLIT_NON		noun.body	627
		noun.act
		noun.group
		verb.contact
CONLIT_1P	0.964	noun.act	1,025
CONLIT_MEM		noun.time	229
		verb.contact
		noun.body
		noun.group
CONLIT_3P	0.987	verb.perception	900
CONLIT_BIO		noun.body	193
		verb.contact
		noun.act
		verb.body
FANFIC	0.996	noun.group	9,948
CONLIT_NON		noun.body	627
		noun.act
		noun.location
		verb.contact
FANFIC	0.998	noun.feeling	9,948
REDDIT		verb.competition	2,643
		noun.relation
		noun.phenomenon
		noun.possession
HATH1M FIC	0.886	verb.motion	2,500
HATH1M NON		verb.perception	2,500
		verb.emotion
		noun.group
		noun.body
HATHI1M_19C	0.903	verb.motion	2,500
(Train)		noun.group
HATHI1M_20C		verb.perception	2,500
(Test)		noun.body
		verb.emotion
FOLK	0.991	noun.plant	5,000
HATHI1M_NON		noun.animal	3,136
19C		verb.body
		verb.competition
		noun.food
WORLDLIT_EN	0.893	noun.body	243
CONLIT_NON		noun.act	627
		verb.contact
		verb.perception
		noun.time
CONLIT_NYT	0.901	noun.group	243
CONLIT_NON		noun.act	627
		verb.relation
		verb.perception
		noun.body
CONLIT_PAGE	0.675	noun.body	2,561
(Train)		noun.group
GPT_NEUTRAL		verb.contact	100
(Test)		noun.act
		verb.motion
CONLIT_PAGE	0.0	noun.body	2,561
(Train)		noun.group
GPT_CONFOUNDING		verb.contact	100
(Test)		noun.act
		verb.motion
FOLK/	1.0	noun.location	3,763
CONLIT_NON		noun.animal
(Train)		verb.social
GPT_NEUTRAL		noun.time	100
(Test)		noun.group
FOLK&	0.995	noun.location	3,763
CONLIT_NON		noun.animal
(Train)		verb.social
GPT_CONFOUNDING		noun.time	100
(Test)		noun.group
WORLDLIT&		noun.body	2,710
CONLIT (Train)		noun.act
India (Test)	0.894	verb.perception	94
Nigeria (Test)	0.931	noun.group	72
S. Africa (Test)	1.0	verb.contact	77