Nece: Narrative Event Chain Extraction Toolkit

Guangxuan Xu^1,2 , Paulina Toro Isaza¹¹¹footnotemark: 1 , Moshi Li³¹¹footnotemark: 1
Akintoye Oloko¹, Bingsheng Yao⁴, Cassia Sanctos¹, Aminat Adebiyi¹
Yufang Hou¹, Nanyun Peng², Dakuo Wang¹
¹ IBM Research ²University of California, Los Angeles
³Northeastern University ⁴Rensselaer Polytechnic Institute
{GX.Xu, ptoroisaza, YHou, Dakuo.Wang}@ibm.com
Equal contribution.

Abstract

To understand a narrative, it is essential to comprehend the temporal event flows, especially those associated with main characters; however, this can be challenging with lengthy and unstructured narrative texts. To address this, we introduce Nece, an open-access, document-level toolkit that automatically extracts and aligns narrative events in the temporal order of their occurrence. Through extensive evaluations, we show the high quality of the Nece toolkit and demonstrates its downstream application in analyzing narrative bias regarding gender. We also openly discuss the shortcomings of the current approach, and potential of leveraging generative models in future works. Lastly the Nece toolkit includes both a Python library and a user-friendly web interface, which offer equal access to professionals and layman audience alike, to visualize event chain, obtain narrative flows, or study narrative bias.

1 Introduction

In this paper, we introduce Nece, a document-level narrative event chain extraction system with an easily accessible online interface. Nece offers visualization of structured event chains from unstructured text corpus, making it easier for readers to access and comprehend information from narrative sources such as fairy-tales, news reports, and memoirs. While there are existing text processing approaches, such as temporal dependency graph parsing (Mathur et al., 2022), document-level AMR (abstract meaning representation) parsing (Naseem et al., 2022), as well as scripts and event schema (Chambers and Jurafsky, 2008; Dror et al., 2022), none offers a end-to-end visualization of the narrative flow as Nece shown in Figure 4. The narrative event chain of Nece is a salience-filtered, temporal-ordered, linear chain of events organized for different gender and character groups. We provide a detailed and document level temporal chain, versus the more abstractive intentions of Chambers and Jurafsky (2008).

Refer to caption — Figure 1: The character interface provides chain of events for main characters of story;

In order to construct a chain of narrative events, we devised narrative feature extraction algorithms to mine salient events and associate them with characters and gender groups (Section 4). Salience score is computed as combination of tf-idf score(punishes generic events and rewards locally high-frequency events), character mention score, and location mention score. We also designed a gender identification algorithm by leveraging document-level co-reference resolution results to find gendered names or pronouns. For the temporal ordering task, we leveraged the SOTA temporal relation (TempRel) model, ECONET (Han et al., 2021), to predict adjacent pairwise ordering, and devised greedy and window-sliding approaches to generate ordered chains of events from pairwise relations.

In summary, this work contributes in the following respects: First, we propose a version of narrative event chain that offers a detailed visualization of narrative event flow, suitable for event understanding and analysis. Second We integrate capabilities of excellent toolkits, such as BookNLP Bamman et al. (2014) and Econet, and design novel custom algorithms to build the proposed event chain construction pipeline with validated performance. Third, we open-source the pip package of (Nece) to extract and visualize document-level narrative event chains, and present our website for non-expert users¹¹1The Nece toolkit is licensed under the MIT license and can be accessed here; screenshot of demo is here. An experimental application in temporal-based bias analysis sheds light on the application potential of Nece toolkit. Lastly, we discuss the technical challenges of existing methods, and explores few-shot temporal reasoning capability of recent popular LLMs.

2 Related Work

Narrative Event Chain is inspired by works that model text for different specialty and usage; Pichotta (2015) uses scripts, linear chains of events, to model stereotypical sequences of events ; there are also hierarchical scripts (Dror et al., 2022) , which creates separate event chains for each detected hierarchy; besides linear chains, document-level semantic graphs Naseem et al. (2022) can offer more detailed view of text content;

Salience detection is an important detail that underpins the utility power of our narrative chain. Different from salience algorithm in Jindal et al. (2020); Zhang et al. (2021), which extract summarizing/abstrative events by training on news summarization corpus; Our salience filtering only aims to remove generic words, auxiliary verbs, and non-narrative related events, while the bulk of event content shall be kept. Salience filtering is the pre-processing step that reduces distractions for temporal ordering model, and render the resulting narrative flow more readable and useful.

Temporal Ordering models are also key component of nece; we tested SOTA TempRel model econet Han et al. (2021), finetuned on BERT to anchor events onto temporal axis and resolve ordering by event start time(MATRES dataset Ning et al. (2018b)). Related TempRel models include TCR (Ning et al., 2018a) which trains on annotated data plus ILP constraints, and ROCK (Zhang et al., 2022) which finetunes RoBERTa (Zhuang et al., 2021) on a 400k large automatically curated temporal dataset. DocTime (Mathur et al., 2022) combines the training of a dependency parser and a TempRel predictor by formulating an edge prediction objective to generate temporal dependency graphs. Generative LLMs like ChatGPT and Flan-T5 (Chung et al., 2022) may also be prompted to make temporal predictions.

Online Interfaces EventPlus (Ma et al., 2021) features sentence-level online temporal reasoning interface. StoryAnalyzer (Mitri, ) website offers narrative analysis capabilities built upon CoreNLP (Manning et al., 2014). Nece expands on both to offer narrative central investigation of temporal event flows, at document level. Our user-friendly website/interfaces can be used for either industry adoption or education purposes.

3 Interactive Interface

Salient Event Chain Figure 2 shows the first window of the Nece toolkit, offering the full salient event chain for the input narrative text. Filtering out non-salient event can reduce distracting items and makes the resulting chain more condensed and informative. This interface offers bird’s-eye view of full event flow of the document, in the temporal order of their occurrence.

Gender Event Chain Figure 3 shows the second interface of Nece that further organizes the salient event chain by gender association. Users can select and view event chains associated with either male or female characters or plural gender group of characters. This division is motivated by concerns for gender fairness in children’s books (McCabe et al., 2011). We build an automatic bias visualization based on odds ratio; we also used nece to perform a temporal anchored bias analysis to shed light on its application potential.

Event Chain for Characters is the third interface of Nece that displays salient event chains for the main characters (Figure 4). Character is another key characteristic of narrative text; event chain for each main character gives user easy access to plots involving their interested character. Main characters are decided based on combined name mention and pronoun mentions; participant role of characters is also differentiated by the shape of the event module: a square indicates an agent/subject participant, and an oval shape implies a patient/direct_object participant of an event.

Approach	Dev-Prec.	Dev-Recall	Test-Prec	Test-Recall
No-Filtering	0.800	1.000	0.695	0.979
Auxiliary-Verb-Filtering	0.910	1.000	0.842	0.979
Salience-Filtering	0.921	0.978	0.857	0.979

Table 1: Evaluation of no-filtering, auxiliary-verb-filtering, and salience-filtering approaches. The testset is a small set of 5 expert-annotated stories; all approach report near perfect recall indicates that our SRL-based event extraction approach is successful in identifying events.

Method	Before		After		F1		Kendall’s $\tau$	# TrueLabel
Method	prec	rec	prec	rec	micro	macro	coefficient	before	after
random-order (ref.)	0.729	0.653	0.203	0.260	0.557	0.456	0.709	262	85
text-order	0.751	1.0	0.0	0.0	0.750	0.428	0.725	262	85
flan-t5-large(fs)	0.777	0.901	0.270	0.210	0.733	0.549	0.724	262	85
chatgpt	0.763	0.737	0.274	0.318	0.631	0.519	0.717	262	85
(ours) greedy+econet	0.933	0.858	0.689	0.807	0.848	0.810	0.735 \| 0.728	262	85

Table 2: before, after, f1, measures the quality of pairwise temporal relation extracted; one story consists of dozens of pairwise relations; Kendall’s

\tau

on the other hand, measures the holistic quality of temporal event chain, where one chain is created per story. random-order refers to random ordering of two events; text-order is the order that events appear in text. Best scores are emboldened. Due to imbalanced dataset labels favoring before relations, taking text-order gives good accuracy for this category. The authors have tried various prompting methods to use chatgpt for temporal classification, but with little avail. flan-t5-large(fs) adopts the same few-shot prompting as ChatGPT and gives comparable performance to the text-order heuristics. greedy+econet uses a greedy approach to construct event chain, only compare each event with its neighbors; on Kendall’s

\tau

, we also report the result from harmonypred+econet of 0.728, which is lower than the Greedy approach overall. Though it performs better in some stories with more complex temporal timeline.

4 Event Chain Extraction Algorithms

4.1 Salience Scoring

Salience identification is an algorithm that underpins the utility of event chain pipeline. Unfiltered event chain, taking all verb events from a semantic role labeling (SRL) model (Gardner et al., 2017) will result in a flood of auxiliary verbs(is, are, am) and stative verbs(has, had) that not only distracts our temporal ordering model, but also make reading the event chain less interesting for users.

Table 1 shows that removing auxiliary verbs alone improves the salience detection by large margin. We further apply an additional layer of filtering to consider multiple dimensions including word frequency, character reference, and location reference, which are inspired by (Jindal et al., 2020). We trained an inverse document frequency (idf) dictionary on 2000 articles from wikipedia dataset (Foundation, ) to calculate tf-idf scores for each event. In addition, we aggregate factors, such as whether argument-0 or argument-1 of event trigger involves main characters or locations. The threshold for salience grouping is relatively low to get high recall. The specific weights on components of salience score is obtained through grid-search on a 3-story dev set. Our final salience filtering algorithm scores a modest improvement on precision over auxiliary filtering, with no cost on recall as shown in Table1.

4.2 Character Feature Extraction

Organizing event chains by gender and character groups is dependent upon successful tagging of those features. We use a modified version of BookNLP Bamman et al. (2014) to extract and cluster characters; each event is then tagged with characters as either agent participant or patient participant. By counting the total name and pronoun mentions for each character, we assign primary, secondary, and tertiary importance status to characters.

Having events tagged with character associations, we can then create their gender tags by predicting the characters’ gender identity. The predictions are derived from co-reference resolution to look up characters’ associated pronouns. We also created a dictionary of common names for different genders, which is directly applicable when character names are explicitly gendered. Due to limitations of the current method, we only offer an imperfect gender categorization as female, male, group/non-binary, or unknown (See Limitation Limitations)

Having events tagged with gender roles, we extract odds ratio to illustrate potential gender bias in story, as in Sun and Peng (2021) and Monroe et al. (2017). For example, in a given story, the occurrence of the event “kill” has an odds ratio of four for male vs. female characters. This means that male characters are four times more likely than female characters to be involved in killing. We apply a common correction, Haldane-Anscombe (Lawson, 2004), to account for cases in which one group has no observed counts of an event.

4.3 Document-level Temporal Event Ordering

Pairwise TempRel. Temporal orders are key features of narrative event chains. We first obtain pairwise temporal relations (TempRel) of events by using econet Han et al. (2021). econet is one of the sota neural temporal models using a BERT architecture (Devlin et al., 2019) and is trained with a joint-objective of temporal relation mask-filling and contrastive loss from discriminating corrupted temporal tokens.

Greedy Event Chain Creation. Pairwise relations between adjacent events do not directly compute a ranking of events. So, we propose an ILP(Integer Linear Programming) constraint which first assumes a textual order of events, and only change order of adjacent events based on pairwise relations from model prediction. The process can be also viewed as an insertion sort: let X be a sorted sequence of events with length $n\geq 0$ and let $x$ be a new event to be inserted into X, the algorithm consists of the following steps: Initialize a hashmap $T$ of temporal relations, where $T[i,j]$ represents the relation between events $i$ and $j$ . Populate $T$ with pairwise relations between neighboring events in document: $T[i,i+1]=R$ , where $R$ is the relation between events $i$ and $i+1$ . The pairwise relation is predicted by the econet model (Han et al., 2021). After having a hashmap of known relations, we use the insertion sort algorithm as in Algorithm 1 to recursively insert a new event into a sorted chain to generate the final ordered event chain.

Algorithm 1 Insertion Sort of Event Chain

Event sequence

X

and a new element

x

, a hash-map

D

containing pair-wise relations between events.

X

is non-empty and sorted in temporal order,

x

has text-order after all elements in

X

b\leftarrow X[-1]

while

b

x

comes after

b

according to

D

then

Insert element

x

right after

b

X

return

X

else

b\leftarrow

the element right before

b

X

end if

end while

Insert element

x

at the first index of

X

return

X

Harmony Predict: Window-Sliding. While greedy method above works well in scenarios where textual order heavily overlaps with temporal order, the greedy algorithm considers only an event’s immediate neighbors during chain ordering; in order to capture longer dependencies, we increase the window size to 5: for each Event-i, pairwise TempRels are also computed for Event- ${i-2}$ and Event- ${i+2}$ . We expand the ILP rules to include the transitive property of temporal relations to deduce additional relations in $T$ that are beyond the size-5 context window; for instance, if $T[i,k]=T[k,j]=``After"$ , then $T[i,j]:=``After"$ . However, conflict may arise due to co-existence of transitive-inferred labels and model predicted labels. Harmony Predict apply the following ILP rules to reconcile differences: 1) Text-order is preferred when the relationship between two events is unpredicted; 2) transitive rules are applied to infer new relations from existing relations. 3) in case of conflicts, transitive rules take precedence over the window-size 5 model’s predictions, since Econet is more reliable for adjacent events.

5 Evaluation

We carry out a series of experiments to understand the performance of Nece. An in-house annotation team annotated event saliency, gender and character association of stories taken from the FairytaleQA dataset (Xu et al., 2022) using instruction and interface shown at Figure 6. For temporal ordering (tempRels) annotation, we sampled 4 stories from both FairytaleQA and 4 from cnn-dailymail (See et al., 2017) dataset to diversify the domain(Fairy tale stories mostly adhere to textual order; while news stories have more complex timelines). The detailed annotation guideline and the example annotation interfaces can be found in Appendix A.

5.1 Temporal Ordering

Pairwise Temporal Ordering. Nece employs econet (Han et al., 2021) for TempRel prediction between pairwise events; econet is finetuned on small-size (36 documents) but expert-annotated MATRES dataset (Ning et al., 2018b). We added random prediction as a reference and text-order of events as a strong heuristic-based baseline. We also included few-shot prompted flan-t5-large (Chung et al., 2022) and ChatGPT(gpt-3.5-turbo) to explore the temporal reasoning capability of instruction-finetuned LLMs. Pairwise temporal relation are measured by F1 scores, as shown in Table 2.

llm. Generative AI spearheaded by chatgpt has made ripples in many domain by demonstrating strong zero-shot or few-shot capabilities. We tried to leverage gpt-3.5-turbo using few-shot examples and dialog style prompts to extract temporal predictions from it, though with little success. The Table 4 in Appendix gives detailed example of its failure in generating meaningful temporal prediction. It tends to generate one label(yes or no), or always the first event if prompted differently to choose the earlier event. Its poor performance is reflected in Table 2 as low f-scores; we then used the same prompts and set-up as chatgpt on Flan-T5-large (Chung et al., 2022), which gives promising results close to text-order.

Document-level Temporal Event Chain. To benchmark the quality of constructed event chain, we use the Kendall’s $\tau$ coefficient, which measures the similarity of the orderings of the data (Kumar and Vassilvitskii, 2010). A higher Kendall’s $\tau$ coefficient indicates better alignment between predicted chain and the real chain of event. In a holistic view, our greedy+econet approach not only scores high in pairwise prediction, but also has the closest event chain to gold. harmony+econet which uses the window-sliding method to construct event chain have a lower average score in Kendall’s $\tau$ , but scores higher in some stories with complex temporal timelineTable 2.

Discussion In Table 2, we find the Text-order model a strong baseline in terms of micro-F1 due to imbalanced labels favoring before relation. The result of flan-t5’s few-shot performance is surprisingly good, achieving 0.777 and 0.9 in precision and recall for before label, especially compared to chatgpt. Temporal ordering task typically involves complex definition to resolve overlap in time-span and multi-axis scenarios, making direct knowledge transfer difficult. Our Econet based methods outperform the strong baselines in most of the metrics, including pairwise f-score and chain-level Kendall’s $\tau$ measure;

Event-chain	accuracy	macro-f1	# samples
Character Resolution	0.872	-	188
Gender Resolution	0.974	0.951	188

Table 3: Evaluation of character resolution, and gender resolution algorithms; we report the accuracy and macro-f1 for gender resolution; however, only accuracy is reported for character resolution because number of character classes is not fixed across different stories.

5.2 Feature Extraction

We evaluate three key narrative features that are crucial to construct event chains. Salience feature is useful to filter out redundant and generic events. Gender and character tagging allows us to organize event chains by their gender and character association. Result for salience filtering is shown in Table 1 and discussed in section 4.1. Gender prediction, which is a relatively easy task solvable through gendered pronoun identification and co-reference resolution, enjoys high accuracy as shown in Table 3. Character identification is a slightly more challenging task due to the large number of possible characters in a narrative; its performance also depends on successful character clustering. We still achieve good performance thanks to the robustness of the BookNLP toolkit (Bamman et al., 2014) in narrative domains.

5.3 Application in Temporal Event Chain-Anchored Gender Bias Analysis

Nece can be used to investigate gender bias that arise from the order in which events happen. We tested Bigram Event Comparisons to locate such biases within the FairytaleQA dataset (Xu et al., 2022). We integrated temporal ordering as a dimension of bias, and we compared unigram, bigram bias, at different temporal locations of the narrative(beginning, middle, end); we also investigated the possibility of bigram bias, as the prior event(before) or the later event(after).

The problem is formulated as following: Bigrams (chains of two events) are extracted from each event chain. For example, a bigram could be (“cry”, “have”); they are further grouped into custom supersense categories (e.g., being mapped to the typical verbNET supersense categories such as “emotion” or “possession”). We then selected one supersense category, such as possession as the anchor, and calculated the odds ratio for all events that occur prior to the anchor. An example analysis is shown in Figure LABEL:fig:analysis_units, biases at before and after locations seem to be even; however, there are significantly more biaes against female character at the beginning of the story than at the end. Anlaysis is based on 278 fairytales (Xu et al., 2022).

6 Conclusion

Nece is a document-level narrative event chain extraction toolkit, built with neural temporal ordering models and novel character feature tagging algorithms. It boasts a user-friendly online interface along with built-in gender bias analysis capability. Nece is built to serve event-based analysis of a wide range of narrative documents, including novels, short stories, fairy-tales, etc. Human evaluation is conducted on key algorithmic components of Nece, demonstrating its robustness and accuracy. While our system has been primarily designed for event chain extraction, its functionalities can be adopted for various downstream applications, including temporal anchored Bias Analysis. Classifer-based temporal ordering has its limits, though directly leveraging generative model for temporal classification is also non-trivial. Future work is needed to investigate generative AI’s temporal ordering ability.

Limitations

The overall performance of Nece is constrained by performance of its component algorithms; temporal relation prediction models are trained on sentences level rather than document level; the pairwise predictions within a document span are also not always harmonious, and need ILP rules to resolve conflicts. Even we obtained all the correct pairwise relations, it could still be hard to build various temporal axis in the narrative, and assemble them into the final event chain. We explored using chatgpt and flan-t5 models to make temporal order predictions, albeit with limited success. It merits future works to unlock the temporal capabilities of LLMs for temporal ordering task.

Ethics Statement

In addition to being an event chain extraction toolkit, we also build in a gender bias analysis component. We make the normative assumption that any substantial, measured numerical difference between two groups is indicative of bias within a story. However, there is the caveat that any distributional imbalance found in single stories are normally not statistically significant. The odds ratio computed are only indicative of potential bias, not as conclusion of bias. Moreover, rather than conducting bias analysis on events level, it is better to first abstract events into verbNET supersense categories and conduct odds ratio analysis on supersense level, as done in Section 5.3. Moreover, we do not make any claims as to the polarity of bias nor does it contextualize such bias in the rich body of work of literary criticism, media studies, gender studies, or feminist literature. We are aware that numerical measures of bias can be used to obfuscate nuance or wave away concerns of harmful representation. We do not intend for our tool to replace qualitative analyses of stories, but rather supplement existing bias analysis frameworks. Lastly, we host a website to demonstrate our toolkit; however, due to the expense of host gpu-backed server, we only uploaded samples of processed contents to the website, but do not support online inference. We do not collect any user data from the website.

References

Bamman et al. (2014) David Bamman, Ted Underwood, and Noah A. Smith. 2014. A bayesian mixed effects model of literary character. In ACL.
Chambers and Jurafsky (2008) Nathanael Chambers and Dan Jurafsky. 2008. Unsupervised learning of narrative event chains. In Proceedings of ACL-08: HLT, pages 789–797, Columbus, Ohio. Association for Computational Linguistics.
Chung et al. (2022) Hyung Won Chung, Le Hou, S. Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Wei Yu, Vincent Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed Huai hsin Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022. Scaling instruction-finetuned language models. ArXiv, abs/2210.11416.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805.
Dror et al. (2022) Rotem Dror, Haoyu Wang, and Dan Roth. 2022. Zero-shot on-the-fly event schema induction. ArXiv, abs/2210.06254.
(6) Wikimedia Foundation. Wikimedia downloads.
Gardner et al. (2017) Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2017. Allennlp: A deep semantic natural language processing platform.
Han et al. (2021) Rujun Han, Xiang Ren, and Nanyun Peng. 2021. Econet: Effective continual pretraining of language models for event temporal reasoning. In EMNLP.
Jindal et al. (2020) Disha Jindal, Daniel Deutsch, and Dan Roth. 2020. Is killed more significant than fled? a contextual model for salient event detection. In Proceedings of the 28th International Conference on Computational Linguistics, pages 114–124, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Kumar and Vassilvitskii (2010) Ravi Kumar and Sergei Vassilvitskii. 2010. Generalized distances between rankings. In WWW ’10.
Lawson (2004) Raef Lawson. 2004. Small sample confidence intervals for the odds ratio. Communications in Statistics - Simulation and Computation.
Ma et al. (2021) Mingyu Derek Ma, Jiao Sun, Mu Yang, Kung-Hsiang Huang, Nuan Wen, Shikhar Singh, Rujun Han, and Nanyun Peng. 2021. Eventplus: A temporal event understanding pipeline. In NAACL.
Manning et al. (2014) Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, Baltimore, Maryland. Association for Computational Linguistics.
Mathur et al. (2022) Puneet Mathur, Vlad Morariu, Verena Kaynig-Fittkau, Jiuxiang Gu, Franck Dernoncourt, Quan Tran, Ani Nenkova, Dinesh Manocha, and Rajiv Jain. 2022. DocTime: A document-level temporal dependency graph parser. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 993–1009, Seattle, United States. Association for Computational Linguistics.
McCabe et al. (2011) Janice McCabe, Emily Fairchild, Liz Grauerholz, Bernice A. Pescosolido, and Daniel Tope. 2011. Gender in twentieth-century children’s books: Patterns of disparity in titles and central characters. Gender & Society, 25(2):197–226.
(16) Mike Mitri. Story analyzer.
Monroe et al. (2017) Burt L. Monroe, Michael P. Colaresi, and Kevin M. Quinn. 2017. Fightin’ words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4):372–403.
Naseem et al. (2022) Tahira Naseem, Austin Blodgett, Sadhana Kumaravel, Tim O’Gorman, Young-Suk Lee, Jeffrey Flanigan, Ramón Astudillo, Radu Florian, Salim Roukos, and Nathan Schneider. 2022. DocAMR: Multi-sentence AMR representation and evaluation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3496–3505, Seattle, United States. Association for Computational Linguistics.
Ning et al. (2018a) Qiang Ning, Zhili Feng, Hao Wu, and Dan Roth. 2018a. Joint reasoning for temporal and causal relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2278–2288, Melbourne, Australia. Association for Computational Linguistics.
Ning et al. (2018b) Qiang Ning, Hao Wu, and Dan Roth. 2018b. A multi-axis annotation scheme for event temporal relations. In ACL.
Pichotta (2015) Karl Pichotta. 2015. Statistical script learning with recurrent neural nets. Proceedings of the AAAI Conference on Artificial Intelligence.
See et al. (2017) A. See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. ArXiv, abs/1704.04368.
Sun and Peng (2021) Jiao Sun and Nanyun Peng. 2021. Men are elected, women are married: Events gender bias on wikipedia. In ACL.
Xu et al. (2022) Ying Xu, Dakuo Wang, Mo Yu, Daniel Ritchie, Bingsheng Yao, Tongshuang Wu, Zheng Zhang, Toby Li, Nora Bradford, Branda Sun, Tran Hoang, Yisi Sang, Yufang Hou, Xiaojuan Ma, Diyi Yang, Nanyun Peng, Zhou Yu, and Mark Warschauer. 2022. Fantastic questions and where to find them: FairytaleQA – an authentic dataset for narrative comprehension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 447–460, Dublin, Ireland. Association for Computational Linguistics.
Zhang et al. (2022) Jiayao Zhang, Hongming Zhang, Dan Roth, and Weijie J. Su. 2022. Causal inference principles for reasoning about commonsense causality. ArXiv.
Zhang et al. (2021) Xiyang Zhang, Muhao Chen, and Jonathan May. 2021. Salience-aware event chain modeling for narrative understanding. In EMNLP.
Zhuang et al. (2021) Liu Zhuang, Lin Wayne, Shi Ya, and Zhao Jun. 2021. A robustly optimized BERT pre-training approach with post-training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 1218–1227, Huhhot, China. Chinese Information Processing Society of China.

Appendix A Annotations

Narrative feature annotations are conducted by in-house annotation team; the temporal ordering annotation is performed by both in-house annotator and cross-checked by authors of the paper to ensure adherence to the annotator rules. Our in-house annotators are compensated fairly in accordance with compensation law in the state of New York.

A.1 Temporal Event Chain Annotations

Task Formulation: Given a passage of the story, a list of verb events, that can be mapped back to passage; the annotators are tasked to create a ranking of listed verb events by anchoring the events into the story timeline. The result will be a chain of event-ids, eg. 2->1->3->4->6->5 …. We ask annotators to put -the results into prepared excel sheets.

Annotation rules: We trained the annotators on below listed rules to resolve ambiguities that arise during ambiguity. As pointed out by (Ning et al., 2018b), not all events fall on the main story axis, with diverging hypothetical, negation, conversatoin timelines, or even past timelines that weren’t able to be clearly located. This it makes the creation of one uniform event chain difficult, since there’s a lot of ambiguities. During annotations, we apply the following rules to ensure a uniform way of resolving ambiguities;

•

Hypotheticals: Hypothetical events could be in the past or in the future; place the hypothetical timeline before its immediate neighbors if it were in the past; place the timeline after the immediate neighbors if it were in the future.
•

Contexts: Some events are used to provide context for later conversations or plots, eg. a bridge stood above the river; make sure the contexts were temporally earlier than the events or conversation dependent on them;
•

Cause and Effect: When there’s cause and effect relations, such as, because, so, since, we should always put cause first, and then effects.
•

Quotations: You can often infer the relations between events inside quotation and outside quotation; if it is ambiguous, annotate in text-order.

A.2 Narrative Features Annotations

Narrative features, including salience detection, gender role resolution, character role resolution are annotated using the following interface in Figure 5 6. We provide the entire story, and for each question, we provide the corresponding sentence and 3 questions about salience, gender, and character role.

Appendix B Use ChatGPT for temporal classification

query1	query2	answer1	answer2	model_pred	gold_label
"in the following text: The call , which happened as President Barack Obama wrapped up his first presidential visit to Israel , was an unexpected outcome from a Mideast trip that seemed to yield few concrete steps. is it possible that event ’happened’ happened before event ’wrapped’ starts? Please only answer yes or no."	"in the following text: The call , which happened as President Barack Obama wrapped up his first presidential visit to Israel , was an unexpected outcome from a Mideast trip that seemed to yield few concrete steps. is it possible that event ’wrapped’ happened before event ’happened’ starts? Please only answer yes or no."	No	No	SIMUL	BEFORE
"in the following text: The call , which happened as President Barack Obama wrapped up his first presidential visit to Israel , was an unexpected outcome from a Mideast trip that seemed to yield few concrete steps. is it possible that event ’happened’ happened before event ’seemed’ starts? Please only answer yes or no."	"in the following text: The call , which happened as President Barack Obama wrapped up his first presidential visit to Israel , was an unexpected outcome from a Mideast trip that seemed to yield few concrete steps. is it possible that event ’seemed’ happened before event ’happened’ starts? Please only answer yes or no."	No	No	SIMUL	BEFORE
"in the following text: The call , which happened as President Barack Obama wrapped up his first presidential visit to Israel , was an unexpected outcome from a Mideast trip that seemed to yield few concrete steps. is it possible that event ’happened’ happened before event ’yield’ starts? Please only answer yes or no."	"in the following text: The call , which happened as President Barack Obama wrapped up his first presidential visit to Israel , was an unexpected outcome from a Mideast trip that seemed to yield few concrete steps. is it possible that event ’yield’ happened before event ’happened’ starts? Please only answer yes or no."	No	No	SIMUL	BEFORE
"in the following text: The call , which happened as President Barack Obama wrapped up his first presidential visit to Israel , was an unexpected outcome from a Mideast trip that seemed to yield few concrete steps. is it possible that event ’wrapped’ happened before event ’seemed’ starts? Please only answer yes or no."	"in the following text: The call , which happened as President Barack Obama wrapped up his first presidential visit to Israel , was an unexpected outcome from a Mideast trip that seemed to yield few concrete steps. is it possible that event ’seemed’ happened before event ’wrapped’ starts? Please only answer yes or no."	Yes	No	BEFORE	VAGUE

Table 4: Examples of ChatGPT model prediction and the prompts fed; Few-shot examples used is: "in the following text: Fidel Castro invited John Paul to come for a reason .is it possible that event ’invited’ happened before event event ’come’ starts? Please only answer yes or no. Yes’; in the following text: She says this puts the very existence of women ’s families at risk .is it possible that event ’says’ happened before event event ’puts’ starts? Please only answer yes or no. Yes"

Table 4 shows the query and resulting answer from ChatGPT, which totally fails to make reasonable reasoning of the temporal relationship between querid events. The few-shot prompt given is shown in the description of the table. We used the same few-shot prompts for ChatGPT as for Flan-T5 models, which give much better performance. ChatGPT, though powerful in the dialog and creative capabilities, is unfortunately not as capable in the temporal capability. We have tested other prompts as well, but still was not able to have ChatGPT give better temporal predictions. Our results should not be taken as conclusive, but it adds to the data points for us to interpret its temporal ability.

Appendix C Temporal Chain Cases Analysis

Figure 7 shows an example of successful ordering by nece and an example of failed ordering by nece. The current temporal reasoning models are really only trained on pairwise relations, without a systematic way to reconcile conflict and ensure a reasonable ordering in paraphragh or document. nece applies some ILP constraints to merge pairwise relations to chains, but is not the most robust solution. We hope future work can further explore the problem and propose better algorithms to holistically process the temporal ordering of narrative events.