Chapter Captor: Text Segmentation in Novels

Charuta Pethe, Allen Kim, Steven Skiena
Department of Computer Science,
Stony Brook University, NY, USA
{cpethe,allekim,skiena}@cs.stonybrook.edu

Abstract

Books are typically segmented into chapters and sections, representing coherent sub-narratives and topics. We investigate the task of predicting chapter boundaries, as a proxy for the general task of segmenting long texts. We build a Project Gutenberg chapter segmentation data set of 9,126 English novels, using a hybrid approach combining neural inference and rule matching to recognize chapter title headers in books, achieving an F1-score of 0.77 on this task. Using this annotated data as ground truth after removing structural cues, we present cut-based and neural methods for chapter segmentation, achieving an F1-score of 0.453 on the challenging task of exact break prediction over book-length documents. Finally, we reveal interesting historical trends in the chapter structure of novels.

1 Introduction

Text segmentation Hearst (1994); Beeferman et al. (1999) is a fundamental task in natural language processing, which seeks to partition texts into sequences of coherent segments or episodes. Segmentation tasks differ widely in scale, from partitioning sentences into clauses to dividing large texts into coherent parts, where each segment is ideally an independent event occurring in the narrative.

Text segmentation plays an important role in many NLP applications including summarization, information retrieval, and question answering. In the context of literary works, event detection is a central concern in discourse analysis Joty et al. (2019). In order to obtain representations of events, it is essential to identify narrative boundaries in the text, where one event ends and another begins.

In novels and related literary works, authors often define such coherent segments by means of sections and chapters. Chapter boundaries are typically denoted by formatting conventions such as page breaks, white-space, chapter numbers, and titles. This physical segmentation improves the readability of long texts for human readers, providing transition cues for breaks in the story.

In this paper, we investigate the task of identifying chapter boundaries in literary works, as a proxy for that of large-scale text segmentation. The text of thousands of scanned books are available in repositories such as Project Gutenberg Gutenberg (n.d.), making the chapter boundaries of these texts an attractive source of annotations to study text segmentation. Unfortunately, the physical manifestations of the printed book have been lost in the Gutenberg texts, limiting their usefulness for such studies. Chapter titles and numbers are retained in the texts but not systematically annotated: indeed they sit as hidden obstacles for most NLP analysis of these texts.

We develop methods for extracting ground truth chapter segmentation from Gutenberg texts, and use this as training/evaluation data to build text segmentation systems to predict the natural boundaries of long narratives. Our primary contributions ¹¹1All code and links to data are available at https://github.com/cpethe/chapter-captor. include:

•

Project Gutenberg Chapter Segmentation Resource: To create a ground-truth data set for chapter segmentation, we developed a hybrid approach to recognizing chapter formatting which is of independent interest. It combines a neural model with a regular expression based rule matching system. Evaluation on a (noisy) silver-standard chapter partitioning yields a mean value F1 score of 0.77 of a test set of 640 books, but manual investigation shows this evaluation receives an artificially low recall score due to incorrect header tags in the silver-standard.

Our data set consists of 9,126 English fiction books in the Project Gutenberg corpus. To encourage further work on text segmentation for narratives, we make the annotated chapter boundaries data publicly available for future research.
•

Local Methods for Chapter Segmentation: By concatenating chapter text following the removal of all explicit signals of chapter boundaries (white space and header notations), we create a natural test bed to develop and evaluate algorithms for large-document text segmentation. We develop two distinct approaches for predicting the location of chapter breaks: an unsupervised weighted-cut approach minimizing cross-boundary cross-references, and a supervised neural network building on the BERT language model Devlin et al. (2019). Both prove effective at identifying likely boundary sites, with F1 scores of 0.164 and 0.447 respectively on the test set.
•

Global Break Prediction using Optimization: Social conventions encourage authors to maintain chapters of modest yet roughly equal length. By incorporating length criteria into the desired optimization criteria and using dynamic programming to find the best global solution enables us to control how important it is to keep the segments equal. We find that a balance between equal segments and model-influenced segments gives us the best segmentation, with minimal error. Indeed, augmenting the BERT-based local classifier with dynamic programming yielded an F1 score of 0.453 on the challenging task of exact break prediction over book-length documents, while simultaneously beating challenging baselines on two other error metrics.

Incorporating chapter length criteria require an independent estimate of the number of chapters in a given text. We demonstrate that there are approximately five times as many likely break candidates as there are chapter breaks in the weighted cut approach, reflecting the number of sub-events within an average book chapter.
•

Historical Analysis of Segmentation Conventions – We exploit our data analysis of segmented books in two directions. We demonstrate that novels grew in length to an average of roughly 30 chapters/book by 1800, and retained this length until 1875 before beginning a steady decline. Second, an analysis of regular expression patterns reveal the wide variety of chapter header conventions and which forms dominate.

2 Previous Work

Many approaches have been developed in recent years to address variants of the task of identifying structural elements in books.

McConnaughey et al. (2017) attempt this task at the page-level, by assigning a label (e.g. Preface, Index, Table of Contents, etc.) to each page of the book. Wu et al. (2013) address the task of recognizing and extracting tables of contents from book documents, with a focus on identifying its style. Participants of the Book Structure Extraction competition at ICDAR 2013 Doucet et al. (2013) attempted to use various approaches for the task. These include making use of the table of contents, OCR information, whitespace, and indentation. Déjean and Meunier (2005) present approaches to identify a table of contents in a book, and Déjean and Meunier (2009) attempt to structure a document according to its table of contents.

However, our approach relies only on text, and does not require positional information or OCR coordinates to extract front matter and headings from book texts.

For text segmentation, many approaches have been developed over the past years, suitable for different types of data, such as news articles, scientific article, Wikipedia pages, and conversation transcripts.

The TextTiling algorithm Hearst (1994) makes use of lexical frequency distributions across blocks of a fixed number of words. Dotplotting Reynar (1994) is a graphical technique to locate discourse boundaries using lexical cohesion across the entire document.

Yamron et al. (1998) and Beeferman et al. (1999) propose methods to identify story boundaries in news transcripts.

The C99 algorithm Choi (2000) uses a global lexical similarity matrix and a ranking scheme for divisive clustering. Choi et al. (2001) further proposed the use of Latent Semantic Analysis (LSA) to compute inter-sentence similarity.

Utiyama and Isahara (2001) proposed a statistical model to find the maximum probability segmentation. The Minimum Cut model Barzilay and Malioutov (2006) addresses segmentation as a graph partitioning task.

This problem has also been addressed in a Bayesian setting Eisenstein and Barzilay (2008); Eisenstein (2009). TopicTiling Riedl and Biemann (2012) is a modification of the TextTiling algorithm, and makes use of LDA for topic modeling.

Segmentation using sentence similarity has been extensively explored using affinity propagation Kazantseva and Szpakowicz (2011); Sakahara et al. (2014). More recent approaches Alemi and Ginsparg (2015); Glavaš et al. (2016) involve the use of semantic representations of words to compute sentence similarities. Koshorek et al. (2018) and Badjatiya et al. (2018) propose neural models to identify break points within the text.

Sims et al. (2019) address the slightly different, but relevant task of event prediction using a neural model, on a human-annotated dataset of short events.

3 Header Annotation

In order to create a ground-truth dataset for chapter segmentation, we first build a system to recognize chapter headings, using a hybrid approach combining a neural model with a regular expression (regex)-based rule matching system.

3.1 Data

In the absence of human-annotated gold standard data with annotated front matter and chapter headings, we derive silver-standard ground truth from Project Gutenberg. We identify 8,400 English fiction books available in HTML format, and extract (noisy) HTML header elements from these books. We use a train-test split of 90-10%.

3.2 Methodology

Refer to caption — Figure 1: Header Annotation Pipeline

The annotation pipeline has five components, as shown in Figure 1. First, we make use of white-space cues and string matching for keywords such as ‘Preface’, ‘Table of contents’ etc. to identify front matter. We tag all such content up to the first chapter heading as the front matter, and identify the remaining content as body.

3.2.1 BERT Inference

We fine-tune a pretrained BERT model Devlin et al. (2019) with a token classification head, to identify the lines which are likely to be headers.

Training:

For each header extracted from the Project Gutenberg HTML files, we append content from before and after the header, to generate training sequences of fixed length. We empirically select a sequence length of $120$ . We use a custom BERT Cased Tokenizer with a special token for the newline character, to tokenize the input sequences. The training samples are of the format:

Sequence: $[p_{1},...,p_{x},h_{1},...,h_{k},q_{1},...,q_{y}]$

Labels: $[0,.......,0,1,.......,1,0,.......,0]$

where $p_{1},...,p_{x}$ are $x$ tokens before the header, $h_{1},...,h_{k}$ are $k$ tokens from the header, and $q_{1},...,q_{y}$ are $y$ tokens after the header. $x$ and $y$ are randomly generated numbers, such that $x+k+y=120$ . This is done in order to prevent header tokens from appearing only in the center of the input sequence.

We fine-tune a pre-trained model for token classification using headers from 6,515 books in our training set for 4 epochs using the BertAdam optimizer. A compute server with a 2.30 GHz CPU and TeslaV100 GPU was used for all experiments.

Inference:

For inference on a test set example, we tokenize the text using the custom BERT Cased Tokenizer, and use the model to generate a confidence score for each token. We do this using a sliding window approach, wherein we run inference on a text window of 120 tokens, and slide the window forward by 60 tokens in each iteration. We then perform token-wise max pooling to obtain a single confidence score per token. Further, we detokenize the output by concatenating sub-word tokens and mean-pooling their confidence scores.

We choose the top 10% tokens with the highest confidence scores, and use the lines containing these tokens as potential header candidates for regex matching.

3.2.2 Regex Rule Matching

We compile a list of regular expressions for constituent elements in chapter headings:

•

Keywords like ‘Chapter’, ‘Section’, ‘Volume’
•

Punctuation marks and whitespace
•

Title (uppercase and mixed case)
•

Roman numerals (uppercase and lowercase)
•

Cardinal, ordinal, and digital numbers.

Using the rules for these constituent elements, we further generate a list of 1,015 regex rules for valid permutations of these elements.

For every potential header candidate generated using the BERT model, we pick the best matching regex rule as the longest rule that captures constituent elements in order of priority, and discard the candidate if there is no matching rule.

3.2.3 Missing Chapter Hunt

Once we have the list of candidates and their corresponding matching rules, we search for chapter headings the BERT model may have missed. For each matched rule that contains a number in some format, we search for chapter headings in the same format with the missing number. In order to account for chapter numbering restarts in different sections of the book, we search for missing headers within all increasing subsequences in the list of chapter numbers found.

3.2.4 Refinement

We get rid of false positive matches, by removing headers between consecutive chapter numbers, which do not match the same rule.

3.3 Evaluation

Table 1 shows the stage-wise performance of the annotation pipeline. Stage 1 contains all candidates generated using the BERT model, Stage 2 contains headers predicted after applying regex rules and searching for missing chapters, Stage 3 contains headers after removing false positives.

Stage	Precision	Recall	F1
1	0.02	0.67	0.05
2	0.75	0.79	0.76
3	0.78	0.78	0.77

Table 1: Stage-wise performance for header annotation

Figure 2 shows the distribution of evaluation metrics on the test set of 640 books, evaluated on the ground truth extracted from HTML files. The mean value of the F1 score is $0.77$ . Manual investigation of a sub-sample of the test set shows that several books get a low recall score due to false negatives, caused due to incorrect header tags in the silver-standard ground truth. Thus we have even greater confidence in our testbed than the F1-score suggests.

3.4 Popularly used rule formats

For each book, we count the number of occurrences of each header format. Figure 3 shows the number of books in which the respective header format occurs most frequently, namely “Chapter # TITLE”.

3.5 Historical Trends

Figure 4 presents the number of chapters in each book as obtained by our annotation pipeline, against the author’s year of birth. For authors born before 1875, novels were roughly 30 chapters long, after which there has been a steady decline in the number of chapters per book.

4 Local Methods for Segmentation

After removing all explicit signals of chapter boundaries from the texts, we now evaluate algorithms for segmenting text into chapters.

We formulate our task as follows:

Given: Sentences $S_{0},S_{1},...,S_{N-1}$ in the book, and $P$ , the number of breaks to insert

Compute: $P$ break points
$B_{0},B_{1},...,B_{P-1}\in\{0,1,...,N-1\}$ corresponding to chapter breaks.

4.1 Weighted Overlap Cut (WOC)

The motivation behind this technique is based on the intuition that chapters are relatively self-contained in the words that they use. For example, consider a chapter that refers to a “cabin in the woods”. We would expect references to this cabin to be higher within the same chapter as compared to other chapters. Hence, our hypothesis is that there will be fewer words in common across a break point separating two chapters, as compared to words within the same chapter.

Considering sentences as nodes, and common words as edges, we can compute the density of a potential break point as the sum of the number of edges going across it, weighted by their distance from the break point. As per our hypothesis, we expect the break point between two chapters to appear as local minima in density as a function of sentence number.

We restrict potential break points to the points between paragraphs, and compute the local minima in density. For each local minimum, we compute its prominence as the vertical distance between the minimum and its highest contour line. We then pick the top $P$ most prominent local minima as the break points.

Note that the same hypothesis can also be made at the paragraph level. However, a major limitation of this approach is that paragraph sizes vary widely, ranging from a single word to a considerably huge block of text. Hence, we have taken the approach of computing sentence-level density and then restricting the potential break points to points between paragraphs.

Preprocessing:

We use the Stanford CoreNLP pipeline Manning et al. (2014) for sentence tokenization and lemmatization. We consider paragraphs as text separated by two or more newline characters.

Computation:

For every potential break point $i$ between sentences $S_{i}$ and $S_{i+1}$ , we compute the density of the break point, which is essentially a weighted sum of the number of overlapping word lemmas within a certain window before and after the break point (weighted by the distance of the word occurrence from the break point). We compute the density $d_{i}$ of break point candidate $i$ as:

$d_{i}=\overset{i}{\underset{x=i-w}{\sum}}\left(\overset{x+w}{\underset{y=\textrm{max}(x+1,i)}{\sum}}\frac{\textrm{overlap}_{xy}}{\left|i-x\right|\left|i-y\right|}\right)$

where $w$ is the window size and $\textrm{overlap}_{xy}$ is the number of common lemmas in sentences $S_{x}$ and $S_{y}$ , excluding stopwords and punctuation. (Note that we use only valid sentence indices during summation, considering the first and last sentences of the book as cutoffs.)

Experiments:

We perform experiments on 2,546 books in the test set, using window sizes of 50, 100, 150, and 200 sentences.

Figure 5(a) shows the computed densities and local minima for window size 200, for a sample book (“The Rover Boys Out West”, by Edward Stratemeyer). The figure shows that chapter breaks roughly correspond to prominent local minima in density.

4.2 BERT for Break Prediction (BBP)

We fine-tune a pre-trained BERT model for the Next Sentence Prediction task, to classify pairs of sequences in which the second sequence is a coherent continuation of the first. Intuitively, for text sequences which are separated by a chapter break, we expect the second sequence to not be a continuation of the first, i.e. the output label should be 0. Whereas for consecutive text sequences within the same chapter, the output label should be 1, denoting that it is a logical continuation.