11email: [email protected]
On an Unknown Ancestor of Burrows’ Delta Measure
Abstract
This article points out some surprising similarities between a 1944 study by Georgy Udny Yule and modern approaches to authorship attribution.
Cet article montre l’existence de similitudes surprenantes entre un ouvrage de Georgy Udny Yule de 1944 et les approches modernes d’attribution d’auteur.
1 Introduction
Review articles usually divide the history of using quantitative methods of authorship attribution into two main periods (cf. e.g. [7, 6]):
-
1.
The univariate approach era of the 19th and first half of the 20th centuries, which focused mainly on the search for a single textual measure that could distinguish documents written by different authors.
-
2.
The multivariate approach era, which launched with a groundbreaking study by Mosteller and Wallace in 1964.[8] Researchers of this era have relied instead on the combined effect of multiple measures and employed multivariate statistical and advanced machine learning methods.
In what follows, I return to a little known chapter from an otherwise influential 1944 study by George Udny Yule. Although Yule’s work is usually seen as an instance of the older univariate approach, in this particular case, he seems to have been quite prescient and treated data in a way that resembles modern multivariate approaches, in particular, John F. Burrows’ well-known Delta measure.[2, 3] I begin with a brief summary of the Delta principle and then explain the connection with Yule’s study.
2 Burrows’ Delta
In a nutshell, Burrows’ Delta responds as follows to cases where there is a target text of unknown or disputed authorship (t0) and a finite set of texts produced by candidate authors in a following way:
-
1.
We extract the most common words () in the entire corpus (i.e. ).
-
2.
Each text is represented as a vector
(1) where stands for the -score of relative frequency of a word in the text .
-
3.
The stylistic disimilarity between and (the Delta measure ) is calculated as the mean of the absolute differences between the -scores of frequencies of particular words in and :
(2) -
4.
The target text is attributed to the candidate which shows the least stylistic dissimilarity from the target, i.e. yields the lowest value of .
As Shlomo Argamon [1] has shown, so long as the Delta serves solely as a ranking metric, the division by (a constant, the number of words analysed) is irrelevant as it in no way affects the ranking of candidate authors. The formula may, thus, be simplified as the Manhattan distance () between vectors and :
(3) |
Finding the candidate author with the lowest Delta value turns out, then, to mean finding the nearest neighbour according to the Manhattan metric.
There have been several modifications proposed to Burrows’ Delta. Along with the original metric, two such changes have become somewhat standard in authorship recognition studies (cf. e.g. [5]):
-
1.
The Quadratic Delta (), as proposed in the above-mentioned article by Argamon, which replaces the Manhattan distance with the Euclidean distance ()—or more precisely, the Euclidean distance squared:
(4) -
2.
The Cosine Delta () as suggested by Smith and Aldrigde, [9] which is based on the size of the angle between the vectors (cosine similarity):
(5)
The Manhattan metric, Euclidean metric and cosine similarity are illustrated in Fig. 1.

3 Yule’s Word-Initial Character Method
Now let us go back several decades. In 1944, George Udny Yule published his book The Statistical Study of Literary Vocabulary, which is now widely recognised for introducing Yule’s (probably the earliest metric of vocabulary richness). This work, however, also contained a little known chapter in which Yule proposed using the frequencies of word-initial characters to discern authorship.
Yule mentions [10, p.183] that he stumbled on this method quite by accident. His survey of vocabulary richness involved a large card catalogue of the nouns found in particular texts. When two drawers were opened at once—the first containing cards on John Bunyan, the second cards on three essays by Thomas Macaulay—he noticed that the distributions were substantially different. A brief inspection of the drawer for another Macaulay essay showed a card distribution similar to that for the author’s other three essays. This led Yule to consider using frequencies of word-initial characters for the purpose of authorship recognition.
Yule tested this approach with samples from Bunyan’s and Macaulay’s respective works. In particular, he investigated whether ranking word-initial characters by their frequencies in a sample by author A produced a result more closely resembling the one for the rest of A’s data than the one for B’s data. In other words, he considered Bunyan’s works , Macaulay’s works , a sample extracted from one of them and the 26 letters of the English alphabet . Here the sample and both sets of works are represented by the vectors
where stands for the rank of in the frequency-rank distribution of the sample/set of works .
The goal was to determine which candidate vector was more similar to . Importantly, in pursuing this inquiry, Yule diverged from the then standard stylometric practice of comparing isolated pairs of values. Instead, he aimed to compare the vectors as a whole. He explained this procedure as follows:
We write down the differences of the ranks in Bunyan sample A from the ranks in the total Bunyan vocabulary, paying no attention to sign; the sum at the foot is a rough measure of the badness of agreement between the sample ranking and for the total of Bunyan vocabulary. In exactly the same way we enter […] the differences between the sample A ranking and the ranking for the total Macaulay vocabulary, and enter the sum, without regard to sign, at the foot. These respective sums are 10 and 37: we have found that the ranking of the given sample differs much less from that of the Bunyan vocabulary than from that of the Macaulay vocabulary, and are left in practically no doubt that the given sample (if we did not know from which author it had come) should be assigned to Bunyan. [10, p.190]
What Yule describes as the “the sum at the foot” based on the “differences of the ranks […] paying no attention to sign” is nothing other than what we now call the Manhattan distance between vectors and :
(6) |
Interestingly enough, Yule himself noted that this method “though serving well to bring out the points required, [is] of a very elementary kind and the statistically minded reader may desire to see the results given by more general methods” [10, p.191] For this purpose he also offers the Spearman’s rank correlation coefficient:111Yule noted: “I have followed the usual, but inexact, practice of using this formula even when some of the ranks have been averaged.”[10, p.191]
(7) |
Notice that the numerator of the fraction in formula 7 equals six times the Euclidean distance between and squared. Since is a constant (the number of vector space dimensions = 26), ranking the candidates based on the increasing value of the Spearman’s rank correlation coefficient necessarily yields the same result as ranking them based on the decrease in Euclidean distance (:
(8) |
4 Discussion
Although Yule’s feature set (word-initial characters) may seem deficient and arbitrarily chosen from a contemporary perspective, the classification methods he employed were ahead of his time. In particular, his study implicitly introduced the nearest neighbour decision rule decades before its appearance in stylometry and some twenty years before it was established in the science (see [4]).
On the other hand, we should not overstate Yule’s contribution. The Manhattan metric is a highly intuitive way of comparing multidimensional data (i.e. as the simple sum of the absolute values of differences) and might be arrived at even without considering its geometrical implications. (Actually, neither Burrows was initially aware of it. As has been shown, it was Shlomo Argamon who connected the dots.) The relationship between the Euclidean metric and the Spearman’s rank correlation coefficient is only indirect. Nevertheless, this study remains noteworthy as an early instance of the multivariate approach in stylometry.
References
- [1] Argamon, Shlomo “Interpreting Burrows‘s Delta: Geometric and probabilistic foundations” In Literary and Linguistic Computing 23.2, 2008, pp. 131–147 DOI: 10.1093/llc/fqn003
- [2] Burrows, John Frederick “»Delta«: a measure of stylistic difference and a guide to likely authorship” In Literary and Linguistic Computing 17.3, 2002, pp. 267–287 DOI: 10.1093/llc/17.3.267
- [3] Burrows, John Frederick “Questions of authorship: attribution and beyond” In Computers and the Humanities 37.1, 2003, pp. 5–32 DOI: 10.1023/A:1021814530952
- [4] T. Cover and P. Hart “Nearest neighbor pattern classification” In IEEE Transactions on Information Theory 13.1, 1967, pp. 21–27
- [5] Stefan Evert et al. “Understanding and explaining Delta measures for authorship attribution” In Digital Scholarship in the Humanities 32.suppl2, 2017, pp. ii4–ii16 DOI: 10.1093/llc/fqx023
- [6] David Holmes and Judit Kardos “Chance” In Who was the author? An introduction to stylometry 16.2, 2003, pp. 5–8 DOI: 10.1080/09332480.2003.10554842
- [7] Moshe Koppel, Jonathan Schler and Shlomo Argamon “Computational methods in authorship attribution” In Journal of the Association for Information Science and Technology 60.1, 2009, pp. 9–26 DOI: 10.1002/asi.20961
- [8] Frederick Mosteller and David Wallace “Inference and Disputed Authorship: The Federalist” Reading: Addison-Wesley, 1964
- [9] Smith, P W H and Aldridge, W “Improving authorship attribution: Optimizing Burrows’ Delta method” In Journal of Quantitative Linguistics 18.1, 2011, pp. 63–88 DOI: 10.1080/09296174.2011.533591
- [10] Yule, George Udny “The Statistical Study of Literary Vocabulary” Cambridge: Cambridge University Press, 1944