This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Traditional Readability Formulas Compared for English

Bruce W. Lee1,2
University of Pennsylvania1
Pennsylvania, USA
[email protected] &Jason Hyung-Jong Lee2
LXPER AI Research (LAIR)2
Seoul, South Korea
[email protected]
Abstract

Traditional English readability formulas, or equations, were largely developed in the 20th century. Nonetheless, many researchers still rely on them for various NLP applications. This phenomenon is presumably due to the convenience and straightforwardness of readability formulas. In this work, we contribute to the NLP community by 1. introducing New English Readability Formula (NERF), 2. recalibrating the coefficients of “old” readability formulas (Flesch-Kincaid Grade Level, Fog Index, SMOG Index, Coleman-Liau Index, and Automated Readability Index), 3. evaluating the readability formulas, for use in text simplification studies and medical texts, and 4. developing a Python-based program for the wide application to various NLP projects.

1 Introduction

Readability Assessment (RA) quantitatively measures the ease of understanding or comprehension of any written text (Feng et al., 2010; Klare, 2000). Understanding text readability, or difficulty, is essential for research on any originated, studied, or shared ideas (Collins-Thompson, 2014). Such inherent property leads to RA’s close applications to various areas of healthcare (Wu et al., 2013), education (Dennis, 2018), communication (Zhou et al., 2017), and Natural Language Processing (NLP), such as text simplification (Aluisio et al., 2010).

Machine learning (ML) or transformer-based methods have been reasonably successful in RA. The RoBERTa-RF-T1 model by Lee et al. (2021) achieves a 99%99\% classification accuracy on OneStopEnglish dataset (Vajjala and Lučić, 2018) and a BERT-based ReadNet model from Meng et al. (2020) achieves about 92%92\% accuracy on WeeBit dataset (Vajjala and Meurers, 2012). However, “traditional readability formulas” still seem to be actively used throughout the research published in popular NLP venues like ACL or EMNLP (Uchendu et al., 2020; Shardlow and Nawaz, 2019; Scarton and Specia, 2018; Schwartz et al., 2017; Xu et al., 2016). The tendency to opt for traditional readability formulas is likely due their convenience and straightforwardness.

In this work, we hope to assist the NLP community by recalibrating five traditional readability formulas – originally developed upon 20th-century military or technical documents. The formulas are adjusted for the modern, standard U.S. education curriculum. We utilize the appendix B (Text Exemplars and Sample Performance Tasks) dataset, provided by the U.S. Common Core State Standards111corestandards.org. Then, we evaluate the performances and applications of these formulas. Lastly, we develop a Python-based program for convenient application of the recalibrated versions.

But traditional readability formulas lack wide linguistic coverage (Feng et al., 2010). Therefore, we create a new formula that is mainly motivated by lexico-semantic and syntactic linguistic branches, as identified by Collins-Thompson (2014). From each, we search for the representative features. The resulting formula is named the New English Readability Formula, or simply NERF, and it aims to give the most generally and commonly accepted approach to calculating English readability.

To sum up, we make the contributions below. The related public resources are in appendix A.

1. We recalibrate five traditional readability formulas to show higher prediction accuracy on modern texts in the U.S. curriculum.

2. We develop NERF, a generalized and easy-to-use readability assessment formula.

3. We evaluate and cross-compare six readability formulas on several datasets. These datasets are carefully selected to collectively represent the diverse audiences, education curricula, and reading levels.

4. We develop <Anonymous>, a fast open-source readability assessment software based on Python.

2 Related Work

The earliest attempt to "calculate" text readability was by Lively and Pressey (1923), in response to their practical problem of selecting science textbooks for high school students (DuBay, 2004). In the consecutive years, many well-known readability formulas were developed, including Flesch Kincaid Grade Level (Kincaid et al., 1975), Gunning Fog Count (or Index) (Gunning et al., 1952), SMOG Index (Mc Laughlin, 1969), Coleman-Liau Index (Coleman and Liau, 1975), and Automated Readability Index (Smith and Senter, 1967).

These formulas are mostly linear models with two or three variables, largely based on superficial properties concerning words or sentences (Feng et al., 2010). Hence, they can easily combine with other systems with less burden of a large trained model (Xu et al., 2016). Such property also proved helpful in research fields outside computational linguistics, with some applications directly related to the public medical knowledge – measuring the difficulty of a patient material (Gaeta et al., 2021; van Ballegooie and Hoang, 2021; Bange et al., 2019; Haller et al., 2019; Hansberry et al., 2018; Kiwanuka et al., 2017).

3 Datasets

3.1 Common Core - Appendix B (CCB)

We use the CCB corpus to calibrate formulas. The article excerpts included in CCB are divided into the categories of story, poetry, informational text, and drama. For the simplification of our approach, we limit our research to story-type texts. This left us with only 69 items to train with. But those are directly from the U.S. Common Core Standards. Hence, we assume with confidence that the item classification is generally agreeable in the U.S.

Properties CCB WBT CAM CKC OSE NSL
audience Ntve Ntve ESL ESL ESL Ntve
grade K1-12 K2-10 A2-C2 S7-12 N/A N/A
curriculum? Yes No Yes Yes No No
balanced? No Yes Yes No Yes No
#class 6 5 5 6 3 5
#item/class 11.5 625 60.0 554 189 2125
#word/item 362 213 508 117 669 752
#sent/item 25.8 17.0 28.4 54.0 35.6 50.9
Table 1: Modified data. These stats are based on respective original versions. S: S.Korea Grade, Ntve: Native

CCB is the only dataset that we use in the calibration of our formulas. All below datasets are mainly for feature selection purposes only.

3.2 WeeBit (WBT)

WBT, the largest native dataset available in RA, contains articles targeted at readers of different age groups from the Weekly Reader magazine and the BBC-Bitesize website. In table 1, we translate those age groups into U.S. schools’ K-* format. We downsample to 625itemclass625\frac{\text{item}}{\text{class}} as per common practice.

3.3 Cambridge English (CAM)

CAM (Xia et al., 2016) classifies 300 items in the Common European Framework of Reference (CEFR) (Verhelst et al., 2001). The passages are from the past reading tasks in the five main suites Cambridge English Exams (KET, PET, FCE, CAE, CPE), targeted at learners at A2–C2 levels of CEFR.

3.4 Corpus of the Korean ELT (English Lang. Train.) Curriculum (CKC)

CKC (Lee and Lee, 2020b, a) is less-explored. It developed upon the reading passages appearing in the Korean English education curriculum. These passages’ classifications are from official sources from the Korean Ministry. CKC represents a non-native country’s official ESL education curriculum.

3.5 OneStopEnglish (OSE)

OSE is a recently developed dataset in RA. It aims at ESL (English as Second Language) learners and consists of three paraphrased versions of an article from The Guardian Newspaper. Along with the original OSE dataset, we created a paired version (OSE-Pair). This variation has 189 items and each item has advanced-intermediate-elementary pairs.

In addition, OSE-Sent is a sentence-paired version of OSE. The dataset consists of three parts: adv-ele (1674 pairs), adv-int (2166), int-ele (2154).

3.6 Newsela (NSL)

NSL (Xu et al., 2015) is a dataset particularly developed for text simplification studies. The dataset consists of 1,130 articles, with each item re-written 4 times for children at different grade levels. We create a paired version (NSL-Pair) (2125 pairs).

3.7 ASSET

ASSET (Alva-Manchego et al., 2020) is a paired sentence dataset. The dataset consists of 360 sentences, with each item simplified 10 times.

4 Recalibration

4.1 Choosing Traditional Read. Formulas

We start by recalibrating five readability formulas. We considered Zhou et al. (2017) and the number of Google Scholar citations to sort out the most popular traditional readability formulas. Further, to make a fair performance comparison with our adjusted variations, we choose the formulas originally intended to output U.S. school grades but are based on 20th-century texts and test subjects.

Flesh-Kincaid Grade Level (FKGL) is primarily developed for U.S. Navy personnel. The readability level of 18 passages from Navy technical training manuals was calculated. The criterion was that 50%50\% of subjects with reading abilities at the specific level had to score 35%\geq 35\% on a cloze test for a text item to be classified as the specific reading level. Responses from 531 Navy personnel were used.

FKGL=a#word#sent+b#syllable#word+c\text{FKGL}=a\cdot\frac{\text{\#word}}{\text{\#sent}}+b\cdot\frac{\text{\#syllable}}{\text{\#word}}+c

where sent is sentence, and # refers to "count of."

The genius of Gunning Fog Index (FOGI) is the idea that word difficulty highly correlates with the number of syllables. Such a conclusion was deduced upon the inspection of Dale’s list of easy words (Zhou et al., 2017; Dale and Chall, 1948). However, the shortcoming of FOGI is the over-generalization that "all" words with more than two syllables are difficult. Indeed, "banana" is quite an easy word.

FOGI=a(#word#sent+b#difficult word#word)+c\text{FOGI}=a\cdot(\frac{\text{\#word}}{\text{\#sent}}+b\cdot\frac{\text{\#difficult word}}{\text{\#word}})+c

Simple Measure of Gobbledygook (SMOG) Index, known for its simplicity, resembles FOGI in that both use the number of syllables to classify a word’s difficulty. But SMOG sets its criterion a little high to more than three syllables per word. Additionally, SMOG incorporates a square root approach instead of a linear regression model.

SMOG=ab#polysyllable word#sent+c\text{SMOG}=a\cdot\sqrt{b\cdot\frac{\text{\#polysyllable word}}{\text{\#sent}}}+c

Coleman-Liau Index (COLE) is more of a lesser-used variation among the five. But we could still find multiple studies outside computational linguistics that still partly depend on COLE (Kue et al., 2021; Szmuda et al., 2020; Joseph et al., 2020; Powell et al., 2020). The novelty of COLE is that it calculates readability without counting syllables, which was viewed as a time-consuming approach.

COLE=a100#letter#word+b100#sent#word+c\text{COLE}=a\cdot 100\cdot\frac{\text{\#letter}}{\text{\#word}}+b\cdot 100\cdot\frac{\text{\#sent}}{\text{\#word}}+c

Automated Readability Index (AUTO) is developed for U.S. Air Force to handle more technical documents than textbooks. Like COLE, AUTO relies on the number of letters per word, instead of the more commonly-used syllables per word. Another quirk is that non-integer scores are all rounded up.

AUTO=a#letter#word+b#word#sent+c\text{AUTO}=a\cdot\frac{\text{\#letter}}{\text{\#word}}+b\cdot\frac{\text{\#word}}{\text{\#sent}}+c

4.2 Recalibration & Performance

4.2.1 Traditional Formulas, Other Text Types

We only recalibrate formulas on the CCB dataset. As stated in section 2.1, we limit to CCB’s story-type items. In a preliminary investigation, we obtained low r2 scores (<0.3<0.3, before and after recalibration) between the traditional readability formulas and poetry, informational text, and drama.

4.2.2 Details on Recalibration

We started with a large feature extraction software, LingFeat (Lee et al., 2021) and expanded it to include more necessary features. From CCB texts, we extracted the surface-level features in traditional readability formulas (i.e. #letter#word\frac{\text{\#letter}}{\text{\#word}}, #word#sent\frac{\text{\#word}}{\text{\#sent}}, #syllable#word\frac{\text{\#syllable}}{\text{\#word}}) and put them in a dataframe.

CCB has 6 readability classes, but they are in the forms of range: K1, K2-3, K4-5, K6-8, K9-10, K11, and CCR (college and above). During calibration and evaluation, we estimated readability classes to K1, K2.5, K4.5, K7, K9.5, or K12 to model the general trend of CCB.

Using the class estimations as true labels and the created dataframe as features, we ran an optimization function to calculate the best coefficients (a, b, c in §4.1). We used non-linear least squares in fitting functions (Virtanen et al., 2020). Additional details are available in appendix B.

4.2.3 Coefficients & Performances

a) Coef.s FKGL FOGI SMOG COLE AUTO
original-a 0.390 0.4000 1.043 0.05880 4.710
adjusted-a 0.1014 0.1229 2.694 0.03993 6.000
original-b 11.80 100.0 30.00 -0.2960 0.5000
adjusted-b 20.89 415.7 8.815 -0.4976 0.1035
original-c -15.59 0.0000 3.129 -15.80 -21.43
adjusted-c -21.94 1.866 3.367 -5.747 -19.61
b) Perf. FKGL FOGI SMOG COLE AUTO
r2 score -0.03835 -0.3905 0.1613 0.4341 -0.5283
r2 score 0.4423 0.4072 0.3192 0.4830 0.4263
Pearson r 0.5698 0.5757 0.5649 0.6800 0.5684
Pearson r 0.6651 0.6381 0.5649 0.6949 0.6529
Table 2: a) Original & adjusted coefficients. b) Perform-ance on CCB. Measured on U.S. Standard Curriculum’s K-* Output. Bold refers to our new, adjusted versions.

Table 2-a shows the original coefficients and the adjusted variations, rounded up to match significant figures. The adjusted traditional readability formulas can be obtained by simply plugging in these values to the formulas in section 4.1.

5 The New English Readability Formula

5.1 Criteria

Considering the value of traditional readability formulas as essentially the generalized definition of readability for the non-experts (section 1), what really matters is the included features. The coefficients (or weights) can be recalibrated anytime to fit a specific use. Therefore, it is important to first identify handcrafted linguistic features that universally affect readability. Additionally, to ensure breadth and usability, we set the following guides:

1. We avoid surface-level features that lack linguistic value (Feng et al., 2010). They include #letter#word\frac{\text{\#letter}}{\text{\#word}}.

2. We include at most one linguistic feature from each linguistic subgroup. We use the classifications from Lee et al. (2021); Collins-Thompson (2014).

3. We stick to a simplistic linear equation format.

Feature CCB WBT CAM CKC OSE
Score Branch Subgroup LingFeat Code Brief Explanation r rk r rk r rk r rk r rk
43 LxSem Psycholinguistic as_AAKuL_C Kuperman Lemma AoA per Sent 0.540 25 0.505 1 0.722 42 0.711 4 0.601 25
43 LxSem Psycholinguistic as_AAKuW_C Kuperman Word AoA per Sent 0.537 28 0.503 2 0.722 43 0.711 6 0.602 24
40 LxSem Psycholinguistic at_AAKuW_C Kuperman Word AoA per Word 0.703 5 0.308 36 0.784 20 0.643 21 0.455 66
40 Synta Tree Structure as_TreeH_C Tree Height per Sent 0.550 21 0.341 30 0.686 51 0.699 9 0.541 44
40 Synta Part-of-Speech as_ContW_C # Content Words per Sent 0.534 29 0.453 13 0.667 56 0.688 14 0.544 43
39 LxSem Psycholinguistic at_AAKuL_C Kuperman Lemma AoA per Word 0.723 4 0.323 35 0.785 19 0.650 20 0.453 67
39 Synta Phrasal as_NoPhr_C # Noun Phrases per Sent 0.550 20 0.406 25 0.660 58 0.673 18 0.582 35
39 Synta Phrasal to_PrPhr_C Total # Prepositional Phrases 0.470 47 0.189 58 0.808 11 0.580 36 0.729 3
39 Synta Part-of-Speech as_FuncW_C # Function Words per Sent 0.468 48 0.471 8 0.662 57 0.673 17 0.614 19
38 LxSem Psycholinguistic to_AAKuL_C Total Sum Kuperman Lemma AoA 0.428 71 0.189 59 0.835 3 0.627 22 0.716 5
38 LxSem Psycholinguistic to_AAKuW_C Total Sum Kuperman Word AoA 0.427 72 0.189 60 0.835 4 0.625 23 0.715 6
36 Synta Phrasal as_PrPhr_C # Prepositional Phrases per Sent 0.513 35 0.417 23 0.607 70 0.608 28 0.590 34
36 LxSem Word Familiarity as_SbL1C_C SubtlexUS Lg10CD Value per Sent 0.467 49 0.430 20 0.612 69 0.699 10 0.533 45
35 LxSem Type Token Ratio CorrTTR_S Corrected Type Token Ratio 0.745 1 0.006 228 0.846 1 0.445 65 0.692 7
35 LxSem Word Familiarity as_SbL1W_C SubtlexUS Lg10WF Value per Sent 0.462 52 0.437 19 0.605 71 0.693 12 0.523 48
Table 3: Top 15 (score \geq 35) handcrafted linguistic features under Approach A. r: Pearson’s correlation between the feature and the dataset. rk: the feature’s correlation ranking on the specific dataset. Full version in appendix D.
Feature CCB WBT CAM CKC OSE
Score Branch Subgroup LingFeat Code Brief Explanation r rk r rk r rk r rk r rk
35 LxSem Psycholinguistic as_AAKuL_C Kuperman Lemma AoA per Sent 0.540 25 0.505 1 0.722 42 0.711 4 0.601 25
35 LxSem Psycholinguistic as_AAKuW_C Kuperman Word AoA per Sent 0.537 28 0.503 2 0.722 43 0.711 6 0.602 24
32 LxSem Psycholinguistic at_AAKuL_C Kuperman Lemma AoA per Word 0.723 2 0.323 35 0.785 42 0.650 22 0.453 67
32 LxSem Psycholinguistic at_AAKuW_C Kuperman Word AoA per Word 0.703 5 0.308 36 0.784 20 0.643 21 0.455 66
31 Synta Phrasal as_NoPhr_C # Noun Phrases per Sent 0.550 20 0.406 25 0.660 58 0.673 18 0.582 35
31 Synta Part-of-Speech as_ContW_C # Content Words per Sent 0.534 29 0.453 13 0.667 56 0.688 14 0.544 43
31 Synta Phrasal as_PrPhr_C # Prepositional Phrases per Sent 0.513 35 0.417 23 0.607 70 0.608 28 0.590 34
31 Synta Part-of-Speech as_FuncW_C # Function Words per Sent 0.468 48 0.471 8 0.662 57 0.673 17 0.614 19
31 LxSem Psycholinguistic to_AAKuL_C Total Sum Kuperman Lemma AoA 0.428 71 0.189 59 0.835 3 0.627 22 0.716 5
31 LxSem Psycholinguistic to_AAKuW_C Total Sum Kuperman Word AoA 0.427 72 0.189 60 0.835 4 0.625 23 0.715 6
30 LxSem Type Token Ratio CorrTTR_S Corrected Type Token Ratio 0.745 1 0.006 228 0.846 1 0.445 65 0.692 7
30 LxSem Variation Ratio CorrNoV_S Corrected Noun Variation-1 0.717 3 0.0858 131 0.842 2 0.406 78 0.612 21
30 Synta Tree Structure as_TreeH_C Tree Height per Sent 0.550 21 0.341 30 0.686 51 0.699 9 0.541 44
30 Synta Phrasal to_PrPhr_C Total # Prepositional Phrases 0.470 47 0.189 58 0.808 11 0.580 36 0.729 3
30 LxSem Word Familiarity as_SbL1C_C SubtlexUS Lg10CD Value per Sent 0.467 49 0.430 20 0.612 69 0.699 10 0.533 45
Table 4: Top 15 (score \geq 30) handcrafted linguistic features under Approach B. Italic for the feature not in Table 3.

5.2 Feature Extraction & Ranking

We utilize LingFeat for feature extraction. It is a public software that supports 255 handcrafted linguistic features in the branches of advanced semantic, discourse, syntactic, lexico-semantic, and shallow traditional. They further classify into 14 subgroups. We study the linguistically-meaningful branches: discourse (entity density, entity grid), syntax (phrasal, tree structure, part-of-speech), and lexico-semantics (variation ratio, type token ratio, psycholinguistics, word familiarity).

After extracting the features from CCB, WBT, CAM, CKC, and OSE, we first create feature performance ranking by Pearson’s correlation. We used Sci-Kit Learn (Pedregosa et al., 2011). We take extra measures (Approach A & B) to model the features’ general performances across datasets. Each approach runs under differing premises:

Premise A: "Human experts’ dataset creation and labeling are partially faulty. The weak performance of a feature in a dataset does not necessarily indicate its weak performance in other data settings".

Premise B: "All datasets are perfect. The weak performance of a feature in a dataset indicates the feature’s weakness to be used universally."

After 78 hours of running, we decided not to extract features from NSL. Computing details are in appendix E. Among the features included in LingFeat, there are traditional readability formulas, like FKGL and COLE. These formulas performed generally well but a single killer feature, like type token ratio (TTR), often outperformed formulas. Traditional readability formulas and shallow traditional features are excluded from the rankings.

5.3 Approach A - Comparative Ranking

Under premise A, each dataset poses a different linguistic environment to feature performance. Further, premise A takes human error into consideration and agrees that data labeling is most likely inconsistent in some way. The literal correlation value itself is not too important under premise A.

NERF =(analogous to) Lexical Difficulty+Syntactic Complexity+Lexical Richness+Bias\displaystyle=\text{(analogous to) {Lexical Difficulty}}+\text{{Syntactic Complexity}}+\text{{Lexical Richness}}+\text{{Bias}}
=0.04876Word Age-of-Acquisition0.1145Word Familiarity#Sentence\displaystyle=\frac{0.04876\cdot\sum\text{Word Age-of-Acquisition}-0.1145\cdot\sum\text{Word Familiarity}}{\text{\#Sentence}}
+0.3091#Content Word+0.1866#Noun Phrase+0.2645Constituency Parse Tree Height#Sentence\displaystyle+\frac{0.3091\cdot\text{\#Content Word}+0.1866\cdot\text{\#Noun Phrase}+0.2645\cdot\text{Constituency Parse Tree Height}}{\text{\#Sentence}}
+1.1017#Unique Word#Word4.125\displaystyle+\frac{1.1017\cdot\text{\#Unique Word}}{\sqrt{\text{\#Word}}}-4.125

Equation) New English Readability Formula (NERF)

Rather, we look for features that perform better than the others, under the same test settings. Thus, approach A’s rewarding system is rank-dependent. In a dataset, features that rank 1-10 are rewarded 10 points, rank 11-20 get 9 points, … and rank 91-100 get 1 point. Since there are five feature correlation rankings (one per dataset), the maximum score is 50. The results are in Table 3, in the order of score.

5.4 Approach B - Absolute Correlation

Under premise B, the weak correlation of a feature in a dataset is solely due to the feature’s weakness to generalize. This is because all datasets are supposedly perfect. Hence, we only measure the feature’s absolute correlation across datasets.

Approach B’s rewarding system is correlation-dependent. In a dataset, features that show correlation value between 0.9-10 are rewarded 10 points, value between 0.8-0.89 get 9 points, … and value between 0.0-0.09 get 1 point. Like approach A, the maximum score is 50. The result is in Table 4.

5.5 Analysis & Manual Feature Selection

First and the most noticeable, the top features under premise A & B are similar. In fact, the two results are almost replications of each other except for minor changes in order. We initially set two premises to introduce differing views (and hence the results) to feature rankings. Then, we would choose the features that perform well in both.

But there seems to be an inseparable correlation between ranking-based (premise A) and correlation-based (premise B) approaches. CorrNoV_S (Corrected Noun Variation) was the only new top feature introduced under premise B.

Second, discourse-based features (mostly entity-related) performed poorly for use in our final NERF. As an exception, ra_NNToT_C (noun-noun transitions : total) scored 28 under premise A and 26 under premise B. On the other hand, a majority of lexico-semantic and syntactic features performed well throughout. This strongly suggests that a possible discovery of universally-effective features for readability is in lexico-semantics or syntax.

Third, the difficulty of a document heavily depended on the difficulty of individual words. In detail, as_AAKuL_C, as_AAKuW_C, to_AAKuL_C, to_AAKuW_C showed consistently high correlations across the five datasets. As shown in Section 2, these five datasets have different authors, target audience, average length, labeling techniques, and the number of classes. Each dataset had at least one of these features among the top 5 performances.

The four features come from age-of-acquisition research by Kuperman et al. (2012), which now prove to be an important resource for RA. Such direct classification of word difficulties always outperformed frequency-based approaches like SubtlexUS (Brysbaert and New, 2009). Back to feature selection, we follow the steps below.

1. From top to bottom, go through ranking (table 3 & 4) to sort out the features that performed the best in each linguistic subgroup.

2. Conduct step 1 to both datasets and compare the results to each other. Though this process, we only leave the features that duplicate in both rankings.

The steps above produce the same results for both approach A and B. The final selected features are as_AAKuL_C (psycholinguistic), as_TreeH_C (tree structure), as_ContW_C (part-of-speech), as_NoPhr_C (phrasal), as_SbL1C_C (word familiarity), CorrTTR_S (type token ratio). CorrNov_S (variation) only appeared under approach B, and we did not include it.

5.6 More on NERF & Calibration

The final NERF (section 4.5) is brought in three parts. The first is lexico-semantics, which measures lexical difficulty. It adds the total sum of each word’s age-of-acquisition (Kuperman’s) and the sum of word familiarity scores (Lg10CD in SubtlexUS). The sum is divided by # sentences.

The second is syntactic complexity, which deals with how each sentence is structured. We look at the number of content words, noun phrases, and the total sum of sentence tree height. Here, content words (CW) are words that possess semantic content and contribute to the meaning of the specific sentence. Following LingFeat, we consider a word to be a content word if it has "NOUN", "VERB", "NUM", "ADJ", "ADV" as a POS tag. Also, a sentence’s tree height (TH) is calculated from a constituency-parsed tree, which we used the CRF parser (Zhang et al., 2020) to obtain. The related algorithms from NLTK (Bird et al., 2009) were used in calculating tree height. The same CRF parser was also used to count the number of noun phrase (NP) occurrences.

The third is lexical richness, given through type token ratio (TTR). This is the only section of NERF that is averaged on the word count. TTR measures how many unique vocabularies appear with respect to the total word count. TTR is often used as a measure of lexical richness (Malvern and Richards, 2012) and ranked the best performance on two native datasets (CCB and CAM). Importantly, these two datasets represent US and UK school curriculums, and TTR seems a good evaluator. What was interesting is that out of the five TTR variations from Lee et al. (2021); Vajjala and Meurers (2012), corrected TTR generalized particularly well.

Like section 3, we use the non-linear least fitting method on CCB to calibrate NERF. The results match what we expected. For example, the coefficient for word familiarity, which measures how frequently the word is used in American English, is negative since common words often have faster lexical comprehension times (Brysbaert et al., 2011).

6 Evaluation, against Human

Metric Human NERF FKGL FOGI SMOG COLE AUTO
MAE N.A. N.A. 2.844 3.413 3.114 2.537 3.377
MAE 3.509 2.154 2.457 2.516 2.728 2.378 2.514
r2 score N.A. N.A. -0.03835 -0.3905 0.1613 0.4341 -0.5283
r2 score -0.0312 0.5536 0.4423 0.4072 0.3192 0.4830 0.4263
Pearson r N.A. N.A. 0.5698 0.5757 0.5649 0.6800 0.5684
Pearson r 0.0838 0.7440 0.6651 0.6381 0.5649 0.6949 0.6530
Table 5: Scores on CCB. Measured on U.S. Standard Curriculum’s K-* Output. Bold for new or adjusted.

Here, we check the human-perceived difficulty of each item in CCB. We used Amazon Mechanical Turk to ask U.S. Bachelor’s degree holders, "Which U.S. grade does this text belong to?" Every item was answered by 1010 different workers to ensure breadth. Details on survey & datasets are in appendix B, C.

Table 5 gives a performance comparison of NERF against other traditional readability formulas and human performances. The human predictions were made by the U.S. Bachelor’s degree holders living in the U.S. Ten human predictions were averaged to obtain the final prediction for each item, for comparison against CCB.

The calibrated formulas show a particularly great increase in r2 score. This likely means that the new recalibrated formulas can capture the variance of the original CCB classifications much better when compared to the original formulas. We believe that such an improvement stems from the change in datasets. The original formulas are mostly built on human tests of 20th century’s military or technical documents, whereas the recalibration dataset (CCB) are from the student-targeted school curriculum. Further, CCB is classified by trained professionals. Hence, the standards for readability can differ. The new recalibrated versions are more suitable for analyzing the modern general documents and giving K-* output by modernized standards.

MAE (Mean Absolute Error), r2 score, and Pearson’s r improve once more with NERF. Even though the same dataset, same fitting function, and same evaluation techniques (no split, all train) were used, the critical difference was in the features. The shallow surface-level features from the traditional readability formulas also showed top rankings across all datasets but lacked linguistic coverage. Hence, NERF could capture more textual properties that led to a difference in readability.

Lastly, we observe that it is highly difficult for the general human population to exactly guess the readability of a text. Out of 690 predictions, only 286 were correct. We carefully posit that this is because: 1. the concept of "readability" is vague and 2. everyone goes through varying education. It could be easier to choose which item is more readable, instead of guessing how readable an item is. Given the general population, it is always better to use some quantified models than trust human.

7 Evaluation, for Application

7.1 Text Simplification - Passage-based

All readability formulas, whether recalibrated or not, show near-perfect performances in ranking the simplicity of texts. On both OSE-Pair & NSL-Pair, we designed a simple task of ranking the simplicity of an item. Both paired datasets include multiple simplified versions of an original item. Each row consists of various simplifications. A correct prediction is the corresponding readability formula output matching simplification level (e.g. original: highest prediction, …, simplest: lowest prediction).

In OSE-Pair, a correct prediction must properly rank three simplified items. NERF showed a meaningfully improved performance than the other five traditional readability formulas before recalibration. NERF correctly classified 98.7% pairs, while the others stayed \leq95% (FKGL: 93.4%, FOGI: 92.6%, SMOG: 94.4%, COLE: 94.9%, AUTO: 92.6%). Recalibration generally helped the traditional readability formulas but NERF still showed better performance (FKGL: 97.8%, FOGI: 97.1%, SMOG: 94.4%, COLE: 89.9%, AUTO: 95.8%).

In NSL-Pair, a correct prediction must properly rank five simplified items, which is a more difficult task than the previous. Nonetheless, all six formulas achieved 100% accuracies. The same results were achieved before and after CCB-recalibration. This hints that NSL-Pair is thoroughly simplified.

Readability formulas seem to perform well in ranking several simplifications on a passage-level. But there certainly are limits. First, one must understand that calculating "how much simple" is a much difficult task (Table 5). Second, the good results could be because sufficient simplification was done. For more fine grained simplifications, readability formulas could not be enough.

7.2 Text Simplification - Sentence-based

a) Adv-Ele NERF FKGL FOGI SMOG COLE AUTO
Accuracy N.A. 74.2% 64.9% 11.4% 66.0% 78.0%
Accuracy 77.4% 62.7% 51.8% 11.4% 71.1% 65.2%
b) Adv-Int NERF FKGL FOGI SMOG COLE AUTO
Accuracy N.A. 70.2% 63.0% 12.2% 63.6% 74.7%
Accuracy 77.8% 60.4% 51.3% 12.2% 67.7% 65.9%
c) Int-Ele NERF FKGL FOGI SMOG COLE AUTO
Accuracy N.A. 69.8% 61.3% 9.02% 61.9% 73.2%
Accuracy 73.1% 59.7% 48.9% 9.02% 66.5% 62.1%
Table 6: Scores on OSE-Sent. Bold for new or adjusted.

We were surprised that some existing text simplification studies are directly using traditional readability formulas for sentence difficulty evaluation. Our results show that using a formula-based approach is particularly useless in evaluating a sentence.

We tested both CCB-recalibrated and original formulas on ASSET. Here, a correct prediction must properly rank eleven simplified items. Despite the task difficulty, we anticipated seeing some correct predictions as there were 360 pairs. SMOG guessed 37 (after recalibration) and 89 (before recalibration) correct out of 360. But all the other formulas failed to make any correct prediction.

OSE-Sent poses an easier task. Since the dataset is divided into adv-int, adv-ele, and int-ele, the readability formulas now had to guess which is more difficult, out of the given two. We do obtain some positive results, showing that readability formulas can be useful in the cases where only two sentences are compared. On ranking two sentences, NERF performs better by a large margin.

7.3 Medical Documents

Refer to caption
Figure 1: On medical texts. NERF, against five others.

We argue that NERF is effective in fixing the over-inflated prediction of difficulty on medical texts. Such sudden inflation is widely-reported (Zheng and Yu, 2017) as the common weaknesses of traditional readability formulas on medical documents.

The U.S. National Institute of Health (NIH) guides that patient documents be \leqK-6 of difficulty. The most distinct characteristic of medical documents is the use of lengthy medical terms, like otolaryngology, urogynecology, and rheumatology. This makes traditional formulas, based on syllables, unreliable. But NERF uses familiarity and age-of-acquisition to penalty and reward word difficulty.

A medical term not found in Kuperman’s and SubtlexUS will have no effect. Instead, it will simply be labeled a content word. But in traditional formulas, the repetitive use of medical terms (which is likely the case) results in an insensible aggregation of text difficulty. In case various medical terms appear, NERF rewards each as a unique word.

Among recent studies is Haller et al. (2019), which analyzed the readability of urogynecology patient education documents in FKGL, SMOG, and Fry Readability. We also analyze the same 18 documents from the American Urogynecologic Society (AUGS) by manual OCR-based scraping. As Figure 1 shows, it is evident that NERF helps regulate the traditional readability formulas’ tendencies to over-inflate on medical texts. An example of the collected resource is given in appendix B.

8 Conclusion

So far, we have recalibrated five traditional readability formulas and assessed their performances. We evaluated them on CCB and proved that the adjusted variations help traditional readability formulas give output more in align with CCB, a common English education curriculum used throughout the United States. Further, we evaluated the recalibrated formulas’ application on text simplification research. On ranking passage difficulty, our recalibrated formulas showed good performance. However, the formulas lacked performance on ranking sentence difficulty because they were calibrated on passage-length instances. We leave sentence difficulty ranking as an open task.

Apart from recalibration traditional readability formulas, we also develop a new, linguistically-rich readability formulas named NERF. We prove that NERF can be much more useful when it comes to text simplification studies and analyzing the readability of medical documents. Also, our paper serves as a cross-comparison among readability metrics. Lastly, we develop a public Python-based software, for the fast dissemination of the results.

9 Limitations

Our work’s limitations mainly come from CCB. It is manifestly difficult to obtain solid, gold readability-labelled dataset from an officially accredited organization. CCB, the main dataset that we used to calibrate traditional readability formulas, has only 69 items available. Thus, we reasonably anticipate that variation in dialect, individual differences and general ability cannot be captured.

However, we highlight that NERF is developed upon several more datasets that represent diverse background, audience, and reading level. Hence, we believe that NERF can counter some of the shallowness of the traditional readability formulas, despite the still existing weaknesses.

One aspect of readability formulas that have not been deeply investigated is how the output changes depending on the text length. As we show in section 7, readability formulas fail to perform well on sentence-level items. But how about a passage of three sentences? Or does the performance have to do with the average number of words in the recalibration dataset? Is there some sensible range that the readability formulas work well for? These are some open question we fail to address in this work.

References

  • Aluisio et al. (2010) Sandra Aluisio, Lucia Specia, Caroline Gasperin, and Carolina Scarton. 2010. Readability assessment for text simplification. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, pages 1–9.
  • Alva-Manchego et al. (2020) Fernando Alva-Manchego, Louis Martin, Antoine Bordes, Carolina Scarton, Benoît Sagot, and Lucia Specia. 2020. Asset: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations. arXiv preprint arXiv:2005.00481.
  • Bange et al. (2019) Matthew Bange, Eric Huh, Sherwin A Novin, Ferdinand K Hui, and Paul H Yi. 2019. Readability of patient education materials from radiologyinfo. org: has there been progress over the past 5 years? American Journal of Roentgenology, 213(4):875–879.
  • Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.".
  • Brysbaert et al. (2011) Marc Brysbaert, Matthias Buchmeier, Markus Conrad, Arthur M Jacobs, Jens Bölte, and Andrea Böhl. 2011. The word frequency effect. Experimental psychology.
  • Brysbaert and New (2009) Marc Brysbaert and Boris New. 2009. Moving beyond kučera and francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for american english. Behavior research methods, 41(4):977–990.
  • Coleman and Liau (1975) Meri Coleman and Ta Lin Liau. 1975. A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60(2):283.
  • Collins-Thompson (2014) Kevyn Collins-Thompson. 2014. Computational assessment of text readability: A survey of current and future research. ITL-International Journal of Applied Linguistics, 165(2):97–135.
  • Dale and Chall (1948) Edgar Dale and Jeanne S Chall. 1948. A formula for predicting readability: Instructions. Educational research bulletin, pages 37–54.
  • Dennis (2018) Murphy Odo Dennis. 2018. A comparison of readability and understandability in second language acquisition textbooks for pre-service efl teachers. Journal of Asia TEFL, 15(3):750–765.
  • DuBay (2004) William H DuBay. 2004. The principles of readability. Online Submission.
  • Feng et al. (2010) Lijun Feng, Martin Jansche, Matt Huenerfauth, and Noémie Elhadad. 2010. A comparison of features for automatic readability assessment.
  • Gaeta et al. (2021) Laura Gaeta, Edward Garcia, and Valeria Gonzalez. 2021. Readability and suitability of spanish-language hearing aid user guides. American Journal of Audiology, 30(2):452–457.
  • Gunning et al. (1952) Robert Gunning et al. 1952. Technique of clear writing.
  • Haller et al. (2019) Jasmine Haller, Zachary Keller, Susan Barr, Kristie Hadden, and Sallie S Oliphant. 2019. Assessing readability: are urogynecologic patient education materials at an appropriate reading level? Female pelvic medicine & reconstructive surgery, 25(2):139–144.
  • Hansberry et al. (2018) David R Hansberry, Michael D’Angelo, Michael D White, Arpan V Prabhu, Mougnyan Cox, Nitin Agarwal, and Sandeep Deshmukh. 2018. Quantitative analysis of the level of readability of online emergency radiology-based patient education resources. Emergency radiology, 25(2):147–152.
  • Honnibal and Johnson (2015) Matthew Honnibal and Mark Johnson. 2015. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1373–1378, Lisbon, Portugal. Association for Computational Linguistics.
  • Joseph et al. (2020) Pradeep Joseph, Nicole A Silva, Anil Nanda, and Gaurav Gupta. 2020. Evaluating the readability of online patient education materials for trigeminal neuralgia. World Neurosurgery, 144:e934–e938.
  • Kincaid et al. (1975) J Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom. 1975. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical report, Naval Technical Training Command Millington TN Research Branch.
  • Kiwanuka et al. (2017) Elizabeth Kiwanuka, Raman Mehrzad, Adnan Prsic, and Daniel Kwan. 2017. Online patient resources for gender affirmation surgery: an analysis of readability. Annals of plastic surgery, 79(4):329–333.
  • Klare (2000) George R Klare. 2000. The measurement of readability: useful information for communicators. ACM Journal of Computer Documentation (JCD), 24(3):107–121.
  • Kue et al. (2021) Jennifer Kue, Dori L Klemanski, and Kristine K Browning. 2021. Evaluating readability scores of treatment summaries and cancer survivorship care plans. JCO Oncology Practice, pages OP–20.
  • Kuperman et al. (2012) Victor Kuperman, Hans Stadthagen-Gonzalez, and Marc Brysbaert. 2012. Age-of-acquisition ratings for 30,000 english words. Behavior research methods, 44(4):978–990.
  • Lee et al. (2021) Bruce W Lee, Yoo Sung Jang, and Jason Lee. 2021. Pushing on text readability assessment: A transformer meets handcrafted linguistic features. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10669–10686.
  • Lee and Lee (2020a) Bruce W. Lee and Jason Lee. 2020a. LXPER index 2.0: Improving text readability assessment model for L2 English students in Korea. In Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications, pages 20–24, Suzhou, China. Association for Computational Linguistics.
  • Lee and Lee (2020b) Bruce W. Lee and Jason Hyung-Jong Lee. 2020b. Lxper index: A curriculum-specific text readability assessment model for efl students in korea. International Journal of Advanced Computer Science and Applications, 11(8).
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  • Lively and Pressey (1923) Bertha A Lively and Sidney L Pressey. 1923. A method for measuring the vocabulary burden of textbooks. Educational administration and supervision, 9(7):389–398.
  • Malvern and Richards (2012) David Malvern and Brian Richards. 2012. Measures of lexical richness. The encyclopedia of applied linguistics.
  • Mc Laughlin (1969) G Harry Mc Laughlin. 1969. Smog grading-a new readability formula. Journal of reading, 12(8):639–646.
  • Meng et al. (2020) Changping Meng, Muhao Chen, Jie Mao, and Jennifer Neville. 2020. Readnet: A hierarchical transformer framework for web article readability analysis. Advances in Information Retrieval, 12035:33.
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
  • Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
  • Powell et al. (2020) Lauren E Powell, Emily S Andersen, and Andrea L Pozez. 2020. Assessing readability of patient education materials on breast reconstruction by major us academic institutions. Plastic and Reconstructive Surgery–Global Open, 8(9S):127–128.
  • Scarton and Specia (2018) Carolina Scarton and Lucia Specia. 2018. Learning simplifications for specific target audiences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 712–718.
  • Schwartz et al. (2017) H. Andrew Schwartz, Masoud Rouhizadeh, Michael Bishop, Philip Tetlock, Barbara Mellers, and Lyle Ungar. 2017. Assessing objective recommendation quality through political forecasting. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2348–2357, Copenhagen, Denmark. Association for Computational Linguistics.
  • Shardlow and Nawaz (2019) Matthew Shardlow and Raheel Nawaz. 2019. Neural text simplification of clinical letters with a domain specific phrase table. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 380–389.
  • Smith and Senter (1967) Edgar A Smith and RJ Senter. 1967. Automated readability index. AMRL-TR. Aerospace Medical Research Laboratories (US), pages 1–14.
  • Szmuda et al. (2020) T Szmuda, C Özdemir, S Ali, A Singh, MT Syed, and P Słoniewski. 2020. Readability of online patient education material for the novel coronavirus disease (covid-19): a cross-sectional health literacy study. Public Health, 185:21–25.
  • Uchendu et al. (2020) Adaku Uchendu, Thai Le, Kai Shu, and Dongwon Lee. 2020. Authorship attribution for neural text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8384–8395, Online. Association for Computational Linguistics.
  • Vajjala and Lučić (2018) Sowmya Vajjala and Ivana Lučić. 2018. Onestopenglish corpus: A new corpus for automatic readability assessment and text simplification. In Proceedings of the thirteenth workshop on innovative use of NLP for building educational applications, pages 297–304.
  • Vajjala and Meurers (2012) Sowmya Vajjala and Detmar Meurers. 2012. On improving the accuracy of readability classification using insights from second language acquisition. In Proceedings of the seventh workshop on building educational applications using NLP, pages 163–173.
  • van Ballegooie and Hoang (2021) Courtney van Ballegooie and Peter Hoang. 2021. Assessment of the readability of online patient education material from major geriatric associations. Journal of the American Geriatrics Society, 69(4):1051–1056.
  • Verhelst et al. (2001) N Verhelst, Piet Van Avermaet, S Takala, N Figueras, and B North. 2001. Common European Framework of Reference for Languages: learning, teaching, assessment. Cambridge University Press.
  • Virtanen et al. (2020) Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272.
  • Weller et al. (2020) Orion Weller, Jordan Hildebrandt, Ilya Reznik, Christopher Challis, E Shannon Tass, Quinn Snell, and Kevin Seppi. 2020. You don’t have time to read this: An exploration of document reading time prediction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1789–1794.
  • Wes McKinney (2010) Wes McKinney. 2010. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, pages 56 – 61.
  • Wu et al. (2013) Danny TY Wu, David A Hanauer, Qiaozhu Mei, Patricia M Clark, Lawrence C An, Jianbo Lei, Joshua Proulx, Qing Zeng-Treitler, and Kai Zheng. 2013. Applying multiple methods to assess the readability of a large corpus of medical documents. Studies in health technology and informatics, 192:647.
  • Xia et al. (2016) Menglin Xia, Ekaterina Kochmar, and Ted Briscoe. 2016. Text readability assessment for second language learners. In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pages 12–22.
  • Xu et al. (2015) Wei Xu, Chris Callison-Burch, and Courtney Napoles. 2015. Problems in current text simplification research: New data can help. Transactions of the Association for Computational Linguistics, 3:283–297.
  • Xu et al. (2016) Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. 2016. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4:401–415.
  • Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
  • Zhang et al. (2020) Yu Zhang, Houquan Zhou, and Zhenghua Li. 2020. Fast and accurate neural CRF constituency parsing. In Proceedings of IJCAI, pages 4046–4053.
  • Zheng and Yu (2017) Jiaping Zheng and Hong Yu. 2017. Readability formulas and user perceptions of electronic health records difficulty: a corpus study. Journal of medical Internet research, 19(3):e59.
  • Zhou et al. (2017) Shixiang Zhou, Heejin Jeong, and Paul A Green. 2017. How consistent are the best-known readability equations in estimating the readability of design standards? IEEE Transactions on Professional Communication, 60(1):97–111.

Appendix A Public Resources We Developed

A.1 Python Library

A.1.1 As a Readability Tool

<Anonymous> supports six readability formulas: NERF, FKGL, FOGI, SMOG, COLE and AUTO. All formulas, other than NERF, are also available in recalibrated variations. A particularly useful feature of this library is that all formulas are fitted to give the U.S. standard school grading system as output. Compared to some other traditional readability formulas where a user has to refer to a table understand output, K-* based numbers are intuitive.

A.1.2 As a General Tool

We have plans to expand <Anonymous> to support various menial tasks in text analysis. We are to focus on the tasks that can be better performed using simplistic approaches. One feature that we had already implemented is text reading time estimation. Weller et al. (2020) has previously shown in a large-scale study that a commonly used rule-of-thumb for online reading estimates, 240 words per minute (WPM), shows better RMSE and MAE results when compared to more modern approaches using XLNet (Yang et al., 2019), ELMo (Peters et al., 2018) and RoBERTa (Liu et al., 2019). We implement 175, 240 and 300 WPM.

A.1.3 Basic Usage

For straightforward maintenance, we keep <Anonymous>’s architecture as simple as possible. There are not many steps for the user to take:

import <Anonymous>

new_object = <Anonymous>.request(…)

readability_score1 = new_object.NERF()

readability_score2 = new_object.FKGL()

readability_score3 = new_object.FOGI()

readability_score4 = new_object.SMOG()

readability_score5 = new_object.COLE()

readability_score6 = new_object.AUTO()

time_to_read = new_object.RT()

NERF(), FKGL(), FOGI(), SMOG(), COLE(), AUTO(), RT() are shortcut functions. It can be slightly faster to directly call in the full forms as:

new_english_readability_formula()

flesch_kincaid_grade_level()

fog_index()

smog_index()

coleman_liau_index()

automated_readability_index()

read_time()

Further, all readability formula functions (except for NERF) has option to choose the original or the adjusted variation. Default is set adjusted = True.

A.1.4 <Anonymous> Speed to Calculation

We care for the library’s calculation speed so that it can be of practical use for research implementations. We chose the following items for evaluation.

ITEM A

In those times panics were common, and few days passed without some city or other registering in its archives an event of this kind. There were nobles, who made war against each other; there was the king, who made war against the cardinal; there was Spain, which made war against the king. Then, in addition to these concealed or public, secret or open wars, there were robbers, mendicants, Huguenots, wolves, and scoundrels, who made war upon everybody. The citizens always took up arms readily against thieves, wolves or scoundrels, often against nobles or Huguenots, sometimes against the king, but never against the cardinal or Spain. It resulted, then, from this habit that on the said first Monday of April, 1625, the citizens, on hearing the clamor, and seeing neither the red-and-yellow standard nor the livery of the Duc de Richelieu, rushed toward the hostel of the Jolly Miller. When arrived there, the cause of the hubbub was apparent to all.

The Three Musketeers, Alexandre Dumas

ITEM B

The vaccine contains lipids (fats), salts, sugars and buffers. COVID-19 vaccines do not contain eggs, gelatin (pork), gluten, latex, preservatives, antibiotics, adjuvants or aluminum. The vaccines are safe, even if you have food, drug, or environmental allergies. Talk to a health care provider first before getting a vaccine if you have allergies to the following vaccine ingredients: polyethylene glycol (PEG), polysorbate 80 and/or tromethamine (trometamol or Tris).

COVID-19 Vaccine Information Sheet, Ministry of Health, Ontario Canada

ITEM C

BERT alleviates the previously mentioned unidirectionality constraint by using a “masked language model”(MLM) pre-training objective, inspired by the Cloze task.

Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

a) ITEM A NERF FKGL FOGI SMOG COLE AUTO
item * 1 0.6371 0.0002 0.0001 0.0001 0.0000 0.0000
item * 5 2.6450 0.0006 0.0005 0.0004 0.0001 0.0001
item * 10 5.5175 0.0011 0.0010 0.0010 0.0004 0.0004
item * 15 7.8088 0.0016 0.0016 0.0013 0.0003 0.0004
item * 20 10.226 0.0021 0.0021 0.0018 0.0004 0.0004
b) ITEM B NERF FKGL FOGI SMOG COLE AUTO
item * 1 0.3531 0.0000 0.0000 0.0000 0.0000 0.0000
item * 5 1.2842 0.0003 0.0003 0.0002 0.0000 0.0000
item * 10 2.5178 0.0005 0.0005 0.0004 0.0001 0.0001
item * 15 3.6545 0.0009 0.0007 0.0006 0.0002 0.0002
item * 20 4.8308 0.0010 0.0010 0.0009 0.0002 0.0002
c) ITEM C NERF FKGL FOGI SMOG COLE AUTO
item * 1 0.1373 0.0000 0.0000 0.0000 0.0000 0.0000
item * 5 0.1888 0.0001 0.0000 0.0000 0.0000 0.0000
item * 10 0.2528 0.0002 0.0002 0.0002 0.0000 0.0000
item * 15 0.3420 0.0003 0.0003 0.0002 0.0000 0.0000
item * 20 0.3886 0.0004 0.0003 0.0003 0.0000 0.0000
Table 8: Speeds in seconds, on Items A, B and C.

First, it is very obvious that AUTO does a great job in keeping calculation speed short for longer texts as originally intended. Second, NERF’s calculation speed linearly increases in respect to the text length. Though, we believe that NERF’s speed is decent in its wide linguistic coverage, it seems true that the speed is weakness when compared to the other readability formulas.

A.2 Research Archive

Our datasets, preprocessing codes and evaluation codes can be found in <Anonymous>. Copyrighted resources are given upon request to the first author.

Appendix B External Resources

B.1 Python Libraries

pandas v.1.3.4 (Wes McKinney, 2010)

Calculations for Kuperman’s AoA CSV, SubtlexUS word familiarity CSV, manage and manipulate data. For feature study purposes, correlate and rank features in Tables 3 and 4.

SuPar v.1.1.3 - CRF Parser

Constituency parsing on input sentences -> calculate tree height and count noun phrases.

spaCy v.3.2.0 (Honnibal and Johnson, 2015)

Sentence/dependency parsing on documents -> sent input into SuPar and count content words (POS).

Sci-Kit Learn v.1.0.1

Calculation, r2 score and MAE in Tables 2 and 5.

SciPy v.1.7.3

Calculation of Pearson’s r for Tables 2 and 5. Fitting function (scipy.optimize.curve_fit()) used to recalibrate traditional readability formulas and give coefficients for NERF in Table 2.

NLTK v.3.6.5

Calculation of tree height for NERF.

LingFeat v.1.0.0-beta.19

Extraction of handcrafted linguistic features.

B.2 Datasets

New Class CCB WBT
K1.0 K1 (Age 6-7) N/A
K2.0 N/A Level 2 (Age 7-8)
K2.5 K2-3 (Age 7-9) N/A
K3.0 N/A Level 3 (Age 8-9)
K4.0 N/A Level 4 (Age 9-10)
K4.5 K4-5 (Age 9-11) N/A
K7.0 K6-8 (Age 11-14) KS3 (Age 11-14)
K9.5 K9-10 (Age 14-16) GCSE (Age 14-16)
K12.0 K11-CCR (Age 16+) N/A
Table 9: Aged-based conversions for CCB and WBT.

We collected CCB by manually going through an official source222corestandards.org/assets/Appendix_B.pdf. WBT was obtained from the authors333Dr. Sowmya Vajjala, National Research Council, Canada in HTML format. We conducted basic preprocessing and manipulated WBT in CSV format. CAM was retrieved from an existing archive444ilexir.co.uk/datasets/index.html. CKC was retrieved from a South Korean educational company555Bruce W. Lee, LXPER Inc., South Korea. OSE was retrieved from a public archive666github.com/nishkalavallabhi/OneStopEnglishCorpus. NSL was obtained from an American educational company777Luke Orland, Newsela Inc., New York, U.S.A.. AUGS medical texts (refer to Section 6.3) were manually scraped from the official website888augs.org/patient-fact-sheets/. ASSET was obtained from a public repository999github.com/facebookresearch/asset. Lastly, Table 8 shows how we converted WBT class labels to fit CCB and show in Table 1. All were consistent with intended use.

Further, to give more backgrounds to section 6.2, we give example pairs from ASSET and OSE-Sent.

ASSET

0: Gable earned an Academy Award nomination for portraying Fletcher Christian in Mutiny on the Bounty.

1: Gable also earned an Oscar nomination when he portrayed Fletcher Christian in 1935’s Mutiny on the Bounty.

2: Gable won an Academy Award vote when he acted in 1935’s Mutiny on the Bounty as Fletcher Christian.

3: Gable also won an Academy Award nomination when he played Fletcher Christian in the 1935 film Mutiny on the Bounty.

4: Gable was nominated for an Academy Award for portraying Fletcher Christian in 1935’s Mutiny on the Bounty.

5: Gable also earned an Academy Award nomination in 1935 for playing Fletcher Christian in "Mutiny on the Bounty.

6: Gable also earned an Academy Award nomination when he played Fletcher Christian in 1935’s Mutiny on the Bounty.

7: Gable recieved an Academy Award nomination for his role as Fletcher Christian. The film was Mutiny on the Bounty (1935).

8: Gable earned an Academy Award nomination for his role as Fletcher Christian in the 1935 film Mutiny on the Bounty.

9: Gable also got an Academy Award nomination when he played Fletcher Christian in 1935’s movie, Mutiny on the Bounty.

10: Gable also earned an Academy Award nomination when he portrayed Fletcher Christian in 1935’s Mutiny on the Bounty.

OSE-Sent (ADV-ELE)

ADV: The Seattle-based company has applied for its brand to be a top-level domain name (currently .com), but the South American governments argue this would prevent the use of this internet address for environmental protection, the promotion of indigenous rights and other public interest uses.

ELE: Amazon has asked for its company name to be a top-level domain name (currently .com), but the South American governments say this would stop the use of this internet address for environmental protection, indigenous rights and other public interest uses.

OSE-Sent (ADV-INT)

ADV: Brazils latest funk sensation, Anitta, has won millions of fans by taking the favela sound into the mainstream, but she is at the centre of a debate about skin colour.

INT: Brazils latest funk sensation, Anitta, has won millions of fans by making the favela sound popular, but she is at the centre of a debate about skin colour.

OSE-Sent (INT-ELE)

INT: Allowing private companies to register geographical names as gTLDs to strengthen their brand or to profit from the meaning of these names is not, in our view, in the public interest, the Brazilian Ministry of Science and Technology said.

ELE: Allowing private companies to register geographical names as gTLDs to profit from the meaning of these names is not, in our view, in the public interest, the Brazilian Ministry of Science and Technology said.

The following is an example of the AUGS medical documents used in Section 6.3 and Figure 1.

Interstitial Cystitis: Interstitial Cystitis/ Bladder Pain Syndrome Interstitial cystitis/bladder pain syndrome (IC/BPS) is a condition with symptoms including burning, pressure, and pain in the bladder along with urgency and frequency. About IC/BPS IC/BPS occurs in three to seven percent of women, and can affect men as well. Though usually diagnosed among women in their 40s, younger and older women have IC/BPS, too. It can feel like a constant bladder infection. Symptoms may become severe (called a "flare") for hours, days or weeks, and then disappear. Or, they may linger at a very low level during other times. Individuals with IC/BPS may also have other health issues such as irritable bowel syndrome, fibromyalgia, chronic headaches, and vulvodynia. Depression and anxiety are also common among women with this condition. The cause of IC/BPS is unknown. It is likely due to a combination of factors. IC/BPS runs in families and so may have a genetic factor. On cystoscopy, the doctor may see damage to the wall of the bladder. This may allow toxins from the urine to seep into the delicate layers of the bladder lining, causing the pain of IC/BPS. Other research found that nerves in and around the bladder of people with IC/BPS are hypersensitive. This may also contribute to IC/BPS pain. There may also be an allergic component.

Appendix C CCB Human Predictions

In Section 2.1, we mention that human predictions were collected on Amazon Mechanical Turk. Then, we compared human performance to readability formulas in Table 5. Here, surveys are designed.

Description: must choose which difficulty level does the text belong, "difficulty does not correlate with text length"

Qualification Requirement(s): Location is one of US, HIT Approval Rate (%) for all Requesters’ HITs greater than 80, Number of HITs Approved greater than 50, US Bachelor’s Degree equal to true, Masters has been granted

All 69 story-type items from CCB were given. Each item had to be completed by at least 10 different individuals, resulting in 690 responses in total. They were given 6 representative examples. Payments were adequately and they were informed that the responds shall be used for research.

Appendix D Handcrafted Linguistic Features and the Respective Generalizability

We give full generalizability rankings that we obtained through LingFeat. Considering that much work has to be done on the generalizability of RA, we believe that these rankings are particularly helpful. Table 9, Table 10, Table 11, Table 12, Table 13, Table 14,Table 15 are expanded versions of Table 3 and Table 4. The features not shown scored a 0.

From the full rankings, it is clear that shallow traditional (surface-level), lexico-semantic and syntactic features are effective throughout all datasets. Advanced semantics and discourse features show some what similar mid-low performances. However, it should be acknowledged that among the worst performing are lexico-semantic and syntactic features, too. This is perhaps because LingFeat itself has a very lexico-semantics and syntax-focused collection of handcrafted linguistic features. Thus, more study is needed.

Even if two features are from the same group (phrasal), they could show drastically varying performances (# Noun phrases per Sent - scored 39 in approach A v.s. # Verb phrases per Sent - scored 1 in approach A). Hence, thorough feature study must always be conducted during research. In a feature selection for a readability-related model, a cherry picking the most well performing feature from each feature group is recommended.

Appendix E Computing Power

Single CPU chip. Architecture: x86_64; CPU(s): 16; Model name: Intel(R) Core(TM) i9-9900KF CPU @ 3.60GHz; CPU MHz: 800.024

Feature CCB WBT CAM CKC OSE
Score Branch Subgroup LingFeat Code Brief Explanation r rk r rk r rk r rk r rk
43 ShaTr Shallow as_Sylla_C # syllables per Sent 0.541 24 0.461 10 0.686 50 0.697 11 0.59 31
43 LxSem Psycholinguistic as_AAKuL_C lemmas AoA of lemmas per Sent 0.54 25 0.505 1 0.722 42 0.711 4 0.601 25
43 ShaTr Shallow as_Chara_C # characters per Sent 0.539 27 0.487 4 0.696 46 0.711 5 0.613 20
43 LxSem Psycholinguistic as_AAKuW_C AoA of words per Sent 0.537 28 0.502 2 0.722 41 0.711 6 0.602 24
42 Synta Tree Structure as_FTree_C length of flattened Trees per Sent 0.505 37 0.485 5 0.677 54 0.719 2 0.622 16
40 LxSem Psycholinguistic at_AAKuW_C AoA of words per Word 0.703 5 0.308 36 0.784 20 0.643 21 0.455 66
40 Synta Tree Structure as_TreeH_C Tree height per Sent 0.55 21 0.341 30 0.686 51 0.699 9 0.541 44
40 Synta Part-of-Speech as_ContW_C # Content words per Sent 0.534 29 0.453 13 0.667 56 0.688 14 0.544 43
40 ShaTr Shallow as_Token_C # tokens per Sent 0.494 40 0.464 9 0.65 60 0.709 7 0.58 36
39 LxSem Psycholinguistic at_AAKuL_C lemmas AoA of lemmas per Word 0.723 2 0.323 35 0.785 19 0.65 20 0.453 67
39 Synta Phrasal as_NoPhr_C # Noun phrases per Sent 0.55 20 0.406 25 0.66 58 0.673 18 0.582 35
39 Synta Phrasal to_PrPhr_C total # prepositional phrases 0.47 47 0.189 58 0.808 11 0.58 36 0.729 3
39 Synta Part-of-Speech as_FuncW_C # Function words per Sent 0.468 48 0.471 8 0.662 57 0.673 17 0.614 19
38 LxSem Psycholinguistic to_AAKuL_C total lemmas AoA of lemmas 0.428 71 0.189 59 0.835 3 0.627 22 0.716 5
38 LxSem Psycholinguistic to_AAKuW_C total AoA (Age of Acquisition) of words 0.427 72 0.189 60 0.835 4 0.625 23 0.715 6
36 Synta Phrasal as_PrPhr_C # prepositional phrases per Sent 0.513 35 0.417 23 0.607 70 0.608 28 0.59 32
36 LxSem Word Familiarity as_SbL1C_C SubtlexUS Lg10CD value per Sent 0.467 49 0.43 20 0.612 69 0.699 10 0.533 45
35 LxSem Type Token Ratio CorrTTR_S Corrected TTR 0.745 1 0.006 228 0.846 1 0.445 65 0.692 7
35 LxSem Word Familiarity as_SbL1W_C SubtlexUS Lg10WF value per Sent 0.462 52 0.437 19 0.605 71 0.693 12 0.523 48
34 Synta Part-of-Speech as_NoTag_C # Noun POS tags per Sent 0.551 19 0.304 38 0.624 65 0.608 29 0.48 61
34 LxSem Psycholinguistic as_AACoL_C AoA of lemmas, Cortese and Khanna norm per Sent 0.532 30 0.339 32 0.649 61 0.597 32 0.499 58
34 LxSem Psycholinguistic as_AABrL_C lemmas AoA of lemmas, Bristol norm per Sent 0.532 31 0.339 31 0.649 62 0.597 31 0.499 57
34 LxSem Psycholinguistic to_AABrL_C total lemmas AoA of lemmas, Bristol norm 0.451 56 0.134 100 0.808 10 0.561 38 0.637 12
33 LxSem Psycholinguistic as_AABiL_C lemmas AoA of lemmas, Bird norm per Sent 0.459 55 0.458 11 0.582 73 0.653 19 0.443 69
33 Synta Phrasal to_NoPhr_C total # Noun phrases 0.416 76 0.148 84 0.809 8 0.527 52 0.659 9
33 Synta Part-of-Speech to_ContW_C total # Content words 0.402 81 0.163 71 0.804 14 0.558 40 0.654 11
32 LxSem Variation Ratio CorrNoV_S Corrected Noun Variation-1 0.717 3 0.086 131 0.842 2 0.406 78 0.612 21
32 LxSem Variation Ratio CorrVeV_S Corrected Verb Variation-1 0.602 11 0.058 155 0.801 15 0.393 86 0.737 2
32 LxSem Psycholinguistic to_AACoL_C total AoA of lemmas, Cortese and Khanna norm 0.451 57 0.134 101 0.808 9 0.561 39 0.637 13
32 Synta Part-of-Speech as_VeTag_C # Verb POS tags per Sent 0.428 70 0.476 6 0.578 74 0.588 34 0.505 55
32 Synta Tree Structure to_FTree_C total length of flattened Trees 0.396 87 0.166 69 0.805 12 0.538 49 0.676 8
31 LxSem Variation Ratio SquaNoV_S Squared Noun Variation-1 0.645 9 0.124 109 0.815 7 0.401 84 0.583 34
31 LxSem Variation Ratio CorrAjV_S Corrected Adjective Variation-1 0.591 12 0.078 134 0.779 21 0.422 70 0.584 33
31 Synta Part-of-Speech to_AjTag_C total # Adjective POS tags 0.441 62 0.191 57 0.777 23 0.504 54 0.525 46
30 LxSem Variation Ratio SquaVeV_S Squared Verb Variation-1 0.559 17 0.076 138 0.777 22 0.384 90 0.716 4
30 Synta Part-of-Speech to_NoTag_C total # Noun POS tags 0.441 61 0.129 107 0.805 13 0.55 44 0.636 15
30 Synta Phrasal as_VePhr_C # Verb phrases per Sent 0.383 90 0.455 12 0.59 72 0.586 35 0.505 54
29 LxSem Word Familiarity as_SbCDL_C SubtlexUS CDlow value per Sent 0.432 65 0.441 14 0.527 82 0.623 26 0.401 85
28 Synta Part-of-Speech as_AjTag_C # Adjective POS tags per Sent 0.506 36 0.353 28 0.553 76 0.533 51 0.404 84
28 Disco Entity Grid ra_NNTo_C ratio of nn transitions to total 0.476 44 0.078 135 0.754 35 0.451 64 0.602 23
28 Synta Tree Structure at_TreeH_C Tree height per Word 0.476 45 0.419 22 0.416 104 0.597 33 0.41 81
28 LxSem Word Familiarity as_SbCDC_C SubtlexUS CD# value per Sent 0.431 67 0.437 17 0.525 84 0.624 24 0.404 82
28 LxSem Word Familiarity as_SbSBC_C SubtlexUS SUBTLCD value per Sent 0.431 68 0.437 18 0.525 85 0.624 25 0.404 83
28 LxSem Word Familiarity to_SbL1C_C total SubtlexUS Lg10CD value 0.37 93 0.14 95 0.797 16 0.491 56 0.621 17
27 LxSem Variation Ratio SquaAjV_S Squared Adjective Variation-1 0.531 32 0.141 94 0.754 34 0.407 77 0.573 37
27 LxSem Word Familiarity as_SbFrL_C SubtlexUS FREQlow value per Sent 0.443 60 0.426 21 0.52 86 0.552 42 0.425 77
26 LxSem Word Familiarity as_SbSBW_C SubtlexUS SUBTLWF value per Sent 0.44 63 0.441 15 0.509 91 0.542 48 0.425 76
26 LxSem Word Familiarity as_SbFrQ_C SubtlexUS FREQ# value per Sent 0.44 64 0.441 16 0.509 90 0.542 47 0.425 75
26 LxSem Word Familiarity to_SbL1W_C total SubtlexUS Lg10WF value 0.365 99 0.144 93 0.795 17 0.477 58 0.611 22
25 LxSem Psycholinguistic to_AABiL_C total lemmas AoA of lemmas, Bird norm 0.365 98 0.155 79 0.786 18 0.473 59 0.565 39
25 LxSem Word Familiarity to_SbFrL_C total SubtlexUS FREQlow value 0.348 109 0.201 51 0.774 24 0.414 74 0.555 40
24 LxSem Word Familiarity to_SbFrQ_C total SubtlexUS FREQ# value 0.34 116 0.206 48 0.77 26 0.403 82 0.551 41
24 LxSem Word Familiarity to_SbSBW_C total SubtlexUS SUBTLWF value 0.34 115 0.206 47 0.77 27 0.403 81 0.551 42
23 ShaTr Shallow at_Sylla_C # syllables per Word 0.66 7 0.106 120 0.627 64 0.505 53 0.37 91
23 Synta Phrasal to_SuPhr_C total # Subordinate Clauses 0.367 96 0.202 50 0.721 43 0.462 61 0.419 78
23 Synta Phrasal to_VePhr_C total # Verb phrases 0.324 127 0.169 68 0.76 31 0.416 72 0.57 38
22 AdSem Wiki Knowledge WTopc15_S Number of topics, 150 topics extracted from Wiki 0.58 15 0.007 227 0.645 63 0.605 30 0.191 122
22 LxSem Variation Ratio CorrAvV_S Corrected AdVerb Variation-1 0.542 23 0.059 154 0.71 44 0.333 99 0.474 63
22 ShaTr Shallow at_Chara_C # characters per Word 0.443 59 0.2 52 0.619 67 0.402 83 0.443 68
22 Synta Part-of-Speech to_CoTag_C total # Coordinating Conjunction POS tags 0.364 101 0.268 43 0.728 39 0.406 80 0.434 72
22 Synta Part-of-Speech to_FuncW_C total # Function words 0.33 126 0.159 77 0.773 25 0.385 89 0.636 14
22 Synta Part-of-Speech to_VeTag_C total # Verb POS tags 0.288 138 0.173 63 0.738 38 0.383 91 0.597 27
21 AdSem Wiki Knowledge WTopc20_S Number of topics, 200 topics extracted from Wiki 0.584 14 0.015 214 0.616 68 0.617 27 0.137 138
20 LxSem Variation Ratio SquaAvV_S Squared AdVerb Variation-1 0.515 34 0.093 128 0.686 52 0.326 102 0.46 65
19 Synta Phrasal as_SuPhr_C # Subordinate Clauses per Sent 0.387 89 0.357 26 0.532 80 0.495 55 0.265 112
19 LxSem Word Familiarity to_SbCDL_C total SubtlexUS CDlow value 0.348 107 0.148 87 0.764 30 0.394 85 0.513 53
18 LxSem Type Token Ratio UberTTR_S Uber Index 0.646 8 0.041 174 0.369 112 0.109 173 0.599 26
18 AdSem Wiki Knowledge WTopc10_S Number of topics, 100 topics extracted from Wiki 0.52 33 0.004 229 0.532 79 0.552 43 0.075 180
18 AdSem Wiki Knowledge WNois20_S Semantic Noise, 200 topics extracted from Wiki 0.492 41 0.032 190 0.566 75 0.572 37 0.025 221
18 Synta Part-of-Speech to_SuTag_C total # Subordinating Conjunction POS tags 0.4 83 0.193 56 0.691 48 0.406 79 0.299 106
18 LxSem Word Familiarity to_SbSBC_C total SubtlexUS SUBTLCD value 0.347 111 0.146 91 0.764 28 0.392 88 0.515 52
18 LxSem Word Familiarity to_SbCDC_C total SubtlexUS CD# value 0.347 110 0.146 90 0.764 29 0.392 87 0.515 51
18 Synta Part-of-Speech to_AvTag_C total # Adverb POS tags 0.342 114 0.17 67 0.726 40 0.352 96 0.469 64
Table 10: Part A. The full generalizability ranking of handcrafted linguistic features under Approach A. r: Pearson’s correlation between the feature and the dataset. rk: the feature’s correlation ranking on the specific dataset.
Feature CCB WBT CAM CKC OSE
Score Branch Subgroup LingFeat Code Brief Explanation r rk r rk r rk r rk r rk
17 AdSem Wiki Knowledge WTopc05_S Number of topics, 50 topics extracted from Wiki 0.549 22 0.033 186 0.514 89 0.533 50 0.042 203
17 Synta Part-of-Speech as_AvTag_C # Adverb POS tags per Sent 0.32 129 0.292 41 0.526 83 0.43 67 0.415 79
16 LxSem Type Token Ratio BiLoTTR_S Bi-Logarithmic TTR 0.591 13 0.062 149 0.07 200 0.001 229 0.523 47
16 AdSem Wiki Knowledge WRich15_S Semantic Richness, 150 topics extracted from Wiki 0.495 39 0.02 208 0.48 95 0.549 45 0.037 209
16 Synta Part-of-Speech as_CoTag_C # Coordinating Conjunction POS tags per Sent 0.38 91 0.411 24 0.463 97 0.442 66 0.293 107
15 Synta Phrasal to_AvPhr_C total # Adverb phrases 0.356 105 0.17 66 0.705 45 0.298 111 0.432 73
15 ShaTr Shallow TokSenL_S log(total # tokens)/log(total # sentence) 0.293 137 0.352 29 0.297 130 0.544 46 0.198 121
14 AdSem Wiki Knowledge WRich20_S Semantic Richness, 200 topics extracted from Wiki 0.465 50 0.029 195 0.446 102 0.556 41 0.027 219
13 Synta Phrasal at_PrPhr_C # prepositional phrases per Word 0.57 16 0.133 103 0.316 124 0.323 105 0.366 92
13 Synta Phrasal ra_NoPrP_C ratio of Noun phrases # to Prep phrases # 0.477 43 0.149 83 0.34 120 0.345 97 0.389 87
13 Disco Entity Grid ra_SNTo_C ratio of sn transitions to total 0.448 58 0.019 210 0.514 88 0.196 133 0.518 49
13 LxSem Word Familiarity at_SbL1C_C SubtlexUS Lg10CD value per Word 0.408 78 0.161 75 0.541 78 0.204 130 0.392 86
13 Synta Part-of-Speech as_SuTag_C # Subordinating Conjunction POS tags per Sent 0.366 97 0.295 39 0.407 105 0.427 68 0.151 131
13 ShaTr Shallow TokSenS_S sqrt(total # tokens x total # sentence) 0.241 154 0.064 147 0.758 32 0.249 121 0.498 59
13 Synta Tree Structure to_TreeH_C total Tree height of all sentences 0.27 145 0.069 143 0.755 33 0.309 108 0.515 50
13 Synta Phrasal as_AvPhr_C # Adverb phrases per Sent 0.244 152 0.328 34 0.427 103 0.38 92 0.356 93
12 Disco Entity Grid ra_NSTo_C ratio of ns transitions to total 0.426 73 0.033 187 0.516 87 0.266 117 0.505 56
12 Synta Phrasal to_AjPhr_C total # Adjective phrases 0.339 120 0.182 62 0.682 53 0.327 101 0.271 111
11 AdSem Wiki Knowledge WNois05_S Semantic Noise, 50 topics extracted from Wiki 0.462 53 0.061 150 0.455 100 0.412 75 0.118 151
11 Synta Phrasal ra_PrNoP_C ratio of Prep phrases # to Noun phrases # 0.421 75 0.162 74 0.276 135 0.344 98 0.37 90
11 ShaTr Shallow TokSenM_S total # tokens x total # sentence 0.189 173 0.112 116 0.674 55 0.177 140 0.486 60
10 Synta Phrasal ra_VeNoP_C ratio of Verb phrases # to Noun phrases # 0.46 54 0.164 70 0.124 174 0.041 209 0.027 220
10 Disco Entity Density at_UEnti_C number of unique Entities per Word 0.127 197 0.307 37 0.548 77 0.253 119 0.124 149
9 LxSem Variation Ratio SimpNoV_S Noun Variation-1 0.499 38 0.087 130 0.038 212 0.031 213 0.337 95
9 Synta Part-of-Speech at_VeTag_C # Verb POS tags per Word 0.431 69 0.187 61 0.076 196 0.111 171 0.011 224
9 LxSem Word Familiarity at_SbL1W_C SubtlexUS Lg10WF value per Word 0.399 84 0.089 129 0.531 81 0.24 123 0.412 80
9 Synta Part-of-Speech ra_VeNoT_C ratio of Verb POS # to Noun POS # 0.397 86 0.198 53 0.234 142 0.171 142 0.067 186
9 LxSem Word Familiarity at_SbSBC_C SubtlexUS SUBTLCD value per Word 0.37 94 0.032 192 0.492 93 0.324 103 0.435 71
9 LxSem Word Familiarity at_SbCDC_C SubtlexUS CD# value per Word 0.37 95 0.032 191 0.492 94 0.324 104 0.435 70
9 Synta Phrasal as_AjPhr_C # Adjective phrases per Sent 0.323 128 0.239 46 0.387 106 0.357 95 0.157 127
9 AdSem WB Knowledge BClar15_S Semantic Clarity, 150 topics extracted from WeeBit 0.025 221 0.161 76 0.38 108 0.481 57 0.315 100
8 AdSem Wiki Knowledge WNois15_S Semantic Noise, 150 topics extracted from Wiki 0.388 88 0.033 188 0.454 101 0.454 63 0.006 226
8 Disco Entity Density at_EntiM_C number of Entities Mentions #s per Word 0.17 180 0.204 49 0.501 92 0.292 112 0.127 146
8 AdSem WB Knowledge BClar20_S Semantic Clarity, 200 topics extracted from WeeBit 0.004 227 0.147 88 0.3 129 0.462 60 0.308 104
7 Synta Phrasal ra_PrVeP_C ratio of Prep phrases # to Verb phrases # 0.485 42 0.055 157 0.184 158 0.189 136 0.219 117
7 LxSem Word Familiarity at_SbCDL_C SubtlexUS CDlow value per Word 0.362 102 0.047 166 0.474 96 0.31 107 0.431 74
7 Synta Part-of-Speech ra_CoNoT_C ratio of Coordinating Conjunction POS # to Noun POS # 0.02 224 0.277 42 0.159 163 0.013 222 0.132 142
7 Synta Part-of-Speech at_CoTag_C # Coordinating Conjunction POS tags per Word 0.218 161 0.267 44 0.02 220 0.111 172 0.087 169
7 Synta Part-of-Speech ra_NoCoT_C ratio of Noun POS # to Coordinating Conjunction # 0.022 222 0.254 45 0.019 221 0.053 201 0.109 157
6 Synta Phrasal ra_VePrP_C ratio of Verb phrases # to Prep phrases # 0.475 46 0.018 211 0.301 127 0.255 118 0.249 114
6 Disco Entity Grid ra_XNTo_C ratio of xn transitions to total 0.339 119 0.103 124 0.658 59 0.327 100 0.29 108
6 AdSem WB Knowledge BTopc15_S Number of topics, 150 topics extracted from WeeBit 0.133 193 0.146 92 0.209 151 0.416 73 0.03 217
6 LxSem Word Familiarity at_SbSBW_C SubtlexUS SUBTLWF value per Word 0.181 175 0.196 54 0.095 184 0.021 220 0.109 156
6 LxSem Word Familiarity at_SbFrQ_C SubtlexUS FREQ# value per Word 0.181 174 0.196 55 0.095 183 0.021 219 0.109 155
5 Synta Part-of-Speech ra_NoVeT_C ratio of Noun POS # to Verb POS # 0.432 66 0.118 111 0.149 168 0.112 170 0.051 197
5 AdSem Wiki Knowledge WRich10_S Semantic Richness, 100 topics extracted from Wiki 0.364 100 0.002 232 0.33 123 0.411 76 0.041 206
5 Disco Entity Grid ra_NXTo_C ratio of nx transitions to total 0.339 118 0.097 127 0.62 66 0.28 116 0.278 110
5 Synta Part-of-Speech at_FuncW_C # Function words per Word 0.28 142 0.04 175 0.181 159 0.461 62 0.032 215
5 AdSem WB Knowledge BTopc20_S Number of topics, 200 topics extracted from WeeBit 0.25 150 0.135 99 0.025 215 0.418 71 0.044 198
5 LxSem Variation Ratio SimpVeV_S Verb Variation-1 0.286 139 0.048 165 0.081 193 0.003 226 0.48 62
5 Synta Part-of-Speech ra_VeCoT_C ratio of Verb POS # to Coordinating Conjunction # 0.192 172 0.172 64 0.134 171 0.022 218 0.054 194
5 LxSem Word Familiarity at_SbFrL_C SubtlexUS FREQlow value per Word 0.176 178 0.171 65 0.061 203 0.001 228 0.09 165
4 Synta Phrasal at_NoPhr_C # Noun phrases per Word 0.424 74 0.066 146 0.089 188 0.005 224 0.042 202
4 LxSem Type Token Ratio SimpTTR_S unique tokens/total tokens (TTR) 0.375 92 0.025 200 0.367 113 0.163 147 0.344 94
4 AdSem Wiki Knowledge WNois10_S Semantic Noise, 100 topics extracted from Wikip 0.34 117 0.021 207 0.376 109 0.426 69 0.03 216
4 Synta Phrasal at_SuPhr_C # Subordinate Clauses per Word 0.204 165 0.157 78 0.246 140 0.314 106 0.073 182
4 Synta Phrasal ra_SuNoP_C ratio of Subordinate Clauses # to Noun phrases # 0.081 203 0.163 72 0.224 146 0.307 109 0.086 170
4 AdSem WB Knowledge BNois15_S Semantic Noise, 150 topics extracted from WeeBit 0.035 214 0.162 73 0.341 119 0.221 127 0.091 164
3 Synta Part-of-Speech ra_AjVeT_C ratio of Adjective POS # to Verb POS # 0.411 77 0.034 185 0.133 172 0.156 150 0.005 227
3 Synta Phrasal ra_NoVeP_C ratio of Noun phrases # to Verb phrases # 0.406 79 0.068 145 0.069 201 0.031 212 0.019 223
3 AdSem Wiki Knowledge WRich05_S Semantic Richness, 50 topics extracted from Wiki 0.405 80 0.063 148 0.347 117 0.301 110 0.035 211
3 Synta Phrasal ra_AvPrP_C ratio of Adv phrases # to Prep phrases # 0.4 82 0.014 217 0.222 147 0.196 135 0.115 152
3 LxSem Variation Ratio SimpAjV_S Adjective Variation-1 0.398 85 0.109 118 0.279 134 0.073 192 0.201 120
3 Synta Phrasal ra_NoSuP_C ratio of Noun phrases # to Subordinate Clauses # 0.157 185 0.153 80 0.228 145 0.052 205 0.04 207
3 Synta Part-of-Speech ra_NoAjT_C ratio of Noun POS # to Adjective POS # 0.121 199 0.152 81 0.125 173 0.114 169 0.004 228
3 Synta Part-of-Speech ra_SuNoT_C ratio of Subordinating Conjunction POS # to Noun POS # 0.085 202 0.149 82 0.039 211 0.155 151 0.158 126
3 AdSem WB Knowledge BNois20_S Semantic Noise, 200 topics extracted from WeeBit 0.129 196 0.148 85 0.202 153 0.167 144 0.032 214
2 Synta Phrasal ra_VeSuP_C ratio of Verb phrases # to Subordinate Clauses # 0.349 106 0.137 98 0.307 126 0.127 167 0.043 200
2 Synta Phrasal ra_SuVeP_C ratio of Subordinate Clauses # to Verb phrases # 0.345 113 0.052 160 0.343 118 0.376 93 0.083 172
2 Synta Part-of-Speech ra_CoFuW_C ratio of Content words to Function words 0.284 141 0.023 203 0.2 154 0.376 94 0.042 201
2 Disco Entity Grid ra_ONTo_C ratio of on transitions to total 0.333 123 0.04 178 0.288 133 0.06 199 0.383 88
2 Disco Entity Grid ra_NOTo_C ratio of no transitions to total 0.348 108 0.022 204 0.383 107 0.056 200 0.378 89
2 AdSem WB Knowledge BRich10_S Semantic Richness, 100 topics extracted from WeeBit 0.196 170 0.044 171 0.369 111 0.035 210 0.336 96
2 Disco Entity Density to_UEnti_C total number of unique Entities 0.308 134 0.132 105 0.3 128 0.023 216 0.31 102
2 Synta Part-of-Speech ra_AjCoT_C ratio of Adjective POS # to Coordinating Conjunction # 0.0 229 0.148 86 0.049 207 0.091 181 0.077 177
2 Synta Part-of-Speech ra_AjNoT_C ratio of Adjective POS # to Noun POS # 0.074 205 0.146 89 0.031 213 0.068 195 0.041 205
Table 11: Part B. The full generalizability ranking of handcrafted linguistic features under Approach A.
Feature CCB WBT CAM CKC OSE
Score Branch Subgroup LingFeat Code Brief Explanation r rk r rk r rk r rk r rk
1 Synta Part-of-Speech ra_SuVeT_C ratio of Subordinating Conjunction POS # to Verb POS # 0.36 103 0.053 159 0.109 177 0.282 115 0.137 139
1 Synta Part-of-Speech ra_AjAvT_C ratio of Adjective POS # to Adverb POS # 0.357 104 0.042 172 0.056 204 0.091 180 0.044 199
1 LxSem Psycholinguistic at_AABrL_C lemmas AoA of lemmas, Bristol norm per Word 0.333 124 0.029 194 0.462 98 0.284 113 0.217 118
1 LxSem Psycholinguistic at_AACoL_C AoA of lemmas, Cortese and Khanna norm per Word 0.333 125 0.029 193 0.462 99 0.284 114 0.217 119
1 AdSem WB Knowledge BNois10_S Semantic Noise, 100 topics extracted from WeeBit 0.193 171 0.036 180 0.37 110 0.161 149 0.33 97
1 AdSem WB Knowledge BNois05_S Semantic Noise, 50 topics extracted from WeeBit 0.158 184 0.011 219 0.351 116 0.15 153 0.325 98
1 AdSem WB Knowledge BTopc10_S Number of topics, 100 topics extracted from WeeBit 0.197 169 0.038 179 0.364 114 0.166 145 0.323 99
1 Disco Entity Density to_EntiM_C total number of Entities Mentions #s 0.139 191 0.02 209 0.335 122 0.0 230 0.312 101
1 AdSem WB Knowledge BRich05_S Semantic Richness, 50 topics extracted from WeeBit 0.126 198 0.051 162 0.24 141 0.051 207 0.309 103
1 LxSem Psycholinguistic at_AABiL_C lemmas AoA of lemmas, Bird norm per Word 0.203 166 0.11 117 0.266 138 0.053 202 0.302 105
1 Synta Tree Structure at_FTree_C length of flattened Trees per Word 0.28 143 0.14 96 0.097 182 0.1 177 0.152 130
1 Synta Phrasal at_VePhr_C # Verb phrases per Word 0.31 132 0.138 97 0.079 194 0.032 211 0.009 225
1 Synta Part-of-Speech ra_NoAvT_C ratio of Noun POS # to Adverb POS # 0.261 147 0.133 102 0.101 180 0.052 204 0.034 212
1 Synta Part-of-Speech ra_CoVeT_C ratio of Coordinating Conjunction POS # to Verb POS # 0.302 135 0.133 104 0.023 218 0.133 164 0.088 168
Table 12: Part C. The full generalizability ranking of handcrafted linguistic features under Approach A.
Feature CCB WBT CAM CKC OSE
Score Branch Subgroup LingFeat Code Brief Explanation r rk r rk r rk r rk r rk
35 LxSem Psycholinguistic as_AAKuL_C lemmas AoA of lemmas per Sent 0.54 25 0.505 1 0.722 42 0.711 4 0.601 25
35 LxSem Psycholinguistic as_AAKuW_C AoA of words per Sent 0.537 28 0.502 2 0.722 41 0.711 6 0.602 24
33 ShaTr Shallow as_Chara_C # characters per Sent 0.539 27 0.487 4 0.696 46 0.711 5 0.613 20
33 Synta Tree Structure as_FTree_C length of flattened Trees per Sent 0.505 37 0.485 5 0.677 54 0.719 2 0.622 16
32 LxSem Psycholinguistic at_AAKuL_C lemmas AoA of lemmas per Word 0.723 2 0.323 35 0.785 19 0.65 20 0.453 67
32 LxSem Psycholinguistic at_AAKuW_C AoA of words per Word 0.703 5 0.308 36 0.784 20 0.643 21 0.455 66
31 Synta Phrasal as_NoPhr_C # Noun phrases per Sent 0.55 20 0.406 25 0.66 58 0.673 18 0.582 35
31 ShaTr Shallow as_Sylla_C # syllables per Sent 0.541 24 0.461 10 0.686 50 0.697 11 0.59 31
31 Synta Part-of-Speech as_ContW_C # Content words per Sent 0.534 29 0.453 13 0.667 56 0.688 14 0.544 43
31 Synta Phrasal as_PrPhr_C # prepositional phrases per Sent 0.513 35 0.417 23 0.607 70 0.608 28 0.59 32
31 ShaTr Shallow as_Token_C # tokens per Sent 0.494 40 0.464 9 0.65 60 0.709 7 0.58 36
31 Synta Part-of-Speech as_FuncW_C # Function words per Sent 0.468 48 0.471 8 0.662 57 0.673 17 0.614 19
31 LxSem Psycholinguistic to_AAKuL_C total lemmas AoA of lemmas 0.428 71 0.189 59 0.835 3 0.627 22 0.716 5
31 LxSem Psycholinguistic to_AAKuW_C total AoA (Age of Acquisition) of words 0.427 72 0.189 60 0.835 4 0.625 23 0.715 6
30 LxSem Type Token Ratio CorrTTR_S Corrected TTR 0.745 1 0.006 228 0.846 1 0.445 65 0.692 7
30 LxSem Variation Ratio CorrNoV_S Corrected Noun Variation-1 0.717 3 0.086 131 0.842 2 0.406 78 0.612 21
30 Synta Tree Structure as_TreeH_C Tree height per Sent 0.55 21 0.341 30 0.686 51 0.699 9 0.541 44
30 Synta Phrasal to_PrPhr_C total # prepositional phrases 0.47 47 0.189 58 0.808 11 0.58 36 0.729 3
30 LxSem Word Familiarity as_SbL1C_C SubtlexUS Lg10CD value per Sent 0.467 49 0.43 20 0.612 69 0.699 10 0.533 45
30 LxSem Word Familiarity as_SbL1W_C SubtlexUS Lg10WF value per Sent 0.462 52 0.437 19 0.605 71 0.693 12 0.523 48
29 LxSem Variation Ratio SquaNoV_S Squared Noun Variation-1 0.645 9 0.124 109 0.815 7 0.401 84 0.583 34
29 LxSem Variation Ratio CorrVeV_S Corrected Verb Variation-1 0.602 11 0.058 155 0.801 15 0.393 86 0.737 2
29 Synta Part-of-Speech as_NoTag_C # Noun POS tags per Sent 0.551 19 0.304 38 0.624 65 0.608 29 0.48 61
29 LxSem Psycholinguistic to_AABrL_C total lemmas AoA of lemmas, Bristol norm 0.451 56 0.134 100 0.808 10 0.561 38 0.637 12
29 LxSem Psycholinguistic to_AACoL_C total AoA of lemmas, Cortese and Khanna norm 0.451 57 0.134 101 0.808 9 0.561 39 0.637 13
29 Synta Part-of-Speech to_NoTag_C total # Noun POS tags 0.441 61 0.129 107 0.805 13 0.55 44 0.636 15
29 Synta Phrasal to_NoPhr_C total # Noun phrases 0.416 76 0.148 84 0.809 8 0.527 52 0.659 9
29 Synta Part-of-Speech to_ContW_C total # Content words 0.402 81 0.163 71 0.804 14 0.558 40 0.654 11
28 LxSem Psycholinguistic as_AACoL_C AoA of lemmas, Cortese and Khanna norm per Sent 0.532 30 0.339 32 0.649 61 0.597 32 0.499 58
28 LxSem Psycholinguistic as_AABrL_C lemmas AoA of lemmas, Bristol norm per Sent 0.532 31 0.339 31 0.649 62 0.597 31 0.499 57
28 LxSem Psycholinguistic as_AABiL_C lemmas AoA of lemmas, Bird norm per Sent 0.459 55 0.458 11 0.582 73 0.653 19 0.443 69
28 LxSem Word Familiarity as_SbCDL_C SubtlexUS CDlow value per Sent 0.432 65 0.441 14 0.527 82 0.623 26 0.401 85
28 LxSem Word Familiarity as_SbCDC_C SubtlexUS CD# value per Sent 0.431 67 0.437 17 0.525 84 0.624 24 0.404 82
28 LxSem Word Familiarity as_SbSBC_C SubtlexUS SUBTLCD value per Sent 0.431 68 0.437 18 0.525 85 0.624 25 0.404 83
28 Synta Part-of-Speech as_VeTag_C # Verb POS tags per Sent 0.428 70 0.476 6 0.578 74 0.588 34 0.505 55
28 Synta Tree Structure to_FTree_C total length of flattened Trees 0.396 87 0.166 69 0.805 12 0.538 49 0.676 8
27 LxSem Variation Ratio SquaVeV_S Squared Verb Variation-1 0.559 17 0.076 138 0.777 22 0.384 90 0.716 4
27 LxSem Variation Ratio SquaAjV_S Squared Adjective Variation-1 0.531 32 0.141 94 0.754 34 0.407 77 0.573 37
27 Synta Part-of-Speech as_AjTag_C # Adjective POS tags per Sent 0.506 36 0.353 28 0.553 76 0.533 51 0.404 84
27 LxSem Word Familiarity as_SbFrL_C SubtlexUS FREQlow value per Sent 0.443 60 0.426 21 0.52 86 0.552 42 0.425 77
27 Synta Part-of-Speech to_AjTag_C total # Adjective POS tags 0.441 62 0.191 57 0.777 23 0.504 54 0.525 46
27 LxSem Word Familiarity as_SbSBW_C SubtlexUS SUBTLWF value per Sent 0.44 63 0.441 15 0.509 91 0.542 48 0.425 76
27 LxSem Word Familiarity as_SbFrQ_C SubtlexUS FREQ# value per Sent 0.44 64 0.441 16 0.509 90 0.542 47 0.425 75
27 Synta Phrasal as_VePhr_C # Verb phrases per Sent 0.383 90 0.455 12 0.59 72 0.586 35 0.505 54
26 ShaTr Shallow at_Sylla_C # syllables per Word 0.66 7 0.106 120 0.627 64 0.505 53 0.37 91
26 LxSem Variation Ratio CorrAjV_S Corrected Adjective Variation-1 0.591 12 0.078 134 0.779 21 0.422 70 0.584 33
26 Disco Entity Grid ra_NNTo_C ratio of nn transitions to total 0.476 44 0.078 135 0.754 35 0.451 64 0.602 23
26 Synta Tree Structure at_TreeH_C Tree height per Word 0.476 45 0.419 22 0.416 104 0.597 33 0.41 81
26 LxSem Word Familiarity to_SbL1C_C total SubtlexUS Lg10CD value 0.37 93 0.14 95 0.797 16 0.491 56 0.621 17
26 LxSem Word Familiarity to_SbL1W_C total SubtlexUS Lg10WF value 0.365 99 0.144 93 0.795 17 0.477 58 0.611 22
26 LxSem Word Familiarity to_SbFrL_C total SubtlexUS FREQlow value 0.348 109 0.201 51 0.774 24 0.414 74 0.555 40
26 LxSem Word Familiarity to_SbSBW_C total SubtlexUS SUBTLWF value 0.34 115 0.206 47 0.77 27 0.403 81 0.551 42
26 LxSem Word Familiarity to_SbFrQ_C total SubtlexUS FREQ# value 0.34 116 0.206 48 0.77 26 0.403 82 0.551 41
Table 13: Part A. The full generalizability ranking of handcrafted linguistic features under Approach A. r: Pearson’s correlation between the feature and the dataset. rk: the feature’s correlation ranking on the specific dataset.
Feature CCB WBT CAM CKC OSE
Score Branch Subgroup LingFeat Code Brief Explanation r rk r rk r rk r rk r rk
25 ShaTr Shallow at_Chara_C # characters per Word 0.443 59 0.2 52 0.619 67 0.402 83 0.443 68
25 Synta Phrasal to_SuPhr_C total # Subordinate Clauses 0.367 96 0.202 50 0.721 43 0.462 61 0.419 78
25 LxSem Psycholinguistic to_AABiL_C total lemmas AoA of lemmas, Bird norm 0.365 98 0.155 79 0.786 18 0.473 59 0.565 39
25 Synta Part-of-Speech to_CoTag_C total # Coordinating Conjunction POS tags 0.364 101 0.268 43 0.728 39 0.406 80 0.434 72
25 Synta Part-of-Speech to_FuncW_C total # Function words 0.33 126 0.159 77 0.773 25 0.385 89 0.636 14
25 Synta Phrasal to_VePhr_C total # Verb phrases 0.324 127 0.169 68 0.76 31 0.416 72 0.57 38
24 LxSem Variation Ratio CorrAvV_S Corrected AdVerb Variation-1 0.542 23 0.059 154 0.71 44 0.333 99 0.474 63
24 LxSem Word Familiarity to_SbCDL_C total SubtlexUS CDlow value 0.348 107 0.148 87 0.764 30 0.394 85 0.513 53
24 LxSem Word Familiarity to_SbCDC_C total SubtlexUS CD# value 0.347 110 0.146 90 0.764 29 0.392 87 0.515 51
24 LxSem Word Familiarity to_SbSBC_C total SubtlexUS SUBTLCD value 0.347 111 0.146 91 0.764 28 0.392 88 0.515 52
23 AdSem Wiki Knowledge WTopc20_S Number of topics, 200 topics extracted from Wikipedia 0.584 14 0.015 214 0.616 68 0.617 27 0.137 138
23 AdSem Wiki Knowledge WTopc15_S Number of topics, 150 topics extracted from Wikipedia 0.58 15 0.007 227 0.645 63 0.605 30 0.191 122
23 LxSem Variation Ratio SquaAvV_S Squared AdVerb Variation-1 0.515 34 0.093 128 0.686 52 0.326 102 0.46 65
23 Synta Part-of-Speech to_AvTag_C total # Adverb POS tags 0.342 114 0.17 67 0.726 40 0.352 96 0.469 64
23 Synta Part-of-Speech as_AvTag_C # Adverb POS tags per Sent 0.32 129 0.292 41 0.526 83 0.43 67 0.415 79
23 Synta Part-of-Speech to_VeTag_C total # Verb POS tags 0.288 138 0.173 63 0.738 38 0.383 91 0.597 27
22 Synta Phrasal as_SuPhr_C # Subordinate Clauses per Sent 0.387 89 0.357 26 0.532 80 0.495 55 0.265 112
22 Synta Part-of-Speech as_CoTag_C # Coordinating Conjunction POS tags per Sent 0.38 91 0.411 24 0.463 97 0.442 66 0.293 107
22 Synta Phrasal to_AvPhr_C total # Adverb phrases 0.356 105 0.17 66 0.705 45 0.298 111 0.432 73
22 Synta Tree Structure to_TreeH_C total Tree height of all sentences 0.27 145 0.069 143 0.755 33 0.309 108 0.515 50
21 Disco Entity Grid ra_NSTo_C ratio of ns transitions to total 0.426 73 0.033 187 0.516 87 0.266 117 0.505 56
21 Synta Part-of-Speech to_SuTag_C total # Subordinating Conjunction POS tags 0.4 83 0.193 56 0.691 48 0.406 79 0.299 106
20 LxSem Type Token Ratio UberTTR_S Uber Index 0.646 8 0.041 174 0.369 112 0.109 173 0.599 26
20 Synta Phrasal at_PrPhr_C # prepositional phrases per Word 0.57 16 0.133 103 0.316 124 0.323 105 0.366 92
20 AdSem Wiki Knowledge WTopc05_S Number of topics, 50 topics extracted from Wiki 0.549 22 0.033 186 0.514 89 0.533 50 0.042 203
20 AdSem Wiki Knowledge WTopc10_S Number of topics, 100 topics extracted from Wiki 0.52 33 0.004 229 0.532 79 0.552 43 0.075 180
20 Disco Entity Grid ra_SNTo_C ratio of sn transitions to total 0.448 58 0.019 210 0.514 88 0.196 133 0.518 49
20 LxSem Word Familiarity at_SbL1C_C SubtlexUS Lg10CD value per Word 0.408 78 0.161 75 0.541 78 0.204 130 0.392 86
20 Disco Entity Grid ra_XNTo_C ratio of xn transitions to total 0.339 119 0.103 124 0.658 59 0.327 100 0.29 108
20 Synta Phrasal to_AjPhr_C total # Adjective phrases 0.339 120 0.182 62 0.682 53 0.327 101 0.271 111
20 Synta Phrasal as_AvPhr_C # Adverb phrases per Sent 0.244 152 0.328 34 0.427 103 0.38 92 0.356 93
20 ShaTr Shallow TokSenS_S sqrt(total # tokens x total # sentence) 0.241 154 0.064 147 0.758 32 0.249 121 0.498 59
19 AdSem Wiki Knowledge WNois20_S Semantic Noise, 200 topics extracted from Wiki 0.492 41 0.032 190 0.566 75 0.572 37 0.025 221
19 Synta Phrasal ra_NoPrP_C ratio of Noun phrases # to Prep phrases # 0.477 43 0.149 83 0.34 120 0.345 97 0.389 87
19 LxSem Word Familiarity at_SbL1W_C SubtlexUS Lg10WF value per Word 0.399 84 0.089 129 0.531 81 0.24 123 0.412 80
19 LxSem Word Familiarity at_SbSBC_C SubtlexUS SUBTLCD value per Word 0.37 94 0.032 192 0.492 93 0.324 103 0.435 71
19 LxSem Word Familiarity at_SbCDC_C SubtlexUS CD# value per Word 0.37 95 0.032 191 0.492 94 0.324 104 0.435 70
19 Synta Part-of-Speech as_SuTag_C # Subordinating Conjunction POS tags per Sent 0.366 97 0.295 39 0.407 105 0.427 68 0.151 131
19 LxSem Word Familiarity at_SbCDL_C SubtlexUS CDlow value per Word 0.362 102 0.047 166 0.474 96 0.31 107 0.431 74
18 AdSem Wiki Knowledge WRich15_S Semantic Richness, 150 topics extracted from Wiki 0.495 39 0.02 208 0.48 95 0.549 45 0.037 209
18 AdSem Wiki Knowledge WRich20_S Semantic Richness, 200 topics extracted from Wiki 0.465 50 0.029 195 0.446 102 0.556 41 0.027 219
18 AdSem Wiki Knowledge WNois05_S Semantic Noise, 50 topics extracted from Wiki 0.462 53 0.061 150 0.455 100 0.412 75 0.118 151
18 Synta Phrasal ra_PrNoP_C ratio of Prep phrases # to Noun phrases # 0.421 75 0.162 74 0.276 135 0.344 98 0.37 90
18 Disco Entity Grid ra_NXTo_C ratio of nx transitions to total 0.339 118 0.097 127 0.62 66 0.28 116 0.278 110
18 ShaTr Shallow TokSenL_S log(total # tokens)/log(total # sentence) 0.293 137 0.352 29 0.297 130 0.544 46 0.198 121
18 ShaTr Shallow TokSenM_S total # tokens x total # sentence 0.189 173 0.112 116 0.674 55 0.177 140 0.486 60
17 Synta Phrasal as_AjPhr_C # Adjective phrases per Sent 0.323 128 0.239 46 0.387 106 0.357 95 0.157 127
17 Disco Entity Density at_UEnti_C number of unique Entities per Word 0.127 197 0.307 37 0.548 77 0.253 119 0.124 149
16 Synta Phrasal ra_VePrP_C ratio of Verb phrases # to Prep phrases # 0.475 46 0.018 211 0.301 127 0.255 118 0.249 114
16 AdSem Wiki Knowledge WNois15_S Semantic Noise, 150 topics extracted from Wiki 0.388 88 0.033 188 0.454 101 0.454 63 0.006 226
16 LxSem Psycholinguistic at_AABrL_C lemmas AoA of lemmas, Bristol norm per Word 0.333 124 0.029 194 0.462 98 0.284 113 0.217 118
16 LxSem Psycholinguistic at_AACoL_C AoA of lemmas, Cortese and Khanna norm per Word 0.333 125 0.029 193 0.462 99 0.284 114 0.217 119
16 Disco Entity Density at_EntiM_C number of Entities Mentions #s per Word 0.17 180 0.204 49 0.501 92 0.292 112 0.127 146
16 AdSem WB Knowledge BClar15_S Semantic Clarity, 150 topics extracted from WeeBit 0.025 221 0.161 76 0.38 108 0.481 57 0.315 100
15 LxSem Type Token Ratio BiLoTTR_S Bi-Logarithmic TTR 0.591 13 0.062 149 0.07 200 0.001 229 0.523 47
15 AdSem Wiki Knowledge WRich05_S Semantic Richness, 50 topics extracted from Wiki 0.405 80 0.063 148 0.347 117 0.301 110 0.035 211
15 LxSem Type Token Ratio SimpTTR_S TTR 0.375 92 0.025 200 0.367 113 0.163 147 0.344 94
15 AdSem Wiki Knowledge WRich10_S Semantic Richness, 100 topics extracted from Wiki 0.364 100 0.002 232 0.33 123 0.411 76 0.041 206
15 AdSem Wiki Knowledge WNois10_S Semantic Noise, 100 topics extracted from Wiki 0.34 117 0.021 207 0.376 109 0.426 69 0.03 216
15 Disco Entity Density to_UEnti_C total number of unique Entities 0.308 134 0.132 105 0.3 128 0.023 216 0.31 102
15 AdSem WB Knowledge BClar20_S Semantic Clarity, 200 topics extracted from WeeBit 0.004 227 0.147 88 0.3 129 0.462 60 0.308 104
14 Disco Entity Grid ra_NOTo_C ratio of no transitions to total 0.348 108 0.022 204 0.383 107 0.056 200 0.378 89
14 Synta Phrasal ra_SuVeP_C ratio of Subordinate Clauses # to Verb phrases # 0.345 113 0.052 160 0.343 118 0.376 93 0.083 172
13 Synta Phrasal ra_PrVeP_C ratio of Prep phrases # to Verb phrases # 0.485 42 0.055 157 0.184 158 0.189 136 0.219 117
13 LxSem Variation Ratio SimpAjV_S Adjective Variation-1 0.398 85 0.109 118 0.279 134 0.073 192 0.201 120
13 Synta Phrasal ra_VeSuP_C ratio of Verb phrases # to Subordinate Clauses # 0.349 106 0.137 98 0.307 126 0.127 167 0.043 200
13 Synta Part-of-Speech at_NoTag_C # Noun POS tags per Word 0.347 112 0.104 122 0.295 131 0.148 154 0.107 159
13 Disco Entity Grid ra_ONTo_C ratio of on transitions to total 0.333 123 0.04 178 0.288 133 0.06 199 0.383 88
13 Synta Phrasal at_SuPhr_C # Subordinate Clauses per Word 0.204 165 0.157 78 0.246 140 0.314 106 0.073 182
13 LxSem Psycholinguistic at_AABiL_C lemmas AoA of lemmas, Bird norm per Word 0.203 166 0.11 117 0.266 138 0.053 202 0.302 105
13 AdSem WB Knowledge BTopc10_S Number of topics, 100 topics extracted from WeeBit 0.197 169 0.038 179 0.364 114 0.166 145 0.323 99
13 AdSem WB Knowledge BNois10_S Semantic Noise, 100 topics extracted from WeeBit 0.193 171 0.036 180 0.37 110 0.161 149 0.33 97
13 AdSem WB Knowledge BNois05_S Semantic Noise, 50 topics extracted from WeeBit 0.158 184 0.011 219 0.351 116 0.15 153 0.325 98
13 AdSem WB Knowledge BTopc15_S Number of topics, 150 topics extracted from WeeBit 0.133 193 0.146 92 0.209 151 0.416 73 0.03 217
Table 14: Part B. The full generalizability ranking of handcrafted linguistic features under Approach B.
Feature CCB WBT CAM CKC OSE
Score Branch Subgroup LingFeat Code Brief Explanation r rk r rk r rk r rk r rk
12 LxSem Variation Ratio SimpNoV_S Noun Variation-1 0.499 38 0.087 130 0.038 212 0.031 213 0.337 95
12 Synta Part-of-Speech ra_NoVeT_C ratio of Noun POS # to Verb POS # 0.432 66 0.118 111 0.149 168 0.112 170 0.051 197
12 Synta Phrasal ra_AvPrP_C ratio of Adv phrases # to Prep phrases # 0.4 82 0.014 217 0.222 147 0.196 135 0.115 152
12 Synta Part-of-Speech ra_VeNoT_C ratio of Verb POS # to Noun POS # 0.397 86 0.198 53 0.234 142 0.171 142 0.067 186
12 Synta Part-of-Speech ra_SuVeT_C ratio of Subordinating Conjunction POS # to Verb POS # 0.36 103 0.053 159 0.109 177 0.282 115 0.137 139
12 Disco Entity Density as_UEnti_C number of unique Entities per Sent 0.337 121 0.114 113 0.273 136 0.066 196 0.157 128
12 Synta Part-of-Speech at_AjTag_C # Adjective POS tags per Word 0.334 122 0.117 112 0.216 149 0.197 132 0.037 210
12 Synta Phrasal ra_SuAvP_C ratio of Subordinate Clauses # to Adv phrases # 0.309 133 0.008 226 0.141 170 0.241 122 0.111 153
12 Synta Part-of-Speech at_FuncW_C # Function words per Word 0.28 142 0.04 175 0.181 159 0.461 62 0.032 215
12 AdSem WB Knowledge BTopc20_S Number of topics, 200 topics extracted from WeeBit 0.25 150 0.135 99 0.025 215 0.418 71 0.044 198
12 AdSem Wiki Knowledge WClar05_S Semantic Clarity, 50 topics extracted from Wiki 0.212 164 0.014 218 0.214 150 0.235 124 0.102 161
12 AdSem WB Knowledge BRich10_S Semantic Richness, 100 topics extracted from WeeBit 0.196 170 0.044 171 0.369 111 0.035 210 0.336 96
12 AdSem WB Knowledge BClar05_S Semantic Clarity, 50 topics extracted from WeeBit 0.14 190 0.041 173 0.339 121 0.164 146 0.289 109
12 Disco Entity Density to_EntiM_C total number of Entities Mentions 0.139 191 0.02 209 0.335 122 0.0 230 0.312 101
11 Synta Phrasal ra_VeNoP_C ratio of Verb phrases # to Noun phrases # 0.46 54 0.164 70 0.124 174 0.041 209 0.027 220
11 Synta Part-of-Speech at_VeTag_C # Verb POS tags per Word 0.431 69 0.187 61 0.076 196 0.111 171 0.011 224
11 Synta Part-of-Speech ra_AjVeT_C ratio of Adjective POS # to Verb POS # 0.411 77 0.034 185 0.133 172 0.156 150 0.005 227
11 Synta Part-of-Speech ra_SuAvT_C ratio of Subordinating Conjunction POS # to Adverb POS # 0.314 131 0.021 206 0.106 178 0.148 156 0.18 124
11 LxSem Variation Ratio SimpVeV_S Verb Variation-1 0.286 139 0.048 165 0.081 193 0.003 226 0.48 62
11 Synta Part-of-Speech ra_CoFuW_C ratio of Content words to Function words 0.284 141 0.023 203 0.2 154 0.376 94 0.042 201
11 Synta Part-of-Speech at_SuTag_C # Subordinating Conjunction POS tags per Word 0.259 148 0.13 106 0.085 192 0.252 120 0.135 141
11 AdSem Wiki Knowledge WClar20_S Semantic Clarity, 200 topics extracted from Wikipedia 0.144 187 0.016 212 0.308 125 0.23 125 0.034 213
11 AdSem WB Knowledge BTopc05_S Number of topics, 50 topics extracted from WeeBit 0.139 192 0.009 224 0.291 132 0.144 160 0.222 116
11 AdSem WB Knowledge BRich05_S Semantic Richness, 50 topics extracted from WeeBit 0.126 198 0.051 162 0.24 141 0.051 207 0.309 103
11 Synta Phrasal ra_SuNoP_C ratio of Subordinate Clauses # to Noun phrases # 0.081 203 0.163 72 0.224 146 0.307 109 0.086 170
11 AdSem WB Knowledge BNois15_S Semantic Noise, 150 topics extracted from WeeBit 0.035 214 0.162 73 0.341 119 0.221 127 0.091 164
10 Synta Part-of-Speech ra_CoVeT_C ratio of Coordinating Conjunction POS # to Verb POS # 0.302 135 0.133 104 0.023 218 0.133 164 0.088 168
10 Synta Phrasal ra_AvSuP_C ratio of Adv phrases # to Subordinate Clauses # 0.299 136 0.06 151 0.256 139 0.128 165 0.077 176
10 Synta Tree Structure at_FTree_C length of flattened Trees per Word 0.28 143 0.14 96 0.097 182 0.1 177 0.152 130
10 Disco Entity Density as_EntiM_C number of Entities Mentions #s per Sent 0.242 153 0.015 215 0.219 148 0.051 206 0.168 125
10 Disco Entity Grid LoCoDPW_S Local Coherence distance for PW score 0.239 155 0.002 230 0.195 156 0.143 161 0.141 136
10 Disco Entity Grid LoCoDPA_S Local Coherence distance for PA score 0.239 156 0.002 231 0.195 157 0.143 162 0.141 135
10 Synta Part-of-Speech at_CoTag_C # Coordinating Conjunction POS tags per Word 0.218 161 0.267 44 0.02 220 0.111 172 0.087 169
10 LxSem Variation Ratio SimpAvV_S AdVerb Variation-1 0.214 163 0.098 126 0.353 115 0.021 221 0.089 166
10 Synta Phrasal ra_AjPrP_C ratio of Adj phrases # to Prep phrases # 0.201 168 0.036 181 0.155 164 0.095 178 0.252 113
10 AdSem WB Knowledge BNois20_S Semantic Noise, 200 topics extracted from WeeBit 0.129 196 0.148 85 0.202 153 0.167 144 0.032 214
10 AdSem WB Knowledge BRich20_S Semantic Richness, 200 topics extracted from WeeBit 0.047 211 0.104 121 0.112 176 0.221 126 0.143 134
9 Synta Phrasal at_NoPhr_C # Noun phrases per Word 0.424 74 0.066 146 0.089 188 0.005 224 0.042 202
9 Synta Phrasal ra_NoVeP_C ratio of Noun phrases # to Verb phrases # 0.406 79 0.068 145 0.069 201 0.031 212 0.019 223
9 Synta Phrasal ra_PrAvP_C ratio of Prep phrases # to Adv phrases # 0.32 130 0.027 196 0.021 219 0.176 141 0.071 183
9 Synta Phrasal at_VePhr_C # Verb phrases per Word 0.31 132 0.138 97 0.079 194 0.032 211 0.009 225
9 Synta Part-of-Speech ra_CoAvT_C ratio of Coordinating Conjunction POS # to Adverb POS # 0.284 140 0.04 176 0.16 162 0.079 189 0.119 150
9 Synta Part-of-Speech ra_NoAvT_C ratio of Noun POS # to Adverb POS # 0.261 147 0.133 102 0.101 180 0.052 204 0.034 212
9 Disco Entity Grid LoCohPW_S Local Coherence for PW score 0.229 159 0.034 183 0.012 227 0.146 157 0.148 133
9 Disco Entity Grid LoCohPA_S Local Coherence for PA score 0.229 160 0.034 184 0.012 226 0.146 158 0.148 132
9 Synta Phrasal ra_SuPrP_C ratio of Subordinate Clauses # to Prep phrases # 0.218 162 0.048 164 0.015 224 0.07 194 0.227 115
9 Synta Part-of-Speech ra_VeAjT_C ratio of Verb POS # to Adjective POS # 0.177 177 0.059 153 0.203 152 0.162 148 0.042 204
9 Synta Part-of-Speech at_ContW_C # Content words per Word 0.161 183 0.057 156 0.23 143 0.183 139 0.055 193
9 Synta Phrasal ra_NoSuP_C ratio of Noun phrases # to Subordinate Clauses # 0.157 185 0.153 80 0.228 145 0.052 205 0.04 207
9 Synta Phrasal ra_PrAjP_C ratio of Prep phrases # to Adj phrases # 0.142 189 0.035 182 0.017 223 0.207 128 0.136 140
9 Synta Part-of-Speech ra_NoAjT_C ratio of Noun POS # to Adjective POS # 0.121 199 0.152 81 0.125 173 0.114 169 0.004 228
9 AdSem WB Knowledge BClar10_S Semantic Clarity, 100 topics extracted from WeeBit 0.079 204 0.015 216 0.269 137 0.148 155 0.181 123
9 Synta Part-of-Speech ra_CoNoT_C ratio of Coordinating Conjunction POS # to Noun POS # 0.02 224 0.277 42 0.159 163 0.013 222 0.132 142
8 Synta Part-of-Speech ra_AjAvT_C ratio of Adjective POS # to Adverb POS # 0.357 104 0.042 172 0.056 204 0.091 180 0.044 199
8 Synta Part-of-Speech ra_SuCoT_C ratio of Subordinating Conj POS # to Coordinating Conj # 0.274 144 0.054 158 0.019 222 0.143 163 0.077 179
8 Synta Part-of-Speech ra_VeSuT_C ratio of Verb POS # to Subordinating Conjunction # 0.266 146 0.046 169 0.09 186 0.105 175 0.065 188
8 Synta Phrasal ra_AvNoP_C ratio of Adv phrases # to Noun phrases # 0.257 149 0.128 108 0.072 199 0.044 208 0.051 196
8 Synta Part-of-Speech ra_SuAjT_C ratio of Subordinating Conjunction POS # to Adjective POS # 0.244 151 0.008 225 0.074 197 0.082 187 0.138 137
8 Synta Phrasal ra_NoAvP_C ratio of Noun phrases # to Adv phrases # 0.235 157 0.102 125 0.09 187 0.082 188 0.071 185
8 Synta Phrasal ra_AjAvP_C ratio of Adj phrases # to Adv phrases # 0.232 158 0.016 213 0.046 209 0.094 179 0.156 129
8 Synta Part-of-Speech ra_AvSuT_C ratio of Adverb POS # to Subordinating Conjunction # 0.202 167 0.024 201 0.003 230 0.114 168 0.067 187
8 Synta Part-of-Speech ra_VeCoT_C ratio of Verb POS # to Coordinating Conjunction # 0.192 172 0.172 64 0.134 171 0.022 218 0.054 194
8 LxSem Word Familiarity at_SbFrQ_C SubtlexUS FREQ# value per Word 0.181 174 0.196 55 0.095 183 0.021 219 0.109 155
8 LxSem Word Familiarity at_SbSBW_C SubtlexUS SUBTLWF value per Word 0.181 175 0.196 54 0.095 184 0.021 220 0.109 156
8 AdSem Wiki Knowledge WClar10_S Semantic Clarity, 100 topics extracted from Wiki 0.178 176 0.01 223 0.153 167 0.171 143 0.084 171
8 AdSem Wiki Knowledge WClar15_S Semantic Clarity, 150 topics extracted from Wiki 0.165 182 0.011 221 0.161 161 0.185 138 0.074 181
8 Disco Entity Grid LoCohPU_S Local Coherence for PU score 0.129 195 0.023 202 0.103 179 0.084 184 0.13 144
8 Synta Part-of-Speech ra_VeAvT_C ratio of Verb POS # to Adverb POS # 0.108 200 0.078 136 0.229 144 0.025 215 0.079 174
8 Synta Part-of-Speech ra_SuNoT_C ratio of Subordinating Conjunction POS # to Noun POS # 0.085 202 0.149 82 0.039 211 0.155 151 0.158 126
8 AdSem WB Knowledge BRich15_S Semantic Richness, 150 topics extracted from WeeBit 0.025 220 0.059 152 0.154 166 0.145 159 0.1 162
8 Synta Part-of-Speech ra_NoCoT_C ratio of Noun POS # to Coordinating Conjunction # 0.022 222 0.254 45 0.019 221 0.053 201 0.109 157
8 LxSem Type Token Ratio MTLDTTR_S Measure of Textual Lexical Diversity (default TTR = 0.72) 0.0 230 0.103 123 0.119 175 0.151 152 0.0 231
Table 15: Part C. The full generalizability ranking of handcrafted linguistic features under Approach B.
Feature CCB WBT CAM CKC OSE
Score Branch Subgroup LingFeat Code Brief Explanation r rk r rk r rk r rk r rk
7 LxSem Word Familiarity at_SbFrL_C SubtlexUS FREQlow value per Word 0.176 178 0.171 65 0.061 203 0.001 228 0.09 165
7 Synta Part-of-Speech ra_AvNoT_C ratio of Adverb POS # to Noun POS # 0.171 179 0.108 119 0.076 195 0.084 185 0.023 222
7 Disco Entity Grid LoCoDPU_S Local Coherence distance for PU score 0.154 186 0.032 189 0.086 191 0.087 182 0.111 154
7 Synta Phrasal at_AvPhr_C # Adverb phrases per Word 0.144 188 0.113 115 0.047 208 0.029 214 0.058 191
7 Synta Phrasal ra_AjSuP_C ratio of Adj phrases # to Subordinate Clauses # 0.133 194 0.04 177 0.195 155 0.001 227 0.079 173
7 Synta Phrasal ra_AjVeP_C ratio of Adj phrases # to Verb phrases # 0.104 201 0.01 222 0.055 205 0.083 186 0.124 148
7 Synta Part-of-Speech ra_CoAjT_C ratio of Coordinating Conjunction POS # to Adjective POS # 0.068 206 0.051 161 0.176 160 0.074 191 0.104 160
7 Synta Part-of-Speech ra_AvCoT_C ratio of Adverb POS # to Coordinating Conjunction # 0.029 216 0.119 110 0.024 216 0.022 217 0.107 158
7 Synta Part-of-Speech ra_AjSuT_C ratio of Adjective POS # to Subordinating Conjunction # 0.025 219 0.001 233 0.024 217 0.204 131 0.057 192
7 Synta Phrasal ra_SuAjP_C ratio of Subordinate Clauses # to Adj phrases # 0.02 223 0.022 205 0.05 206 0.204 129 0.029 218
7 Synta Phrasal ra_PrSuP_C ratio of Prep phrases # to Subordinate Clauses # 0.002 228 0.076 139 0.143 169 0.07 193 0.13 143
6 Synta Part-of-Speech ra_AvVeT_C ratio of Adverb POS # to Verb POS # 0.168 181 0.011 220 0.097 181 0.053 203 0.053 195
6 Synta Part-of-Speech ra_AjNoT_C ratio of Adjective POS # to Noun POS # 0.074 205 0.146 89 0.031 213 0.068 195 0.041 205
6 Synta Phrasal ra_VeAjP_C ratio of Verb phrases # to Adj phrases # 0.067 207 0.072 142 0.087 190 0.104 176 0.064 189
6 Synta Part-of-Speech ra_AvAjT_C ratio of Adverb POS # to Adjective POS # 0.061 208 0.049 163 0.088 189 0.107 174 0.039 208
6 Synta Phrasal ra_NoAjP_C ratio of Noun phrases # to Adj phrases # 0.05 209 0.084 132 0.073 198 0.128 166 0.062 190
6 Synta Part-of-Speech ra_NoSuT_C ratio of Noun POS # to Subordinating Conjunction # 0.049 210 0.075 140 0.004 229 0.186 137 0.077 178
6 Synta Phrasal ra_VeAvP_C ratio of Verb phrases # to Adv phrases # 0.039 213 0.084 133 0.155 165 0.065 198 0.097 163
6 Synta Part-of-Speech ra_CoSuT_C ratio of Coordinating Conj POS # to Subordinating Conj # 0.03 215 0.076 137 0.044 210 0.196 134 0.001 229
6 Synta Phrasal at_AjPhr_C # Adjective phrases per Word 0.027 218 0.046 167 0.029 214 0.076 190 0.126 147
6 Synta Phrasal ra_AjNoP_C ratio of Adj phrases # to Noun phrases # 0.01 226 0.046 168 0.013 225 0.066 197 0.127 145
6 Synta Part-of-Speech ra_AjCoT_C ratio of Adjective POS # to Coordinating Conjunction # 0.0 229 0.148 86 0.049 207 0.091 181 0.077 177
5 Synta Phrasal ra_AvAjP_C ratio of Adv phrases # to Adj phrases # 0.044 212 0.044 170 0.066 202 0.086 183 0.088 167
5 Synta Part-of-Speech at_AvTag_C # Adverb POS tags per Word 0.029 217 0.072 141 0.095 185 0.011 223 0.078 175
5 Synta Phrasal ra_AvVeP_C ratio of Adv phrases # to Verb phrases # 0.02 225 0.068 144 0.005 228 0.003 225 0.071 184
5 Disco Entity Grid ra_XXTo_C ratio of xx transitions to total 0.0 231 0.025 198 0.0 231 0.0 231 0.0 230
5 Disco Entity Grid ra_XSTo_C ratio of xs transitions to total 0.0 232 0.025 197 0.0 232 0.0 232 0.0 232
5 Disco Entity Grid ra_SSTo_C ratio of ss transitions to total 0.0 233 0.025 199 0.0 233 0.0 233 0.0 233
Table 16: Part D. The full generalizability ranking of handcrafted linguistic features under Approach B.