Traditional Readability Formulas Compared for English
Abstract
Traditional English readability formulas, or equations, were largely developed in the 20th century. Nonetheless, many researchers still rely on them for various NLP applications. This phenomenon is presumably due to the convenience and straightforwardness of readability formulas. In this work, we contribute to the NLP community by 1. introducing New English Readability Formula (NERF), 2. recalibrating the coefficients of “old” readability formulas (Flesch-Kincaid Grade Level, Fog Index, SMOG Index, Coleman-Liau Index, and Automated Readability Index), 3. evaluating the readability formulas, for use in text simplification studies and medical texts, and 4. developing a Python-based program for the wide application to various NLP projects.
1 Introduction
Readability Assessment (RA) quantitatively measures the ease of understanding or comprehension of any written text (Feng et al., 2010; Klare, 2000). Understanding text readability, or difficulty, is essential for research on any originated, studied, or shared ideas (Collins-Thompson, 2014). Such inherent property leads to RA’s close applications to various areas of healthcare (Wu et al., 2013), education (Dennis, 2018), communication (Zhou et al., 2017), and Natural Language Processing (NLP), such as text simplification (Aluisio et al., 2010).
Machine learning (ML) or transformer-based methods have been reasonably successful in RA. The RoBERTa-RF-T1 model by Lee et al. (2021) achieves a classification accuracy on OneStopEnglish dataset (Vajjala and Lučić, 2018) and a BERT-based ReadNet model from Meng et al. (2020) achieves about accuracy on WeeBit dataset (Vajjala and Meurers, 2012). However, “traditional readability formulas” still seem to be actively used throughout the research published in popular NLP venues like ACL or EMNLP (Uchendu et al., 2020; Shardlow and Nawaz, 2019; Scarton and Specia, 2018; Schwartz et al., 2017; Xu et al., 2016). The tendency to opt for traditional readability formulas is likely due their convenience and straightforwardness.
In this work, we hope to assist the NLP community by recalibrating five traditional readability formulas – originally developed upon 20th-century military or technical documents. The formulas are adjusted for the modern, standard U.S. education curriculum. We utilize the appendix B (Text Exemplars and Sample Performance Tasks) dataset, provided by the U.S. Common Core State Standards111corestandards.org. Then, we evaluate the performances and applications of these formulas. Lastly, we develop a Python-based program for convenient application of the recalibrated versions.
But traditional readability formulas lack wide linguistic coverage (Feng et al., 2010). Therefore, we create a new formula that is mainly motivated by lexico-semantic and syntactic linguistic branches, as identified by Collins-Thompson (2014). From each, we search for the representative features. The resulting formula is named the New English Readability Formula, or simply NERF, and it aims to give the most generally and commonly accepted approach to calculating English readability.
To sum up, we make the contributions below. The related public resources are in appendix A.
1. We recalibrate five traditional readability formulas to show higher prediction accuracy on modern texts in the U.S. curriculum.
2. We develop NERF, a generalized and easy-to-use readability assessment formula.
3. We evaluate and cross-compare six readability formulas on several datasets. These datasets are carefully selected to collectively represent the diverse audiences, education curricula, and reading levels.
4. We develop <Anonymous>, a fast open-source readability assessment software based on Python.
2 Related Work
The earliest attempt to "calculate" text readability was by Lively and Pressey (1923), in response to their practical problem of selecting science textbooks for high school students (DuBay, 2004). In the consecutive years, many well-known readability formulas were developed, including Flesch Kincaid Grade Level (Kincaid et al., 1975), Gunning Fog Count (or Index) (Gunning et al., 1952), SMOG Index (Mc Laughlin, 1969), Coleman-Liau Index (Coleman and Liau, 1975), and Automated Readability Index (Smith and Senter, 1967).
These formulas are mostly linear models with two or three variables, largely based on superficial properties concerning words or sentences (Feng et al., 2010). Hence, they can easily combine with other systems with less burden of a large trained model (Xu et al., 2016). Such property also proved helpful in research fields outside computational linguistics, with some applications directly related to the public medical knowledge – measuring the difficulty of a patient material (Gaeta et al., 2021; van Ballegooie and Hoang, 2021; Bange et al., 2019; Haller et al., 2019; Hansberry et al., 2018; Kiwanuka et al., 2017).
3 Datasets
3.1 Common Core - Appendix B (CCB)
We use the CCB corpus to calibrate formulas. The article excerpts included in CCB are divided into the categories of story, poetry, informational text, and drama. For the simplification of our approach, we limit our research to story-type texts. This left us with only 69 items to train with. But those are directly from the U.S. Common Core Standards. Hence, we assume with confidence that the item classification is generally agreeable in the U.S.
Properties | CCB | WBT | CAM | CKC | OSE | NSL |
---|---|---|---|---|---|---|
audience | Ntve | Ntve | ESL | ESL | ESL | Ntve |
grade | K1-12 | K2-10 | A2-C2 | S7-12 | N/A | N/A |
curriculum? | Yes | No | Yes | Yes | No | No |
balanced? | No | Yes | Yes | No | Yes | No |
#class | 6 | 5 | 5 | 6 | 3 | 5 |
#item/class | 11.5 | 625 | 60.0 | 554 | 189 | 2125 |
#word/item | 362 | 213 | 508 | 117 | 669 | 752 |
#sent/item | 25.8 | 17.0 | 28.4 | 54.0 | 35.6 | 50.9 |
CCB is the only dataset that we use in the calibration of our formulas. All below datasets are mainly for feature selection purposes only.
3.2 WeeBit (WBT)
WBT, the largest native dataset available in RA, contains articles targeted at readers of different age groups from the Weekly Reader magazine and the BBC-Bitesize website. In table 1, we translate those age groups into U.S. schools’ K-* format. We downsample to as per common practice.
3.3 Cambridge English (CAM)
3.4 Corpus of the Korean ELT (English Lang. Train.) Curriculum (CKC)
3.5 OneStopEnglish (OSE)
OSE is a recently developed dataset in RA. It aims at ESL (English as Second Language) learners and consists of three paraphrased versions of an article from The Guardian Newspaper. Along with the original OSE dataset, we created a paired version (OSE-Pair). This variation has 189 items and each item has advanced-intermediate-elementary pairs.
In addition, OSE-Sent is a sentence-paired version of OSE. The dataset consists of three parts: adv-ele (1674 pairs), adv-int (2166), int-ele (2154).
3.6 Newsela (NSL)
NSL (Xu et al., 2015) is a dataset particularly developed for text simplification studies. The dataset consists of 1,130 articles, with each item re-written 4 times for children at different grade levels. We create a paired version (NSL-Pair) (2125 pairs).
3.7 ASSET
ASSET (Alva-Manchego et al., 2020) is a paired sentence dataset. The dataset consists of 360 sentences, with each item simplified 10 times.
4 Recalibration
4.1 Choosing Traditional Read. Formulas
We start by recalibrating five readability formulas. We considered Zhou et al. (2017) and the number of Google Scholar citations to sort out the most popular traditional readability formulas. Further, to make a fair performance comparison with our adjusted variations, we choose the formulas originally intended to output U.S. school grades but are based on 20th-century texts and test subjects.
Flesh-Kincaid Grade Level (FKGL) is primarily developed for U.S. Navy personnel. The readability level of 18 passages from Navy technical training manuals was calculated. The criterion was that of subjects with reading abilities at the specific level had to score on a cloze test for a text item to be classified as the specific reading level. Responses from 531 Navy personnel were used.
where sent is sentence, and # refers to "count of."
The genius of Gunning Fog Index (FOGI) is the idea that word difficulty highly correlates with the number of syllables. Such a conclusion was deduced upon the inspection of Dale’s list of easy words (Zhou et al., 2017; Dale and Chall, 1948). However, the shortcoming of FOGI is the over-generalization that "all" words with more than two syllables are difficult. Indeed, "banana" is quite an easy word.
Simple Measure of Gobbledygook (SMOG) Index, known for its simplicity, resembles FOGI in that both use the number of syllables to classify a word’s difficulty. But SMOG sets its criterion a little high to more than three syllables per word. Additionally, SMOG incorporates a square root approach instead of a linear regression model.
Coleman-Liau Index (COLE) is more of a lesser-used variation among the five. But we could still find multiple studies outside computational linguistics that still partly depend on COLE (Kue et al., 2021; Szmuda et al., 2020; Joseph et al., 2020; Powell et al., 2020). The novelty of COLE is that it calculates readability without counting syllables, which was viewed as a time-consuming approach.
Automated Readability Index (AUTO) is developed for U.S. Air Force to handle more technical documents than textbooks. Like COLE, AUTO relies on the number of letters per word, instead of the more commonly-used syllables per word. Another quirk is that non-integer scores are all rounded up.
4.2 Recalibration & Performance
4.2.1 Traditional Formulas, Other Text Types
We only recalibrate formulas on the CCB dataset. As stated in section 2.1, we limit to CCB’s story-type items. In a preliminary investigation, we obtained low r2 scores (, before and after recalibration) between the traditional readability formulas and poetry, informational text, and drama.
4.2.2 Details on Recalibration
We started with a large feature extraction software, LingFeat (Lee et al., 2021) and expanded it to include more necessary features. From CCB texts, we extracted the surface-level features in traditional readability formulas (i.e. , , ) and put them in a dataframe.
CCB has 6 readability classes, but they are in the forms of range: K1, K2-3, K4-5, K6-8, K9-10, K11, and CCR (college and above). During calibration and evaluation, we estimated readability classes to K1, K2.5, K4.5, K7, K9.5, or K12 to model the general trend of CCB.
Using the class estimations as true labels and the created dataframe as features, we ran an optimization function to calculate the best coefficients (a, b, c in §4.1). We used non-linear least squares in fitting functions (Virtanen et al., 2020). Additional details are available in appendix B.
4.2.3 Coefficients & Performances
a) Coef.s | FKGL | FOGI | SMOG | COLE | AUTO |
---|---|---|---|---|---|
original-a | 0.390 | 0.4000 | 1.043 | 0.05880 | 4.710 |
adjusted-a | 0.1014 | 0.1229 | 2.694 | 0.03993 | 6.000 |
original-b | 11.80 | 100.0 | 30.00 | -0.2960 | 0.5000 |
adjusted-b | 20.89 | 415.7 | 8.815 | -0.4976 | 0.1035 |
original-c | -15.59 | 0.0000 | 3.129 | -15.80 | -21.43 |
adjusted-c | -21.94 | 1.866 | 3.367 | -5.747 | -19.61 |
b) Perf. | FKGL | FOGI | SMOG | COLE | AUTO |
r2 score | -0.03835 | -0.3905 | 0.1613 | 0.4341 | -0.5283 |
r2 score | 0.4423 | 0.4072 | 0.3192 | 0.4830 | 0.4263 |
Pearson r | 0.5698 | 0.5757 | 0.5649 | 0.6800 | 0.5684 |
Pearson r | 0.6651 | 0.6381 | 0.5649 | 0.6949 | 0.6529 |
Table 2-a shows the original coefficients and the adjusted variations, rounded up to match significant figures. The adjusted traditional readability formulas can be obtained by simply plugging in these values to the formulas in section 4.1.
5 The New English Readability Formula
5.1 Criteria
Considering the value of traditional readability formulas as essentially the generalized definition of readability for the non-experts (section 1), what really matters is the included features. The coefficients (or weights) can be recalibrated anytime to fit a specific use. Therefore, it is important to first identify handcrafted linguistic features that universally affect readability. Additionally, to ensure breadth and usability, we set the following guides:
1. We avoid surface-level features that lack linguistic value (Feng et al., 2010). They include .
2. We include at most one linguistic feature from each linguistic subgroup. We use the classifications from Lee et al. (2021); Collins-Thompson (2014).
3. We stick to a simplistic linear equation format.
Feature | CCB | WBT | CAM | CKC | OSE | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | Branch | Subgroup | LingFeat Code | Brief Explanation | r | rk | r | rk | r | rk | r | rk | r | rk |
43 | LxSem | Psycholinguistic | as_AAKuL_C | Kuperman Lemma AoA per Sent | 0.540 | 25 | 0.505 | 1 | 0.722 | 42 | 0.711 | 4 | 0.601 | 25 |
43 | LxSem | Psycholinguistic | as_AAKuW_C | Kuperman Word AoA per Sent | 0.537 | 28 | 0.503 | 2 | 0.722 | 43 | 0.711 | 6 | 0.602 | 24 |
40 | LxSem | Psycholinguistic | at_AAKuW_C | Kuperman Word AoA per Word | 0.703 | 5 | 0.308 | 36 | 0.784 | 20 | 0.643 | 21 | 0.455 | 66 |
40 | Synta | Tree Structure | as_TreeH_C | Tree Height per Sent | 0.550 | 21 | 0.341 | 30 | 0.686 | 51 | 0.699 | 9 | 0.541 | 44 |
40 | Synta | Part-of-Speech | as_ContW_C | # Content Words per Sent | 0.534 | 29 | 0.453 | 13 | 0.667 | 56 | 0.688 | 14 | 0.544 | 43 |
39 | LxSem | Psycholinguistic | at_AAKuL_C | Kuperman Lemma AoA per Word | 0.723 | 4 | 0.323 | 35 | 0.785 | 19 | 0.650 | 20 | 0.453 | 67 |
39 | Synta | Phrasal | as_NoPhr_C | # Noun Phrases per Sent | 0.550 | 20 | 0.406 | 25 | 0.660 | 58 | 0.673 | 18 | 0.582 | 35 |
39 | Synta | Phrasal | to_PrPhr_C | Total # Prepositional Phrases | 0.470 | 47 | 0.189 | 58 | 0.808 | 11 | 0.580 | 36 | 0.729 | 3 |
39 | Synta | Part-of-Speech | as_FuncW_C | # Function Words per Sent | 0.468 | 48 | 0.471 | 8 | 0.662 | 57 | 0.673 | 17 | 0.614 | 19 |
38 | LxSem | Psycholinguistic | to_AAKuL_C | Total Sum Kuperman Lemma AoA | 0.428 | 71 | 0.189 | 59 | 0.835 | 3 | 0.627 | 22 | 0.716 | 5 |
38 | LxSem | Psycholinguistic | to_AAKuW_C | Total Sum Kuperman Word AoA | 0.427 | 72 | 0.189 | 60 | 0.835 | 4 | 0.625 | 23 | 0.715 | 6 |
36 | Synta | Phrasal | as_PrPhr_C | # Prepositional Phrases per Sent | 0.513 | 35 | 0.417 | 23 | 0.607 | 70 | 0.608 | 28 | 0.590 | 34 |
36 | LxSem | Word Familiarity | as_SbL1C_C | SubtlexUS Lg10CD Value per Sent | 0.467 | 49 | 0.430 | 20 | 0.612 | 69 | 0.699 | 10 | 0.533 | 45 |
35 | LxSem | Type Token Ratio | CorrTTR_S | Corrected Type Token Ratio | 0.745 | 1 | 0.006 | 228 | 0.846 | 1 | 0.445 | 65 | 0.692 | 7 |
35 | LxSem | Word Familiarity | as_SbL1W_C | SubtlexUS Lg10WF Value per Sent | 0.462 | 52 | 0.437 | 19 | 0.605 | 71 | 0.693 | 12 | 0.523 | 48 |
Feature | CCB | WBT | CAM | CKC | OSE | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | Branch | Subgroup | LingFeat Code | Brief Explanation | r | rk | r | rk | r | rk | r | rk | r | rk |
35 | LxSem | Psycholinguistic | as_AAKuL_C | Kuperman Lemma AoA per Sent | 0.540 | 25 | 0.505 | 1 | 0.722 | 42 | 0.711 | 4 | 0.601 | 25 |
35 | LxSem | Psycholinguistic | as_AAKuW_C | Kuperman Word AoA per Sent | 0.537 | 28 | 0.503 | 2 | 0.722 | 43 | 0.711 | 6 | 0.602 | 24 |
32 | LxSem | Psycholinguistic | at_AAKuL_C | Kuperman Lemma AoA per Word | 0.723 | 2 | 0.323 | 35 | 0.785 | 42 | 0.650 | 22 | 0.453 | 67 |
32 | LxSem | Psycholinguistic | at_AAKuW_C | Kuperman Word AoA per Word | 0.703 | 5 | 0.308 | 36 | 0.784 | 20 | 0.643 | 21 | 0.455 | 66 |
31 | Synta | Phrasal | as_NoPhr_C | # Noun Phrases per Sent | 0.550 | 20 | 0.406 | 25 | 0.660 | 58 | 0.673 | 18 | 0.582 | 35 |
31 | Synta | Part-of-Speech | as_ContW_C | # Content Words per Sent | 0.534 | 29 | 0.453 | 13 | 0.667 | 56 | 0.688 | 14 | 0.544 | 43 |
31 | Synta | Phrasal | as_PrPhr_C | # Prepositional Phrases per Sent | 0.513 | 35 | 0.417 | 23 | 0.607 | 70 | 0.608 | 28 | 0.590 | 34 |
31 | Synta | Part-of-Speech | as_FuncW_C | # Function Words per Sent | 0.468 | 48 | 0.471 | 8 | 0.662 | 57 | 0.673 | 17 | 0.614 | 19 |
31 | LxSem | Psycholinguistic | to_AAKuL_C | Total Sum Kuperman Lemma AoA | 0.428 | 71 | 0.189 | 59 | 0.835 | 3 | 0.627 | 22 | 0.716 | 5 |
31 | LxSem | Psycholinguistic | to_AAKuW_C | Total Sum Kuperman Word AoA | 0.427 | 72 | 0.189 | 60 | 0.835 | 4 | 0.625 | 23 | 0.715 | 6 |
30 | LxSem | Type Token Ratio | CorrTTR_S | Corrected Type Token Ratio | 0.745 | 1 | 0.006 | 228 | 0.846 | 1 | 0.445 | 65 | 0.692 | 7 |
30 | LxSem | Variation Ratio | CorrNoV_S | Corrected Noun Variation-1 | 0.717 | 3 | 0.0858 | 131 | 0.842 | 2 | 0.406 | 78 | 0.612 | 21 |
30 | Synta | Tree Structure | as_TreeH_C | Tree Height per Sent | 0.550 | 21 | 0.341 | 30 | 0.686 | 51 | 0.699 | 9 | 0.541 | 44 |
30 | Synta | Phrasal | to_PrPhr_C | Total # Prepositional Phrases | 0.470 | 47 | 0.189 | 58 | 0.808 | 11 | 0.580 | 36 | 0.729 | 3 |
30 | LxSem | Word Familiarity | as_SbL1C_C | SubtlexUS Lg10CD Value per Sent | 0.467 | 49 | 0.430 | 20 | 0.612 | 69 | 0.699 | 10 | 0.533 | 45 |
5.2 Feature Extraction & Ranking
We utilize LingFeat for feature extraction. It is a public software that supports 255 handcrafted linguistic features in the branches of advanced semantic, discourse, syntactic, lexico-semantic, and shallow traditional. They further classify into 14 subgroups. We study the linguistically-meaningful branches: discourse (entity density, entity grid), syntax (phrasal, tree structure, part-of-speech), and lexico-semantics (variation ratio, type token ratio, psycholinguistics, word familiarity).
After extracting the features from CCB, WBT, CAM, CKC, and OSE, we first create feature performance ranking by Pearson’s correlation. We used Sci-Kit Learn (Pedregosa et al., 2011). We take extra measures (Approach A & B) to model the features’ general performances across datasets. Each approach runs under differing premises:
Premise A: "Human experts’ dataset creation and labeling are partially faulty. The weak performance of a feature in a dataset does not necessarily indicate its weak performance in other data settings".
Premise B: "All datasets are perfect. The weak performance of a feature in a dataset indicates the feature’s weakness to be used universally."
After 78 hours of running, we decided not to extract features from NSL. Computing details are in appendix E. Among the features included in LingFeat, there are traditional readability formulas, like FKGL and COLE. These formulas performed generally well but a single killer feature, like type token ratio (TTR), often outperformed formulas. Traditional readability formulas and shallow traditional features are excluded from the rankings.
5.3 Approach A - Comparative Ranking
Under premise A, each dataset poses a different linguistic environment to feature performance. Further, premise A takes human error into consideration and agrees that data labeling is most likely inconsistent in some way. The literal correlation value itself is not too important under premise A.
NERF | |||
Rather, we look for features that perform better than the others, under the same test settings. Thus, approach A’s rewarding system is rank-dependent. In a dataset, features that rank 1-10 are rewarded 10 points, rank 11-20 get 9 points, … and rank 91-100 get 1 point. Since there are five feature correlation rankings (one per dataset), the maximum score is 50. The results are in Table 3, in the order of score.
5.4 Approach B - Absolute Correlation
Under premise B, the weak correlation of a feature in a dataset is solely due to the feature’s weakness to generalize. This is because all datasets are supposedly perfect. Hence, we only measure the feature’s absolute correlation across datasets.
Approach B’s rewarding system is correlation-dependent. In a dataset, features that show correlation value between 0.9-10 are rewarded 10 points, value between 0.8-0.89 get 9 points, … and value between 0.0-0.09 get 1 point. Like approach A, the maximum score is 50. The result is in Table 4.
5.5 Analysis & Manual Feature Selection
First and the most noticeable, the top features under premise A & B are similar. In fact, the two results are almost replications of each other except for minor changes in order. We initially set two premises to introduce differing views (and hence the results) to feature rankings. Then, we would choose the features that perform well in both.
But there seems to be an inseparable correlation between ranking-based (premise A) and correlation-based (premise B) approaches. CorrNoV_S (Corrected Noun Variation) was the only new top feature introduced under premise B.
Second, discourse-based features (mostly entity-related) performed poorly for use in our final NERF. As an exception, ra_NNToT_C (noun-noun transitions : total) scored 28 under premise A and 26 under premise B. On the other hand, a majority of lexico-semantic and syntactic features performed well throughout. This strongly suggests that a possible discovery of universally-effective features for readability is in lexico-semantics or syntax.
Third, the difficulty of a document heavily depended on the difficulty of individual words. In detail, as_AAKuL_C, as_AAKuW_C, to_AAKuL_C, to_AAKuW_C showed consistently high correlations across the five datasets. As shown in Section 2, these five datasets have different authors, target audience, average length, labeling techniques, and the number of classes. Each dataset had at least one of these features among the top 5 performances.
The four features come from age-of-acquisition research by Kuperman et al. (2012), which now prove to be an important resource for RA. Such direct classification of word difficulties always outperformed frequency-based approaches like SubtlexUS (Brysbaert and New, 2009). Back to feature selection, we follow the steps below.
1. From top to bottom, go through ranking (table 3 & 4) to sort out the features that performed the best in each linguistic subgroup.
2. Conduct step 1 to both datasets and compare the results to each other. Though this process, we only leave the features that duplicate in both rankings.
The steps above produce the same results for both approach A and B. The final selected features are as_AAKuL_C (psycholinguistic), as_TreeH_C (tree structure), as_ContW_C (part-of-speech), as_NoPhr_C (phrasal), as_SbL1C_C (word familiarity), CorrTTR_S (type token ratio). CorrNov_S (variation) only appeared under approach B, and we did not include it.
5.6 More on NERF & Calibration
The final NERF (section 4.5) is brought in three parts. The first is lexico-semantics, which measures lexical difficulty. It adds the total sum of each word’s age-of-acquisition (Kuperman’s) and the sum of word familiarity scores (Lg10CD in SubtlexUS). The sum is divided by # sentences.
The second is syntactic complexity, which deals with how each sentence is structured. We look at the number of content words, noun phrases, and the total sum of sentence tree height. Here, content words (CW) are words that possess semantic content and contribute to the meaning of the specific sentence. Following LingFeat, we consider a word to be a content word if it has "NOUN", "VERB", "NUM", "ADJ", "ADV" as a POS tag. Also, a sentence’s tree height (TH) is calculated from a constituency-parsed tree, which we used the CRF parser (Zhang et al., 2020) to obtain. The related algorithms from NLTK (Bird et al., 2009) were used in calculating tree height. The same CRF parser was also used to count the number of noun phrase (NP) occurrences.
The third is lexical richness, given through type token ratio (TTR). This is the only section of NERF that is averaged on the word count. TTR measures how many unique vocabularies appear with respect to the total word count. TTR is often used as a measure of lexical richness (Malvern and Richards, 2012) and ranked the best performance on two native datasets (CCB and CAM). Importantly, these two datasets represent US and UK school curriculums, and TTR seems a good evaluator. What was interesting is that out of the five TTR variations from Lee et al. (2021); Vajjala and Meurers (2012), corrected TTR generalized particularly well.
Like section 3, we use the non-linear least fitting method on CCB to calibrate NERF. The results match what we expected. For example, the coefficient for word familiarity, which measures how frequently the word is used in American English, is negative since common words often have faster lexical comprehension times (Brysbaert et al., 2011).
6 Evaluation, against Human
Metric | Human | NERF | FKGL | FOGI | SMOG | COLE | AUTO |
---|---|---|---|---|---|---|---|
MAE | N.A. | N.A. | 2.844 | 3.413 | 3.114 | 2.537 | 3.377 |
MAE | 3.509 | 2.154 | 2.457 | 2.516 | 2.728 | 2.378 | 2.514 |
r2 score | N.A. | N.A. | -0.03835 | -0.3905 | 0.1613 | 0.4341 | -0.5283 |
r2 score | -0.0312 | 0.5536 | 0.4423 | 0.4072 | 0.3192 | 0.4830 | 0.4263 |
Pearson r | N.A. | N.A. | 0.5698 | 0.5757 | 0.5649 | 0.6800 | 0.5684 |
Pearson r | 0.0838 | 0.7440 | 0.6651 | 0.6381 | 0.5649 | 0.6949 | 0.6530 |
Here, we check the human-perceived difficulty of each item in CCB. We used Amazon Mechanical Turk to ask U.S. Bachelor’s degree holders, "Which U.S. grade does this text belong to?" Every item was answered by different workers to ensure breadth. Details on survey & datasets are in appendix B, C.
Table 5 gives a performance comparison of NERF against other traditional readability formulas and human performances. The human predictions were made by the U.S. Bachelor’s degree holders living in the U.S. Ten human predictions were averaged to obtain the final prediction for each item, for comparison against CCB.
The calibrated formulas show a particularly great increase in r2 score. This likely means that the new recalibrated formulas can capture the variance of the original CCB classifications much better when compared to the original formulas. We believe that such an improvement stems from the change in datasets. The original formulas are mostly built on human tests of 20th century’s military or technical documents, whereas the recalibration dataset (CCB) are from the student-targeted school curriculum. Further, CCB is classified by trained professionals. Hence, the standards for readability can differ. The new recalibrated versions are more suitable for analyzing the modern general documents and giving K-* output by modernized standards.
MAE (Mean Absolute Error), r2 score, and Pearson’s r improve once more with NERF. Even though the same dataset, same fitting function, and same evaluation techniques (no split, all train) were used, the critical difference was in the features. The shallow surface-level features from the traditional readability formulas also showed top rankings across all datasets but lacked linguistic coverage. Hence, NERF could capture more textual properties that led to a difference in readability.
Lastly, we observe that it is highly difficult for the general human population to exactly guess the readability of a text. Out of 690 predictions, only 286 were correct. We carefully posit that this is because: 1. the concept of "readability" is vague and 2. everyone goes through varying education. It could be easier to choose which item is more readable, instead of guessing how readable an item is. Given the general population, it is always better to use some quantified models than trust human.
7 Evaluation, for Application
7.1 Text Simplification - Passage-based
All readability formulas, whether recalibrated or not, show near-perfect performances in ranking the simplicity of texts. On both OSE-Pair & NSL-Pair, we designed a simple task of ranking the simplicity of an item. Both paired datasets include multiple simplified versions of an original item. Each row consists of various simplifications. A correct prediction is the corresponding readability formula output matching simplification level (e.g. original: highest prediction, …, simplest: lowest prediction).
In OSE-Pair, a correct prediction must properly rank three simplified items. NERF showed a meaningfully improved performance than the other five traditional readability formulas before recalibration. NERF correctly classified 98.7% pairs, while the others stayed 95% (FKGL: 93.4%, FOGI: 92.6%, SMOG: 94.4%, COLE: 94.9%, AUTO: 92.6%). Recalibration generally helped the traditional readability formulas but NERF still showed better performance (FKGL: 97.8%, FOGI: 97.1%, SMOG: 94.4%, COLE: 89.9%, AUTO: 95.8%).
In NSL-Pair, a correct prediction must properly rank five simplified items, which is a more difficult task than the previous. Nonetheless, all six formulas achieved 100% accuracies. The same results were achieved before and after CCB-recalibration. This hints that NSL-Pair is thoroughly simplified.
Readability formulas seem to perform well in ranking several simplifications on a passage-level. But there certainly are limits. First, one must understand that calculating "how much simple" is a much difficult task (Table 5). Second, the good results could be because sufficient simplification was done. For more fine grained simplifications, readability formulas could not be enough.
7.2 Text Simplification - Sentence-based
a) Adv-Ele | NERF | FKGL | FOGI | SMOG | COLE | AUTO |
---|---|---|---|---|---|---|
Accuracy | N.A. | 74.2% | 64.9% | 11.4% | 66.0% | 78.0% |
Accuracy | 77.4% | 62.7% | 51.8% | 11.4% | 71.1% | 65.2% |
b) Adv-Int | NERF | FKGL | FOGI | SMOG | COLE | AUTO |
Accuracy | N.A. | 70.2% | 63.0% | 12.2% | 63.6% | 74.7% |
Accuracy | 77.8% | 60.4% | 51.3% | 12.2% | 67.7% | 65.9% |
c) Int-Ele | NERF | FKGL | FOGI | SMOG | COLE | AUTO |
Accuracy | N.A. | 69.8% | 61.3% | 9.02% | 61.9% | 73.2% |
Accuracy | 73.1% | 59.7% | 48.9% | 9.02% | 66.5% | 62.1% |
We were surprised that some existing text simplification studies are directly using traditional readability formulas for sentence difficulty evaluation. Our results show that using a formula-based approach is particularly useless in evaluating a sentence.
We tested both CCB-recalibrated and original formulas on ASSET. Here, a correct prediction must properly rank eleven simplified items. Despite the task difficulty, we anticipated seeing some correct predictions as there were 360 pairs. SMOG guessed 37 (after recalibration) and 89 (before recalibration) correct out of 360. But all the other formulas failed to make any correct prediction.
OSE-Sent poses an easier task. Since the dataset is divided into adv-int, adv-ele, and int-ele, the readability formulas now had to guess which is more difficult, out of the given two. We do obtain some positive results, showing that readability formulas can be useful in the cases where only two sentences are compared. On ranking two sentences, NERF performs better by a large margin.
7.3 Medical Documents

We argue that NERF is effective in fixing the over-inflated prediction of difficulty on medical texts. Such sudden inflation is widely-reported (Zheng and Yu, 2017) as the common weaknesses of traditional readability formulas on medical documents.
The U.S. National Institute of Health (NIH) guides that patient documents be K-6 of difficulty. The most distinct characteristic of medical documents is the use of lengthy medical terms, like otolaryngology, urogynecology, and rheumatology. This makes traditional formulas, based on syllables, unreliable. But NERF uses familiarity and age-of-acquisition to penalty and reward word difficulty.
A medical term not found in Kuperman’s and SubtlexUS will have no effect. Instead, it will simply be labeled a content word. But in traditional formulas, the repetitive use of medical terms (which is likely the case) results in an insensible aggregation of text difficulty. In case various medical terms appear, NERF rewards each as a unique word.
Among recent studies is Haller et al. (2019), which analyzed the readability of urogynecology patient education documents in FKGL, SMOG, and Fry Readability. We also analyze the same 18 documents from the American Urogynecologic Society (AUGS) by manual OCR-based scraping. As Figure 1 shows, it is evident that NERF helps regulate the traditional readability formulas’ tendencies to over-inflate on medical texts. An example of the collected resource is given in appendix B.
8 Conclusion
So far, we have recalibrated five traditional readability formulas and assessed their performances. We evaluated them on CCB and proved that the adjusted variations help traditional readability formulas give output more in align with CCB, a common English education curriculum used throughout the United States. Further, we evaluated the recalibrated formulas’ application on text simplification research. On ranking passage difficulty, our recalibrated formulas showed good performance. However, the formulas lacked performance on ranking sentence difficulty because they were calibrated on passage-length instances. We leave sentence difficulty ranking as an open task.
Apart from recalibration traditional readability formulas, we also develop a new, linguistically-rich readability formulas named NERF. We prove that NERF can be much more useful when it comes to text simplification studies and analyzing the readability of medical documents. Also, our paper serves as a cross-comparison among readability metrics. Lastly, we develop a public Python-based software, for the fast dissemination of the results.
9 Limitations
Our work’s limitations mainly come from CCB. It is manifestly difficult to obtain solid, gold readability-labelled dataset from an officially accredited organization. CCB, the main dataset that we used to calibrate traditional readability formulas, has only 69 items available. Thus, we reasonably anticipate that variation in dialect, individual differences and general ability cannot be captured.
However, we highlight that NERF is developed upon several more datasets that represent diverse background, audience, and reading level. Hence, we believe that NERF can counter some of the shallowness of the traditional readability formulas, despite the still existing weaknesses.
One aspect of readability formulas that have not been deeply investigated is how the output changes depending on the text length. As we show in section 7, readability formulas fail to perform well on sentence-level items. But how about a passage of three sentences? Or does the performance have to do with the average number of words in the recalibration dataset? Is there some sensible range that the readability formulas work well for? These are some open question we fail to address in this work.
References
- Aluisio et al. (2010) Sandra Aluisio, Lucia Specia, Caroline Gasperin, and Carolina Scarton. 2010. Readability assessment for text simplification. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, pages 1–9.
- Alva-Manchego et al. (2020) Fernando Alva-Manchego, Louis Martin, Antoine Bordes, Carolina Scarton, Benoît Sagot, and Lucia Specia. 2020. Asset: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations. arXiv preprint arXiv:2005.00481.
- Bange et al. (2019) Matthew Bange, Eric Huh, Sherwin A Novin, Ferdinand K Hui, and Paul H Yi. 2019. Readability of patient education materials from radiologyinfo. org: has there been progress over the past 5 years? American Journal of Roentgenology, 213(4):875–879.
- Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.".
- Brysbaert et al. (2011) Marc Brysbaert, Matthias Buchmeier, Markus Conrad, Arthur M Jacobs, Jens Bölte, and Andrea Böhl. 2011. The word frequency effect. Experimental psychology.
- Brysbaert and New (2009) Marc Brysbaert and Boris New. 2009. Moving beyond kučera and francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for american english. Behavior research methods, 41(4):977–990.
- Coleman and Liau (1975) Meri Coleman and Ta Lin Liau. 1975. A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60(2):283.
- Collins-Thompson (2014) Kevyn Collins-Thompson. 2014. Computational assessment of text readability: A survey of current and future research. ITL-International Journal of Applied Linguistics, 165(2):97–135.
- Dale and Chall (1948) Edgar Dale and Jeanne S Chall. 1948. A formula for predicting readability: Instructions. Educational research bulletin, pages 37–54.
- Dennis (2018) Murphy Odo Dennis. 2018. A comparison of readability and understandability in second language acquisition textbooks for pre-service efl teachers. Journal of Asia TEFL, 15(3):750–765.
- DuBay (2004) William H DuBay. 2004. The principles of readability. Online Submission.
- Feng et al. (2010) Lijun Feng, Martin Jansche, Matt Huenerfauth, and Noémie Elhadad. 2010. A comparison of features for automatic readability assessment.
- Gaeta et al. (2021) Laura Gaeta, Edward Garcia, and Valeria Gonzalez. 2021. Readability and suitability of spanish-language hearing aid user guides. American Journal of Audiology, 30(2):452–457.
- Gunning et al. (1952) Robert Gunning et al. 1952. Technique of clear writing.
- Haller et al. (2019) Jasmine Haller, Zachary Keller, Susan Barr, Kristie Hadden, and Sallie S Oliphant. 2019. Assessing readability: are urogynecologic patient education materials at an appropriate reading level? Female pelvic medicine & reconstructive surgery, 25(2):139–144.
- Hansberry et al. (2018) David R Hansberry, Michael D’Angelo, Michael D White, Arpan V Prabhu, Mougnyan Cox, Nitin Agarwal, and Sandeep Deshmukh. 2018. Quantitative analysis of the level of readability of online emergency radiology-based patient education resources. Emergency radiology, 25(2):147–152.
- Honnibal and Johnson (2015) Matthew Honnibal and Mark Johnson. 2015. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1373–1378, Lisbon, Portugal. Association for Computational Linguistics.
- Joseph et al. (2020) Pradeep Joseph, Nicole A Silva, Anil Nanda, and Gaurav Gupta. 2020. Evaluating the readability of online patient education materials for trigeminal neuralgia. World Neurosurgery, 144:e934–e938.
- Kincaid et al. (1975) J Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom. 1975. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical report, Naval Technical Training Command Millington TN Research Branch.
- Kiwanuka et al. (2017) Elizabeth Kiwanuka, Raman Mehrzad, Adnan Prsic, and Daniel Kwan. 2017. Online patient resources for gender affirmation surgery: an analysis of readability. Annals of plastic surgery, 79(4):329–333.
- Klare (2000) George R Klare. 2000. The measurement of readability: useful information for communicators. ACM Journal of Computer Documentation (JCD), 24(3):107–121.
- Kue et al. (2021) Jennifer Kue, Dori L Klemanski, and Kristine K Browning. 2021. Evaluating readability scores of treatment summaries and cancer survivorship care plans. JCO Oncology Practice, pages OP–20.
- Kuperman et al. (2012) Victor Kuperman, Hans Stadthagen-Gonzalez, and Marc Brysbaert. 2012. Age-of-acquisition ratings for 30,000 english words. Behavior research methods, 44(4):978–990.
- Lee et al. (2021) Bruce W Lee, Yoo Sung Jang, and Jason Lee. 2021. Pushing on text readability assessment: A transformer meets handcrafted linguistic features. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10669–10686.
- Lee and Lee (2020a) Bruce W. Lee and Jason Lee. 2020a. LXPER index 2.0: Improving text readability assessment model for L2 English students in Korea. In Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications, pages 20–24, Suzhou, China. Association for Computational Linguistics.
- Lee and Lee (2020b) Bruce W. Lee and Jason Hyung-Jong Lee. 2020b. Lxper index: A curriculum-specific text readability assessment model for efl students in korea. International Journal of Advanced Computer Science and Applications, 11(8).
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Lively and Pressey (1923) Bertha A Lively and Sidney L Pressey. 1923. A method for measuring the vocabulary burden of textbooks. Educational administration and supervision, 9(7):389–398.
- Malvern and Richards (2012) David Malvern and Brian Richards. 2012. Measures of lexical richness. The encyclopedia of applied linguistics.
- Mc Laughlin (1969) G Harry Mc Laughlin. 1969. Smog grading-a new readability formula. Journal of reading, 12(8):639–646.
- Meng et al. (2020) Changping Meng, Muhao Chen, Jie Mao, and Jennifer Neville. 2020. Readnet: A hierarchical transformer framework for web article readability analysis. Advances in Information Retrieval, 12035:33.
- Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
- Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
- Powell et al. (2020) Lauren E Powell, Emily S Andersen, and Andrea L Pozez. 2020. Assessing readability of patient education materials on breast reconstruction by major us academic institutions. Plastic and Reconstructive Surgery–Global Open, 8(9S):127–128.
- Scarton and Specia (2018) Carolina Scarton and Lucia Specia. 2018. Learning simplifications for specific target audiences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 712–718.
- Schwartz et al. (2017) H. Andrew Schwartz, Masoud Rouhizadeh, Michael Bishop, Philip Tetlock, Barbara Mellers, and Lyle Ungar. 2017. Assessing objective recommendation quality through political forecasting. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2348–2357, Copenhagen, Denmark. Association for Computational Linguistics.
- Shardlow and Nawaz (2019) Matthew Shardlow and Raheel Nawaz. 2019. Neural text simplification of clinical letters with a domain specific phrase table. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 380–389.
- Smith and Senter (1967) Edgar A Smith and RJ Senter. 1967. Automated readability index. AMRL-TR. Aerospace Medical Research Laboratories (US), pages 1–14.
- Szmuda et al. (2020) T Szmuda, C Özdemir, S Ali, A Singh, MT Syed, and P Słoniewski. 2020. Readability of online patient education material for the novel coronavirus disease (covid-19): a cross-sectional health literacy study. Public Health, 185:21–25.
- Uchendu et al. (2020) Adaku Uchendu, Thai Le, Kai Shu, and Dongwon Lee. 2020. Authorship attribution for neural text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8384–8395, Online. Association for Computational Linguistics.
- Vajjala and Lučić (2018) Sowmya Vajjala and Ivana Lučić. 2018. Onestopenglish corpus: A new corpus for automatic readability assessment and text simplification. In Proceedings of the thirteenth workshop on innovative use of NLP for building educational applications, pages 297–304.
- Vajjala and Meurers (2012) Sowmya Vajjala and Detmar Meurers. 2012. On improving the accuracy of readability classification using insights from second language acquisition. In Proceedings of the seventh workshop on building educational applications using NLP, pages 163–173.
- van Ballegooie and Hoang (2021) Courtney van Ballegooie and Peter Hoang. 2021. Assessment of the readability of online patient education material from major geriatric associations. Journal of the American Geriatrics Society, 69(4):1051–1056.
- Verhelst et al. (2001) N Verhelst, Piet Van Avermaet, S Takala, N Figueras, and B North. 2001. Common European Framework of Reference for Languages: learning, teaching, assessment. Cambridge University Press.
- Virtanen et al. (2020) Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272.
- Weller et al. (2020) Orion Weller, Jordan Hildebrandt, Ilya Reznik, Christopher Challis, E Shannon Tass, Quinn Snell, and Kevin Seppi. 2020. You don’t have time to read this: An exploration of document reading time prediction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1789–1794.
- Wes McKinney (2010) Wes McKinney. 2010. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, pages 56 – 61.
- Wu et al. (2013) Danny TY Wu, David A Hanauer, Qiaozhu Mei, Patricia M Clark, Lawrence C An, Jianbo Lei, Joshua Proulx, Qing Zeng-Treitler, and Kai Zheng. 2013. Applying multiple methods to assess the readability of a large corpus of medical documents. Studies in health technology and informatics, 192:647.
- Xia et al. (2016) Menglin Xia, Ekaterina Kochmar, and Ted Briscoe. 2016. Text readability assessment for second language learners. In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pages 12–22.
- Xu et al. (2015) Wei Xu, Chris Callison-Burch, and Courtney Napoles. 2015. Problems in current text simplification research: New data can help. Transactions of the Association for Computational Linguistics, 3:283–297.
- Xu et al. (2016) Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. 2016. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4:401–415.
- Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
- Zhang et al. (2020) Yu Zhang, Houquan Zhou, and Zhenghua Li. 2020. Fast and accurate neural CRF constituency parsing. In Proceedings of IJCAI, pages 4046–4053.
- Zheng and Yu (2017) Jiaping Zheng and Hong Yu. 2017. Readability formulas and user perceptions of electronic health records difficulty: a corpus study. Journal of medical Internet research, 19(3):e59.
- Zhou et al. (2017) Shixiang Zhou, Heejin Jeong, and Paul A Green. 2017. How consistent are the best-known readability equations in estimating the readability of design standards? IEEE Transactions on Professional Communication, 60(1):97–111.
Appendix A Public Resources We Developed
A.1 Python Library
A.1.1 As a Readability Tool
<Anonymous> supports six readability formulas: NERF, FKGL, FOGI, SMOG, COLE and AUTO. All formulas, other than NERF, are also available in recalibrated variations. A particularly useful feature of this library is that all formulas are fitted to give the U.S. standard school grading system as output. Compared to some other traditional readability formulas where a user has to refer to a table understand output, K-* based numbers are intuitive.
A.1.2 As a General Tool
We have plans to expand <Anonymous> to support various menial tasks in text analysis. We are to focus on the tasks that can be better performed using simplistic approaches. One feature that we had already implemented is text reading time estimation. Weller et al. (2020) has previously shown in a large-scale study that a commonly used rule-of-thumb for online reading estimates, 240 words per minute (WPM), shows better RMSE and MAE results when compared to more modern approaches using XLNet (Yang et al., 2019), ELMo (Peters et al., 2018) and RoBERTa (Liu et al., 2019). We implement 175, 240 and 300 WPM.
A.1.3 Basic Usage
For straightforward maintenance, we keep <Anonymous>’s architecture as simple as possible. There are not many steps for the user to take:
import <Anonymous>
new_object = <Anonymous>.request(…)
readability_score1 = new_object.NERF()
readability_score2 = new_object.FKGL()
readability_score3 = new_object.FOGI()
readability_score4 = new_object.SMOG()
readability_score5 = new_object.COLE()
readability_score6 = new_object.AUTO()
time_to_read = new_object.RT()
NERF(), FKGL(), FOGI(), SMOG(), COLE(), AUTO(), RT() are shortcut functions. It can be slightly faster to directly call in the full forms as:
new_english_readability_formula()
flesch_kincaid_grade_level()
fog_index()
smog_index()
coleman_liau_index()
automated_readability_index()
read_time()
Further, all readability formula functions (except for NERF) has option to choose the original or the adjusted variation. Default is set adjusted = True.
A.1.4 <Anonymous> Speed to Calculation
We care for the library’s calculation speed so that it can be of practical use for research implementations. We chose the following items for evaluation.
ITEM A
In those times panics were common, and few days passed without some city or other registering in its archives an event of this kind. There were nobles, who made war against each other; there was the king, who made war against the cardinal; there was Spain, which made war against the king. Then, in addition to these concealed or public, secret or open wars, there were robbers, mendicants, Huguenots, wolves, and scoundrels, who made war upon everybody. The citizens always took up arms readily against thieves, wolves or scoundrels, often against nobles or Huguenots, sometimes against the king, but never against the cardinal or Spain. It resulted, then, from this habit that on the said first Monday of April, 1625, the citizens, on hearing the clamor, and seeing neither the red-and-yellow standard nor the livery of the Duc de Richelieu, rushed toward the hostel of the Jolly Miller. When arrived there, the cause of the hubbub was apparent to all.
The Three Musketeers, Alexandre Dumas
ITEM B
The vaccine contains lipids (fats), salts, sugars and buffers. COVID-19 vaccines do not contain eggs, gelatin (pork), gluten, latex, preservatives, antibiotics, adjuvants or aluminum. The vaccines are safe, even if you have food, drug, or environmental allergies. Talk to a health care provider first before getting a vaccine if you have allergies to the following vaccine ingredients: polyethylene glycol (PEG), polysorbate 80 and/or tromethamine (trometamol or Tris).
COVID-19 Vaccine Information Sheet, Ministry of Health, Ontario Canada
ITEM C
BERT alleviates the previously mentioned unidirectionality constraint by using a “masked language model”(MLM) pre-training objective, inspired by the Cloze task.
Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
a) ITEM A | NERF | FKGL | FOGI | SMOG | COLE | AUTO |
---|---|---|---|---|---|---|
item * 1 | 0.6371 | 0.0002 | 0.0001 | 0.0001 | 0.0000 | 0.0000 |
item * 5 | 2.6450 | 0.0006 | 0.0005 | 0.0004 | 0.0001 | 0.0001 |
item * 10 | 5.5175 | 0.0011 | 0.0010 | 0.0010 | 0.0004 | 0.0004 |
item * 15 | 7.8088 | 0.0016 | 0.0016 | 0.0013 | 0.0003 | 0.0004 |
item * 20 | 10.226 | 0.0021 | 0.0021 | 0.0018 | 0.0004 | 0.0004 |
b) ITEM B | NERF | FKGL | FOGI | SMOG | COLE | AUTO |
item * 1 | 0.3531 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
item * 5 | 1.2842 | 0.0003 | 0.0003 | 0.0002 | 0.0000 | 0.0000 |
item * 10 | 2.5178 | 0.0005 | 0.0005 | 0.0004 | 0.0001 | 0.0001 |
item * 15 | 3.6545 | 0.0009 | 0.0007 | 0.0006 | 0.0002 | 0.0002 |
item * 20 | 4.8308 | 0.0010 | 0.0010 | 0.0009 | 0.0002 | 0.0002 |
c) ITEM C | NERF | FKGL | FOGI | SMOG | COLE | AUTO |
item * 1 | 0.1373 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
item * 5 | 0.1888 | 0.0001 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
item * 10 | 0.2528 | 0.0002 | 0.0002 | 0.0002 | 0.0000 | 0.0000 |
item * 15 | 0.3420 | 0.0003 | 0.0003 | 0.0002 | 0.0000 | 0.0000 |
item * 20 | 0.3886 | 0.0004 | 0.0003 | 0.0003 | 0.0000 | 0.0000 |
First, it is very obvious that AUTO does a great job in keeping calculation speed short for longer texts as originally intended. Second, NERF’s calculation speed linearly increases in respect to the text length. Though, we believe that NERF’s speed is decent in its wide linguistic coverage, it seems true that the speed is weakness when compared to the other readability formulas.
A.2 Research Archive
Our datasets, preprocessing codes and evaluation codes can be found in <Anonymous>. Copyrighted resources are given upon request to the first author.
Appendix B External Resources
B.1 Python Libraries
pandas v.1.3.4 (Wes McKinney, 2010)
Calculations for Kuperman’s AoA CSV, SubtlexUS word familiarity CSV, manage and manipulate data. For feature study purposes, correlate and rank features in Tables 3 and 4.
SuPar v.1.1.3 - CRF Parser
Constituency parsing on input sentences -> calculate tree height and count noun phrases.
spaCy v.3.2.0 (Honnibal and Johnson, 2015)
Sentence/dependency parsing on documents -> sent input into SuPar and count content words (POS).
Sci-Kit Learn v.1.0.1
Calculation, r2 score and MAE in Tables 2 and 5.
SciPy v.1.7.3
Calculation of Pearson’s r for Tables 2 and 5. Fitting function (scipy.optimize.curve_fit()) used to recalibrate traditional readability formulas and give coefficients for NERF in Table 2.
NLTK v.3.6.5
Calculation of tree height for NERF.
LingFeat v.1.0.0-beta.19
Extraction of handcrafted linguistic features.
B.2 Datasets
New Class | CCB | WBT |
---|---|---|
K1.0 | K1 (Age 6-7) | N/A |
K2.0 | N/A | Level 2 (Age 7-8) |
K2.5 | K2-3 (Age 7-9) | N/A |
K3.0 | N/A | Level 3 (Age 8-9) |
K4.0 | N/A | Level 4 (Age 9-10) |
K4.5 | K4-5 (Age 9-11) | N/A |
K7.0 | K6-8 (Age 11-14) | KS3 (Age 11-14) |
K9.5 | K9-10 (Age 14-16) | GCSE (Age 14-16) |
K12.0 | K11-CCR (Age 16+) | N/A |
We collected CCB by manually going through an official source222corestandards.org/assets/Appendix_B.pdf. WBT was obtained from the authors333Dr. Sowmya Vajjala, National Research Council, Canada in HTML format. We conducted basic preprocessing and manipulated WBT in CSV format. CAM was retrieved from an existing archive444ilexir.co.uk/datasets/index.html. CKC was retrieved from a South Korean educational company555Bruce W. Lee, LXPER Inc., South Korea. OSE was retrieved from a public archive666github.com/nishkalavallabhi/OneStopEnglishCorpus. NSL was obtained from an American educational company777Luke Orland, Newsela Inc., New York, U.S.A.. AUGS medical texts (refer to Section 6.3) were manually scraped from the official website888augs.org/patient-fact-sheets/. ASSET was obtained from a public repository999github.com/facebookresearch/asset. Lastly, Table 8 shows how we converted WBT class labels to fit CCB and show in Table 1. All were consistent with intended use.
Further, to give more backgrounds to section 6.2, we give example pairs from ASSET and OSE-Sent.
ASSET
0: Gable earned an Academy Award nomination for portraying Fletcher Christian in Mutiny on the Bounty.
1: Gable also earned an Oscar nomination when he portrayed Fletcher Christian in 1935’s Mutiny on the Bounty.
2: Gable won an Academy Award vote when he acted in 1935’s Mutiny on the Bounty as Fletcher Christian.
3: Gable also won an Academy Award nomination when he played Fletcher Christian in the 1935 film Mutiny on the Bounty.
4: Gable was nominated for an Academy Award for portraying Fletcher Christian in 1935’s Mutiny on the Bounty.
5: Gable also earned an Academy Award nomination in 1935 for playing Fletcher Christian in "Mutiny on the Bounty.
6: Gable also earned an Academy Award nomination when he played Fletcher Christian in 1935’s Mutiny on the Bounty.
7: Gable recieved an Academy Award nomination for his role as Fletcher Christian. The film was Mutiny on the Bounty (1935).
8: Gable earned an Academy Award nomination for his role as Fletcher Christian in the 1935 film Mutiny on the Bounty.
9: Gable also got an Academy Award nomination when he played Fletcher Christian in 1935’s movie, Mutiny on the Bounty.
10: Gable also earned an Academy Award nomination when he portrayed Fletcher Christian in 1935’s Mutiny on the Bounty.
OSE-Sent (ADV-ELE)
ADV: The Seattle-based company has applied for its brand to be a top-level domain name (currently .com), but the South American governments argue this would prevent the use of this internet address for environmental protection, the promotion of indigenous rights and other public interest uses.
ELE: Amazon has asked for its company name to be a top-level domain name (currently .com), but the South American governments say this would stop the use of this internet address for environmental protection, indigenous rights and other public interest uses.
OSE-Sent (ADV-INT)
ADV: Brazils latest funk sensation, Anitta, has won millions of fans by taking the favela sound into the mainstream, but she is at the centre of a debate about skin colour.
INT: Brazils latest funk sensation, Anitta, has won millions of fans by making the favela sound popular, but she is at the centre of a debate about skin colour.
OSE-Sent (INT-ELE)
INT: Allowing private companies to register geographical names as gTLDs to strengthen their brand or to profit from the meaning of these names is not, in our view, in the public interest, the Brazilian Ministry of Science and Technology said.
ELE: Allowing private companies to register geographical names as gTLDs to profit from the meaning of these names is not, in our view, in the public interest, the Brazilian Ministry of Science and Technology said.
The following is an example of the AUGS medical documents used in Section 6.3 and Figure 1.
Interstitial Cystitis: Interstitial Cystitis/ Bladder Pain Syndrome Interstitial cystitis/bladder pain syndrome (IC/BPS) is a condition with symptoms including burning, pressure, and pain in the bladder along with urgency and frequency. About IC/BPS IC/BPS occurs in three to seven percent of women, and can affect men as well. Though usually diagnosed among women in their 40s, younger and older women have IC/BPS, too. It can feel like a constant bladder infection. Symptoms may become severe (called a "flare") for hours, days or weeks, and then disappear. Or, they may linger at a very low level during other times. Individuals with IC/BPS may also have other health issues such as irritable bowel syndrome, fibromyalgia, chronic headaches, and vulvodynia. Depression and anxiety are also common among women with this condition. The cause of IC/BPS is unknown. It is likely due to a combination of factors. IC/BPS runs in families and so may have a genetic factor. On cystoscopy, the doctor may see damage to the wall of the bladder. This may allow toxins from the urine to seep into the delicate layers of the bladder lining, causing the pain of IC/BPS. Other research found that nerves in and around the bladder of people with IC/BPS are hypersensitive. This may also contribute to IC/BPS pain. There may also be an allergic component.
Appendix C CCB Human Predictions
In Section 2.1, we mention that human predictions were collected on Amazon Mechanical Turk. Then, we compared human performance to readability formulas in Table 5. Here, surveys are designed.
Description: must choose which difficulty level does the text belong, "difficulty does not correlate with text length"
Qualification Requirement(s): Location is one of US, HIT Approval Rate (%) for all Requesters’ HITs greater than 80, Number of HITs Approved greater than 50, US Bachelor’s Degree equal to true, Masters has been granted
All 69 story-type items from CCB were given. Each item had to be completed by at least 10 different individuals, resulting in 690 responses in total. They were given 6 representative examples. Payments were adequately and they were informed that the responds shall be used for research.
Appendix D Handcrafted Linguistic Features and the Respective Generalizability
We give full generalizability rankings that we obtained through LingFeat. Considering that much work has to be done on the generalizability of RA, we believe that these rankings are particularly helpful. Table 9, Table 10, Table 11, Table 12, Table 13, Table 14,Table 15 are expanded versions of Table 3 and Table 4. The features not shown scored a 0.
From the full rankings, it is clear that shallow traditional (surface-level), lexico-semantic and syntactic features are effective throughout all datasets. Advanced semantics and discourse features show some what similar mid-low performances. However, it should be acknowledged that among the worst performing are lexico-semantic and syntactic features, too. This is perhaps because LingFeat itself has a very lexico-semantics and syntax-focused collection of handcrafted linguistic features. Thus, more study is needed.
Even if two features are from the same group (phrasal), they could show drastically varying performances (# Noun phrases per Sent - scored 39 in approach A v.s. # Verb phrases per Sent - scored 1 in approach A). Hence, thorough feature study must always be conducted during research. In a feature selection for a readability-related model, a cherry picking the most well performing feature from each feature group is recommended.
Appendix E Computing Power
Single CPU chip. Architecture: x86_64; CPU(s): 16; Model name: Intel(R) Core(TM) i9-9900KF CPU @ 3.60GHz; CPU MHz: 800.024
Feature | CCB | WBT | CAM | CKC | OSE | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | Branch | Subgroup | LingFeat Code | Brief Explanation | r | rk | r | rk | r | rk | r | rk | r | rk |
43 | ShaTr | Shallow | as_Sylla_C | # syllables per Sent | 0.541 | 24 | 0.461 | 10 | 0.686 | 50 | 0.697 | 11 | 0.59 | 31 |
43 | LxSem | Psycholinguistic | as_AAKuL_C | lemmas AoA of lemmas per Sent | 0.54 | 25 | 0.505 | 1 | 0.722 | 42 | 0.711 | 4 | 0.601 | 25 |
43 | ShaTr | Shallow | as_Chara_C | # characters per Sent | 0.539 | 27 | 0.487 | 4 | 0.696 | 46 | 0.711 | 5 | 0.613 | 20 |
43 | LxSem | Psycholinguistic | as_AAKuW_C | AoA of words per Sent | 0.537 | 28 | 0.502 | 2 | 0.722 | 41 | 0.711 | 6 | 0.602 | 24 |
42 | Synta | Tree Structure | as_FTree_C | length of flattened Trees per Sent | 0.505 | 37 | 0.485 | 5 | 0.677 | 54 | 0.719 | 2 | 0.622 | 16 |
40 | LxSem | Psycholinguistic | at_AAKuW_C | AoA of words per Word | 0.703 | 5 | 0.308 | 36 | 0.784 | 20 | 0.643 | 21 | 0.455 | 66 |
40 | Synta | Tree Structure | as_TreeH_C | Tree height per Sent | 0.55 | 21 | 0.341 | 30 | 0.686 | 51 | 0.699 | 9 | 0.541 | 44 |
40 | Synta | Part-of-Speech | as_ContW_C | # Content words per Sent | 0.534 | 29 | 0.453 | 13 | 0.667 | 56 | 0.688 | 14 | 0.544 | 43 |
40 | ShaTr | Shallow | as_Token_C | # tokens per Sent | 0.494 | 40 | 0.464 | 9 | 0.65 | 60 | 0.709 | 7 | 0.58 | 36 |
39 | LxSem | Psycholinguistic | at_AAKuL_C | lemmas AoA of lemmas per Word | 0.723 | 2 | 0.323 | 35 | 0.785 | 19 | 0.65 | 20 | 0.453 | 67 |
39 | Synta | Phrasal | as_NoPhr_C | # Noun phrases per Sent | 0.55 | 20 | 0.406 | 25 | 0.66 | 58 | 0.673 | 18 | 0.582 | 35 |
39 | Synta | Phrasal | to_PrPhr_C | total # prepositional phrases | 0.47 | 47 | 0.189 | 58 | 0.808 | 11 | 0.58 | 36 | 0.729 | 3 |
39 | Synta | Part-of-Speech | as_FuncW_C | # Function words per Sent | 0.468 | 48 | 0.471 | 8 | 0.662 | 57 | 0.673 | 17 | 0.614 | 19 |
38 | LxSem | Psycholinguistic | to_AAKuL_C | total lemmas AoA of lemmas | 0.428 | 71 | 0.189 | 59 | 0.835 | 3 | 0.627 | 22 | 0.716 | 5 |
38 | LxSem | Psycholinguistic | to_AAKuW_C | total AoA (Age of Acquisition) of words | 0.427 | 72 | 0.189 | 60 | 0.835 | 4 | 0.625 | 23 | 0.715 | 6 |
36 | Synta | Phrasal | as_PrPhr_C | # prepositional phrases per Sent | 0.513 | 35 | 0.417 | 23 | 0.607 | 70 | 0.608 | 28 | 0.59 | 32 |
36 | LxSem | Word Familiarity | as_SbL1C_C | SubtlexUS Lg10CD value per Sent | 0.467 | 49 | 0.43 | 20 | 0.612 | 69 | 0.699 | 10 | 0.533 | 45 |
35 | LxSem | Type Token Ratio | CorrTTR_S | Corrected TTR | 0.745 | 1 | 0.006 | 228 | 0.846 | 1 | 0.445 | 65 | 0.692 | 7 |
35 | LxSem | Word Familiarity | as_SbL1W_C | SubtlexUS Lg10WF value per Sent | 0.462 | 52 | 0.437 | 19 | 0.605 | 71 | 0.693 | 12 | 0.523 | 48 |
34 | Synta | Part-of-Speech | as_NoTag_C | # Noun POS tags per Sent | 0.551 | 19 | 0.304 | 38 | 0.624 | 65 | 0.608 | 29 | 0.48 | 61 |
34 | LxSem | Psycholinguistic | as_AACoL_C | AoA of lemmas, Cortese and Khanna norm per Sent | 0.532 | 30 | 0.339 | 32 | 0.649 | 61 | 0.597 | 32 | 0.499 | 58 |
34 | LxSem | Psycholinguistic | as_AABrL_C | lemmas AoA of lemmas, Bristol norm per Sent | 0.532 | 31 | 0.339 | 31 | 0.649 | 62 | 0.597 | 31 | 0.499 | 57 |
34 | LxSem | Psycholinguistic | to_AABrL_C | total lemmas AoA of lemmas, Bristol norm | 0.451 | 56 | 0.134 | 100 | 0.808 | 10 | 0.561 | 38 | 0.637 | 12 |
33 | LxSem | Psycholinguistic | as_AABiL_C | lemmas AoA of lemmas, Bird norm per Sent | 0.459 | 55 | 0.458 | 11 | 0.582 | 73 | 0.653 | 19 | 0.443 | 69 |
33 | Synta | Phrasal | to_NoPhr_C | total # Noun phrases | 0.416 | 76 | 0.148 | 84 | 0.809 | 8 | 0.527 | 52 | 0.659 | 9 |
33 | Synta | Part-of-Speech | to_ContW_C | total # Content words | 0.402 | 81 | 0.163 | 71 | 0.804 | 14 | 0.558 | 40 | 0.654 | 11 |
32 | LxSem | Variation Ratio | CorrNoV_S | Corrected Noun Variation-1 | 0.717 | 3 | 0.086 | 131 | 0.842 | 2 | 0.406 | 78 | 0.612 | 21 |
32 | LxSem | Variation Ratio | CorrVeV_S | Corrected Verb Variation-1 | 0.602 | 11 | 0.058 | 155 | 0.801 | 15 | 0.393 | 86 | 0.737 | 2 |
32 | LxSem | Psycholinguistic | to_AACoL_C | total AoA of lemmas, Cortese and Khanna norm | 0.451 | 57 | 0.134 | 101 | 0.808 | 9 | 0.561 | 39 | 0.637 | 13 |
32 | Synta | Part-of-Speech | as_VeTag_C | # Verb POS tags per Sent | 0.428 | 70 | 0.476 | 6 | 0.578 | 74 | 0.588 | 34 | 0.505 | 55 |
32 | Synta | Tree Structure | to_FTree_C | total length of flattened Trees | 0.396 | 87 | 0.166 | 69 | 0.805 | 12 | 0.538 | 49 | 0.676 | 8 |
31 | LxSem | Variation Ratio | SquaNoV_S | Squared Noun Variation-1 | 0.645 | 9 | 0.124 | 109 | 0.815 | 7 | 0.401 | 84 | 0.583 | 34 |
31 | LxSem | Variation Ratio | CorrAjV_S | Corrected Adjective Variation-1 | 0.591 | 12 | 0.078 | 134 | 0.779 | 21 | 0.422 | 70 | 0.584 | 33 |
31 | Synta | Part-of-Speech | to_AjTag_C | total # Adjective POS tags | 0.441 | 62 | 0.191 | 57 | 0.777 | 23 | 0.504 | 54 | 0.525 | 46 |
30 | LxSem | Variation Ratio | SquaVeV_S | Squared Verb Variation-1 | 0.559 | 17 | 0.076 | 138 | 0.777 | 22 | 0.384 | 90 | 0.716 | 4 |
30 | Synta | Part-of-Speech | to_NoTag_C | total # Noun POS tags | 0.441 | 61 | 0.129 | 107 | 0.805 | 13 | 0.55 | 44 | 0.636 | 15 |
30 | Synta | Phrasal | as_VePhr_C | # Verb phrases per Sent | 0.383 | 90 | 0.455 | 12 | 0.59 | 72 | 0.586 | 35 | 0.505 | 54 |
29 | LxSem | Word Familiarity | as_SbCDL_C | SubtlexUS CDlow value per Sent | 0.432 | 65 | 0.441 | 14 | 0.527 | 82 | 0.623 | 26 | 0.401 | 85 |
28 | Synta | Part-of-Speech | as_AjTag_C | # Adjective POS tags per Sent | 0.506 | 36 | 0.353 | 28 | 0.553 | 76 | 0.533 | 51 | 0.404 | 84 |
28 | Disco | Entity Grid | ra_NNTo_C | ratio of nn transitions to total | 0.476 | 44 | 0.078 | 135 | 0.754 | 35 | 0.451 | 64 | 0.602 | 23 |
28 | Synta | Tree Structure | at_TreeH_C | Tree height per Word | 0.476 | 45 | 0.419 | 22 | 0.416 | 104 | 0.597 | 33 | 0.41 | 81 |
28 | LxSem | Word Familiarity | as_SbCDC_C | SubtlexUS CD# value per Sent | 0.431 | 67 | 0.437 | 17 | 0.525 | 84 | 0.624 | 24 | 0.404 | 82 |
28 | LxSem | Word Familiarity | as_SbSBC_C | SubtlexUS SUBTLCD value per Sent | 0.431 | 68 | 0.437 | 18 | 0.525 | 85 | 0.624 | 25 | 0.404 | 83 |
28 | LxSem | Word Familiarity | to_SbL1C_C | total SubtlexUS Lg10CD value | 0.37 | 93 | 0.14 | 95 | 0.797 | 16 | 0.491 | 56 | 0.621 | 17 |
27 | LxSem | Variation Ratio | SquaAjV_S | Squared Adjective Variation-1 | 0.531 | 32 | 0.141 | 94 | 0.754 | 34 | 0.407 | 77 | 0.573 | 37 |
27 | LxSem | Word Familiarity | as_SbFrL_C | SubtlexUS FREQlow value per Sent | 0.443 | 60 | 0.426 | 21 | 0.52 | 86 | 0.552 | 42 | 0.425 | 77 |
26 | LxSem | Word Familiarity | as_SbSBW_C | SubtlexUS SUBTLWF value per Sent | 0.44 | 63 | 0.441 | 15 | 0.509 | 91 | 0.542 | 48 | 0.425 | 76 |
26 | LxSem | Word Familiarity | as_SbFrQ_C | SubtlexUS FREQ# value per Sent | 0.44 | 64 | 0.441 | 16 | 0.509 | 90 | 0.542 | 47 | 0.425 | 75 |
26 | LxSem | Word Familiarity | to_SbL1W_C | total SubtlexUS Lg10WF value | 0.365 | 99 | 0.144 | 93 | 0.795 | 17 | 0.477 | 58 | 0.611 | 22 |
25 | LxSem | Psycholinguistic | to_AABiL_C | total lemmas AoA of lemmas, Bird norm | 0.365 | 98 | 0.155 | 79 | 0.786 | 18 | 0.473 | 59 | 0.565 | 39 |
25 | LxSem | Word Familiarity | to_SbFrL_C | total SubtlexUS FREQlow value | 0.348 | 109 | 0.201 | 51 | 0.774 | 24 | 0.414 | 74 | 0.555 | 40 |
24 | LxSem | Word Familiarity | to_SbFrQ_C | total SubtlexUS FREQ# value | 0.34 | 116 | 0.206 | 48 | 0.77 | 26 | 0.403 | 82 | 0.551 | 41 |
24 | LxSem | Word Familiarity | to_SbSBW_C | total SubtlexUS SUBTLWF value | 0.34 | 115 | 0.206 | 47 | 0.77 | 27 | 0.403 | 81 | 0.551 | 42 |
23 | ShaTr | Shallow | at_Sylla_C | # syllables per Word | 0.66 | 7 | 0.106 | 120 | 0.627 | 64 | 0.505 | 53 | 0.37 | 91 |
23 | Synta | Phrasal | to_SuPhr_C | total # Subordinate Clauses | 0.367 | 96 | 0.202 | 50 | 0.721 | 43 | 0.462 | 61 | 0.419 | 78 |
23 | Synta | Phrasal | to_VePhr_C | total # Verb phrases | 0.324 | 127 | 0.169 | 68 | 0.76 | 31 | 0.416 | 72 | 0.57 | 38 |
22 | AdSem | Wiki Knowledge | WTopc15_S | Number of topics, 150 topics extracted from Wiki | 0.58 | 15 | 0.007 | 227 | 0.645 | 63 | 0.605 | 30 | 0.191 | 122 |
22 | LxSem | Variation Ratio | CorrAvV_S | Corrected AdVerb Variation-1 | 0.542 | 23 | 0.059 | 154 | 0.71 | 44 | 0.333 | 99 | 0.474 | 63 |
22 | ShaTr | Shallow | at_Chara_C | # characters per Word | 0.443 | 59 | 0.2 | 52 | 0.619 | 67 | 0.402 | 83 | 0.443 | 68 |
22 | Synta | Part-of-Speech | to_CoTag_C | total # Coordinating Conjunction POS tags | 0.364 | 101 | 0.268 | 43 | 0.728 | 39 | 0.406 | 80 | 0.434 | 72 |
22 | Synta | Part-of-Speech | to_FuncW_C | total # Function words | 0.33 | 126 | 0.159 | 77 | 0.773 | 25 | 0.385 | 89 | 0.636 | 14 |
22 | Synta | Part-of-Speech | to_VeTag_C | total # Verb POS tags | 0.288 | 138 | 0.173 | 63 | 0.738 | 38 | 0.383 | 91 | 0.597 | 27 |
21 | AdSem | Wiki Knowledge | WTopc20_S | Number of topics, 200 topics extracted from Wiki | 0.584 | 14 | 0.015 | 214 | 0.616 | 68 | 0.617 | 27 | 0.137 | 138 |
20 | LxSem | Variation Ratio | SquaAvV_S | Squared AdVerb Variation-1 | 0.515 | 34 | 0.093 | 128 | 0.686 | 52 | 0.326 | 102 | 0.46 | 65 |
19 | Synta | Phrasal | as_SuPhr_C | # Subordinate Clauses per Sent | 0.387 | 89 | 0.357 | 26 | 0.532 | 80 | 0.495 | 55 | 0.265 | 112 |
19 | LxSem | Word Familiarity | to_SbCDL_C | total SubtlexUS CDlow value | 0.348 | 107 | 0.148 | 87 | 0.764 | 30 | 0.394 | 85 | 0.513 | 53 |
18 | LxSem | Type Token Ratio | UberTTR_S | Uber Index | 0.646 | 8 | 0.041 | 174 | 0.369 | 112 | 0.109 | 173 | 0.599 | 26 |
18 | AdSem | Wiki Knowledge | WTopc10_S | Number of topics, 100 topics extracted from Wiki | 0.52 | 33 | 0.004 | 229 | 0.532 | 79 | 0.552 | 43 | 0.075 | 180 |
18 | AdSem | Wiki Knowledge | WNois20_S | Semantic Noise, 200 topics extracted from Wiki | 0.492 | 41 | 0.032 | 190 | 0.566 | 75 | 0.572 | 37 | 0.025 | 221 |
18 | Synta | Part-of-Speech | to_SuTag_C | total # Subordinating Conjunction POS tags | 0.4 | 83 | 0.193 | 56 | 0.691 | 48 | 0.406 | 79 | 0.299 | 106 |
18 | LxSem | Word Familiarity | to_SbSBC_C | total SubtlexUS SUBTLCD value | 0.347 | 111 | 0.146 | 91 | 0.764 | 28 | 0.392 | 88 | 0.515 | 52 |
18 | LxSem | Word Familiarity | to_SbCDC_C | total SubtlexUS CD# value | 0.347 | 110 | 0.146 | 90 | 0.764 | 29 | 0.392 | 87 | 0.515 | 51 |
18 | Synta | Part-of-Speech | to_AvTag_C | total # Adverb POS tags | 0.342 | 114 | 0.17 | 67 | 0.726 | 40 | 0.352 | 96 | 0.469 | 64 |
Feature | CCB | WBT | CAM | CKC | OSE | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | Branch | Subgroup | LingFeat Code | Brief Explanation | r | rk | r | rk | r | rk | r | rk | r | rk |
17 | AdSem | Wiki Knowledge | WTopc05_S | Number of topics, 50 topics extracted from Wiki | 0.549 | 22 | 0.033 | 186 | 0.514 | 89 | 0.533 | 50 | 0.042 | 203 |
17 | Synta | Part-of-Speech | as_AvTag_C | # Adverb POS tags per Sent | 0.32 | 129 | 0.292 | 41 | 0.526 | 83 | 0.43 | 67 | 0.415 | 79 |
16 | LxSem | Type Token Ratio | BiLoTTR_S | Bi-Logarithmic TTR | 0.591 | 13 | 0.062 | 149 | 0.07 | 200 | 0.001 | 229 | 0.523 | 47 |
16 | AdSem | Wiki Knowledge | WRich15_S | Semantic Richness, 150 topics extracted from Wiki | 0.495 | 39 | 0.02 | 208 | 0.48 | 95 | 0.549 | 45 | 0.037 | 209 |
16 | Synta | Part-of-Speech | as_CoTag_C | # Coordinating Conjunction POS tags per Sent | 0.38 | 91 | 0.411 | 24 | 0.463 | 97 | 0.442 | 66 | 0.293 | 107 |
15 | Synta | Phrasal | to_AvPhr_C | total # Adverb phrases | 0.356 | 105 | 0.17 | 66 | 0.705 | 45 | 0.298 | 111 | 0.432 | 73 |
15 | ShaTr | Shallow | TokSenL_S | log(total # tokens)/log(total # sentence) | 0.293 | 137 | 0.352 | 29 | 0.297 | 130 | 0.544 | 46 | 0.198 | 121 |
14 | AdSem | Wiki Knowledge | WRich20_S | Semantic Richness, 200 topics extracted from Wiki | 0.465 | 50 | 0.029 | 195 | 0.446 | 102 | 0.556 | 41 | 0.027 | 219 |
13 | Synta | Phrasal | at_PrPhr_C | # prepositional phrases per Word | 0.57 | 16 | 0.133 | 103 | 0.316 | 124 | 0.323 | 105 | 0.366 | 92 |
13 | Synta | Phrasal | ra_NoPrP_C | ratio of Noun phrases # to Prep phrases # | 0.477 | 43 | 0.149 | 83 | 0.34 | 120 | 0.345 | 97 | 0.389 | 87 |
13 | Disco | Entity Grid | ra_SNTo_C | ratio of sn transitions to total | 0.448 | 58 | 0.019 | 210 | 0.514 | 88 | 0.196 | 133 | 0.518 | 49 |
13 | LxSem | Word Familiarity | at_SbL1C_C | SubtlexUS Lg10CD value per Word | 0.408 | 78 | 0.161 | 75 | 0.541 | 78 | 0.204 | 130 | 0.392 | 86 |
13 | Synta | Part-of-Speech | as_SuTag_C | # Subordinating Conjunction POS tags per Sent | 0.366 | 97 | 0.295 | 39 | 0.407 | 105 | 0.427 | 68 | 0.151 | 131 |
13 | ShaTr | Shallow | TokSenS_S | sqrt(total # tokens x total # sentence) | 0.241 | 154 | 0.064 | 147 | 0.758 | 32 | 0.249 | 121 | 0.498 | 59 |
13 | Synta | Tree Structure | to_TreeH_C | total Tree height of all sentences | 0.27 | 145 | 0.069 | 143 | 0.755 | 33 | 0.309 | 108 | 0.515 | 50 |
13 | Synta | Phrasal | as_AvPhr_C | # Adverb phrases per Sent | 0.244 | 152 | 0.328 | 34 | 0.427 | 103 | 0.38 | 92 | 0.356 | 93 |
12 | Disco | Entity Grid | ra_NSTo_C | ratio of ns transitions to total | 0.426 | 73 | 0.033 | 187 | 0.516 | 87 | 0.266 | 117 | 0.505 | 56 |
12 | Synta | Phrasal | to_AjPhr_C | total # Adjective phrases | 0.339 | 120 | 0.182 | 62 | 0.682 | 53 | 0.327 | 101 | 0.271 | 111 |
11 | AdSem | Wiki Knowledge | WNois05_S | Semantic Noise, 50 topics extracted from Wiki | 0.462 | 53 | 0.061 | 150 | 0.455 | 100 | 0.412 | 75 | 0.118 | 151 |
11 | Synta | Phrasal | ra_PrNoP_C | ratio of Prep phrases # to Noun phrases # | 0.421 | 75 | 0.162 | 74 | 0.276 | 135 | 0.344 | 98 | 0.37 | 90 |
11 | ShaTr | Shallow | TokSenM_S | total # tokens x total # sentence | 0.189 | 173 | 0.112 | 116 | 0.674 | 55 | 0.177 | 140 | 0.486 | 60 |
10 | Synta | Phrasal | ra_VeNoP_C | ratio of Verb phrases # to Noun phrases # | 0.46 | 54 | 0.164 | 70 | 0.124 | 174 | 0.041 | 209 | 0.027 | 220 |
10 | Disco | Entity Density | at_UEnti_C | number of unique Entities per Word | 0.127 | 197 | 0.307 | 37 | 0.548 | 77 | 0.253 | 119 | 0.124 | 149 |
9 | LxSem | Variation Ratio | SimpNoV_S | Noun Variation-1 | 0.499 | 38 | 0.087 | 130 | 0.038 | 212 | 0.031 | 213 | 0.337 | 95 |
9 | Synta | Part-of-Speech | at_VeTag_C | # Verb POS tags per Word | 0.431 | 69 | 0.187 | 61 | 0.076 | 196 | 0.111 | 171 | 0.011 | 224 |
9 | LxSem | Word Familiarity | at_SbL1W_C | SubtlexUS Lg10WF value per Word | 0.399 | 84 | 0.089 | 129 | 0.531 | 81 | 0.24 | 123 | 0.412 | 80 |
9 | Synta | Part-of-Speech | ra_VeNoT_C | ratio of Verb POS # to Noun POS # | 0.397 | 86 | 0.198 | 53 | 0.234 | 142 | 0.171 | 142 | 0.067 | 186 |
9 | LxSem | Word Familiarity | at_SbSBC_C | SubtlexUS SUBTLCD value per Word | 0.37 | 94 | 0.032 | 192 | 0.492 | 93 | 0.324 | 103 | 0.435 | 71 |
9 | LxSem | Word Familiarity | at_SbCDC_C | SubtlexUS CD# value per Word | 0.37 | 95 | 0.032 | 191 | 0.492 | 94 | 0.324 | 104 | 0.435 | 70 |
9 | Synta | Phrasal | as_AjPhr_C | # Adjective phrases per Sent | 0.323 | 128 | 0.239 | 46 | 0.387 | 106 | 0.357 | 95 | 0.157 | 127 |
9 | AdSem | WB Knowledge | BClar15_S | Semantic Clarity, 150 topics extracted from WeeBit | 0.025 | 221 | 0.161 | 76 | 0.38 | 108 | 0.481 | 57 | 0.315 | 100 |
8 | AdSem | Wiki Knowledge | WNois15_S | Semantic Noise, 150 topics extracted from Wiki | 0.388 | 88 | 0.033 | 188 | 0.454 | 101 | 0.454 | 63 | 0.006 | 226 |
8 | Disco | Entity Density | at_EntiM_C | number of Entities Mentions #s per Word | 0.17 | 180 | 0.204 | 49 | 0.501 | 92 | 0.292 | 112 | 0.127 | 146 |
8 | AdSem | WB Knowledge | BClar20_S | Semantic Clarity, 200 topics extracted from WeeBit | 0.004 | 227 | 0.147 | 88 | 0.3 | 129 | 0.462 | 60 | 0.308 | 104 |
7 | Synta | Phrasal | ra_PrVeP_C | ratio of Prep phrases # to Verb phrases # | 0.485 | 42 | 0.055 | 157 | 0.184 | 158 | 0.189 | 136 | 0.219 | 117 |
7 | LxSem | Word Familiarity | at_SbCDL_C | SubtlexUS CDlow value per Word | 0.362 | 102 | 0.047 | 166 | 0.474 | 96 | 0.31 | 107 | 0.431 | 74 |
7 | Synta | Part-of-Speech | ra_CoNoT_C | ratio of Coordinating Conjunction POS # to Noun POS # | 0.02 | 224 | 0.277 | 42 | 0.159 | 163 | 0.013 | 222 | 0.132 | 142 |
7 | Synta | Part-of-Speech | at_CoTag_C | # Coordinating Conjunction POS tags per Word | 0.218 | 161 | 0.267 | 44 | 0.02 | 220 | 0.111 | 172 | 0.087 | 169 |
7 | Synta | Part-of-Speech | ra_NoCoT_C | ratio of Noun POS # to Coordinating Conjunction # | 0.022 | 222 | 0.254 | 45 | 0.019 | 221 | 0.053 | 201 | 0.109 | 157 |
6 | Synta | Phrasal | ra_VePrP_C | ratio of Verb phrases # to Prep phrases # | 0.475 | 46 | 0.018 | 211 | 0.301 | 127 | 0.255 | 118 | 0.249 | 114 |
6 | Disco | Entity Grid | ra_XNTo_C | ratio of xn transitions to total | 0.339 | 119 | 0.103 | 124 | 0.658 | 59 | 0.327 | 100 | 0.29 | 108 |
6 | AdSem | WB Knowledge | BTopc15_S | Number of topics, 150 topics extracted from WeeBit | 0.133 | 193 | 0.146 | 92 | 0.209 | 151 | 0.416 | 73 | 0.03 | 217 |
6 | LxSem | Word Familiarity | at_SbSBW_C | SubtlexUS SUBTLWF value per Word | 0.181 | 175 | 0.196 | 54 | 0.095 | 184 | 0.021 | 220 | 0.109 | 156 |
6 | LxSem | Word Familiarity | at_SbFrQ_C | SubtlexUS FREQ# value per Word | 0.181 | 174 | 0.196 | 55 | 0.095 | 183 | 0.021 | 219 | 0.109 | 155 |
5 | Synta | Part-of-Speech | ra_NoVeT_C | ratio of Noun POS # to Verb POS # | 0.432 | 66 | 0.118 | 111 | 0.149 | 168 | 0.112 | 170 | 0.051 | 197 |
5 | AdSem | Wiki Knowledge | WRich10_S | Semantic Richness, 100 topics extracted from Wiki | 0.364 | 100 | 0.002 | 232 | 0.33 | 123 | 0.411 | 76 | 0.041 | 206 |
5 | Disco | Entity Grid | ra_NXTo_C | ratio of nx transitions to total | 0.339 | 118 | 0.097 | 127 | 0.62 | 66 | 0.28 | 116 | 0.278 | 110 |
5 | Synta | Part-of-Speech | at_FuncW_C | # Function words per Word | 0.28 | 142 | 0.04 | 175 | 0.181 | 159 | 0.461 | 62 | 0.032 | 215 |
5 | AdSem | WB Knowledge | BTopc20_S | Number of topics, 200 topics extracted from WeeBit | 0.25 | 150 | 0.135 | 99 | 0.025 | 215 | 0.418 | 71 | 0.044 | 198 |
5 | LxSem | Variation Ratio | SimpVeV_S | Verb Variation-1 | 0.286 | 139 | 0.048 | 165 | 0.081 | 193 | 0.003 | 226 | 0.48 | 62 |
5 | Synta | Part-of-Speech | ra_VeCoT_C | ratio of Verb POS # to Coordinating Conjunction # | 0.192 | 172 | 0.172 | 64 | 0.134 | 171 | 0.022 | 218 | 0.054 | 194 |
5 | LxSem | Word Familiarity | at_SbFrL_C | SubtlexUS FREQlow value per Word | 0.176 | 178 | 0.171 | 65 | 0.061 | 203 | 0.001 | 228 | 0.09 | 165 |
4 | Synta | Phrasal | at_NoPhr_C | # Noun phrases per Word | 0.424 | 74 | 0.066 | 146 | 0.089 | 188 | 0.005 | 224 | 0.042 | 202 |
4 | LxSem | Type Token Ratio | SimpTTR_S | unique tokens/total tokens (TTR) | 0.375 | 92 | 0.025 | 200 | 0.367 | 113 | 0.163 | 147 | 0.344 | 94 |
4 | AdSem | Wiki Knowledge | WNois10_S | Semantic Noise, 100 topics extracted from Wikip | 0.34 | 117 | 0.021 | 207 | 0.376 | 109 | 0.426 | 69 | 0.03 | 216 |
4 | Synta | Phrasal | at_SuPhr_C | # Subordinate Clauses per Word | 0.204 | 165 | 0.157 | 78 | 0.246 | 140 | 0.314 | 106 | 0.073 | 182 |
4 | Synta | Phrasal | ra_SuNoP_C | ratio of Subordinate Clauses # to Noun phrases # | 0.081 | 203 | 0.163 | 72 | 0.224 | 146 | 0.307 | 109 | 0.086 | 170 |
4 | AdSem | WB Knowledge | BNois15_S | Semantic Noise, 150 topics extracted from WeeBit | 0.035 | 214 | 0.162 | 73 | 0.341 | 119 | 0.221 | 127 | 0.091 | 164 |
3 | Synta | Part-of-Speech | ra_AjVeT_C | ratio of Adjective POS # to Verb POS # | 0.411 | 77 | 0.034 | 185 | 0.133 | 172 | 0.156 | 150 | 0.005 | 227 |
3 | Synta | Phrasal | ra_NoVeP_C | ratio of Noun phrases # to Verb phrases # | 0.406 | 79 | 0.068 | 145 | 0.069 | 201 | 0.031 | 212 | 0.019 | 223 |
3 | AdSem | Wiki Knowledge | WRich05_S | Semantic Richness, 50 topics extracted from Wiki | 0.405 | 80 | 0.063 | 148 | 0.347 | 117 | 0.301 | 110 | 0.035 | 211 |
3 | Synta | Phrasal | ra_AvPrP_C | ratio of Adv phrases # to Prep phrases # | 0.4 | 82 | 0.014 | 217 | 0.222 | 147 | 0.196 | 135 | 0.115 | 152 |
3 | LxSem | Variation Ratio | SimpAjV_S | Adjective Variation-1 | 0.398 | 85 | 0.109 | 118 | 0.279 | 134 | 0.073 | 192 | 0.201 | 120 |
3 | Synta | Phrasal | ra_NoSuP_C | ratio of Noun phrases # to Subordinate Clauses # | 0.157 | 185 | 0.153 | 80 | 0.228 | 145 | 0.052 | 205 | 0.04 | 207 |
3 | Synta | Part-of-Speech | ra_NoAjT_C | ratio of Noun POS # to Adjective POS # | 0.121 | 199 | 0.152 | 81 | 0.125 | 173 | 0.114 | 169 | 0.004 | 228 |
3 | Synta | Part-of-Speech | ra_SuNoT_C | ratio of Subordinating Conjunction POS # to Noun POS # | 0.085 | 202 | 0.149 | 82 | 0.039 | 211 | 0.155 | 151 | 0.158 | 126 |
3 | AdSem | WB Knowledge | BNois20_S | Semantic Noise, 200 topics extracted from WeeBit | 0.129 | 196 | 0.148 | 85 | 0.202 | 153 | 0.167 | 144 | 0.032 | 214 |
2 | Synta | Phrasal | ra_VeSuP_C | ratio of Verb phrases # to Subordinate Clauses # | 0.349 | 106 | 0.137 | 98 | 0.307 | 126 | 0.127 | 167 | 0.043 | 200 |
2 | Synta | Phrasal | ra_SuVeP_C | ratio of Subordinate Clauses # to Verb phrases # | 0.345 | 113 | 0.052 | 160 | 0.343 | 118 | 0.376 | 93 | 0.083 | 172 |
2 | Synta | Part-of-Speech | ra_CoFuW_C | ratio of Content words to Function words | 0.284 | 141 | 0.023 | 203 | 0.2 | 154 | 0.376 | 94 | 0.042 | 201 |
2 | Disco | Entity Grid | ra_ONTo_C | ratio of on transitions to total | 0.333 | 123 | 0.04 | 178 | 0.288 | 133 | 0.06 | 199 | 0.383 | 88 |
2 | Disco | Entity Grid | ra_NOTo_C | ratio of no transitions to total | 0.348 | 108 | 0.022 | 204 | 0.383 | 107 | 0.056 | 200 | 0.378 | 89 |
2 | AdSem | WB Knowledge | BRich10_S | Semantic Richness, 100 topics extracted from WeeBit | 0.196 | 170 | 0.044 | 171 | 0.369 | 111 | 0.035 | 210 | 0.336 | 96 |
2 | Disco | Entity Density | to_UEnti_C | total number of unique Entities | 0.308 | 134 | 0.132 | 105 | 0.3 | 128 | 0.023 | 216 | 0.31 | 102 |
2 | Synta | Part-of-Speech | ra_AjCoT_C | ratio of Adjective POS # to Coordinating Conjunction # | 0.0 | 229 | 0.148 | 86 | 0.049 | 207 | 0.091 | 181 | 0.077 | 177 |
2 | Synta | Part-of-Speech | ra_AjNoT_C | ratio of Adjective POS # to Noun POS # | 0.074 | 205 | 0.146 | 89 | 0.031 | 213 | 0.068 | 195 | 0.041 | 205 |
Feature | CCB | WBT | CAM | CKC | OSE | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | Branch | Subgroup | LingFeat Code | Brief Explanation | r | rk | r | rk | r | rk | r | rk | r | rk |
1 | Synta | Part-of-Speech | ra_SuVeT_C | ratio of Subordinating Conjunction POS # to Verb POS # | 0.36 | 103 | 0.053 | 159 | 0.109 | 177 | 0.282 | 115 | 0.137 | 139 |
1 | Synta | Part-of-Speech | ra_AjAvT_C | ratio of Adjective POS # to Adverb POS # | 0.357 | 104 | 0.042 | 172 | 0.056 | 204 | 0.091 | 180 | 0.044 | 199 |
1 | LxSem | Psycholinguistic | at_AABrL_C | lemmas AoA of lemmas, Bristol norm per Word | 0.333 | 124 | 0.029 | 194 | 0.462 | 98 | 0.284 | 113 | 0.217 | 118 |
1 | LxSem | Psycholinguistic | at_AACoL_C | AoA of lemmas, Cortese and Khanna norm per Word | 0.333 | 125 | 0.029 | 193 | 0.462 | 99 | 0.284 | 114 | 0.217 | 119 |
1 | AdSem | WB Knowledge | BNois10_S | Semantic Noise, 100 topics extracted from WeeBit | 0.193 | 171 | 0.036 | 180 | 0.37 | 110 | 0.161 | 149 | 0.33 | 97 |
1 | AdSem | WB Knowledge | BNois05_S | Semantic Noise, 50 topics extracted from WeeBit | 0.158 | 184 | 0.011 | 219 | 0.351 | 116 | 0.15 | 153 | 0.325 | 98 |
1 | AdSem | WB Knowledge | BTopc10_S | Number of topics, 100 topics extracted from WeeBit | 0.197 | 169 | 0.038 | 179 | 0.364 | 114 | 0.166 | 145 | 0.323 | 99 |
1 | Disco | Entity Density | to_EntiM_C | total number of Entities Mentions #s | 0.139 | 191 | 0.02 | 209 | 0.335 | 122 | 0.0 | 230 | 0.312 | 101 |
1 | AdSem | WB Knowledge | BRich05_S | Semantic Richness, 50 topics extracted from WeeBit | 0.126 | 198 | 0.051 | 162 | 0.24 | 141 | 0.051 | 207 | 0.309 | 103 |
1 | LxSem | Psycholinguistic | at_AABiL_C | lemmas AoA of lemmas, Bird norm per Word | 0.203 | 166 | 0.11 | 117 | 0.266 | 138 | 0.053 | 202 | 0.302 | 105 |
1 | Synta | Tree Structure | at_FTree_C | length of flattened Trees per Word | 0.28 | 143 | 0.14 | 96 | 0.097 | 182 | 0.1 | 177 | 0.152 | 130 |
1 | Synta | Phrasal | at_VePhr_C | # Verb phrases per Word | 0.31 | 132 | 0.138 | 97 | 0.079 | 194 | 0.032 | 211 | 0.009 | 225 |
1 | Synta | Part-of-Speech | ra_NoAvT_C | ratio of Noun POS # to Adverb POS # | 0.261 | 147 | 0.133 | 102 | 0.101 | 180 | 0.052 | 204 | 0.034 | 212 |
1 | Synta | Part-of-Speech | ra_CoVeT_C | ratio of Coordinating Conjunction POS # to Verb POS # | 0.302 | 135 | 0.133 | 104 | 0.023 | 218 | 0.133 | 164 | 0.088 | 168 |
Feature | CCB | WBT | CAM | CKC | OSE | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | Branch | Subgroup | LingFeat Code | Brief Explanation | r | rk | r | rk | r | rk | r | rk | r | rk |
35 | LxSem | Psycholinguistic | as_AAKuL_C | lemmas AoA of lemmas per Sent | 0.54 | 25 | 0.505 | 1 | 0.722 | 42 | 0.711 | 4 | 0.601 | 25 |
35 | LxSem | Psycholinguistic | as_AAKuW_C | AoA of words per Sent | 0.537 | 28 | 0.502 | 2 | 0.722 | 41 | 0.711 | 6 | 0.602 | 24 |
33 | ShaTr | Shallow | as_Chara_C | # characters per Sent | 0.539 | 27 | 0.487 | 4 | 0.696 | 46 | 0.711 | 5 | 0.613 | 20 |
33 | Synta | Tree Structure | as_FTree_C | length of flattened Trees per Sent | 0.505 | 37 | 0.485 | 5 | 0.677 | 54 | 0.719 | 2 | 0.622 | 16 |
32 | LxSem | Psycholinguistic | at_AAKuL_C | lemmas AoA of lemmas per Word | 0.723 | 2 | 0.323 | 35 | 0.785 | 19 | 0.65 | 20 | 0.453 | 67 |
32 | LxSem | Psycholinguistic | at_AAKuW_C | AoA of words per Word | 0.703 | 5 | 0.308 | 36 | 0.784 | 20 | 0.643 | 21 | 0.455 | 66 |
31 | Synta | Phrasal | as_NoPhr_C | # Noun phrases per Sent | 0.55 | 20 | 0.406 | 25 | 0.66 | 58 | 0.673 | 18 | 0.582 | 35 |
31 | ShaTr | Shallow | as_Sylla_C | # syllables per Sent | 0.541 | 24 | 0.461 | 10 | 0.686 | 50 | 0.697 | 11 | 0.59 | 31 |
31 | Synta | Part-of-Speech | as_ContW_C | # Content words per Sent | 0.534 | 29 | 0.453 | 13 | 0.667 | 56 | 0.688 | 14 | 0.544 | 43 |
31 | Synta | Phrasal | as_PrPhr_C | # prepositional phrases per Sent | 0.513 | 35 | 0.417 | 23 | 0.607 | 70 | 0.608 | 28 | 0.59 | 32 |
31 | ShaTr | Shallow | as_Token_C | # tokens per Sent | 0.494 | 40 | 0.464 | 9 | 0.65 | 60 | 0.709 | 7 | 0.58 | 36 |
31 | Synta | Part-of-Speech | as_FuncW_C | # Function words per Sent | 0.468 | 48 | 0.471 | 8 | 0.662 | 57 | 0.673 | 17 | 0.614 | 19 |
31 | LxSem | Psycholinguistic | to_AAKuL_C | total lemmas AoA of lemmas | 0.428 | 71 | 0.189 | 59 | 0.835 | 3 | 0.627 | 22 | 0.716 | 5 |
31 | LxSem | Psycholinguistic | to_AAKuW_C | total AoA (Age of Acquisition) of words | 0.427 | 72 | 0.189 | 60 | 0.835 | 4 | 0.625 | 23 | 0.715 | 6 |
30 | LxSem | Type Token Ratio | CorrTTR_S | Corrected TTR | 0.745 | 1 | 0.006 | 228 | 0.846 | 1 | 0.445 | 65 | 0.692 | 7 |
30 | LxSem | Variation Ratio | CorrNoV_S | Corrected Noun Variation-1 | 0.717 | 3 | 0.086 | 131 | 0.842 | 2 | 0.406 | 78 | 0.612 | 21 |
30 | Synta | Tree Structure | as_TreeH_C | Tree height per Sent | 0.55 | 21 | 0.341 | 30 | 0.686 | 51 | 0.699 | 9 | 0.541 | 44 |
30 | Synta | Phrasal | to_PrPhr_C | total # prepositional phrases | 0.47 | 47 | 0.189 | 58 | 0.808 | 11 | 0.58 | 36 | 0.729 | 3 |
30 | LxSem | Word Familiarity | as_SbL1C_C | SubtlexUS Lg10CD value per Sent | 0.467 | 49 | 0.43 | 20 | 0.612 | 69 | 0.699 | 10 | 0.533 | 45 |
30 | LxSem | Word Familiarity | as_SbL1W_C | SubtlexUS Lg10WF value per Sent | 0.462 | 52 | 0.437 | 19 | 0.605 | 71 | 0.693 | 12 | 0.523 | 48 |
29 | LxSem | Variation Ratio | SquaNoV_S | Squared Noun Variation-1 | 0.645 | 9 | 0.124 | 109 | 0.815 | 7 | 0.401 | 84 | 0.583 | 34 |
29 | LxSem | Variation Ratio | CorrVeV_S | Corrected Verb Variation-1 | 0.602 | 11 | 0.058 | 155 | 0.801 | 15 | 0.393 | 86 | 0.737 | 2 |
29 | Synta | Part-of-Speech | as_NoTag_C | # Noun POS tags per Sent | 0.551 | 19 | 0.304 | 38 | 0.624 | 65 | 0.608 | 29 | 0.48 | 61 |
29 | LxSem | Psycholinguistic | to_AABrL_C | total lemmas AoA of lemmas, Bristol norm | 0.451 | 56 | 0.134 | 100 | 0.808 | 10 | 0.561 | 38 | 0.637 | 12 |
29 | LxSem | Psycholinguistic | to_AACoL_C | total AoA of lemmas, Cortese and Khanna norm | 0.451 | 57 | 0.134 | 101 | 0.808 | 9 | 0.561 | 39 | 0.637 | 13 |
29 | Synta | Part-of-Speech | to_NoTag_C | total # Noun POS tags | 0.441 | 61 | 0.129 | 107 | 0.805 | 13 | 0.55 | 44 | 0.636 | 15 |
29 | Synta | Phrasal | to_NoPhr_C | total # Noun phrases | 0.416 | 76 | 0.148 | 84 | 0.809 | 8 | 0.527 | 52 | 0.659 | 9 |
29 | Synta | Part-of-Speech | to_ContW_C | total # Content words | 0.402 | 81 | 0.163 | 71 | 0.804 | 14 | 0.558 | 40 | 0.654 | 11 |
28 | LxSem | Psycholinguistic | as_AACoL_C | AoA of lemmas, Cortese and Khanna norm per Sent | 0.532 | 30 | 0.339 | 32 | 0.649 | 61 | 0.597 | 32 | 0.499 | 58 |
28 | LxSem | Psycholinguistic | as_AABrL_C | lemmas AoA of lemmas, Bristol norm per Sent | 0.532 | 31 | 0.339 | 31 | 0.649 | 62 | 0.597 | 31 | 0.499 | 57 |
28 | LxSem | Psycholinguistic | as_AABiL_C | lemmas AoA of lemmas, Bird norm per Sent | 0.459 | 55 | 0.458 | 11 | 0.582 | 73 | 0.653 | 19 | 0.443 | 69 |
28 | LxSem | Word Familiarity | as_SbCDL_C | SubtlexUS CDlow value per Sent | 0.432 | 65 | 0.441 | 14 | 0.527 | 82 | 0.623 | 26 | 0.401 | 85 |
28 | LxSem | Word Familiarity | as_SbCDC_C | SubtlexUS CD# value per Sent | 0.431 | 67 | 0.437 | 17 | 0.525 | 84 | 0.624 | 24 | 0.404 | 82 |
28 | LxSem | Word Familiarity | as_SbSBC_C | SubtlexUS SUBTLCD value per Sent | 0.431 | 68 | 0.437 | 18 | 0.525 | 85 | 0.624 | 25 | 0.404 | 83 |
28 | Synta | Part-of-Speech | as_VeTag_C | # Verb POS tags per Sent | 0.428 | 70 | 0.476 | 6 | 0.578 | 74 | 0.588 | 34 | 0.505 | 55 |
28 | Synta | Tree Structure | to_FTree_C | total length of flattened Trees | 0.396 | 87 | 0.166 | 69 | 0.805 | 12 | 0.538 | 49 | 0.676 | 8 |
27 | LxSem | Variation Ratio | SquaVeV_S | Squared Verb Variation-1 | 0.559 | 17 | 0.076 | 138 | 0.777 | 22 | 0.384 | 90 | 0.716 | 4 |
27 | LxSem | Variation Ratio | SquaAjV_S | Squared Adjective Variation-1 | 0.531 | 32 | 0.141 | 94 | 0.754 | 34 | 0.407 | 77 | 0.573 | 37 |
27 | Synta | Part-of-Speech | as_AjTag_C | # Adjective POS tags per Sent | 0.506 | 36 | 0.353 | 28 | 0.553 | 76 | 0.533 | 51 | 0.404 | 84 |
27 | LxSem | Word Familiarity | as_SbFrL_C | SubtlexUS FREQlow value per Sent | 0.443 | 60 | 0.426 | 21 | 0.52 | 86 | 0.552 | 42 | 0.425 | 77 |
27 | Synta | Part-of-Speech | to_AjTag_C | total # Adjective POS tags | 0.441 | 62 | 0.191 | 57 | 0.777 | 23 | 0.504 | 54 | 0.525 | 46 |
27 | LxSem | Word Familiarity | as_SbSBW_C | SubtlexUS SUBTLWF value per Sent | 0.44 | 63 | 0.441 | 15 | 0.509 | 91 | 0.542 | 48 | 0.425 | 76 |
27 | LxSem | Word Familiarity | as_SbFrQ_C | SubtlexUS FREQ# value per Sent | 0.44 | 64 | 0.441 | 16 | 0.509 | 90 | 0.542 | 47 | 0.425 | 75 |
27 | Synta | Phrasal | as_VePhr_C | # Verb phrases per Sent | 0.383 | 90 | 0.455 | 12 | 0.59 | 72 | 0.586 | 35 | 0.505 | 54 |
26 | ShaTr | Shallow | at_Sylla_C | # syllables per Word | 0.66 | 7 | 0.106 | 120 | 0.627 | 64 | 0.505 | 53 | 0.37 | 91 |
26 | LxSem | Variation Ratio | CorrAjV_S | Corrected Adjective Variation-1 | 0.591 | 12 | 0.078 | 134 | 0.779 | 21 | 0.422 | 70 | 0.584 | 33 |
26 | Disco | Entity Grid | ra_NNTo_C | ratio of nn transitions to total | 0.476 | 44 | 0.078 | 135 | 0.754 | 35 | 0.451 | 64 | 0.602 | 23 |
26 | Synta | Tree Structure | at_TreeH_C | Tree height per Word | 0.476 | 45 | 0.419 | 22 | 0.416 | 104 | 0.597 | 33 | 0.41 | 81 |
26 | LxSem | Word Familiarity | to_SbL1C_C | total SubtlexUS Lg10CD value | 0.37 | 93 | 0.14 | 95 | 0.797 | 16 | 0.491 | 56 | 0.621 | 17 |
26 | LxSem | Word Familiarity | to_SbL1W_C | total SubtlexUS Lg10WF value | 0.365 | 99 | 0.144 | 93 | 0.795 | 17 | 0.477 | 58 | 0.611 | 22 |
26 | LxSem | Word Familiarity | to_SbFrL_C | total SubtlexUS FREQlow value | 0.348 | 109 | 0.201 | 51 | 0.774 | 24 | 0.414 | 74 | 0.555 | 40 |
26 | LxSem | Word Familiarity | to_SbSBW_C | total SubtlexUS SUBTLWF value | 0.34 | 115 | 0.206 | 47 | 0.77 | 27 | 0.403 | 81 | 0.551 | 42 |
26 | LxSem | Word Familiarity | to_SbFrQ_C | total SubtlexUS FREQ# value | 0.34 | 116 | 0.206 | 48 | 0.77 | 26 | 0.403 | 82 | 0.551 | 41 |
Feature | CCB | WBT | CAM | CKC | OSE | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | Branch | Subgroup | LingFeat Code | Brief Explanation | r | rk | r | rk | r | rk | r | rk | r | rk |
25 | ShaTr | Shallow | at_Chara_C | # characters per Word | 0.443 | 59 | 0.2 | 52 | 0.619 | 67 | 0.402 | 83 | 0.443 | 68 |
25 | Synta | Phrasal | to_SuPhr_C | total # Subordinate Clauses | 0.367 | 96 | 0.202 | 50 | 0.721 | 43 | 0.462 | 61 | 0.419 | 78 |
25 | LxSem | Psycholinguistic | to_AABiL_C | total lemmas AoA of lemmas, Bird norm | 0.365 | 98 | 0.155 | 79 | 0.786 | 18 | 0.473 | 59 | 0.565 | 39 |
25 | Synta | Part-of-Speech | to_CoTag_C | total # Coordinating Conjunction POS tags | 0.364 | 101 | 0.268 | 43 | 0.728 | 39 | 0.406 | 80 | 0.434 | 72 |
25 | Synta | Part-of-Speech | to_FuncW_C | total # Function words | 0.33 | 126 | 0.159 | 77 | 0.773 | 25 | 0.385 | 89 | 0.636 | 14 |
25 | Synta | Phrasal | to_VePhr_C | total # Verb phrases | 0.324 | 127 | 0.169 | 68 | 0.76 | 31 | 0.416 | 72 | 0.57 | 38 |
24 | LxSem | Variation Ratio | CorrAvV_S | Corrected AdVerb Variation-1 | 0.542 | 23 | 0.059 | 154 | 0.71 | 44 | 0.333 | 99 | 0.474 | 63 |
24 | LxSem | Word Familiarity | to_SbCDL_C | total SubtlexUS CDlow value | 0.348 | 107 | 0.148 | 87 | 0.764 | 30 | 0.394 | 85 | 0.513 | 53 |
24 | LxSem | Word Familiarity | to_SbCDC_C | total SubtlexUS CD# value | 0.347 | 110 | 0.146 | 90 | 0.764 | 29 | 0.392 | 87 | 0.515 | 51 |
24 | LxSem | Word Familiarity | to_SbSBC_C | total SubtlexUS SUBTLCD value | 0.347 | 111 | 0.146 | 91 | 0.764 | 28 | 0.392 | 88 | 0.515 | 52 |
23 | AdSem | Wiki Knowledge | WTopc20_S | Number of topics, 200 topics extracted from Wikipedia | 0.584 | 14 | 0.015 | 214 | 0.616 | 68 | 0.617 | 27 | 0.137 | 138 |
23 | AdSem | Wiki Knowledge | WTopc15_S | Number of topics, 150 topics extracted from Wikipedia | 0.58 | 15 | 0.007 | 227 | 0.645 | 63 | 0.605 | 30 | 0.191 | 122 |
23 | LxSem | Variation Ratio | SquaAvV_S | Squared AdVerb Variation-1 | 0.515 | 34 | 0.093 | 128 | 0.686 | 52 | 0.326 | 102 | 0.46 | 65 |
23 | Synta | Part-of-Speech | to_AvTag_C | total # Adverb POS tags | 0.342 | 114 | 0.17 | 67 | 0.726 | 40 | 0.352 | 96 | 0.469 | 64 |
23 | Synta | Part-of-Speech | as_AvTag_C | # Adverb POS tags per Sent | 0.32 | 129 | 0.292 | 41 | 0.526 | 83 | 0.43 | 67 | 0.415 | 79 |
23 | Synta | Part-of-Speech | to_VeTag_C | total # Verb POS tags | 0.288 | 138 | 0.173 | 63 | 0.738 | 38 | 0.383 | 91 | 0.597 | 27 |
22 | Synta | Phrasal | as_SuPhr_C | # Subordinate Clauses per Sent | 0.387 | 89 | 0.357 | 26 | 0.532 | 80 | 0.495 | 55 | 0.265 | 112 |
22 | Synta | Part-of-Speech | as_CoTag_C | # Coordinating Conjunction POS tags per Sent | 0.38 | 91 | 0.411 | 24 | 0.463 | 97 | 0.442 | 66 | 0.293 | 107 |
22 | Synta | Phrasal | to_AvPhr_C | total # Adverb phrases | 0.356 | 105 | 0.17 | 66 | 0.705 | 45 | 0.298 | 111 | 0.432 | 73 |
22 | Synta | Tree Structure | to_TreeH_C | total Tree height of all sentences | 0.27 | 145 | 0.069 | 143 | 0.755 | 33 | 0.309 | 108 | 0.515 | 50 |
21 | Disco | Entity Grid | ra_NSTo_C | ratio of ns transitions to total | 0.426 | 73 | 0.033 | 187 | 0.516 | 87 | 0.266 | 117 | 0.505 | 56 |
21 | Synta | Part-of-Speech | to_SuTag_C | total # Subordinating Conjunction POS tags | 0.4 | 83 | 0.193 | 56 | 0.691 | 48 | 0.406 | 79 | 0.299 | 106 |
20 | LxSem | Type Token Ratio | UberTTR_S | Uber Index | 0.646 | 8 | 0.041 | 174 | 0.369 | 112 | 0.109 | 173 | 0.599 | 26 |
20 | Synta | Phrasal | at_PrPhr_C | # prepositional phrases per Word | 0.57 | 16 | 0.133 | 103 | 0.316 | 124 | 0.323 | 105 | 0.366 | 92 |
20 | AdSem | Wiki Knowledge | WTopc05_S | Number of topics, 50 topics extracted from Wiki | 0.549 | 22 | 0.033 | 186 | 0.514 | 89 | 0.533 | 50 | 0.042 | 203 |
20 | AdSem | Wiki Knowledge | WTopc10_S | Number of topics, 100 topics extracted from Wiki | 0.52 | 33 | 0.004 | 229 | 0.532 | 79 | 0.552 | 43 | 0.075 | 180 |
20 | Disco | Entity Grid | ra_SNTo_C | ratio of sn transitions to total | 0.448 | 58 | 0.019 | 210 | 0.514 | 88 | 0.196 | 133 | 0.518 | 49 |
20 | LxSem | Word Familiarity | at_SbL1C_C | SubtlexUS Lg10CD value per Word | 0.408 | 78 | 0.161 | 75 | 0.541 | 78 | 0.204 | 130 | 0.392 | 86 |
20 | Disco | Entity Grid | ra_XNTo_C | ratio of xn transitions to total | 0.339 | 119 | 0.103 | 124 | 0.658 | 59 | 0.327 | 100 | 0.29 | 108 |
20 | Synta | Phrasal | to_AjPhr_C | total # Adjective phrases | 0.339 | 120 | 0.182 | 62 | 0.682 | 53 | 0.327 | 101 | 0.271 | 111 |
20 | Synta | Phrasal | as_AvPhr_C | # Adverb phrases per Sent | 0.244 | 152 | 0.328 | 34 | 0.427 | 103 | 0.38 | 92 | 0.356 | 93 |
20 | ShaTr | Shallow | TokSenS_S | sqrt(total # tokens x total # sentence) | 0.241 | 154 | 0.064 | 147 | 0.758 | 32 | 0.249 | 121 | 0.498 | 59 |
19 | AdSem | Wiki Knowledge | WNois20_S | Semantic Noise, 200 topics extracted from Wiki | 0.492 | 41 | 0.032 | 190 | 0.566 | 75 | 0.572 | 37 | 0.025 | 221 |
19 | Synta | Phrasal | ra_NoPrP_C | ratio of Noun phrases # to Prep phrases # | 0.477 | 43 | 0.149 | 83 | 0.34 | 120 | 0.345 | 97 | 0.389 | 87 |
19 | LxSem | Word Familiarity | at_SbL1W_C | SubtlexUS Lg10WF value per Word | 0.399 | 84 | 0.089 | 129 | 0.531 | 81 | 0.24 | 123 | 0.412 | 80 |
19 | LxSem | Word Familiarity | at_SbSBC_C | SubtlexUS SUBTLCD value per Word | 0.37 | 94 | 0.032 | 192 | 0.492 | 93 | 0.324 | 103 | 0.435 | 71 |
19 | LxSem | Word Familiarity | at_SbCDC_C | SubtlexUS CD# value per Word | 0.37 | 95 | 0.032 | 191 | 0.492 | 94 | 0.324 | 104 | 0.435 | 70 |
19 | Synta | Part-of-Speech | as_SuTag_C | # Subordinating Conjunction POS tags per Sent | 0.366 | 97 | 0.295 | 39 | 0.407 | 105 | 0.427 | 68 | 0.151 | 131 |
19 | LxSem | Word Familiarity | at_SbCDL_C | SubtlexUS CDlow value per Word | 0.362 | 102 | 0.047 | 166 | 0.474 | 96 | 0.31 | 107 | 0.431 | 74 |
18 | AdSem | Wiki Knowledge | WRich15_S | Semantic Richness, 150 topics extracted from Wiki | 0.495 | 39 | 0.02 | 208 | 0.48 | 95 | 0.549 | 45 | 0.037 | 209 |
18 | AdSem | Wiki Knowledge | WRich20_S | Semantic Richness, 200 topics extracted from Wiki | 0.465 | 50 | 0.029 | 195 | 0.446 | 102 | 0.556 | 41 | 0.027 | 219 |
18 | AdSem | Wiki Knowledge | WNois05_S | Semantic Noise, 50 topics extracted from Wiki | 0.462 | 53 | 0.061 | 150 | 0.455 | 100 | 0.412 | 75 | 0.118 | 151 |
18 | Synta | Phrasal | ra_PrNoP_C | ratio of Prep phrases # to Noun phrases # | 0.421 | 75 | 0.162 | 74 | 0.276 | 135 | 0.344 | 98 | 0.37 | 90 |
18 | Disco | Entity Grid | ra_NXTo_C | ratio of nx transitions to total | 0.339 | 118 | 0.097 | 127 | 0.62 | 66 | 0.28 | 116 | 0.278 | 110 |
18 | ShaTr | Shallow | TokSenL_S | log(total # tokens)/log(total # sentence) | 0.293 | 137 | 0.352 | 29 | 0.297 | 130 | 0.544 | 46 | 0.198 | 121 |
18 | ShaTr | Shallow | TokSenM_S | total # tokens x total # sentence | 0.189 | 173 | 0.112 | 116 | 0.674 | 55 | 0.177 | 140 | 0.486 | 60 |
17 | Synta | Phrasal | as_AjPhr_C | # Adjective phrases per Sent | 0.323 | 128 | 0.239 | 46 | 0.387 | 106 | 0.357 | 95 | 0.157 | 127 |
17 | Disco | Entity Density | at_UEnti_C | number of unique Entities per Word | 0.127 | 197 | 0.307 | 37 | 0.548 | 77 | 0.253 | 119 | 0.124 | 149 |
16 | Synta | Phrasal | ra_VePrP_C | ratio of Verb phrases # to Prep phrases # | 0.475 | 46 | 0.018 | 211 | 0.301 | 127 | 0.255 | 118 | 0.249 | 114 |
16 | AdSem | Wiki Knowledge | WNois15_S | Semantic Noise, 150 topics extracted from Wiki | 0.388 | 88 | 0.033 | 188 | 0.454 | 101 | 0.454 | 63 | 0.006 | 226 |
16 | LxSem | Psycholinguistic | at_AABrL_C | lemmas AoA of lemmas, Bristol norm per Word | 0.333 | 124 | 0.029 | 194 | 0.462 | 98 | 0.284 | 113 | 0.217 | 118 |
16 | LxSem | Psycholinguistic | at_AACoL_C | AoA of lemmas, Cortese and Khanna norm per Word | 0.333 | 125 | 0.029 | 193 | 0.462 | 99 | 0.284 | 114 | 0.217 | 119 |
16 | Disco | Entity Density | at_EntiM_C | number of Entities Mentions #s per Word | 0.17 | 180 | 0.204 | 49 | 0.501 | 92 | 0.292 | 112 | 0.127 | 146 |
16 | AdSem | WB Knowledge | BClar15_S | Semantic Clarity, 150 topics extracted from WeeBit | 0.025 | 221 | 0.161 | 76 | 0.38 | 108 | 0.481 | 57 | 0.315 | 100 |
15 | LxSem | Type Token Ratio | BiLoTTR_S | Bi-Logarithmic TTR | 0.591 | 13 | 0.062 | 149 | 0.07 | 200 | 0.001 | 229 | 0.523 | 47 |
15 | AdSem | Wiki Knowledge | WRich05_S | Semantic Richness, 50 topics extracted from Wiki | 0.405 | 80 | 0.063 | 148 | 0.347 | 117 | 0.301 | 110 | 0.035 | 211 |
15 | LxSem | Type Token Ratio | SimpTTR_S | TTR | 0.375 | 92 | 0.025 | 200 | 0.367 | 113 | 0.163 | 147 | 0.344 | 94 |
15 | AdSem | Wiki Knowledge | WRich10_S | Semantic Richness, 100 topics extracted from Wiki | 0.364 | 100 | 0.002 | 232 | 0.33 | 123 | 0.411 | 76 | 0.041 | 206 |
15 | AdSem | Wiki Knowledge | WNois10_S | Semantic Noise, 100 topics extracted from Wiki | 0.34 | 117 | 0.021 | 207 | 0.376 | 109 | 0.426 | 69 | 0.03 | 216 |
15 | Disco | Entity Density | to_UEnti_C | total number of unique Entities | 0.308 | 134 | 0.132 | 105 | 0.3 | 128 | 0.023 | 216 | 0.31 | 102 |
15 | AdSem | WB Knowledge | BClar20_S | Semantic Clarity, 200 topics extracted from WeeBit | 0.004 | 227 | 0.147 | 88 | 0.3 | 129 | 0.462 | 60 | 0.308 | 104 |
14 | Disco | Entity Grid | ra_NOTo_C | ratio of no transitions to total | 0.348 | 108 | 0.022 | 204 | 0.383 | 107 | 0.056 | 200 | 0.378 | 89 |
14 | Synta | Phrasal | ra_SuVeP_C | ratio of Subordinate Clauses # to Verb phrases # | 0.345 | 113 | 0.052 | 160 | 0.343 | 118 | 0.376 | 93 | 0.083 | 172 |
13 | Synta | Phrasal | ra_PrVeP_C | ratio of Prep phrases # to Verb phrases # | 0.485 | 42 | 0.055 | 157 | 0.184 | 158 | 0.189 | 136 | 0.219 | 117 |
13 | LxSem | Variation Ratio | SimpAjV_S | Adjective Variation-1 | 0.398 | 85 | 0.109 | 118 | 0.279 | 134 | 0.073 | 192 | 0.201 | 120 |
13 | Synta | Phrasal | ra_VeSuP_C | ratio of Verb phrases # to Subordinate Clauses # | 0.349 | 106 | 0.137 | 98 | 0.307 | 126 | 0.127 | 167 | 0.043 | 200 |
13 | Synta | Part-of-Speech | at_NoTag_C | # Noun POS tags per Word | 0.347 | 112 | 0.104 | 122 | 0.295 | 131 | 0.148 | 154 | 0.107 | 159 |
13 | Disco | Entity Grid | ra_ONTo_C | ratio of on transitions to total | 0.333 | 123 | 0.04 | 178 | 0.288 | 133 | 0.06 | 199 | 0.383 | 88 |
13 | Synta | Phrasal | at_SuPhr_C | # Subordinate Clauses per Word | 0.204 | 165 | 0.157 | 78 | 0.246 | 140 | 0.314 | 106 | 0.073 | 182 |
13 | LxSem | Psycholinguistic | at_AABiL_C | lemmas AoA of lemmas, Bird norm per Word | 0.203 | 166 | 0.11 | 117 | 0.266 | 138 | 0.053 | 202 | 0.302 | 105 |
13 | AdSem | WB Knowledge | BTopc10_S | Number of topics, 100 topics extracted from WeeBit | 0.197 | 169 | 0.038 | 179 | 0.364 | 114 | 0.166 | 145 | 0.323 | 99 |
13 | AdSem | WB Knowledge | BNois10_S | Semantic Noise, 100 topics extracted from WeeBit | 0.193 | 171 | 0.036 | 180 | 0.37 | 110 | 0.161 | 149 | 0.33 | 97 |
13 | AdSem | WB Knowledge | BNois05_S | Semantic Noise, 50 topics extracted from WeeBit | 0.158 | 184 | 0.011 | 219 | 0.351 | 116 | 0.15 | 153 | 0.325 | 98 |
13 | AdSem | WB Knowledge | BTopc15_S | Number of topics, 150 topics extracted from WeeBit | 0.133 | 193 | 0.146 | 92 | 0.209 | 151 | 0.416 | 73 | 0.03 | 217 |
Feature | CCB | WBT | CAM | CKC | OSE | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | Branch | Subgroup | LingFeat Code | Brief Explanation | r | rk | r | rk | r | rk | r | rk | r | rk |
12 | LxSem | Variation Ratio | SimpNoV_S | Noun Variation-1 | 0.499 | 38 | 0.087 | 130 | 0.038 | 212 | 0.031 | 213 | 0.337 | 95 |
12 | Synta | Part-of-Speech | ra_NoVeT_C | ratio of Noun POS # to Verb POS # | 0.432 | 66 | 0.118 | 111 | 0.149 | 168 | 0.112 | 170 | 0.051 | 197 |
12 | Synta | Phrasal | ra_AvPrP_C | ratio of Adv phrases # to Prep phrases # | 0.4 | 82 | 0.014 | 217 | 0.222 | 147 | 0.196 | 135 | 0.115 | 152 |
12 | Synta | Part-of-Speech | ra_VeNoT_C | ratio of Verb POS # to Noun POS # | 0.397 | 86 | 0.198 | 53 | 0.234 | 142 | 0.171 | 142 | 0.067 | 186 |
12 | Synta | Part-of-Speech | ra_SuVeT_C | ratio of Subordinating Conjunction POS # to Verb POS # | 0.36 | 103 | 0.053 | 159 | 0.109 | 177 | 0.282 | 115 | 0.137 | 139 |
12 | Disco | Entity Density | as_UEnti_C | number of unique Entities per Sent | 0.337 | 121 | 0.114 | 113 | 0.273 | 136 | 0.066 | 196 | 0.157 | 128 |
12 | Synta | Part-of-Speech | at_AjTag_C | # Adjective POS tags per Word | 0.334 | 122 | 0.117 | 112 | 0.216 | 149 | 0.197 | 132 | 0.037 | 210 |
12 | Synta | Phrasal | ra_SuAvP_C | ratio of Subordinate Clauses # to Adv phrases # | 0.309 | 133 | 0.008 | 226 | 0.141 | 170 | 0.241 | 122 | 0.111 | 153 |
12 | Synta | Part-of-Speech | at_FuncW_C | # Function words per Word | 0.28 | 142 | 0.04 | 175 | 0.181 | 159 | 0.461 | 62 | 0.032 | 215 |
12 | AdSem | WB Knowledge | BTopc20_S | Number of topics, 200 topics extracted from WeeBit | 0.25 | 150 | 0.135 | 99 | 0.025 | 215 | 0.418 | 71 | 0.044 | 198 |
12 | AdSem | Wiki Knowledge | WClar05_S | Semantic Clarity, 50 topics extracted from Wiki | 0.212 | 164 | 0.014 | 218 | 0.214 | 150 | 0.235 | 124 | 0.102 | 161 |
12 | AdSem | WB Knowledge | BRich10_S | Semantic Richness, 100 topics extracted from WeeBit | 0.196 | 170 | 0.044 | 171 | 0.369 | 111 | 0.035 | 210 | 0.336 | 96 |
12 | AdSem | WB Knowledge | BClar05_S | Semantic Clarity, 50 topics extracted from WeeBit | 0.14 | 190 | 0.041 | 173 | 0.339 | 121 | 0.164 | 146 | 0.289 | 109 |
12 | Disco | Entity Density | to_EntiM_C | total number of Entities Mentions | 0.139 | 191 | 0.02 | 209 | 0.335 | 122 | 0.0 | 230 | 0.312 | 101 |
11 | Synta | Phrasal | ra_VeNoP_C | ratio of Verb phrases # to Noun phrases # | 0.46 | 54 | 0.164 | 70 | 0.124 | 174 | 0.041 | 209 | 0.027 | 220 |
11 | Synta | Part-of-Speech | at_VeTag_C | # Verb POS tags per Word | 0.431 | 69 | 0.187 | 61 | 0.076 | 196 | 0.111 | 171 | 0.011 | 224 |
11 | Synta | Part-of-Speech | ra_AjVeT_C | ratio of Adjective POS # to Verb POS # | 0.411 | 77 | 0.034 | 185 | 0.133 | 172 | 0.156 | 150 | 0.005 | 227 |
11 | Synta | Part-of-Speech | ra_SuAvT_C | ratio of Subordinating Conjunction POS # to Adverb POS # | 0.314 | 131 | 0.021 | 206 | 0.106 | 178 | 0.148 | 156 | 0.18 | 124 |
11 | LxSem | Variation Ratio | SimpVeV_S | Verb Variation-1 | 0.286 | 139 | 0.048 | 165 | 0.081 | 193 | 0.003 | 226 | 0.48 | 62 |
11 | Synta | Part-of-Speech | ra_CoFuW_C | ratio of Content words to Function words | 0.284 | 141 | 0.023 | 203 | 0.2 | 154 | 0.376 | 94 | 0.042 | 201 |
11 | Synta | Part-of-Speech | at_SuTag_C | # Subordinating Conjunction POS tags per Word | 0.259 | 148 | 0.13 | 106 | 0.085 | 192 | 0.252 | 120 | 0.135 | 141 |
11 | AdSem | Wiki Knowledge | WClar20_S | Semantic Clarity, 200 topics extracted from Wikipedia | 0.144 | 187 | 0.016 | 212 | 0.308 | 125 | 0.23 | 125 | 0.034 | 213 |
11 | AdSem | WB Knowledge | BTopc05_S | Number of topics, 50 topics extracted from WeeBit | 0.139 | 192 | 0.009 | 224 | 0.291 | 132 | 0.144 | 160 | 0.222 | 116 |
11 | AdSem | WB Knowledge | BRich05_S | Semantic Richness, 50 topics extracted from WeeBit | 0.126 | 198 | 0.051 | 162 | 0.24 | 141 | 0.051 | 207 | 0.309 | 103 |
11 | Synta | Phrasal | ra_SuNoP_C | ratio of Subordinate Clauses # to Noun phrases # | 0.081 | 203 | 0.163 | 72 | 0.224 | 146 | 0.307 | 109 | 0.086 | 170 |
11 | AdSem | WB Knowledge | BNois15_S | Semantic Noise, 150 topics extracted from WeeBit | 0.035 | 214 | 0.162 | 73 | 0.341 | 119 | 0.221 | 127 | 0.091 | 164 |
10 | Synta | Part-of-Speech | ra_CoVeT_C | ratio of Coordinating Conjunction POS # to Verb POS # | 0.302 | 135 | 0.133 | 104 | 0.023 | 218 | 0.133 | 164 | 0.088 | 168 |
10 | Synta | Phrasal | ra_AvSuP_C | ratio of Adv phrases # to Subordinate Clauses # | 0.299 | 136 | 0.06 | 151 | 0.256 | 139 | 0.128 | 165 | 0.077 | 176 |
10 | Synta | Tree Structure | at_FTree_C | length of flattened Trees per Word | 0.28 | 143 | 0.14 | 96 | 0.097 | 182 | 0.1 | 177 | 0.152 | 130 |
10 | Disco | Entity Density | as_EntiM_C | number of Entities Mentions #s per Sent | 0.242 | 153 | 0.015 | 215 | 0.219 | 148 | 0.051 | 206 | 0.168 | 125 |
10 | Disco | Entity Grid | LoCoDPW_S | Local Coherence distance for PW score | 0.239 | 155 | 0.002 | 230 | 0.195 | 156 | 0.143 | 161 | 0.141 | 136 |
10 | Disco | Entity Grid | LoCoDPA_S | Local Coherence distance for PA score | 0.239 | 156 | 0.002 | 231 | 0.195 | 157 | 0.143 | 162 | 0.141 | 135 |
10 | Synta | Part-of-Speech | at_CoTag_C | # Coordinating Conjunction POS tags per Word | 0.218 | 161 | 0.267 | 44 | 0.02 | 220 | 0.111 | 172 | 0.087 | 169 |
10 | LxSem | Variation Ratio | SimpAvV_S | AdVerb Variation-1 | 0.214 | 163 | 0.098 | 126 | 0.353 | 115 | 0.021 | 221 | 0.089 | 166 |
10 | Synta | Phrasal | ra_AjPrP_C | ratio of Adj phrases # to Prep phrases # | 0.201 | 168 | 0.036 | 181 | 0.155 | 164 | 0.095 | 178 | 0.252 | 113 |
10 | AdSem | WB Knowledge | BNois20_S | Semantic Noise, 200 topics extracted from WeeBit | 0.129 | 196 | 0.148 | 85 | 0.202 | 153 | 0.167 | 144 | 0.032 | 214 |
10 | AdSem | WB Knowledge | BRich20_S | Semantic Richness, 200 topics extracted from WeeBit | 0.047 | 211 | 0.104 | 121 | 0.112 | 176 | 0.221 | 126 | 0.143 | 134 |
9 | Synta | Phrasal | at_NoPhr_C | # Noun phrases per Word | 0.424 | 74 | 0.066 | 146 | 0.089 | 188 | 0.005 | 224 | 0.042 | 202 |
9 | Synta | Phrasal | ra_NoVeP_C | ratio of Noun phrases # to Verb phrases # | 0.406 | 79 | 0.068 | 145 | 0.069 | 201 | 0.031 | 212 | 0.019 | 223 |
9 | Synta | Phrasal | ra_PrAvP_C | ratio of Prep phrases # to Adv phrases # | 0.32 | 130 | 0.027 | 196 | 0.021 | 219 | 0.176 | 141 | 0.071 | 183 |
9 | Synta | Phrasal | at_VePhr_C | # Verb phrases per Word | 0.31 | 132 | 0.138 | 97 | 0.079 | 194 | 0.032 | 211 | 0.009 | 225 |
9 | Synta | Part-of-Speech | ra_CoAvT_C | ratio of Coordinating Conjunction POS # to Adverb POS # | 0.284 | 140 | 0.04 | 176 | 0.16 | 162 | 0.079 | 189 | 0.119 | 150 |
9 | Synta | Part-of-Speech | ra_NoAvT_C | ratio of Noun POS # to Adverb POS # | 0.261 | 147 | 0.133 | 102 | 0.101 | 180 | 0.052 | 204 | 0.034 | 212 |
9 | Disco | Entity Grid | LoCohPW_S | Local Coherence for PW score | 0.229 | 159 | 0.034 | 183 | 0.012 | 227 | 0.146 | 157 | 0.148 | 133 |
9 | Disco | Entity Grid | LoCohPA_S | Local Coherence for PA score | 0.229 | 160 | 0.034 | 184 | 0.012 | 226 | 0.146 | 158 | 0.148 | 132 |
9 | Synta | Phrasal | ra_SuPrP_C | ratio of Subordinate Clauses # to Prep phrases # | 0.218 | 162 | 0.048 | 164 | 0.015 | 224 | 0.07 | 194 | 0.227 | 115 |
9 | Synta | Part-of-Speech | ra_VeAjT_C | ratio of Verb POS # to Adjective POS # | 0.177 | 177 | 0.059 | 153 | 0.203 | 152 | 0.162 | 148 | 0.042 | 204 |
9 | Synta | Part-of-Speech | at_ContW_C | # Content words per Word | 0.161 | 183 | 0.057 | 156 | 0.23 | 143 | 0.183 | 139 | 0.055 | 193 |
9 | Synta | Phrasal | ra_NoSuP_C | ratio of Noun phrases # to Subordinate Clauses # | 0.157 | 185 | 0.153 | 80 | 0.228 | 145 | 0.052 | 205 | 0.04 | 207 |
9 | Synta | Phrasal | ra_PrAjP_C | ratio of Prep phrases # to Adj phrases # | 0.142 | 189 | 0.035 | 182 | 0.017 | 223 | 0.207 | 128 | 0.136 | 140 |
9 | Synta | Part-of-Speech | ra_NoAjT_C | ratio of Noun POS # to Adjective POS # | 0.121 | 199 | 0.152 | 81 | 0.125 | 173 | 0.114 | 169 | 0.004 | 228 |
9 | AdSem | WB Knowledge | BClar10_S | Semantic Clarity, 100 topics extracted from WeeBit | 0.079 | 204 | 0.015 | 216 | 0.269 | 137 | 0.148 | 155 | 0.181 | 123 |
9 | Synta | Part-of-Speech | ra_CoNoT_C | ratio of Coordinating Conjunction POS # to Noun POS # | 0.02 | 224 | 0.277 | 42 | 0.159 | 163 | 0.013 | 222 | 0.132 | 142 |
8 | Synta | Part-of-Speech | ra_AjAvT_C | ratio of Adjective POS # to Adverb POS # | 0.357 | 104 | 0.042 | 172 | 0.056 | 204 | 0.091 | 180 | 0.044 | 199 |
8 | Synta | Part-of-Speech | ra_SuCoT_C | ratio of Subordinating Conj POS # to Coordinating Conj # | 0.274 | 144 | 0.054 | 158 | 0.019 | 222 | 0.143 | 163 | 0.077 | 179 |
8 | Synta | Part-of-Speech | ra_VeSuT_C | ratio of Verb POS # to Subordinating Conjunction # | 0.266 | 146 | 0.046 | 169 | 0.09 | 186 | 0.105 | 175 | 0.065 | 188 |
8 | Synta | Phrasal | ra_AvNoP_C | ratio of Adv phrases # to Noun phrases # | 0.257 | 149 | 0.128 | 108 | 0.072 | 199 | 0.044 | 208 | 0.051 | 196 |
8 | Synta | Part-of-Speech | ra_SuAjT_C | ratio of Subordinating Conjunction POS # to Adjective POS # | 0.244 | 151 | 0.008 | 225 | 0.074 | 197 | 0.082 | 187 | 0.138 | 137 |
8 | Synta | Phrasal | ra_NoAvP_C | ratio of Noun phrases # to Adv phrases # | 0.235 | 157 | 0.102 | 125 | 0.09 | 187 | 0.082 | 188 | 0.071 | 185 |
8 | Synta | Phrasal | ra_AjAvP_C | ratio of Adj phrases # to Adv phrases # | 0.232 | 158 | 0.016 | 213 | 0.046 | 209 | 0.094 | 179 | 0.156 | 129 |
8 | Synta | Part-of-Speech | ra_AvSuT_C | ratio of Adverb POS # to Subordinating Conjunction # | 0.202 | 167 | 0.024 | 201 | 0.003 | 230 | 0.114 | 168 | 0.067 | 187 |
8 | Synta | Part-of-Speech | ra_VeCoT_C | ratio of Verb POS # to Coordinating Conjunction # | 0.192 | 172 | 0.172 | 64 | 0.134 | 171 | 0.022 | 218 | 0.054 | 194 |
8 | LxSem | Word Familiarity | at_SbFrQ_C | SubtlexUS FREQ# value per Word | 0.181 | 174 | 0.196 | 55 | 0.095 | 183 | 0.021 | 219 | 0.109 | 155 |
8 | LxSem | Word Familiarity | at_SbSBW_C | SubtlexUS SUBTLWF value per Word | 0.181 | 175 | 0.196 | 54 | 0.095 | 184 | 0.021 | 220 | 0.109 | 156 |
8 | AdSem | Wiki Knowledge | WClar10_S | Semantic Clarity, 100 topics extracted from Wiki | 0.178 | 176 | 0.01 | 223 | 0.153 | 167 | 0.171 | 143 | 0.084 | 171 |
8 | AdSem | Wiki Knowledge | WClar15_S | Semantic Clarity, 150 topics extracted from Wiki | 0.165 | 182 | 0.011 | 221 | 0.161 | 161 | 0.185 | 138 | 0.074 | 181 |
8 | Disco | Entity Grid | LoCohPU_S | Local Coherence for PU score | 0.129 | 195 | 0.023 | 202 | 0.103 | 179 | 0.084 | 184 | 0.13 | 144 |
8 | Synta | Part-of-Speech | ra_VeAvT_C | ratio of Verb POS # to Adverb POS # | 0.108 | 200 | 0.078 | 136 | 0.229 | 144 | 0.025 | 215 | 0.079 | 174 |
8 | Synta | Part-of-Speech | ra_SuNoT_C | ratio of Subordinating Conjunction POS # to Noun POS # | 0.085 | 202 | 0.149 | 82 | 0.039 | 211 | 0.155 | 151 | 0.158 | 126 |
8 | AdSem | WB Knowledge | BRich15_S | Semantic Richness, 150 topics extracted from WeeBit | 0.025 | 220 | 0.059 | 152 | 0.154 | 166 | 0.145 | 159 | 0.1 | 162 |
8 | Synta | Part-of-Speech | ra_NoCoT_C | ratio of Noun POS # to Coordinating Conjunction # | 0.022 | 222 | 0.254 | 45 | 0.019 | 221 | 0.053 | 201 | 0.109 | 157 |
8 | LxSem | Type Token Ratio | MTLDTTR_S | Measure of Textual Lexical Diversity (default TTR = 0.72) | 0.0 | 230 | 0.103 | 123 | 0.119 | 175 | 0.151 | 152 | 0.0 | 231 |
Feature | CCB | WBT | CAM | CKC | OSE | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Score | Branch | Subgroup | LingFeat Code | Brief Explanation | r | rk | r | rk | r | rk | r | rk | r | rk |
7 | LxSem | Word Familiarity | at_SbFrL_C | SubtlexUS FREQlow value per Word | 0.176 | 178 | 0.171 | 65 | 0.061 | 203 | 0.001 | 228 | 0.09 | 165 |
7 | Synta | Part-of-Speech | ra_AvNoT_C | ratio of Adverb POS # to Noun POS # | 0.171 | 179 | 0.108 | 119 | 0.076 | 195 | 0.084 | 185 | 0.023 | 222 |
7 | Disco | Entity Grid | LoCoDPU_S | Local Coherence distance for PU score | 0.154 | 186 | 0.032 | 189 | 0.086 | 191 | 0.087 | 182 | 0.111 | 154 |
7 | Synta | Phrasal | at_AvPhr_C | # Adverb phrases per Word | 0.144 | 188 | 0.113 | 115 | 0.047 | 208 | 0.029 | 214 | 0.058 | 191 |
7 | Synta | Phrasal | ra_AjSuP_C | ratio of Adj phrases # to Subordinate Clauses # | 0.133 | 194 | 0.04 | 177 | 0.195 | 155 | 0.001 | 227 | 0.079 | 173 |
7 | Synta | Phrasal | ra_AjVeP_C | ratio of Adj phrases # to Verb phrases # | 0.104 | 201 | 0.01 | 222 | 0.055 | 205 | 0.083 | 186 | 0.124 | 148 |
7 | Synta | Part-of-Speech | ra_CoAjT_C | ratio of Coordinating Conjunction POS # to Adjective POS # | 0.068 | 206 | 0.051 | 161 | 0.176 | 160 | 0.074 | 191 | 0.104 | 160 |
7 | Synta | Part-of-Speech | ra_AvCoT_C | ratio of Adverb POS # to Coordinating Conjunction # | 0.029 | 216 | 0.119 | 110 | 0.024 | 216 | 0.022 | 217 | 0.107 | 158 |
7 | Synta | Part-of-Speech | ra_AjSuT_C | ratio of Adjective POS # to Subordinating Conjunction # | 0.025 | 219 | 0.001 | 233 | 0.024 | 217 | 0.204 | 131 | 0.057 | 192 |
7 | Synta | Phrasal | ra_SuAjP_C | ratio of Subordinate Clauses # to Adj phrases # | 0.02 | 223 | 0.022 | 205 | 0.05 | 206 | 0.204 | 129 | 0.029 | 218 |
7 | Synta | Phrasal | ra_PrSuP_C | ratio of Prep phrases # to Subordinate Clauses # | 0.002 | 228 | 0.076 | 139 | 0.143 | 169 | 0.07 | 193 | 0.13 | 143 |
6 | Synta | Part-of-Speech | ra_AvVeT_C | ratio of Adverb POS # to Verb POS # | 0.168 | 181 | 0.011 | 220 | 0.097 | 181 | 0.053 | 203 | 0.053 | 195 |
6 | Synta | Part-of-Speech | ra_AjNoT_C | ratio of Adjective POS # to Noun POS # | 0.074 | 205 | 0.146 | 89 | 0.031 | 213 | 0.068 | 195 | 0.041 | 205 |
6 | Synta | Phrasal | ra_VeAjP_C | ratio of Verb phrases # to Adj phrases # | 0.067 | 207 | 0.072 | 142 | 0.087 | 190 | 0.104 | 176 | 0.064 | 189 |
6 | Synta | Part-of-Speech | ra_AvAjT_C | ratio of Adverb POS # to Adjective POS # | 0.061 | 208 | 0.049 | 163 | 0.088 | 189 | 0.107 | 174 | 0.039 | 208 |
6 | Synta | Phrasal | ra_NoAjP_C | ratio of Noun phrases # to Adj phrases # | 0.05 | 209 | 0.084 | 132 | 0.073 | 198 | 0.128 | 166 | 0.062 | 190 |
6 | Synta | Part-of-Speech | ra_NoSuT_C | ratio of Noun POS # to Subordinating Conjunction # | 0.049 | 210 | 0.075 | 140 | 0.004 | 229 | 0.186 | 137 | 0.077 | 178 |
6 | Synta | Phrasal | ra_VeAvP_C | ratio of Verb phrases # to Adv phrases # | 0.039 | 213 | 0.084 | 133 | 0.155 | 165 | 0.065 | 198 | 0.097 | 163 |
6 | Synta | Part-of-Speech | ra_CoSuT_C | ratio of Coordinating Conj POS # to Subordinating Conj # | 0.03 | 215 | 0.076 | 137 | 0.044 | 210 | 0.196 | 134 | 0.001 | 229 |
6 | Synta | Phrasal | at_AjPhr_C | # Adjective phrases per Word | 0.027 | 218 | 0.046 | 167 | 0.029 | 214 | 0.076 | 190 | 0.126 | 147 |
6 | Synta | Phrasal | ra_AjNoP_C | ratio of Adj phrases # to Noun phrases # | 0.01 | 226 | 0.046 | 168 | 0.013 | 225 | 0.066 | 197 | 0.127 | 145 |
6 | Synta | Part-of-Speech | ra_AjCoT_C | ratio of Adjective POS # to Coordinating Conjunction # | 0.0 | 229 | 0.148 | 86 | 0.049 | 207 | 0.091 | 181 | 0.077 | 177 |
5 | Synta | Phrasal | ra_AvAjP_C | ratio of Adv phrases # to Adj phrases # | 0.044 | 212 | 0.044 | 170 | 0.066 | 202 | 0.086 | 183 | 0.088 | 167 |
5 | Synta | Part-of-Speech | at_AvTag_C | # Adverb POS tags per Word | 0.029 | 217 | 0.072 | 141 | 0.095 | 185 | 0.011 | 223 | 0.078 | 175 |
5 | Synta | Phrasal | ra_AvVeP_C | ratio of Adv phrases # to Verb phrases # | 0.02 | 225 | 0.068 | 144 | 0.005 | 228 | 0.003 | 225 | 0.071 | 184 |
5 | Disco | Entity Grid | ra_XXTo_C | ratio of xx transitions to total | 0.0 | 231 | 0.025 | 198 | 0.0 | 231 | 0.0 | 231 | 0.0 | 230 |
5 | Disco | Entity Grid | ra_XSTo_C | ratio of xs transitions to total | 0.0 | 232 | 0.025 | 197 | 0.0 | 232 | 0.0 | 232 | 0.0 | 232 |
5 | Disco | Entity Grid | ra_SSTo_C | ratio of ss transitions to total | 0.0 | 233 | 0.025 | 199 | 0.0 | 233 | 0.0 | 233 | 0.0 | 233 |