This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Consistency and Coherence from Points of Contextual Similarity

Oleg Vasilyev, John Bohannon
Primer Technologies Inc.
San Francisco, California
oleg,[email protected]
Abstract

Factual consistency is one of important summary evaluation dimensions, especially as summary generation becomes more fluent and coherent. The ESTIME measure, recently proposed specifically for factual consistency, achieves high correlations with human expert scores both for consistency and fluency, while in principle being restricted to evaluating such text-summary pairs that have high dictionary overlap. This is not a problem for current styles of summarization, but it may become an obstacle for future summarization systems, or for evaluating arbitrary claims against the text. In this work we generalize the method, and make a variant of the measure applicable to any text-summary pairs. As ESTIME uses points of contextual similarity, it provides insights into usefulness of information taken from different BERT layers. We observe that useful information exists in almost all of the layers except the several lowest ones. For consistency and fluency - qualities focused on local text details - the most useful layers are close to the top (but not at the top); for coherence and relevance we found a more complicated and interesting picture.

1 Introduction

A summary is assessed by evaluating its qualities, which can be defined in different ways Fan et al. (2018); Xenouleas et al. (2019); Kryscinski et al. (2020); Vasilyev et al. (2020b); Fabbri et al. (2021a). The commonly considered qualities are of two classes: summary-focused and summarization-focused. The summary-focused qualities are supposed to reflect the language of the summary itself, without any relation to the summarization. For example, grammar, fluency, structure and coherence of a summary should not require consideration of the text from which the summary is produced: grading a summary in this respect is no different from evaluation of any other text, generated or not. The summarization-focused qualities are supposed to reflect the summarization, and require both the summary and the text considered together, - for example, relevancy, informativeness, factual consistency.

The automated evaluation measures normally use both the summary and the text Louis and Nenkova (2009); Scialom et al. (2019); Gao et al. (2020); Vasilyev et al. (2020a); Vasilyev and Bohannon (2021a); Scialom et al. (2021); other measures use the summary and the ’reference summaries’ - the summaries human-written specifically for the text Papineni et al. (2002); Lin (2004); Zhang et al. (2020). Correlation of evaluation measures with human scores for all qualities is widely accepted as a criterion for judging about the evaluation measures Fabbri et al. (2021a), with a few caveats.

Since the summary qualities (e.g. relevance, consistency, coherence and fluency) are all different, improving a measure eventually cannot be expected to keep improving the correlations with all the qualities. It would be natural to have measures intentionally focused on certain qualities. If a measure correlates reasonably well (by current standards) with all the qualities, this may reflect the simple reality that better generation systems produce summaries with better qualities overall, and that easier texts allow to produce summaries with better qualities overall. There is also a possibility of an implicit bias even in expert scores Vasilyev and Bohannon (2021b).

Factual consistency is arguably the most objective quality of the summary, and there were consistent efforts to improve its evaluation Falke et al. (2019); Kryscinski et al. (2020); Wang et al. (2020); Maynez et al. (2020); Scialom et al. (2021); Gabriel et al. (2021). Recently introduced measure ESTIME is focused on factual consistency, and achieves superior scores both for consistency and fluency Vasilyev and Bohannon (2021a)111https://github.com/PrimerAI/blanc/tree/master/estime. The measure is simple, interpretable and easily reproducible: It is not tuned on some specific human-annotated dataset but simply using a well known pretrained language model, and almost no parameters to set or chose. It can use any other pretrained language model, including multilingual BERT, thus being ready for other languages. However, ESTIME can be used only in situations where the summary and the text have high dictionary overlap. This is not a problem for evaluating current generation systems (which rarely introduce words not from the text), but this makes more problematic a progress toward estimation of factual consistency of an arbitrary ’claim’ (not necessarily a summary) with respect to a text.

Our contribution222The code for all the introduced measures will be added to https://github.com/PrimerAI/blanc/tree/master/estime:

  1. 1.

    We generalize ESTIME, and present its variant applicable to more scenarios.

  2. 2.

    We get insights on usefulness of embeddings from different BERT layers for evaluating summary qualities.

  3. 3.

    We provide an alternative measure for evaluating coherence. Curiously, its correlations with human scores are better when using the lower-middle part of large BERT.

2 Evaluation at points of similarity

ESTIME as defined in Vasilyev and Bohannon (2021a) can be expressed by a count of all the instances of mismatched tokens at the points of similarity between the summary and the text:

i:tiT(1δtitargmaxα(eieα))\sum_{i:t_{i}\in T}(1-\delta_{t_{i}t^{\prime}_{\operatorname*{argmax}_{\alpha}(e_{i}e^{\prime}_{\alpha})}}) (1)

The summary is represented by a sequence of tokens tit_{i}, each having embedding eie_{i}; the text TT is a sequence of tokens tαt^{\prime}_{\alpha}, having embeddings eαe^{\prime}_{\alpha}. The tokens are the words or word parts as defined by a pretrained language model that is used to obtain the embeddings. For each summary token tit_{i}, located at a point ii, there is a point of similarity β=argmaxα(eieα)\beta=\operatorname*{argmax}_{\alpha}(e_{i}e^{\prime}_{\alpha}), and the similarity is defined by the product of the embeddings (eieα)(e_{i}e^{\prime}_{\alpha}). If the tokens tit_{i} and tβt^{\prime}_{\beta} do not coincide, it adds up to the total count of ’alarms’. The summation in Equation 1 is over all the summary tokens tit_{i} that exist in the text TT.

The embeddings ee are the contextual embeddings, obtained from processing the masked tokens. Loosely speaking, ESTIME imitates what a human does to find factual inconsistencies: for each point in a summary, find or recall the point of similar context in the text, and verify that the summary responds to this context in the same way as the text does. In this sense, any ’fact’ is simplistically represented by a duple of context and response to the context, which is a token. ESTIME verifies that all such summary facts are in agreement with their counterparts in the text.

We replace the response: instead of the tokens tit_{i}, we consider raw (input) token embeddings ϕi\phi_{i}. We define ’ESTIME-soft’ consistency score as:

1Ni=1Ncos(ϕi,ϕargmaxα(eieα))\frac{1}{N}\sum_{i=1}^{N}cos(\phi_{i},\phi^{\prime}_{\operatorname*{argmax}_{\alpha}(e_{i}e^{\prime}_{\alpha})}) (2)

The agreement is now measured not between the tokens, but between the raw token embeddings at the points of similarity. The summation is over all the summary tokens, and the result is the averaged agreement: NN is the length of the summary in terms of the tokens.

The context similarity is measured using dot-product of the contextualized embeddings ee of masked tokens, and the embeddings are to be taken from one of the layers of the model, the same way as it is done in ESTIME. The length of the embeddings is influenced by the context, and this influence is not lost by dot-product. But the agreement between the ’responses’ is measured by the cosine of the angle between the raw token embeddings ϕ\phi: the context is not needed here.

3 Correlations with expert scores

measure consistency relevance coherence fluency
ρ\rho τ\tau ρ\rho τ\tau ρ\rho τ\tau ρ\rho τ\tau
BLANC 0.19 0.10 0.28 0.20 0.22 0.16 0.13 0.07
ESTIME 0.39 0.19 0.15 0.11 0.27 0.19 0.38 0.22
ESTIME-soft 0.40 0.20 0.25 0.18 0.28 0.20 0.36 0.21
Jensen-Shannon 0.18 0.09 0.39 0.28 0.29 0.21 0.11 0.06
SummaQA-F 0.17 0.08 0.14 0.10 0.08 0.06 0.12 0.07
SummaQA-P 0.19 0.09 0.17 0.12 0.10 0.08 0.12 0.07
SUPERT 0.28 0.14 0.26 0.19 0.20 0.15 0.17 0.10
BERTScore-F 0.10 0.05 0.38 0.28 0.39 0.28 0.13 0.07
BERTScore-P 0.05 0.03 0.29 0.21 0.34 0.25 0.11 0.06
BERTScore-R 0.15 0.08 0.41 0.30 0.34 0.25 0.11 0.06
BLEU 0.09 0.04 0.23 0.17 0.19 0.14 0.12 0.07
ROUGE-L 0.12 0.06 0.23 0.16 0.16 0.11 0.08 0.04
ROUGE-1 0.13 0.07 0.28 0.20 0.17 0.12 0.07 0.04
ROUGE-2 0.12 0.06 0.23 0.16 0.14 0.10 0.06 0.04
ROUGE-3 0.15 0.07 0.23 0.17 0.15 0.11 0.06 0.04
Table 1: Summary level correlations ρ\rho (Spearman) and τ\tau (Kendall Tau-c) of quality estimators with human experts scores. Truly automated evaluation measures are in the top rows; the measures under the separation line need human-written reference summaries. In each column the highest correlation is bold-typed.

We compare ESTIME-soft with other measures by using SummEval dataset 333https://github.com/Yale-LILY/SummEval Fabbri et al. (2021a). The human-annotated part of the dataset consists of 100 texts, each text is accompanies by 17 summaries generated by 17 systems. All 1700 text-summary pairs are annotated (on scale 1 to 5) by 3 experts for 4 qualities: consistency, relevance, coherence and fluency. Each text is also accompanied by 11 human-written reference summaries, for the measures that need them.

Correlations of a few measures with average expert scores (each score is an average of the three individual expert scores) are shown in Table 1. The results for most measures and qualities are a bit different from Vasilyev and Bohannon (2021a), where the original version of SummEval was used, with 16 (not 17) generation systems. BLANC is taken by its default version444https://github.com/PrimerAI/blanc; ESTIME is taken with embeddings taken from layer 21. ESTIME-soft is obtained using the same model ’bert-large-uncased-whole-word-masking’ and the same layer 21 for the points of contextual similarity, while using ’bert-base-uncased’ for the raw embeddings. More details on calculating the correlations of Table 1 are in Appendix A.

ESTIME-soft has in Table 1 approximately the same correlations with expert scores as ESTIME. The distinct improvement for relevance is not important by itself: a summary-to-text verification measure should not be applied to judge relevance. But together with small improvements in consistency and coherence this suggests that context is used better in ESTIME-soft than in ESTIME, in expense of correlations with fluency. Of course ESTIME and ESTIME-soft can be calculated together, because the main processing cost is in obtaining the masked embeddings, which are used then for both of the measures.

4 Embeddings from different layers

4.1 Layers for summary and text

We can expect that our measure given by Eq.2 is not only applicable in more situations, but also more robust. In Figure 1 we show how the correlations with expert scores depend on the BERT layers from which the embeddings ee are taken. We allow embeddings for the summary and the text to be taken from different layers. Our motivation for this is that (1) the summary and the text have different information density, and (2) we may get more insight at how robust the measures are. Figure 1 shows Kendall Tau-c correlations. Spearman correlations manifest similar patterns, see Appendix C.

Refer to caption
Figure 1: Kendall Tau-c correlation between ESTIME and SummEval expert scores. Upper row is for ESTIME-soft, lower row is for ESTIME. Each column is for one of the qualities. Points of similarity are found by using embeddings (of masked tokens) from BERT layer of axis X for text, and layer of axis Y for summary.

Indeed, we observe that for the consistency ESTIME-soft is more tolerant to the choice of the layers, including the higher and more flat peak around the layers 21 (compared to ESTIME). For all qualities ESTIME-soft is wider across the layers. Compared to consistency, the fluency peak at high levels is more pronounced, this might be because the fluency is closer to the task of tokens prediction, which is the actual goal of the BERT training.

Coherence and relevance are the qualities that, on opposite, require long-range context for making a judgement. It is interesting therefore that ESTIME-soft (and ESTIME) correlates the best with both of these qualities away from the diagonals in Figure 1, when the embeddings for the summary tokens and for the text tokens come from different layers. The shape of these heatmaps suggests that something is happening at the middle of the transformer, and that embeddings from different layers may have complementary information, beneficial for finding the points of contextual similarity, and ultimately to the correlations.

The usefulness of ESTIME-soft Eq.2 depends on two factors: (1) successful finding of the points of contextual similarity (eieα)(e_{i}e^{\prime}_{\alpha}), where we now allow eie_{i} and eαe_{\alpha}^{\prime} to be picked up from different layers, and (2) a reliable relation between the context eie_{i} and the ’response’ ϕi\phi_{i} (likewise between eβe_{\beta}^{\prime} and ϕβ\phi_{\beta}^{\prime}). We speculate that satisfying these two factors using a single transformer layer is easier for a ’local context’ quality - consistency or fluency, - but it is more difficult for coherence or relevance, for which a longer range context is important.

4.2 Changes across the layers

The coherence and relevance correlations with ESTIME in Figure 1 suggests that there may be a qualitative difference between the context-relevant information in the lower half and the upper half of the layers. Transformer layers were probed in Zhang et al. (2020); Vasilyev and Bohannon (2021a) by their usefulness for estimating qualities of a summary; the layers were also inspected by other kinds of ’probes’ in Jawahar et al. (2019); Voita et al. (2019); Liu et al. (2019); Kashyap et al. (2021); van Aken et al. (2019); Wallat et al. (2020); De Cao et al. (2020); Timkey and van Schijndel (2021); Turton et al. (2021); Geva et al. (2021); Yun et al. (2021); Fomicheva et al. (2021).

Since the points of contextual similarity are defined in Eqs.1 and 2 by dot-product of the embeddings of the masked tokens, we are curious about the evolution of such embeddings through the layers, regardless of the summarization task. The embedding norm is one interesting characteristic, as being solely responsible for the peak around the layer 21 for the correlations with consistency Vasilyev and Bohannon (2021a). In Figure 2 we present the norm of masked token embeddings across the layers, using the same annotated part of SummEval dataset: 100 texts and 1700 generated summaries.

Refer to caption
Figure 2: Norm of embedding eike^{k}_{i} of a masked token ii, for each layer kk. The token is either CLS or the usual text token; the token is also either taken from a summary or from a text. Results are averaged over all tokens of each summary or text, and then the results are averaged over all the summaries or texts (of SummEval dataset). Model: ’bert-large-uncased-whole-word-masking’.

Notice that we are using these texts and summaries simply as separate texts, and simply considering the embeddings of all masked tokens of the text.

We observe that there is only minor difference between considering the tokens of the text and the tokens of the summary. The embeddings of the text tokens have slightly higher norm than embeddings of the summary tokens, probably boosted by having more related context for each token. More importantly, the CLS token, having more long-range semantic information, gets a big peak of the norm exactly in the middle of the transformer, and a big dip and peak at higher layers.

In Figure 3 we consider how much does an embedding changes from one layer to the next, on average.

Refer to caption
Figure 3: Norm of the change of embedding. The data and the notations are the same as in Figure 2, except that here for layer kk we show |eik+1eik||e^{k+1}_{i}-e^{k}_{i}|: the norm of the difference of embeddings of the same (masked) token ii from the neighbor layers.

The figure shows the norm of the increment of a masked token embedding from one layer to the next. The change exactly in the middle of the transformer shows up even stronger.

A cosine between the embeddings of the same token in two neighbor layers is shown in Figure 4.

Refer to caption
Figure 4: Cosign between embeddings from neighbor layers. The data and the notations are the same as in Figure 2, except that here for layer kk we show cos(eik+1,eik)cos(e^{k+1}_{i},e^{k}_{i}): the cosign of the angle between embeddings of the same (masked) token ii from the neighbor layers.

There are almost no changes in the CLS token across the layers, except the strong change in the middle layers 10-13, and then at high layers starting from layer 18. The common text tokens keep going through substantial transformation, remarkably with a stable pace - the cosign between the embeddings of neighbor layers is about 0.95, - and with much larger changes at the few special layers at which the CLS embedding also changes.

The evolution of the masked token embedding across the layers reflected in Figures 2, 3, 4 is not related to the summarization: this is how the embeddings change for a text or a summary (see also Appendix B). Looking in retrospect on the coherence in Figure 1, it seems natural to split the layers-by-layers square into four quadrants.

5 Coherence and points of similarity

5.1 Coherence

As we have seen in Figure 1, the coherence and relevance would get higher correlations with expert scores when the embeddings are picked up from different layers. The layers 18-21 for summary and around 8-11 for text would give a better correlations for coherence than in Table 1. We may speculate that the reason for this is that coherence and relevance are not as ’local’ as consistency and fluency, and need a more varied information for a judgement. When forced to use the points of similarity between the summary and the text, they do better with different layers.

In principle there should be a way to estimate the coherence and fluency of a summary without using the text, but having a text as a helpful pattern makes it easier. For better understanding of how the text helps, we can consider estimation of coherence by using the order of the points of similarity. Let us consider a correlation between the sequence of summary token locations i=0,1,2,,Ni=0,1,2,...,N and the sequence of the corresponding similarity locations argmaxα(eieα)\operatorname*{argmax}_{\alpha}(e_{i}e^{\prime}_{\alpha}) of the tokens in the text:

τ((i)i=1N,(argmaxα(eieα))i=1N)\tau\left((i)_{i=1}^{N},(\operatorname*{argmax}_{\alpha}(e_{i}e^{\prime}_{\alpha}))_{i=1}^{N}\right) (3)

Equation 3, with τ\tau being Kendall Tau-c correlation, estimates how well the summary keeps the order of the points of similar context. Figure 5 shows correlations of this measure with expert coherence scores.

Refer to caption
Figure 5: Kendall Tau-c correlations between the order of summary-text similarity points (Eq.3) and expert coherence scores. Points of similarity are found by embeddings from BERT layer of axis X for text, and layer of axis Y for summary.

This figure is very different from the coherence part of Figure 1. For once, the diagonal - using the same layers for the summary and the text - is fine again. The plateau of high correlations is wide. Using the same layer for the summary and the text, we get correlation with coherence 0.20 or 0.21 in the layers from 7 to 15, which are quite far from the top layers. These layers, however, would be in the top part of a base BERT.

The coherence estimate by Eq.3 involves only the points of similar context, and (unlike Eqs.1 and 2) does not deal with comparing the ’responses’. Such context happens to be better represented for coherence by the layers in the lower-middle part of the large BERT.

5.2 Local order of similarities

For a better insight into how the consistency is a more ’local’ quality than coherence, we will modify Kendall Tau correlation between two sequences by introducing a restriction on the distance between the elements of the first sequence. In applying such a modified Kendall correlation in Eq.3, we would consider only such pairs of summary tokens locations ii that are not further from each other than an (integer) interval dd:

τd((i)i=1N,(argmaxα(eieα))i=1N)\tau_{d}\left((i)_{i=1}^{N},(\operatorname*{argmax}_{\alpha}(e_{i}e^{\prime}_{\alpha}))_{i=1}^{N}\right) (4)

In Eq.4 we use the modified ’local’ Kendall Tau correlation τd\tau_{d}. We define τd\tau_{d} as a simple Kendall Tau correlation between two sequences XX and YY, but with a restriction on the distance between the elements XX. As usual, τd=ncndnt\tau_{d}=\frac{n_{c}-n_{d}}{n_{t}}, where ncn_{c} is the number of concordant pairs, ndn_{d} is the number of discordant pairs, and ntn_{t} is the total number of pairs, but the counts include only the pairs {(Xi,Yi),(Xj,Yj)}\{(X_{i},Y_{i}),(X_{j},Y_{j})\} for which |XiXj|<=d|X_{i}-X_{j}|<=d. The concordant pair is a pair satisfying the condition (Xi>XjandYi>Yj)or(Xi<XjandYi<Yj)(X_{i}>X_{j}\;\textrm{and}\;Y_{i}>Y_{j})\;\textrm{or}\;(X_{i}<X_{j}\;\textrm{and}\;Y_{i}<Y_{j}), otherwise the pair is discordant. Note that in our case (Eq.4) the distance is the same no matter whether it is defined between the XX elements or between the XX ranks.

Figure 6 shows the correlations of the measure given by Eq.4 with SummEval expert scores of coherence and consistency.

Refer to caption
Figure 6: Kendall Tau-c correlations between the ’local order’ measure Eq.4 and SummEval expert scores of coherence and consistency. The correlations depend on the measure distance restriction dd (axis X) and the layer from which the embeddings are taken (axis Y).

We can make several simple conclusions from this figure and its underlying data:

  1. 1.

    Consistency is a more ’local’ quality than coherence: keeping order on shorter distances provides better correlations. For consistency the layer 21 provides the best correlations - exceeding 0.18 - at the distances from 1 to 15. For coherence the layer 8 provides the best correlations - exceeding 0.22 - at the distances from 19 to 33.

  2. 2.

    There is a wide range of layers that are helpful both for coherence and for consistency. The best layers are at the lower half of the BERT for the coherence, and close to the top for the consistency.

  3. 3.

    The verification of facts (responses) by Eq.2 or 1 is not important (and maybe not needed) for coherence: estimation of order of similarities by Eq.4 or 3 is sufficient. Naturally, for consistency the measures by Eq.2 or 1, based on a (simplistic) verification of facts, provide better correlations.

6 Conclusion

In this paper we presented ESTIME-soft, Eq.2, - a variant of ESTIME applicable to more scenarios of estimating factual consistency and less sensitive to the choice of the BERT layers for the embeddings. We considered estimation of consistency as verification that the summary and the text respond the same way in all points of similar context. Each ’fact’ is represented, accordingly, as a context and a response to the context.

The context similarity for consistency and fluency has a better correlations with human scores when it uses embeddings from the BERT layers close to the top (but not at the top). For coherence the embeddings are more helpful when taken from different layers for the summary and for the text. More robust estimation of coherence is obtained by inspecting the order of the points of contextual similarity (Esq.3, 4). Embeddings from lower-middle part of BERT turned out to be the most helpful for this estimation.

A systematic comparison of evaluation measures would have to include much more measures (e.g. Fabbri et al. (2021b, a); Gabriel et al. (2021)). From our limited observations we suggest that there must be much better ways to evaluate the summary qualities, and achieve higher correlations with human scores. If currently superior correlations with human scores can be achieved by a simple measure that is based on an oversimplified representation of consistency or coherence and does not need specially tuned models or specially prepared data, then we probably should raise expectations.

References

Appendix A Calculation of correlations of selected measures with expert scores

Table 1 in Section 3 shows two types of rank correlations: Kendall Tau-c correlations and Spearman correlations. We do not consider non-rank correlations like Pearson because such presentations could be misleading: (1) there is a subjectivity in the intervals between the ’1-2-3-4-5’ grades in human scoring, and (2) different automated measures may be differently dilated or contracted at different intervals of their ranges - which is irrelevant for their judgements in comparing the summaries.

BLANC Vasilyev et al. (2020a) is calculated as default BLANC-help555https://github.com/PrimerAI/blanc. ESTIME and Jensen-Shannon Louis and Nenkova (2009) values are negated. SummaQA Scialom et al. (2019) is represented by SummaQA-P (prob) and SummaQA-F1 (F1 score)666https://github.com/recitalAI/summa-qa. SUPERT Gao et al. (2020) is calculated as single-doc with 20 reference sentences ’top20’777https://github.com/yg211/acl20-ref-free-eval. BLEU Papineni et al. (2002) is calculated with NLTK. BERTScore Zhang et al. (2020) (by default 888https://github.com/Tiiiger/bert_score using roberta-large) is represented by F1, precision (P) and recall (R). ROUGE Lin (2004) is calculated with rouge-score package999https://github.com/google-research/google-research/tree/master/rouge.

Appendix B Alignment of embeddings across the layers

In Section 4.2 we have seen that embeddings of the masked tokens of texts (regardless of summarization) undergo sharp changes between a few layers, - for example the change of a cosine between neighbor layer embeddings of the same token in Figure 4. If a cosine of the angle between embeddings within the same layer would make a good characterization of the layer, we would expect to observe different distributions of the cosine at the layers before and after the sharp change in Figure 4.

Refer to caption
Figure 7: Distribution of a cosine between embeddings of two (masked) tokens, taken over all pairs of tokens in each layer, and normalized for each layer. The bar for cosine histogram at each layer is 0.01. This heatmap is obtained from all the SummEval texts and summaries that were used in this paper. There is no any noticeable difference if only texts or only summaries were used here.

However, the distribution, shown in Figure 7, is more involved. Going from the input up, once the embeddings acquire the context and become on average less aligned with each other, they pass through three full periods of becoming less aligned and then more aligned again. Only the upper two dips in alignment correspond to the sharp changes in Figure 4.

Refer to caption
Figure 8: Spearman correlation between ESTIME and SummEval expert scores. Upper row is for ESTIME-soft, lower row is for ESTIME. Each column is for one of the qualities. Points of similarity are found by using embeddings (of masked tokens) from BERT layer of axis X for text, and layer of axis Y for summary.

Appendix C Spearman correlations: dependency on layers

In section 4.1 we supported our discussion with figures showing Kendall Tau-c correlations. Figure 8 shows the corresponding Spearman correlations. A notable difference between the figures 8 and 1 is that in the Spearman version the coherence and the relevance lost more of their level of correlations as compared to the consistency and fluency. We interpret this as a consequence of that Kendall’s τ\tau (based on all pairs of observations) is more helpful than Spearman’s ρ\rho in accounting for overall long-range correlations. Thus, Kendall’s τ\tau delivers more for coherence and relevance (as compared to the ’local’ qualities - consistency and fluency).