Contextual Networks and Unsupervised Ranking of Sentences
Abstract
We construct a contextual network to represent a document with syntactic and semantic relations between word-sentence pairs, based on which we devise an unsupervised algorithm called CNATAR to score sentences, and rank them through a bi-objective 0-1 knapsack maximization problem over topic analysis and sentence scores. We show that CNATAR outperforms the combined ranking of the three human judges provided on the SummBank dataset under both ROUGE and BLEU metrics, which in term significantly outperforms each individual judge’s ranking. Moreover, CNATAR produces so far the highest ROUGE scores over DUC-02, and outperforms previous supervised algorithms on the CNN/DailyMail and NYT datasets. We also compare the performance of CNATAR and the latest supervised neural-network summarization models and compute oracle results.
Index Terms:
contextual network, topic analysis, T5 sentence similarity, bi-objective 0-1 knapsackI Introduction
Ranking sentences (or segments of text) for a given article may be used, for example, as an oracle to build a hierarchical-reading tool to allow readers to read the article one layer of sentences at a time in a descending order of significance, as a selection criterion to construct a better search engine, or as a base for constructing a summary.
We present an unsupervised algorithm called CNATAR (Contextual Network And Topic Analysis Rank) to rank sentences for a given article, which works as follows:
Step 1: Construct a contextual network (CN) to represent semantic and syntactic relations between sentences in the article by leveraging dependency trees and contextual embeddings of words to form weighted edges between word-sentence pairs.
Step 2: Devise an unsupervised algorithm called CNR (Contextual Network Rank) to score nodes of the underlying CN using a biased PageRank algorithm w.r.t. the underlying article structure, and then score a sentence by summing up node scores for nodes containing the said sentence with a BM25 normalizer.
Step 3: Carry out topic analysis using Affinity Propagation [1] based on T5 sentence similarity, and rank sentences by approximating a bi-objective 0-1 knapsack maximization problem to select sentences with the largest scores and topic diversity using the Within-Cluster Sum of Square metric and dynamic programming.
We show that CNATAR outperforms the combined ranking of all human judges over the SummBank dataset in all categories under both ROUGE and BLEU measures, and substantially outperforms each judge’s individual ranking. Moreover, CNATAR is efficient with an average running time of about 0.7 seconds for each document in SummBank on a commonplace CPU desktop computer. We also evaluate CNATAR on other datasets for abstractive summaries, including DUC-02, CNN/DailyMail (CNN/DM in short), and NYT. We show that CNATAR outperforms all previous algorithms on DUC-02; and outperforms all previous unsupervised algorithms and the supervised model REFRESH [2] on CNN/DM and NYT trained on these datasets. We then compare performance of CNATAR and the two latest supervised BERT-based models BERTSum [3] and MatchSum [4].
II Related Work
Early sentence-ranking algorithms typically score sentences in connection to text summarization. Recent unsupervised methods include [5], Semantic SentenceRank (SSR) [6], BES (BERT Extractive Summarizer) [7] and PacSum [8].
models a document as a bipartite graph between words and sentences and uses Hyperlink-Induced Topic Search [9] to score sentences that maximizes sentence importance, non-redundancy, and coherence.
SSR introduces semantic relations overlooked by early unsupervised algorithms to construct word-level and sentence-level semantic graphs. It uses article-structure-biased (ASB) PageRank to score words and sentences separately, and then combines them to generate the final score for each sentence. SSR ranks sentences based on their final scores and topic diversity through semantic subtopic clustering. In so doing, SSR offers higher ROUGE scores on the DUC-02 dataset than CP3 and the previous unsupervised algorithms, and significantly outperforms each judge’s individual ranking on the SummBank dataset, but still falls short of the combined ranking of the three judges.
BES clusters sentence embeddings generated by BERT with K-means, and ranks a sentence by the Euclidean distance between the sentence and the centroid of the underlying cluster. PacSum builds a complete graph based on dot products of sentence embeddings, attempting to capture influence of any two sentences to their respective importance by their relative positions in the document.
Recent supervised methods construct neural-network models to perform sequence scoring/labeling. REFRESH [2] a sentence with a CNN encoder and scores sentences using LSTM that globally optimizes the ROUGE metric with reinforcement learning. BERTSum [3], a fine-tuned BERT embeddings for sentences from the input document, scores sentences with a summarization-specific layer trained on a labeled dataset such as CNN/DM. MatchSum [4], another variant of BERT, produces an embedding for the input document and an embedding of the best summary candidate that is most similar to the document embedding. A summary candidate is formed with a desired number of sentences selected from a number of sentences with high scores produced by other models such as BERTSum. These supervised models are trained on the CNN/DM and NYT datasets, all imposing a small upper bound on the size of an input sequence due to the difficulty on handing a long sequence. BERTSum, for example, imposes an input sequence of upto 512 tokens (about 30 sentences on average) and drops the remaining text after the first 512 tokens. Needing a large labeled dataset to train a supervised neural-network model also imposes a major roadblock for languages without such labeled datasets.
III Contextual Networks
Let denote a document of sentences and let be the original sequences of sentences in . Compute a dependency tree for each sentence and then replace each pronoun in a sentence with its original mention using a coreference resolution tool. A dependency tree for a sentence [10] is an undirected tree rooted at the main verb, connecting other words according to grammatical relations (see Fig. 1 for an example).

Compute a contextual embedding for with . Next, mark non-content words (stop words) using a stopword filter. Stop words include determiners, prepositions, postpositions, coordinating conjunctions, copulas, and auxiliary verbs. For each sentence , let and be two content words in . If there are stopwords such that forms a path on the dependency tree , then add a new connection of and in . Finally, remove stop words and replace every content word with its lemma using a lemmatizer.
In what follows, unless otherwise stated, by “word” it means its lemma. For each word in , let be the set of direct neighbors of on . Two words and are said to be syntactically related if either they are neighbors (i.e., ) or they share a common third neighbor (i.e., and ). This relation captures the structure of subject-verb-object in the same sentence such that any two of these words are syntactically related.
Let be the original sequence of words in . By comparing with we mean to compare the words at locations and . Denote by the -th word in the -th sentence. When is given, it is straightforward to determine , which can be expressed with a function . Namely, for all . If , then and are different entities even if and .
Construct a weighted, undirected multi-edge graph with Let with . is constructed below: (1) Semantic edges inside or across sentences. Connect and if the cosine similarity of and is at least (a hyperparameter; it is reasonable to set ). (2) Syntactic edges inside the same sentence, namely, . Connect and if and are syntactically related on the dependency tree . (3) Syntactic edges across sentences, namely, . Connect and if there is a third node with and such that , and are semantically connected as in (1), and and are syntactically connected as in (2); or the mirror condition (i.e., swap with in the above condition) is true. Fig. 2 illustrates this construction.

In other words, we transform a syntactic relation between and inside to a syntactic relation between and if w.r.t. is semantically close to w.r.t. . Note that requiring is critical, for it will result in undesirable syntactic relations if (see Remark 1 below).
To construct a contextual network for , compute edge weights and merge multiple edges. If and are connected by a syntactic edge, let its initial weight be 1, and normalize it by the total number of syntactic edges. If they are connected by a semantic edge, let its initial weight be the cosine similarity of and , and normalize it by the summation of all the initial weights of the semantic edges. If and are connected by both a syntactic edge and a semantic edge, then merge the two edges to one edge and let its new weight be the summation of the corresponding syntactic weight and the semantic weight.
Remark 1. To see why we must require when constructing syntactic edges across sentences consider, for example, the following two sentences: : A dove with an olive branch in its mouth is a common symbol of world peace. : Doves, comparing with pigeons, are smaller and slenderer, while pigeons are larger. Fig. 3 depicts the correct syntactic relations by our construction. If, however, we allowed , then because the cosine similarity of and is greater than the threshold value of , these two nodes would be connected by a semantic edge, implying that “dove” in and “large” in would be syntactically connected, which is undesirable.

Remark 2. Co-occurrences of words are previously used to capture syntactic relations between words, where two words are related if they co-occur in a small window of successive words. However, this method may falsely relate unrelated words and miss related words. For example, if two adjacent words in the same sentence fall in different sub-trees of its dependency tree, then they are unrelated from the syntactic point of view, but they could be made related because they co-occur. Co-occurrence also fails to capture related words that do not co-occur within a small window. Fig. 4 depicts the syntactic relations of words in the above sentences and with a window size of 3, which includes undesirable syntactic edges between “pigeon” and “small”, “pigeon” and “slender”, and “mouth” and “symbol”; yet misses desirable syntactic edges between “dove” and “small”, “dove” and “slender”, “dove” and “peace”, among other things. Our construction of syntactically related words through dependency trees resolve these issues.

IV Sentence Ranking
Article structures also play a role in ranking sentences [6], which may be classified into four types based on locations where words tend to be more important: (1) Rectangle. Words are of the same importance in any part of the article. Narrative articles are typically of this type. (2) Inverted pyramid. Words toward the beginning of the article tend to be more important. News articles are typically of this type. (3) Pyramid. Words toward the end of the article tend to be more important. Argumentative articles are typically of this type. (4) Hourglass. Words toward the beginning and the end of the article tend to be more important. Research papers are typically of this type.
Let denote the location weight of the -th word (to be constructed later) with . CNR computes the score of node over the contextual network, denoted by , using the following article-structure-biased (ASB) PageRank:
and is the edge weight of . It then scores a sentence by summing up the scores of all the nodes with and normalizing the sum by a BM25 normalizer:
where is the number of words contained in , is the average sentence length of the document, and is a hyperparameter for the purpose of penalizing sentences that are longer than average and rewarding sentences that are shorter than average. Since the ratio of a sentence length over the average sentence length for a given document is often larger than 2 or smaller than 1/2, an appropriate value of should be near the first quadrant and we choose .
Next, we define so that it does not abruptly change weight from location to location . For the rectangle structure we simply use a uniform distribution with . For the inverted pyramid structure, we use a slow decreasing quadratic curve to assign location weight for the -th word by
where is a hyperparameter (e.g., let ). The pyramid structure is a mirror image of the inverted pyramid, where the location weight of the -th word equals the weight for the -th word in the inverted pyramid structure. For the hourglass structure, we again use a quadratic curve defined by , with the minimum value in the middle of 1 and .
Sentence ranking should reflect the topics covered by the article. A topic clustering algorithm partitions sentences into topic clusters based on a sentence similarity measure. Let denote the distribution of topic covered by a subset , and the maximum allowed. The sentence ranking problem can be modeled as the following bi-objective 0-1 knapsack maximization problem:
where if is selected, and 0 otherwise. A ranking of sentences can be achieved by starting from 1, incremented by 1 each time, until .
CNATAR approximates the bi-objective 0-1 knapsack problem as follows: Suppose that is partitioned into topic clusters of sentences . Define a topic diversity function by dividing into numbers where is the Within-Cluster Sum of Square (WCSS) [11] for cluster , which is the squared average distance of all the points within to the cluster centroid. Thus, . Divide the bi-objective 0-1 knapsack into 0-1 knapsack problems over each with length bound for . That is, , , where the sentences in with the highest scores form the maximum solution. Rank sentences in all solutions according to their scores. Let . If , then select a remaining sentence with the highest score, rank it after the selected sentences, and increase by 1. Repeat until .
Remark 3. Sometimes we may need to select sentences such that the total number of words contained in them do not exceed a certain limit . The constraint of the 0-1 knapsack becomes . Using dynamic programming we can obtain a maximum solution to this version of the -th 0-1 Knapsack problem in time, which is feasible in practice since and would be small. We will need to use this version of 0-1 knapsack later when we deal with the DUC-02 dataset.
V Implementation and Evaluation
We preprocess documents with spaCy [12] to split the text into sentences, and resolve coreference within a sentence using the NeuralCoref pipline [13]. We then generate, for each sentence, a dependency tree with spaCy, and generate contextual embedding using BERT-Large [14] for each word in the sentence. Next, we identify stopwords with spaCy’s stopword list and replace each content word with its lemma using spaCy’s lemmatizer. To generate a contextual embedding for each word w.r.t. to a sentence, we sum up, the corresponding vector representations in the last 4 layers of BERT-Large to form a contextual embedding of the word, to take the advantage of more syntactic information at the lower layers more semantic information at the higher layers [15]. Finally, we use Affinity Propagation (AP) [1], an exemplar-based clustering algorithm, to cluster sentences using a pretrained T5 similarity [16] to compute sentence similarities. T5 similarity takes two sentences as input and returns a similarity score between 1 and 5. AP dynamically determines the number of topic clusters. The major components and dataflows of CNATAR are shown in Fig. 5.

![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/722e82a6-d50e-4229-937d-ed2449c9e51e/data.png)
Datasets. SummBank [17] is the most suitable dataset for evaluating sentence-ranking algorithms. Three human judges rank sentences for each of the 200 news articles written in English individually in categories of top 5%, and 10% to 90% with an increment of 10%. A combined sentence ranking of all judges, denoted by CMB-HR, is also provided on each article. DUC-02, CNN/DM, and NYT are other datasets for evaluating single-document summarization algorithms, consisting of one or more human-written abstractive summaries for each article as the gold standard. Each summary in DUC-02 consists of upto 100 words, while each summary consists of an average of 3 sentences in CNN, and 4 sentences in DM and NYT. These datasets, although not ideal for evaluating sentence ranking, are used to compare with the latest summarization algorithms in the last 5 years. We follow the standard split of training and evaluating [18] of CNN/DM on supervised algorithms, and use scripts supplied by [19] to obtain non-anonymized version of data. The XSum [20] dataset provides a one-sentence abstractive summary for each article, and so is inappropriate for evaluating sentence-ranking algorithms.
All of these datasets are news articles and so the location weight function for the inverted pyramid structure is applied.
Comparison on SummBank. We compare machine rankings with CMB-HR against each individual ranking as reference and average the ROUGE [21] and BLUE [22] scores over all documents. Both CMB-HR and SSR outperform each individual judge’s ranking using the other two judges’ ranking as reference [6]. A full-range comparison is shown in Table I against all judges under common measures of ROUGE- (R-) and BLEU- (B-), where . The highest score under each category is shown in boldface. It can be seen that under all categories, CNATAR outperforms CMB-HR and substantially outperforms SSR, PacSum, and BES. SSR slightly outperforms CNR. The oracle results are computed by choosing, for each article and under each percentage category, an individual judge’s selection of sentences that has the highest R-1 score against all three judge’s selections. Because one judge’s selection is always selected, the corresponding BLEU score is 100%. We carry out the same experiments on two 32G-RAM computers, one with an Intel Core i7-8700K CPU and the other an NVIDIA RTX 2080 Ti GPU. The average running time of CNATAR on each document is 0.73 seconds on the CPU machine, and 0.6 seconds on the GPU machine.
Comparison on DUC-02. Table II depicts the comparison results of the algorithms published in the last five years on the DUC-02 dataset, where each of the algorithms extracts sentences of the highest ranks with a total length bounded by 100 words. Among these algorithms, CNN-W2V [23] is a supervised algorithm. In addition, we also provide oracle results by selecting a subset of sentences for each document that maximizes the ROUGE score w.r.t. the benchmark summaries except the 6 articles with 78 sentences or more. For these 6 articles we use an approximation to avoid combinatorial explosion by selecting the first sentence with the highest R-1 score, then the next to the already-selected sentences with the highest R-1 score until the total number of words exceeds 100. To the best of our knowledge, no oracle results on DUC-02 were published before.
Methods | R-1 | R-2 | R-SU4 |
---|---|---|---|
Oracle | 52.0 | 29.1 | 29.2 |
CNATAR | 49.4 | 25.6 | 26.7 |
CNR | 49.2 | 24.8 | 26.1 |
SSR | 49.3 | 25.1 | 26.5 |
49.0 | 24.7 | 25.8 | |
PacSum | 48.7 | 23.3 | 25.3 |
CNN-W2V | 48.6 | 22.0 | |
BES | 48.5 | 23.3 | 25.4 |
It can be seen that CNATAR outperforms all previous algorithms, supervised and unsupervised. The three latest supervised models trained on CNN/DM and NYT only produce 3 to 4 sentences for a given document, and so perform poorly on DUC-02, where each summary typically contains more than 4 sentences.
Comparison on CNN/DM and NYT. Table III shows the comparison results of CNATAR, REFRESH [2], BERTSum [3], and MatchSum [4] on CNN/DM, NYT, and SummBank-4, a subset of 156 articles in SummBank that provide the top 4 sentences (the rest of the articles do not provide top 4 sentences because SummBank only rank sentences on certain percentages). All models output 4 sentences. Recall that MatchSum suffers from a combinatorial blowup, to make it feasible to train, we select sentence candidates using the top 5 most important sentences on CNN/DM, top 6 sentences on NYT, and top 9 sentences on SummBank-4. It can be seen that CNATAR outperforms the unsupervised PacSum and the supervised REFRESH while MatchSum achieves the highest ROUGE scores, where R-L stands for ROUGE-L. On SummBank-4, CNATAR outperforms all the supervised models by a large margin, even if MatchSum has tried all possible candidate outputs for each article in SummBank-4. The oracle results are computed by selecting the first sentence with the highest R-1 score, then select the next sentence to the already selected sentences with the highest R-1 score.
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/722e82a6-d50e-4229-937d-ed2449c9e51e/cnndm.png)
Ablation study. We show that, over SummBank, each mechanism in CNATAR is necessary for achieving its overall performance. In particular, contextual networks, location weights, and topic-cluster-wise 0-1 knapsack are the most significant components. Table IV depicts the numerical results, where
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/722e82a6-d50e-4229-937d-ed2449c9e51e/ablation.png)
V-NSR denotes a variant of CNATAR without contextual networks but using co-occurrences to capture weaker syntactic relations between words as in SSR [6], V-NLW denotes a variant without location weight functions, and V-NBM25 a variant that replaces the use of a BM25 normalizer with the standard normalizer of sentence length. Moreover, V-BERT and V-WMD denote two variants that replace the T5 similarities with, respectively, the cosine similarity of BERT embedding, and similarities based on Word Mover’s Distance [24] as in SSR. Finally, V-RR and V-CS denote two variants that replace the cluster-wise 0-1 knapsack with, respectively, round-robin selections from clusters as in SSR and proportional selections based on cluster size.
VI Conclusions and Final Remarks
CNATAR ranks sentences based on context networks and topic analysis, and achieves the state-of-the-art results. Our construction of contextual networks, however, only takes advantage of a few recent NLP tools. More NLP tools may be leveraged, including part-of-speech tags, role labeling, and sentiment analysis. Using these extra language features, it is expected a more appropriate weight can be computed when merging two edges in constructing a contextual network, instead of assigning an equal weight as in the current construction. Topic diversity also plays an important role in ranking sentences, and so it would be interesting to investigate a better mathematical formulation for the diversity function and explore other topic clustering algorithms.
References
- [1] D. Dueck, “Affinity propagation: clustering data by passing messages,” Ph.D. dissertation, University of Toronto, 2009.
- [2] S. Narayan, S. B. Cohen, and M. Lapata, “Ranking sentences for extractive summarization with reinforcement learning,” in Proc. of NACCL 2018, Vol. 1, pp. 1747–1759.
- [3] Y. Liu, “Fine-tune BERT for extractive summarization,” arXiv preprint arXiv:1903.10318, 2019.
- [4] M. Zhong, P. Liu, Y. Chen, D. Wang, X. Qiu, and X. Huang, “Extractive summarization as text matching,” arXiv preprint arXiv:2004.08795, 2020.
- [5] D. Parveen, M. Mesgar, and M. Strube, “Generating coherent summaries of scientific articles using coherence patterns,” in Proc. of EMNLP 2016, pp. 772–783.
- [6] H. Zhang and J. Wang, “An unsupervised semantic sentence ranking scheme for text documents,” Integrated Computer-Aided Engineering, no. 28, pp. 17–33, 2021.
- [7] D. Miller, “Leveraging BERT for extractive text summarization on lectures,” arXiv preprint arXiv:1906.04165, 2019.
- [8] H. Zheng and M. Lapata, “Sentence centrality revisited for unsupervised summarization,” arXiv preprint arXiv:1906.03508, 2019.
- [9] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” in Proc. of SODA 1998, pp. 668–677.
- [10] R. A. Hudson, Word grammar. Blackwell Oxford, 1984.
- [11] J. A. Hartigan and M. A. Wong, “Ak-means clustering algorithm,” Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.
- [12] M. Honnibal and I. Montani, “spaCy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing.”
- [13] T. Wolf, “State-of-the-art neural coreference resolution for chatbots,” Blog post, 2017.
- [14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- [15] G. Jawahar, B. Sagot, and D. Seddah, “What does BERT learn about the structure of language?” in Proc. of ACL 2019.
- [16] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” arXiv preprint arXiv:1910.10683, 2019.
- [17] D. Radev et al., “Summbank 1.0 ldc2003t16. web download,” Philadelphia: Linguistic Data Consortium, 2003.
- [18] R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang et al., “Abstractive text summarization using sequence-to-sequence rnns and beyond,” arXiv preprint arXiv:1602.06023, 2016.
- [19] A. See, P. J. Liu, and C. D. Manning, “Get to the point: Summarization with pointer-generator networks,” arXiv preprint arXiv:1704.04368, 2017.
- [20] S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization,” arXiv preprint arXiv:1808.08745, 2018.
- [21] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74–81.
- [22] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for automatic evaluation of machine translation,” in Proc. of ACL 2002, pp. 311–318.
- [23] Y. Zhang, M. J. Er, and M. Pratama, “Extractive document summarization based on convolutional neural networks,” in Proc. of IECON 2016, pp. 918–922.
- [24] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From word embeddings to document distances,” in Proc. of ICML 2015, pp. 957–966.