This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Contextual Networks and Unsupervised Ranking of Sentences

Hao Zhang Dept. of Computer Science
University of Massachusetts
Lowell, USA
[email protected]
   You Zhou Dept. of Computer Science
University of Massachusetts
Lowell, USA
[email protected]
   Jie Wang Dept. of Computer Science
University of Massachusetts
Lowell, USA
[email protected]
Abstract

We construct a contextual network to represent a document with syntactic and semantic relations between word-sentence pairs, based on which we devise an unsupervised algorithm called CNATAR to score sentences, and rank them through a bi-objective 0-1 knapsack maximization problem over topic analysis and sentence scores. We show that CNATAR outperforms the combined ranking of the three human judges provided on the SummBank dataset under both ROUGE and BLEU metrics, which in term significantly outperforms each individual judge’s ranking. Moreover, CNATAR produces so far the highest ROUGE scores over DUC-02, and outperforms previous supervised algorithms on the CNN/DailyMail and NYT datasets. We also compare the performance of CNATAR and the latest supervised neural-network summarization models and compute oracle results.

Index Terms:
contextual network, topic analysis, T5 sentence similarity, bi-objective 0-1 knapsack

I Introduction

Ranking sentences (or segments of text) for a given article may be used, for example, as an oracle to build a hierarchical-reading tool to allow readers to read the article one layer of sentences at a time in a descending order of significance, as a selection criterion to construct a better search engine, or as a base for constructing a summary.

We present an unsupervised algorithm called CNATAR (Contextual Network And Topic Analysis Rank) to rank sentences for a given article, which works as follows:

Step 1: Construct a contextual network (CN) to represent semantic and syntactic relations between sentences in the article by leveraging dependency trees and contextual embeddings of words to form weighted edges between word-sentence pairs.

Step 2: Devise an unsupervised algorithm called CNR (Contextual Network Rank) to score nodes of the underlying CN using a biased PageRank algorithm w.r.t. the underlying article structure, and then score a sentence by summing up node scores for nodes containing the said sentence with a BM25 normalizer.

Step 3: Carry out topic analysis using Affinity Propagation [1] based on T5 sentence similarity, and rank sentences by approximating a bi-objective 0-1 knapsack maximization problem to select sentences with the largest scores and topic diversity using the Within-Cluster Sum of Square metric and dynamic programming.

We show that CNATAR outperforms the combined ranking of all human judges over the SummBank dataset in all categories under both ROUGE and BLEU measures, and substantially outperforms each judge’s individual ranking. Moreover, CNATAR is efficient with an average running time of about 0.7 seconds for each document in SummBank on a commonplace CPU desktop computer. We also evaluate CNATAR on other datasets for abstractive summaries, including DUC-02, CNN/DailyMail (CNN/DM in short), and NYT. We show that CNATAR outperforms all previous algorithms on DUC-02; and outperforms all previous unsupervised algorithms and the supervised model REFRESH [2] on CNN/DM and NYT trained on these datasets. We then compare performance of CNATAR and the two latest supervised BERT-based models BERTSum [3] and MatchSum [4].

II Related Work

Early sentence-ranking algorithms typically score sentences in connection to text summarization. Recent unsupervised methods include CP3\text{CP}_{3} [5], Semantic SentenceRank (SSR) [6], BES (BERT Extractive Summarizer) [7] and PacSum [8].

CP3\text{CP}_{3} models a document as a bipartite graph between words and sentences and uses Hyperlink-Induced Topic Search [9] to score sentences that maximizes sentence importance, non-redundancy, and coherence.

SSR introduces semantic relations overlooked by early unsupervised algorithms to construct word-level and sentence-level semantic graphs. It uses article-structure-biased (ASB) PageRank to score words and sentences separately, and then combines them to generate the final score for each sentence. SSR ranks sentences based on their final scores and topic diversity through semantic subtopic clustering. In so doing, SSR offers higher ROUGE scores on the DUC-02 dataset than CP3 and the previous unsupervised algorithms, and significantly outperforms each judge’s individual ranking on the SummBank dataset, but still falls short of the combined ranking of the three judges.

BES clusters sentence embeddings generated by BERT with K-means, and ranks a sentence by the Euclidean distance between the sentence and the centroid of the underlying cluster. PacSum builds a complete graph based on dot products of sentence embeddings, attempting to capture influence of any two sentences to their respective importance by their relative positions in the document.

Recent supervised methods construct neural-network models to perform sequence scoring/labeling. REFRESH [2] a sentence with a CNN encoder and scores sentences using LSTM that globally optimizes the ROUGE metric with reinforcement learning. BERTSum [3], a fine-tuned BERT embeddings for sentences from the input document, scores sentences with a summarization-specific layer trained on a labeled dataset such as CNN/DM. MatchSum [4], another variant of BERT, produces an embedding for the input document and an embedding of the best summary candidate that is most similar to the document embedding. A summary candidate is formed with a desired number of sentences selected from a number of sentences with high scores produced by other models such as BERTSum. These supervised models are trained on the CNN/DM and NYT datasets, all imposing a small upper bound on the size of an input sequence due to the difficulty on handing a long sequence. BERTSum, for example, imposes an input sequence of upto 512 tokens (about 30 sentences on average) and drops the remaining text after the first 512 tokens. Needing a large labeled dataset to train a supervised neural-network model also imposes a major roadblock for languages without such labeled datasets.

III Contextual Networks

Let DD denote a document of mm sentences and let S1,S2,,Sm\langle S_{1},S_{2},\ldots,S_{m}\rangle be the original sequences of sentences in DD. Compute a dependency tree for each sentence and then replace each pronoun in a sentence with its original mention using a coreference resolution tool. A dependency tree for a sentence [10] is an undirected tree rooted at the main verb, connecting other words according to grammatical relations (see Fig. 1 for an example).

Refer to caption
Figure 1: A dependency tree for “He bought her a beautiful dress last year.”

Compute a contextual embedding e(w,S)e(w,S) for with wSw\in S. Next, mark non-content words (stop words) using a stopword filter. Stop words include determiners, prepositions, postpositions, coordinating conjunctions, copulas, and auxiliary verbs. For each sentence SS, let ww and ww^{\prime} be two content words in SS. If there are stopwords σ1,σr\sigma_{1}\cdots,\sigma_{r} such that w,σ1,,σr,ww,\sigma_{1},\ldots,\sigma_{r},w^{\prime} forms a path on the dependency tree TST_{S}, then add a new connection of ww and ww^{\prime} in TST_{S}. Finally, remove stop words and replace every content word with its lemma using a lemmatizer.

In what follows, unless otherwise stated, by “word” it means its lemma. For each word ww in SS, let NwN_{w} be the set of direct neighbors of ww on TST_{S}. Two words xx and yy are said to be syntactically related if either they are neighbors (i.e., xNyx\in N_{y}) or they share a common third neighbor ww (i.e., xNwx\in N_{w} and yNwy\in N_{w}). This relation captures the structure of subject-verb-object in the same sentence such that any two of these words are syntactically related.

Let w1,w2,,wn\langle w_{1},w_{2},\ldots,w_{n}\rangle be the original sequence of words in DD. By comparing wiw_{i} with wjw_{j} we mean to compare the words at locations ii and jj. Denote by (wi,Sk)(w_{i},S_{k}) the ii-th word in the kk-th sentence. When ii is given, it is straightforward to determine kk, which can be expressed with a function h(i)h(i). Namely, wiSh(i)w_{i}\in S_{h(i)} for all ii. If iji\not=j, then (wi,Sh(i))(w_{i},S_{h(i)}) and (wj,Sh(j))(w_{j},S_{h(j)}) are different entities even if wi=wjw_{i}=w_{j} and h(i)=h(j)h(i)=h(j).

Construct a weighted, undirected multi-edge graph GD=(VD,ED)G_{D}=(V_{D},E_{D}) with VD={vivi=(wi,Sh(i)),1in}.V_{D}=\{v_{i}\mid v_{i}=(w_{i},S_{h(i)}),1\leq i\leq n\}. Let vi,vjVDv_{i},v_{j}\in V_{D} with iji\not=j. EDE_{D} is constructed below: (1) Semantic edges inside or across sentences. Connect viv_{i} and vjv_{j} if the cosine similarity of e(vi)e(v_{i}) and e(vj)e(v_{j}) is at least δ\delta (a hyperparameter; it is reasonable to set δ=0.7\delta=0.7). (2) Syntactic edges inside the same sentence, namely, h(i)=h(j)h(i)=h(j). Connect viv_{i} and vjv_{j} if wiw_{i} and wjw_{j} are syntactically related on the dependency tree TSh(i)T_{S_{h(i)}}. (3) Syntactic edges across sentences, namely, h(i)h(j)h(i)\not=h(j). Connect viv_{i} and vjv_{j} if there is a third node vq=(wq,Sh(q))v_{q}=(w_{q},S_{h(q)}) with h(q)=h(i)h(q)=h(i) and qiq\not=i such that wq=wjw_{q}=w_{j}, vqv_{q} and vjv_{j} are semantically connected as in (1), and vqv_{q} and viv_{i} are syntactically connected as in (2); or the mirror condition (i.e., swap ii with jj in the above condition) is true. Fig. 2 illustrates this construction.

Refer to caption
Figure 2: Two syntactic edges between SkS_{k} and SlS_{l}.

In other words, we transform a syntactic relation between wqw_{q} and wiw_{i} inside Sh(i)S_{h(i)} to a syntactic relation between viv_{i} and vjv_{j} if wqw_{q} w.r.t. Sh(i)S_{h(i)} is semantically close to wjw_{j} w.r.t. Sh(j)S_{h(j)}. Note that requiring wq=wjw_{q}=w_{j} is critical, for it will result in undesirable syntactic relations if wqwjw_{q}\not=w_{j} (see Remark 1 below).

To construct a contextual network for DD, compute edge weights and merge multiple edges. If viv_{i} and vjv_{j} are connected by a syntactic edge, let its initial weight be 1, and normalize it by the total number of syntactic edges. If they are connected by a semantic edge, let its initial weight be the cosine similarity of e(vi)e(v_{i}) and e(vj)e(v_{j}), and normalize it by the summation of all the initial weights of the semantic edges. If viv_{i} and vjv_{j} are connected by both a syntactic edge and a semantic edge, then merge the two edges to one edge and let its new weight be the summation of the corresponding syntactic weight and the semantic weight.

Remark 1. To see why we must require wq=wjw_{q}=w_{j} when constructing syntactic edges across sentences consider, for example, the following two sentences: S1S_{1}: A dove with an olive branch in its mouth is a common symbol of world peace. S2S_{2}: Doves, comparing with pigeons, are smaller and slenderer, while pigeons are larger. Fig. 3 depicts the correct syntactic relations by our construction. If, however, we allowed wqwjw_{q}\not=w_{j}, then because the cosine similarity of e(``dove",S1)e(``dove",S_{1}) and e(``pigeion",S2)e(``pigeion",S_{2}) is greater than the threshold value of δ=0.7\delta=0.7, these two nodes would be connected by a semantic edge, implying that “dove” in S1S_{1} and “large” in S2S_{2} would be syntactically connected, which is undesirable.

Refer to caption
Figure 3: Syntactic relations via dependency trees on S1S_{1} (red) and S2S_{2} (green) mentioned in Remark 1.

Remark 2. Co-occurrences of words are previously used to capture syntactic relations between words, where two words are related if they co-occur in a small window of successive words. However, this method may falsely relate unrelated words and miss related words. For example, if two adjacent words in the same sentence fall in different sub-trees of its dependency tree, then they are unrelated from the syntactic point of view, but they could be made related because they co-occur. Co-occurrence also fails to capture related words that do not co-occur within a small window. Fig. 4 depicts the syntactic relations of words in the above sentences S1S_{1} and S2S_{2} with a window size of 3, which includes undesirable syntactic edges between “pigeon” and “small”, “pigeon” and “slender”, and “mouth” and “symbol”; yet misses desirable syntactic edges between “dove” and “small”, “dove” and “slender”, “dove” and “peace”, among other things. Our construction of syntactically related words through dependency trees resolve these issues.

Refer to caption
Figure 4: Syntactic relations through co-occurrences with a window size of 3 between words in S1S_{1} and S2S_{2} in Fig. 3.

IV Sentence Ranking

Article structures also play a role in ranking sentences [6], which may be classified into four types based on locations where words tend to be more important: (1) Rectangle. Words are of the same importance in any part of the article. Narrative articles are typically of this type. (2) Inverted pyramid. Words toward the beginning of the article tend to be more important. News articles are typically of this type. (3) Pyramid. Words toward the end of the article tend to be more important. Argumentative articles are typically of this type. (4) Hourglass. Words toward the beginning and the end of the article tend to be more important. Research papers are typically of this type.

Let LW(i)>0\text{LW}(i)>0 denote the location weight of the ii-th word (to be constructed later) with i=1nLW(i)=1\sum_{i=1}^{n}\text{LW}(i)=1. CNR computes the score of node viv_{i} over the contextual network, denoted by score(vi)\text{score}(v_{i}), using the following article-structure-biased (ASB) PageRank:

score(vi)\displaystyle\text{score}(v_{i}) =0.85M(vi)+0.15LW(i), where\displaystyle=0.85M(v_{i})+0.15\text{LW}(i),\text{ where}
M(vi)\displaystyle M(v_{i}) =vjAdj(vi)wt(vi,vj)score(vj)vkAdj(vj)wt(vj,vk),\displaystyle=\sum_{v_{j}\in Adj(v_{i})}\frac{wt(v_{i},v_{j})\cdot\text{score}(v_{j})}{\sum_{v_{k}\in Adj(v_{j})}wt(v_{j},v_{k})},

and wt(u,v)wt(u,v) is the edge weight of (u,v)(u,v). It then scores a sentence SkS_{k} by summing up the scores of all the nodes viv_{i} with h(i)=kh(i)=k and normalizing the sum by a BM25 normalizer:

score(Sk)=i:h(i)=kscore(vi)1β+β(|Sk|avsl),\displaystyle\text{score}(S_{k})=\frac{\sum_{i:h(i)=k}\text{score}(v_{i})}{1-\beta+\beta\left(\frac{|S_{k}|}{\text{avsl}}\right)},

where |Sk||S_{k}| is the number of words contained in SkS_{k}, avsl=j=1m|Sj|/m\text{avsl}=\sum_{j=1}^{m}|S_{j}|/m is the average sentence length of the document, and β[0,1]\beta\in[0,1] is a hyperparameter for the purpose of penalizing sentences that are longer than average and rewarding sentences that are shorter than average. Since the ratio of a sentence length over the average sentence length for a given document is often larger than 2 or smaller than 1/2, an appropriate value of β\beta should be near the first quadrant and we choose β=0.2\beta=0.2.

Next, we define LW(i)\text{LW}(i) so that it does not abruptly change weight from location ii to location i+1i+1. For the rectangle structure we simply use a uniform distribution with LW(i)=1/n\text{LW}(i)=1/n. For the inverted pyramid structure, we use a slow decreasing quadratic curve to assign location weight for the ii-th word by

LW(i)=6(γ1)(in)2(n1)n(2nγnγ)+a(n1)2γ1,\displaystyle\text{LW}(i)=\frac{6(\gamma-1)(i-n)^{2}}{(n-1)n(2n\gamma-n-\gamma)}+\frac{a(n-1)^{2}}{\gamma-1},

where γ=LW(1)/LW(n)\gamma=\text{LW}(1)/\text{LW}(n) is a hyperparameter (e.g., let γ=5\gamma=5). The pyramid structure is a mirror image of the inverted pyramid, where the location weight of the ii-th word equals the weight for the (ni+1)(n-i+1)-th word in the inverted pyramid structure. For the hourglass structure, we again use a quadratic curve defined by LW(i)=((in/2)2+1)/i=1n((in/2)2+1)\text{LW}(i)=\left((i-n/2)^{2}+1\right)/\sum_{i=1}^{n}\left((i-n/2)^{2}+1\right), with the minimum value in the middle of 1 and nn.

Sentence ranking should reflect the topics covered by the article. A topic clustering algorithm partitions sentences into topic clusters based on a sentence similarity measure. Let F(D)F(D^{\prime}) denote the distribution of topic covered by a subset DDD^{\prime}\subseteq D, and LL the maximum |D||D^{\prime}| allowed. The sentence ranking problem can be modeled as the following bi-objective 0-1 knapsack maximization problem:

Maximizek=1mscore(Sk)xk and F({Skxk=1}),\displaystyle\text{Maximize}\sum_{k=1}^{m}\text{score}(S_{k})\cdot x_{k}\mbox{ and }F(\{S_{k}\mid x_{k}=1\}),
subject tok=1mxk=L and xk{0,1},\displaystyle\text{subject to}\sum\limits_{k=1}^{m}x_{k}=L\mbox{ and }x_{k}\in\{0,1\},

where xk=1x_{k}=1 if SkS_{k} is selected, and 0 otherwise. A ranking of sentences can be achieved by starting LL from 1, incremented by 1 each time, until L=|D|1L=|D|-1.

CNATAR approximates the bi-objective 0-1 knapsack problem as follows: Suppose that DD is partitioned into KK topic clusters of sentences D1,,DKD_{1},\ldots,D_{K}. Define a topic diversity function FF by dividing LL into KK numbers Lj=(Wj/=1KW)L,L_{j}=\lfloor(W_{j}/\sum_{\ell=1}^{K}W_{\ell})L\rfloor, where WjW_{j} is the Within-Cluster Sum of Square (WCSS) [11] for cluster DjD_{j}, which is the squared average distance of all the points within DjD_{j} to the cluster centroid. Thus, j=1KLjL\sum_{j=1}^{K}L_{j}\leq L. Divide the bi-objective 0-1 knapsack into KK 0-1 knapsack problems over each DjD_{j} with length bound LjL_{j} for 1jK1\leq j\leq K. That is, maximize SkDjscore(Sk)xk\text{maximize }\sum_{S_{k}\in D_{j}}\text{score}(S_{k})\cdot x_{k}, subject toSkDjxk=Lj and xk{0,1}\text{subject to}\sum_{S_{k}\in D_{j}}x_{k}=L_{j}\mbox{ and }x_{k}\in\{0,1\}, where the LjL_{j} sentences in DjD_{j} with the highest scores form the maximum solution. Rank sentences in all solutions according to their scores. Let L=j=1KLjL^{\prime}=\sum_{j=1}^{K}L_{j}. If L<LL^{\prime}<L, then select a remaining sentence with the highest score, rank it after the selected sentences, and increase LL^{\prime} by 1. Repeat until L=LL^{\prime}=L.

Remark 3. Sometimes we may need to select sentences such that the total number of words contained in them do not exceed a certain limit LjL^{\prime}_{j}. The constraint of the 0-1 knapsack becomes SkDjxk|Sk|Lj\sum_{S_{k}\in D_{j}}x_{k}\cdot|S_{k}|\leq L^{\prime}_{j}. Using dynamic programming we can obtain a maximum solution to this version of the jj-th 0-1 Knapsack problem in O(|Dj|Lj)O(|D_{j}|L^{\prime}_{j}) time, which is feasible in practice since |Dj||D_{j}| and LjL^{\prime}_{j} would be small. We will need to use this version of 0-1 knapsack later when we deal with the DUC-02 dataset.

V Implementation and Evaluation

We preprocess documents with spaCy [12] to split the text into sentences, and resolve coreference within a sentence using the NeuralCoref pipline [13]. We then generate, for each sentence, a dependency tree with spaCy, and generate contextual embedding using BERT-Large [14] for each word in the sentence. Next, we identify stopwords with spaCy’s stopword list and replace each content word with its lemma using spaCy’s lemmatizer. To generate a contextual embedding for each word w.r.t. to a sentence, we sum up, the corresponding vector representations in the last 4 layers of BERT-Large to form a contextual embedding of the word, to take the advantage of more syntactic information at the lower layers more semantic information at the higher layers [15]. Finally, we use Affinity Propagation (AP) [1], an exemplar-based clustering algorithm, to cluster sentences using a pretrained T5 similarity [16] to compute sentence similarities. T5 similarity takes two sentences as input and returns a similarity score between 1 and 5. AP dynamically determines the number of topic clusters. The major components and dataflows of CNATAR are shown in Fig. 5.

Refer to caption
Figure 5: CNATAR components and dataflows
TABLE I: Comparison of CMB-HR, CNATAR, CNR, SSR, PacSum, and BES against all judges over SummBank
[Uncaptioned image]

Datasets. SummBank [17] is the most suitable dataset for evaluating sentence-ranking algorithms. Three human judges rank sentences for each of the 200 news articles written in English individually in categories of top 5%, and 10% to 90% with an increment of 10%. A combined sentence ranking of all judges, denoted by CMB-HR, is also provided on each article. DUC-02, CNN/DM, and NYT are other datasets for evaluating single-document summarization algorithms, consisting of one or more human-written abstractive summaries for each article as the gold standard. Each summary in DUC-02 consists of upto 100 words, while each summary consists of an average of 3 sentences in CNN, and 4 sentences in DM and NYT. These datasets, although not ideal for evaluating sentence ranking, are used to compare with the latest summarization algorithms in the last 5 years. We follow the standard split of training and evaluating [18] of CNN/DM on supervised algorithms, and use scripts supplied by [19] to obtain non-anonymized version of data. The XSum [20] dataset provides a one-sentence abstractive summary for each article, and so is inappropriate for evaluating sentence-ranking algorithms.

All of these datasets are news articles and so the location weight function for the inverted pyramid structure is applied.

Comparison on SummBank. We compare machine rankings with CMB-HR against each individual ranking as reference and average the ROUGE [21] and BLUE [22] scores over all documents. Both CMB-HR and SSR outperform each individual judge’s ranking using the other two judges’ ranking as reference [6]. A full-range comparison is shown in Table I against all judges under common measures of ROUGE-nn (R-nn) and BLEU-nn (B-nn), where n=1,2n=1,2. The highest score under each category is shown in boldface. It can be seen that under all categories, CNATAR outperforms CMB-HR and substantially outperforms SSR, PacSum, and BES. SSR slightly outperforms CNR. The oracle results are computed by choosing, for each article and under each percentage category, an individual judge’s selection of sentences that has the highest R-1 score against all three judge’s selections. Because one judge’s selection is always selected, the corresponding BLEU score is 100%. We carry out the same experiments on two 32G-RAM computers, one with an Intel Core i7-8700K CPU and the other an NVIDIA RTX 2080 Ti GPU. The average running time of CNATAR on each document is 0.73 seconds on the CPU machine, and 0.6 seconds on the GPU machine.

Comparison on DUC-02. Table II depicts the comparison results of the algorithms published in the last five years on the DUC-02 dataset, where each of the algorithms extracts sentences of the highest ranks with a total length bounded by 100 words. Among these algorithms, CNN-W2V [23] is a supervised algorithm. In addition, we also provide oracle results by selecting a subset of sentences for each document that maximizes the ROUGE score w.r.t. the benchmark summaries except the 6 articles with 78 sentences or more. For these 6 articles we use an approximation to avoid combinatorial explosion by selecting the first sentence with the highest R-1 score, then the next to the already-selected sentences with the highest R-1 score until the total number of words exceeds 100. To the best of our knowledge, no oracle results on DUC-02 were published before.

TABLE II: Comparison results (%) on DUC-02, where the italic numbers are extracted from the corresponding papers.
Methods  R-1  R-2  R-SU4
Oracle 52.0 29.1 29.2
CNATAR 49.4 25.6 26.7
CNR 49.2 24.8 26.1
SSR 49.3 25.1 26.5
CP3\text{CP}_{3} 49.0 24.7 25.8
PacSum 48.7 23.3 25.3
CNN-W2V 48.6 22.0 -
BES 48.5 23.3 25.4

It can be seen that CNATAR outperforms all previous algorithms, supervised and unsupervised. The three latest supervised models trained on CNN/DM and NYT only produce 3 to 4 sentences for a given document, and so perform poorly on DUC-02, where each summary typically contains more than 4 sentences.

Comparison on CNN/DM and NYT. Table III shows the comparison results of CNATAR, REFRESH [2], BERTSum [3], and MatchSum [4] on CNN/DM, NYT, and SummBank-4, a subset of 156 articles in SummBank that provide the top 4 sentences (the rest of the articles do not provide top 4 sentences because SummBank only rank sentences on certain percentages). All models output 4 sentences. Recall that MatchSum suffers from a combinatorial blowup, to make it feasible to train, we select sentence candidates using the top 5 most important sentences on CNN/DM, top 6 sentences on NYT, and top 9 sentences on SummBank-4. It can be seen that CNATAR outperforms the unsupervised PacSum and the supervised REFRESH while MatchSum achieves the highest ROUGE scores, where R-L stands for ROUGE-L. On SummBank-4, CNATAR outperforms all the supervised models by a large margin, even if MatchSum has tried all possible candidate outputs for each article in SummBank-4. The oracle results are computed by selecting the first sentence with the highest R-1 score, then select the next sentence to the already selected sentences with the highest R-1 score.

TABLE III: Comparison results (%), where the numbers in italic are taken from the corresponding papers.
[Uncaptioned image]

Ablation study. We show that, over SummBank, each mechanism in CNATAR is necessary for achieving its overall performance. In particular, contextual networks, location weights, and topic-cluster-wise 0-1 knapsack are the most significant components. Table IV depicts the numerical results, where

TABLE IV: Results from ablation study.
[Uncaptioned image]

V-NSR denotes a variant of CNATAR without contextual networks but using co-occurrences to capture weaker syntactic relations between words as in SSR [6], V-NLW denotes a variant without location weight functions, and V-NBM25 a variant that replaces the use of a BM25 normalizer with the standard normalizer of sentence length. Moreover, V-BERT and V-WMD denote two variants that replace the T5 similarities with, respectively, the cosine similarity of BERT embedding, and similarities based on Word Mover’s Distance [24] as in SSR. Finally, V-RR and V-CS denote two variants that replace the cluster-wise 0-1 knapsack with, respectively, round-robin selections from clusters as in SSR and proportional selections based on cluster size.

VI Conclusions and Final Remarks

CNATAR ranks sentences based on context networks and topic analysis, and achieves the state-of-the-art results. Our construction of contextual networks, however, only takes advantage of a few recent NLP tools. More NLP tools may be leveraged, including part-of-speech tags, role labeling, and sentiment analysis. Using these extra language features, it is expected a more appropriate weight can be computed when merging two edges in constructing a contextual network, instead of assigning an equal weight as in the current construction. Topic diversity also plays an important role in ranking sentences, and so it would be interesting to investigate a better mathematical formulation for the diversity function and explore other topic clustering algorithms.

References

  • [1] D. Dueck, “Affinity propagation: clustering data by passing messages,” Ph.D. dissertation, University of Toronto, 2009.
  • [2] S. Narayan, S. B. Cohen, and M. Lapata, “Ranking sentences for extractive summarization with reinforcement learning,” in Proc. of NACCL 2018, Vol. 1, pp. 1747–1759.
  • [3] Y. Liu, “Fine-tune BERT for extractive summarization,” arXiv preprint arXiv:1903.10318, 2019.
  • [4] M. Zhong, P. Liu, Y. Chen, D. Wang, X. Qiu, and X. Huang, “Extractive summarization as text matching,” arXiv preprint arXiv:2004.08795, 2020.
  • [5] D. Parveen, M. Mesgar, and M. Strube, “Generating coherent summaries of scientific articles using coherence patterns,” in Proc. of EMNLP 2016, pp. 772–783.
  • [6] H. Zhang and J. Wang, “An unsupervised semantic sentence ranking scheme for text documents,” Integrated Computer-Aided Engineering, no. 28, pp. 17–33, 2021.
  • [7] D. Miller, “Leveraging BERT for extractive text summarization on lectures,” arXiv preprint arXiv:1906.04165, 2019.
  • [8] H. Zheng and M. Lapata, “Sentence centrality revisited for unsupervised summarization,” arXiv preprint arXiv:1906.03508, 2019.
  • [9] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” in Proc. of SODA 1998, pp. 668–677.
  • [10] R. A. Hudson, Word grammar.   Blackwell Oxford, 1984.
  • [11] J. A. Hartigan and M. A. Wong, “Ak-means clustering algorithm,” Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.
  • [12] M. Honnibal and I. Montani, “spaCy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing.”
  • [13] T. Wolf, “State-of-the-art neural coreference resolution for chatbots,” Blog post, 2017.
  • [14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [15] G. Jawahar, B. Sagot, and D. Seddah, “What does BERT learn about the structure of language?” in Proc. of ACL 2019.
  • [16] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” arXiv preprint arXiv:1910.10683, 2019.
  • [17] D. Radev et al., “Summbank 1.0 ldc2003t16. web download,” Philadelphia: Linguistic Data Consortium, 2003.
  • [18] R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang et al., “Abstractive text summarization using sequence-to-sequence rnns and beyond,” arXiv preprint arXiv:1602.06023, 2016.
  • [19] A. See, P. J. Liu, and C. D. Manning, “Get to the point: Summarization with pointer-generator networks,” arXiv preprint arXiv:1704.04368, 2017.
  • [20] S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization,” arXiv preprint arXiv:1808.08745, 2018.
  • [21] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74–81.
  • [22] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for automatic evaluation of machine translation,” in Proc. of ACL 2002, pp. 311–318.
  • [23] Y. Zhang, M. J. Er, and M. Pratama, “Extractive document summarization based on convolutional neural networks,” in Proc. of IECON 2016, pp. 918–922.
  • [24] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, “From word embeddings to document distances,” in Proc. of ICML 2015, pp. 957–966.