This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PoBRL: Optimizing Multi-Document Summarization by
Blending Reinforcement Learning Policies

Andy Su
Princeton University
&Difei Su
University of British Columbia \ANDJohn M. Mulvey
Princeton University
&H. Vincent Poor
Princeton University
Abstract

We propose a novel reinforcement learning based framework PoBRL for solving multi-document summarization. PoBRL jointly optimizes over the following three objectives necessary for a high-quality summary: importance, relevance, and length. Our strategy decouples this multi-objective optimization into different sub-problems that can be solved individually by reinforcement learning. Utilizing PoBRL, we then blend each learned policies together to produce a summary that’s a concise and complete representation of the original input. Our empirical analysis shows state-of-the-art performance on several multi-document datasets. Human evaluation also shows that our method produces high-quality output.

1 Introduction

Summarization is the process of creating a concise and comprehensive representation from a large and complex information input. It has important applications in information retrieval (IR) and natural language processing (NLP) domains. Traditionally, summarization can be broadly categorized as extractive or abstractive summarization Torres-Moreno (2014). In extractive text summarization, salient information and key sentences are extracted directly from the original text without modification. In abstractive text summarization, summary is built by paraphrasing sentences or generating new words that are not in the original text.

In this paper, we tackle extractive summarization within a cluster of related texts (i.e., multi-document summarization111Note: multi-document here means multiple documents about the same topic.). Unlike single-document summarization, redundancy is particularly important because sentences across related documents might convey overlapping information. Thus, sentence extraction in such setting is difficult because one will need to determine which piece of information is relevant while avoiding unnecessary repetitiveness. In this sense, finding an optimal summary can be viewed as a combinatorial optimization problem. One way to solve this is the Maximal Marginal Relevance (MMR) algorithm Carbonell and Goldstein (1998). In the MMR approach, each sentence is extracted by maximizing its relevance while minimizing over redundancy. Since MMR is an iterative and greedy algorithm, it only achieves local optimality. As another approach, in McDonald (2007) the author developed a globally optimal MMR method by reformulating it as an inference problem and solving it using integer linear programming.

In this paper, we present a novel approach to tackle multi-document summarization. We formulate this as a multi-objective optimization problem and we seek to jointly optimize over the following three key metrics necessary for a high quality output McDonald (2007):

  • Importance: only the relevant and critical information should be extracted;

  • Redundancy: the extracted information should not be self-repetitive;

  • Length: the final output should be as concise as possible.

Our approach optimizes these three objectives simultaneously and is able to produce output which is a condensed and complete summary of the original content. At a high level, our method utilizes MMR to navigate through a complicated and overlapping multi-document space. We present our PoBRL (Policy Blending with maximal marginal relevance and Reinforcement Learning) framework to decouple this multi-objective optimization problem into smaller problems that can be solved using reinforcement learning. Then through our PoBRL framework, we present a policy blending method that integrates the learned policies together so that the combined policy is of high quality in the context of the three identified objectives. Through our newly proposed model and empirical evaluation on the Multi-News Fabbri et al. (2019) and DUC-04 datasets, our approach demonstrates state-of-the-art results. Our contributions are as follows:

  • We propose a novel PoBRL algorithmic framework that jointly optimize over the three identified metrics by objective decoupling, independent policy learning. We combine each learned policies by "blending";

  • We propose a novel, simple, and effective multi-document summarization model that leverage the hierarchical structure of the multi-document input;

  • Our empirical performance on multi-document datasets shows state-of-the-art performance. Human evaluation also shows that our method produces fluent and high-quality output.

2 Related Work

Extractive multi-document summarization can be viewed as a discrete optimization problem. In this setting, each source sentence can be regarded as binary variables. Under a pre-defined summary length as the constraint, the optimization problem needs to decide which sentences to be included in the final summary. Several techniques were introduced to solve this problem. For instance, Lin and Bilmes (2010) proposed to solve with maximizing sub-modular functions under budget constraint. McDonald (2007)Gillick and Favre (2009) reformulated the problem and solved with integer linear programming.

With the recent advances in machine learning, deep learning techniques have gained traction in summarization. For instance, a more recent approach combines BERT with text-matching for extractive summarization Zhong et al. (2020). In Lebanoff et al. (2018)Zhang et al. (2018), the authors adapted the network trained on single document corpus directly to multi-document setting. In Jin et al. (2020), the author proposed multi-granularity network for multi-document summarization. In Yasunaga et al. (2017), this problem was solved by graph convolutional network. Other related works include Conroy et al. (2013) Cao et al. (2017)Isonuma et al. (2017)Nallapati et al. (2016).

Alternatively, there is RL approach to solve summarization. In Shashi Narayan, Shay B. Cohen, and Mirella Lapata (2018), the author utilized the REINFORCE Williams (1992) (an RL algorithm) to train the model for extractive single-doc summarization. A similar approach was also taken in Arumae and Liu (2018) by including a question-focus reward. In Chen and Bansal (2018), the author performed abstractive summarization following the sentence extraction. In Paulus et al. (2017), REINFORCE is being applied for abstractive summarization. A more stabilize strategy (Actor-critic Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, Yoshua Bengio (2016)) has been considered in Li et al. (2018). Similar approach has also been investigated in Dong et al. (2019) Pasunuru and Bansal (2018). However, all these approaches focus on training a single RL policy for solving the summarization problem.

To the best of our knowledge, our PoBRL approach that blends two RL policies (with one policy optimizes for importance, and another policy optimizes for redundancy) for solving the multi-document summarization problem has not been explored in any of the NLP literature. The closest possible related work is Czarnecki et al. (2018), in which the authors proposed a curriculum learning based RL method for compressing (mixing) multiples RL policies to solve a high-dimensional (non-NLP) task. Different than their curriculum learning approach, here we show that we can blend(mix) different policies by simply taking the MMR combination of them as in eqn.1 (and eqn.4), a much simpler and intuitive approach for solving a NLP task (multi-doc summarization) without introducing an extra layer of complexity.

3 Background

3.1 Maximal Marginal Relevance

We leverage MMR to navigate the complicated and overlapping sentence space in the multi-document. MMR provides a balance between relevance and diversity. Formally, MMR is defined as:

MMR=maxsiR\S(λImportance(si)(1λ)maxsjS Redundancy(si,sj))\begin{split}\text{MMR}=\text{max}_{s_{i}\in R\backslash S}(\lambda\textit{Importance}(s_{i})\\ -(1-\lambda)\text{max}_{s_{j}\in S}\text{ }\textit{Redundancy}(s_{i},s_{j}))\end{split} (1)

where SS is the produced summary, sis_{i} is the ithi^{th} sentence from the documents RR excluding SS, and λ\lambda is a parameter balancing between importance and diversity. In general, importance and redundancy functions can be in any form. In practice, eqn 1 is implemented as an iterative and greedy algorithm that loops through each sentence in the documents for a pre-defined summary length. The algorithm is shown in Appendix Section A.

3.2 Motivation for Reinforcement Learning

As noted by McDonald (2007)Lin and Bilmes (2010), such iterative and greedy approach is non-optimal because the decision made at each time step is local and doesn’t consider its downstream consequences. The sequential nature of algorithm naturally fits within the reinforcement learning framework in which each sentence extraction action is optimized for future(downstream) cumulative expected reward.

Refer to caption
Figure 1: Multi-document search space shows sentences across multiple related texts can represent the same ground truth sentence. The search procedure for πPoBRL\pi^{\text{PoBRL}} is to have πimp\pi^{\text{imp}} get to the relevant sentence region. Then from here, πred\pi^{\text{red}} finds the most optimal candidate sentence.

Formally, we view the sentence extraction as a Markov Decision Process (MDP). We define a state Xt={R,S}X_{t}=\{R,S\}, which is a combination of input documents and extracted set of sentences at time tt respectively. At each time step, our policy makes a decision ata_{t} determining which sentence sts_{t} is to be extracted. We define our policy π\pi that searches over possible sentences across the space. Instead of having a pre-determined length of summary, we let our policy π\pi to optimize it directly. We define actions of two types:

at={stPπ(.|Xt),if continue to extractSTOP extraction,otherwisea_{t}=\begin{cases}s_{t}\sim P_{\pi}(.|X_{t}),&\text{if continue to extract}\\ \text{STOP extraction},&\text{otherwise}\end{cases}

With this action setup, our extractive agent can design the optimal strategy as when to stop extraction. When it instead decides to continue extracting sentences, each sentence in the input will be extracted according to the probability Pπ(|Xt)P_{\pi}(\cdot|X_{t}), where XtX_{t} is the state which is used to keep track of what has been extracted so far (SS), and what has not be extracted (RR). Our goal then is to come up with an optimal extraction policy π=argmaxπΠE(trt|Xt,at)\pi^{*}=\text{argmax}_{\pi\in\Pi}E(\sum_{t}r_{t}|X_{t},a_{t}), where rtr_{t} is the reward contribution of adding sentence sts_{t} onto the system summary SS.

If we define the reward contribution to be the ROUGE score Lin (2004) that we wish to optimize, then it can be shown (see Appendix B) that :

ROUGE(S)i=1imp(si)a,bS,abred(a,b)\text{ROUGE}(S)\approx\sum_{i=1}\text{imp}(s_{i})-\sum_{a,b\in S,a\neq b}{\text{red}(a,b)} (2)

Fig.1 shows the sentence search process for the multi-document space. For each ground truth sentence (labeled as blue), there might exist more than one candidate sentences coming across multiple related text. For the optimal summary SS^{*}, we want to extract the single best sentence (colored dark orange) from a pool of similar looking sentences (region denoted as"relevant sentence space"in the figure).

For effective navigation in such a search space, we proposed our PoBRL framework(Algorithm 1). Based on eqn 2, our framework decomposes the importance and redundancy objectives independently into two sub-problems. Since each problem has its own local objectives, to maximize eqn 2, we simply design policies πimp\pi^{\text{imp}} and πred\pi^{\text{red}} to optimize for each objective independently. At a high level, we can think of πimp\pi^{\text{imp}} helps the policy to narrow down the search by identifying the most relevant(or important) regions. From these regions, πred\pi^{\text{red}} searches for the most relevant and diversify (non-redundant) sentence. This search procedure is graphically illustrated in Fig. 1 where πimp\pi^{\text{imp}} leads the search toward the region containing candidates (1,2,3) which are sentences relevant to the ground-truth sentence; then in this region, πred\pi^{\text{red}} leads the search to discover the most relevant and non-redundant sentence (denoted as candidate 3*). Connecting πred\pi^{\text{red}} and πimp\pi^{\text{imp}} together formulates the optimal policy πPoBRL\pi^{\text{PoBRL}} which selects the most optimal sentence.

4 PoBRL Algorithmic Framework

We present our PoBRL algorithmic framework as in Algorithm 1. We start off the algorithm by independently training two policies πθimp\pi^{\text{imp}}_{\theta} and πϕred\pi^{\text{red}}_{\phi}, where θ\theta and ϕ\phi represent two sets of neural network parameters. We trained both policies with actor-critic (Section.4.1) but with different objective. For the πθimp\pi^{\text{imp}}_{\theta}, it is trained by optimizing over importance. For the πϕred\pi^{\text{red}}_{\phi}, it is trained with redundancy as its objective.

Input: λ,R\lambda,R
S{}\leftarrow\{\}
πθimp\pi^{\text{imp}}_{\theta}\leftarrowtrain-RL-Policy(importance) as in Section 4.1.1
πϕred\pi^{\text{red}}_{\phi}\leftarrow train-RL-Policy(redundancy) as in Section 4.1.2
while True do
       Xt={R,S}X_{t}=\{R,S\}
       P1:nmimpP^{\text{imp}}_{1:{nm}}\leftarrow Pπθimp(|Xt)P_{\pi^{\text{imp}}_{\theta}}(\cdot|X_{t}) with πθimp\pi^{\text{imp}}_{\theta} (as in Eqn.3)
       P1:nmredP^{\text{red}}_{1:{nm}}\leftarrow Pπϕred(|Xt)P_{\pi^{\text{red}}_{\phi}}(\cdot|X_{t}) with πϕred\pi^{\text{red}}_{\phi} (as in Eqn.3)
       (optional) calculate a new value of λ\text{(optional) }\text{calculate a new value of }\lambda (Eqn.9)
       calculate πPoBRL\pi^{\text{PoBRL}} by blending πθimp\pi^{\text{imp}}_{\theta}, πϕred\pi^{\text{red}}_{\phi} (Eqn.4)
       P1:nmPoBRLP^{\text{PoBRL}}_{1:{nm}}PπPoBRL(|Xt)\leftarrow P_{\pi^{\text{PoBRL}}}(\cdot|X_{t}) with πPoBRL\pi^{\text{PoBRL}}as in Eqn.5
       sselectP1:nmPoBRLs_{\text{select}}\sim P^{\text{PoBRL}}_{1:{nm}} (extract sentence by sampling from extraction probability) SS{sselect}S\leftarrow S\cup\{s_{\text{select}}\}
       ;
       if  STOP \sim πPoBRL(.|Xt)\pi^{\text{PoBRL}}(.|X_{t}) then
             break 
       end if
      
end while
return system summary SS
Algorithm 1 PoBRL Algorithmic Framework

Assume there are nn articles and mm sentences per article for a total of nmnm number of sentences in the input, once we have trained these two policies, we then calculate the sentence extraction probabilities for each sentence in the input as:

P1:nmPπ(sk|Xt) for all knmP_{1:{nm}}\leftarrow P_{\pi}(s_{k}|X_{t})\text{ for all }k\in{nm} (3)

where P1:nmRnm×1P_{1:{nm}}\in R^{nm\times 1} (with each entry represents the sentence sks_{k} extraction probability). We first calculate this sentence extraction probability from eqn.3 with πθimp\pi^{\text{imp}}_{\theta} as: P1:nmimpP^{\text{imp}}_{1:{nm}}\leftarrow Pπθimp(|Xt)P_{\pi^{\text{imp}}_{\theta}}(\cdot|X_{t}). We repeat this calculation with πϕred\pi^{\text{red}}_{\phi} as: P1:nmredPπϕred(|Xt)P^{\text{red}}_{1:{nm}}\leftarrow P_{\pi^{\text{red}}_{\phi}}(\cdot|X_{t}).

Now, we are ready to blend these two policies together utilizing MMR, which is:

πPoBRL(|Xt)=λπθimp(|Xt)(1λ)πϕred(a2,t|Xt,P1:nmimp)\pi^{\text{PoBRL}}(\cdot|X_{t})=\lambda\pi^{\text{imp}}_{\theta}(\cdot|X_{t})-(1-\lambda)\pi^{\text{red}}_{\phi}(a_{2,t}|X_{t},P^{\text{imp}}_{1:{nm}}) (4)

where the sentences extraction probabilities of this newly blended policy PoBRL are given by:

P1:nmPoBRLPπPoBRL(|Xt)=λP1:nmimp(1λ)P1:nmredP^{\text{PoBRL}}_{1:{nm}}\leftarrow P_{\pi^{\text{PoBRL}}}(\cdot|X_{t})=\lambda P^{\text{imp}}_{1:{nm}}-(1-\lambda)P^{\text{red}}_{1:{nm}} (5)

Now comparing both Eqn.5 and Eqn.4 to the original MMR setup as in Eqn.1, we see that they are nearly identical in terms of setup. Finally, to determine which sentence to extract at the t=1t=1 (the first time step), we sample sselectπPoBRL(|Xt=1)s_{\text{select}}\sim\pi^{\text{PoBRL}}(\cdot|X_{t=1}), with the sentence extraction probabilities given by Eqn.5. We appended the extracted sentence sselects_{\text{select}} into the summary set SS, and we update the state Xt=1={R,S}X_{t=1}=\{R,S\}. We repeat the extraction process until stop action is sampled.

4.1 Reinforcement Learning Training Algorithm: Actor-Critic

To generate labels, we utilize the following technique. For each sentence goldkgold_{k} in gold-summary, we find its most similar sentence from the input text. We define similarity using the ROUGE-L measure. That is,  labelk=argmax ROUGE-L(goldk,sk)\text{ label}_{k}=\text{argmax}\text{ ROUGE-L}(gold_{k},s_{k}), for each sks_{k} in the example. We define the rewards in corresponding to the agent’s actions. Formally,

rt={ROUGE-L(sk,goldk),if extractROUGE-1(summsys,summgold),if stopr_{t}=\begin{cases}\text{ROUGE-L}(s_{k},\text{gold}_{k}),&\text{if extract}\\ \text{ROUGE-1}(\text{summ}_{\text{sys}},\text{summ}_{\text{gold}}),&\text{if stop}\end{cases} (6)

where summsys\text{summ}_{\text{sys}} is the system summary and summgold\text{summ}_{\text{gold}} is the gold summary. We utilize actor-criticKonda and Tsitsiklis (2003) to train our policy network. The gradient update for the weights of policy network will be:

J(ψ)=Eπψ[ψlogπψ(at|Xt)A(Xt,at)]\nabla J(\psi)=E_{\pi_{\psi}}[\nabla_{\psi}\text{log}\pi_{\psi}(a_{t}|X_{t})A(X_{t},a_{t})] (7)

where πψ\pi_{\psi}(\cdot) is the policy function that takes an input state and outputs an action (in our case, which sentence to extract), ψ\psi consists of the learnable parameters of the neural network. Next, we have

A(Xt,at=st)=Q(Xt,at=st)V(Xt)A(X_{t},a_{t}=s_{t})=Q(X_{t},a_{t}=s_{t})-V(X_{t}) (8)

is the advantage function which measures the advantage of extracting a particular sentence sts_{t} over the rest of the sentences in the pool. And V(Xt)=Eπθ{γtrt+1|Xt}V(X_{t})=E_{\pi_{\theta}}\{\sum\gamma^{t}r_{t+1}|X_{t}\} measures the quality of state, Q(Xt,at=st)=Eπθ{γtrt+1|Xt,at=st}Q(X_{t},a_{t}=s_{t})=E_{\pi_{\theta}}\{\sum\gamma^{t}r_{t+1}|X_{t},a_{t}=s_{t}\}.

We instantiate two sentence extraction policy networks as in Algorithm 1, each with its own objective, and trained with actor-critic.

4.1.1 train-RL-Policy(Importance)

We train our first policy network πθimp\pi^{\text{imp}}_{\theta} to maximize the importance over each step with actor-critic as giving by eqn.7. To do this, we modify the first part of the reward eqn.6 to be ROUGE-LF1{}_{F_{1}} to encourage this RL policy network searching for the most relevant sentence region from the multi-document space.

4.1.2 train-RL-Policy(Redundancy)

We train our second policy network πϕred\pi^{\text{red}}_{\phi} with the redundancy as its objective using actor-critic. To do this, we modify the reward measure of eqn.6 (first part) to be ROUGE-LPrecision\text{ROUGE-L}_{\text{Precision}} to encourage this policy network to identify similarity between each sentences in the pool versus the ones already extracted. This in effect helps πθimp\pi^{\text{imp}}_{\theta} to identify the best (or the most relevant but non-redundant) sentence from the region identified by πθimp\pi^{\text{imp}}_{\theta}.

4.2 Adaptive Tuning of λ\lambda with the Advantage Function

The formulation in eqn.4 and eqn.5 are good as long as we have a reasonably good λ\lambda. In practice, the λ\lambda parameter can be a fixed numerical value that can be computed offline and determined beforehand. In reality, this is not an easy task because: 1) the λ\lambda parameter might vary from document to document; and 2) the optimal value of λ\lambda might change throughout the extraction process.

To solve this problem, we propose an innovative idea that requires no separate offline training process and can be computed online in an adaptive fashion by leveraging the power of the RL framework. When we train the network with actor critic, we get the advantage function automatically as a byproduct. Therefore, the advantage function from eqn 8 provides a quality measure by comparing sentences from the pool. Utilizing the advantage function, we can have a score function that rates each sentence in accordance with the importance and redundancy/diversity balance. Formally,

λtadv=Aπθimp(Xt,at=st)Aπϕred(Xt,at=st)\lambda_{t}^{\text{adv}}=A^{\pi^{\text{imp}}_{\theta}}(X_{t},a_{t}=s_{t})-A^{\pi^{\text{red}}_{\phi}}(X_{t},a_{t}=s_{t}) (9)

where Aπθimp,AπϕredA^{\pi^{\text{imp}}_{\theta}},A^{\pi^{\text{red}}_{\phi}} are the advantage functions of the importance and redundancy policy network respectively, and ata_{t} defines the action of extracting a particular sentence at time tt. The advantage function defined in eqn.8 scores sentences from the pool according to the impimp or redred measure. Note that since the sentences already extracted are encoded by the state XtX_{t}, so the advantage function is context aware.

5 Hierarchical Sentence Extractor

Refer to caption
Figure 2: The Hierarchical RL Sentence Extractor Model. Each multi-document input will go through the article level path (dark blue) and the sentence level path (light blue).

In this section, we describe our model structure. Our hierarchical sentence extractor consists of the following building blocks: sentence encoder, document-level encoder, attention, and pointer network. The complete model structure is shown in Fig 2. At the high level, the model operates like a pointer network with the attention module adjusts the focus between the article level and sentence level.

5.1 Sentence Encoder

Our model deals with each input document by reading one sentence at a time. Each input sentence will first be encoded by the sentence encoder. We use a temporal-CNN to encode each sentence, and denoted each sentence encoded hidden state with senciart=js^{art=j}_{enc_{i}}, where art=jart=j indicates the sentence that came from the jthj_{th} article and ii represents the sentence number from that article. Each of the encoded sentences will then go through two paths: the article-level path and the sentence aggregation path.

5.2 Article-Level Path

In our multi-document setting, each input contains one or more articles. Here, we are looking to formulate a representation for each article of the multi-document input. The article level path represents the dark blue colored region in Fig 2. We implement the article-level encoder with a bidirectional LSTM. For each article of the multi-document input, this encoder takes its input sentences (eg, senc1art=1s^{art=1}_{enc_{1}}, senc2art=1,.sencnart=1s^{art=1}_{enc_{2}},....s^{art=1}_{enc_{n}}). This article is represented by the corresponding article level sentence encoded hidden states: hiart=j=LSTM(senciart=j,hi1art=j)h^{art=j}_{i}=\text{LSTM}(s^{art=j}_{enc_{i}},h^{art=j}_{i-1}) for i[1,n]i\in[1,n]. We repeat this process for all the article of the multi-document input text.

5.3 Sentence Aggregate Path

In this path, we feed in each sentence senciart=js^{art=j}_{enc_{i}} for i[1,n],j[1,m]i\in[1,n],j\in[1,m] one by one by concatenating them together as if they came from a single piece of text. We utilize a pointer network Vinyals et al. (2015) that captures local contextual and semantic information. This pointer network utilizes information by the article level encoder and form the basis for sentence extraction. We implement the pointer network with encoder-decoder based LSTM. Formally, we can write:

ukt=vTtanh(Wsentek+Wartct)u^{t}_{k}=v^{T}\text{tanh}(W_{sent}e_{k}+W_{art}c_{t}) (10)

where ctc_{t} is the output of the attention mechanism at each output time tt, eke_{k} is the hidden state of the encoder pointer network LSTM. In other words, (e1e_{1}, e2e_{2}, e3e_{3}, …, eke_{k}, …, en×me_{n\times m}) are the encoder hidden states for the input sentences senciart=js^{art=j}_{enc_{i}} for i[1,n],j[1,m]i\in[1,n],j\in[1,m]. Lastly, v,Wsentv,W_{sent}, and WartW_{art} are the network learn-able weight matrices.

5.4 Attention

The attention module integrates article level information to the sentence aggregate level extraction network.We implement this module with dot product attention Luong et al. (2015).

Let dtd_{t} to be hidden state of decoder pointer network LSTM at output time index tt, hkh_{k} to be the corresponding article level sentence encoded hidden state, then:

ct=kαt,khk,αt,k=e(score(dt,hk))ke(score(dt,hk)),c_{t}=\sum_{k}\alpha_{t,k}h_{k},\qquad\alpha_{t,k}=\frac{e^{(\text{score}(d_{t},h_{k}))}}{\sum_{k^{\prime}}e^{(\text{score}(d_{t},h_{k^{\prime}}))}},\qquad (11)

and that score(dt,hk)=dtThk\text{score}(d_{t},h_{k})=d_{t}^{T}h_{k}.

5.5 Actor Critic Sentence Extraction Policy Network

Now, we can connect the dots between each component above, and this builds our sentence extractor. During the actual sentence extraction process, we ranked each candidates in the pool set by: P(sk|sk1,s0)=softmax(uk),for kpool set.P(s_{k}|s_{k-1},...s_{0})=\text{softmax}(u_{k}),\text{for }k\in\text{pool set}. The pool set is initiated to contain all of the sentences in the input text. When a particular sentence has been extracted, we remove it from the pool set. That is: Pool Set Pool Set {sk}\text{Pool Set }\leftarrow\text{Pool Set }\setminus\{s_{k}\}

5.6 Training Details

Training a reinforcement learning neural network from scratch is difficult and time consuming because the initial policy network tends to behave randomly. To accelerate this learning process, we warm start it with supervised learning.

Warm Start: Supervised Learning We modify the training objective to be negative log-likelihood, and we train our network by maximizing this objective function. After pre-trained supervised learning, we can then train with actor-critics.

6 Experimentation

We showcase the performance of our methods on the following two multi-document datasets: Multi-News and DUC 2004. We begin by defining the following baselines :

Lead-n. We concatenate the first nn sentences from each article to produce a system summary.

MGSum (ext/abs), an extractive/abstractive summarization method with Multi-Granularity Interaction Network. Jin et al. (2020).

MatchSum, an extractive summarization method with text-matching and BERT Zhong et al. (2020).

GRU+GCN: a model that uses a graph convolution network combined with a recurrent neural network to learn sentence saliency Yasunaga et al. (2017).

OCCAMS-V a topic modeling method Conroy et al. (2013).

CopyTransformer This is a transformer model from Gehrmann et al. (2018) that implements the PG network.

Hi-Map This is a hierarchical LSTM models with MMR attention Fabbri et al. (2019).

FastRL-Ext This is the reinforcement learning approach with extractive summarization focusing on single document summarization Chen and Bansal (2018). Here, we adopt it and apply it to the multi-document case.

RL w/o Blend We train the multi-doc Hierarchical Sentence Extractor (this paper) with actor-critic without policy blending (i.e., single policy instead of blending of two policies). This can be regarded as setting λ=1.0\lambda=1.0 in PoBRL method (eqn.5 and eqn.4) which optimize purely on importance.

6.1 Experimental Setup

We report the ROUGE F1F_{1} score on with ROUGE-1, ROUGE-2, ROUGE-SU(skip bigrams with a maximum distance of four). For our PoBRL model, we instantiated two actor-critic policy networks representing πθimp,πϕred\pi^{\text{imp}}_{\theta},\pi^{\text{red}}_{\phi} respectively. We then blend these two polcies by forming πPoBRL\pi^{\text{PoBRL}} as in eqn.4 with extraction probability in eqn.5. For complete network hyper-parameter and training details, please refer to Appendix D.

Besides using a fixed λ\lambda value, we also evaluated a λ\lambda value calculated using the advantage function as in eqn.9, and denote it by PoBRL(λ=λtadv\lambda=\lambda_{t}^{\text{adv}}). At each extraction step, the optimal λ\lambda is recalculated by balancing between the importance and redundancy consideration.

Method R-1 R-2 R-L
Lead-3 39.41 11.77 18.03
CopyTransformer 43.57 14.03 20.50
Hi-Map 43.47 14.87 21.38
FastRL-Ext 43.56 16.05 39.69
MGSum-ext 44.75 15.75 40.84
MGSum-abs 46.00 16.81 41.36
MatchSum 46.20 16.51 41.89
RL w/o Blend 44.01 17.63 33.96
PoBRL(λ=0.8)(Ours)\textbf{PoBRL}(\lambda=0.8)(\textbf{Ours}) 45.13 16.69 41.12
PoBRL(λ=λtadv)(Ours)\textbf{PoBRL}(\lambda=\lambda_{t}^{\text{adv}})(\textbf{Ours}) 46.51 17.33 42.42
Table 1: MultiNews dataset

6.2 Empirical Performance

We empirically evaluate the performance of the models on the following two multi-document datasets.

Multi-News The Multi-News Fabbri et al. (2019) is a large dataset that contains news articles scrapped over 1500 sites. The average number of words per doc and ground-truth summaries are much longer than DUC-04. This dataset has 44972 number of training examples, 5622 for validation and test set respectively. We trained our models on this dataset, and reported their performance in Table 1. Our best model PoBRL (λ\lambda = λtadv\lambda_{t}^{\text{adv}}) achieves a score of 46.51/17.33/42.42, outperforms other baselines in all metrics (with the strongest ones being MatchSum and MGSum). Here, we also show the performance of our model on different values of λ\lambda. When λ=1.0\lambda=1.0, the model reduces to single policy and is not policy-blended (denoted by RL w/o Blend). When λ\lambda = λtadv\lambda_{t}^{\text{adv}}, the value of λ\lambda is adaptively determined by the advantage function as in eqn. 9. For a fixed value of λ=0.8\lambda=0.8, we achieve a good score (45.13/16.69/41.12). However, it is intuitive that the optimal value should be varied from document to document, and it might change during the extraction process. That is reflected by the PoBRL(λ=λtadv\lambda=\lambda_{t}^{\text{adv}}) advantage method which provides a significant performance boost. On the other hand, the RL w/o Blend represents a model without policy blend strategy. Comparing RL w/o Blend with PoBRL demonstrates the effect of our policy-blending algorithm. Incorporating policy-blending on top of this hierarchical sentence extraction model provides remarkable performance boost (going from RL w/o Blend’s 44.01/15.65/33.96 to PoBRL’s 46.51/17.33/42.42).

DUC-04 This is a test only dataset that contains 50 clusters with each cluster having 10 documents. For each cluster, the dataset provides 4 human hand-crafted summaries as ground-truth. Following Fabbri et al. (2019)Lebanoff et al. (2018), we trained our models using the CNN-DM Chen et al. (2016) dataset, and report the performance of different models on Table 3. Our best model (PoBRL λ\lambda = λtadv\lambda_{t}^{\text{adv}}) achieves 38.67/10.23/13.19 (ROUGE-1/2/SU) which is a substantial improvement over all other baselines (with the closest ones being OCCAMS-V and GRU+GCN). On the other hand, using a fixed value of λ=0.8\lambda=0.8 of PoBRL still achieves good performance (38.13/9.99/12.85). Comparing PoBRL’s λ=0.8\lambda=0.8 to λ=λtadv\lambda=\lambda_{t}^{\text{adv}} shows the power of our method of calculating λ\lambda with the advantage function. The effect of our policy blending algorithm is demonstrated by comparing RL w/o Blend with PoBRL. Here, we also observe a major performance boost (36.95/9.12/11.66 versus 38.67/10.23/13.19).

6.3 Human Evaluation

In this experiment, we randomly sampled 100 summaries of Multi-News dataset from the following models: Hi-Map, CopyTransformer, MatchSum, and PoBRL. Following Chu and Liu (2019), we asked judges222All judges are native English speakers with at least a bachelor’s degree and experience in scientific research. We compensated the judges at an hourly rate of $28. (10 for each sample) to rate the quality of system generated system in a numerical rating of 1 to 5, with the lowest score to be worst and the highest score to be the best. We evaluated the quality of the system generated summarization in terms of the following metrics as in Chu and Liu (2019) Dang (2006): Grammatically, Non-redundancy, Referential clarity, Focus, Structure and Coherence. Our result is shown in Table 2.

Method Grammar Non-redundancy Referential clarity Focus Structure and Coherence
CopyTransformer 3.83 3.26 3.89 3.94 4.04
Hi-Map 4.06 3.46 3.74 4.01 3.89
MatchSum 4.26 3.32 4.12 4.00 3.96
PoBRL (Ours) 4.28 4.68 4.19 4.23 4.08
Table 2: Human evaluation on system generated summaries of Multi-News dataset
Method R-1 R-2 R-SU
Lead-1 30.77 8.27 7.35
CopyTransformer 28.54 6.38 7.22
Hi-Map 35.78 8.90 11.43
GRU+GCN: 38.23 8.48 11.56
OCCAMSV\text{OCCAMS}_{V} 38.50 9.76 12.86
RL w/o Blend 36.95 9.12 11.66
PoBRL(λ=0.9)(Ours)\textbf{PoBRL}(\lambda=0.9)(\textbf{Ours}) 38.13 9.99 12.85
PoBRL(λ=λtAdv)(Ours)\textbf{PoBRL}(\lambda=\lambda_{t}^{\text{Adv}})(\textbf{Ours}) 38.67 10.23 13.19
Table 3: DUC-04 dataset
Redundancy Measures Cosine (Higher the better) ROUGE-L (Lower the better)
RL w/o Blend 12.49 2.28
PoBRL 12.73 0.19
Table 4: Redundancy Analysis, the first column redundancy is in cosine similarity measure (the higher the better), and the second column is the ROUGE-L measure (the lower the better).

7 Analysis

Since the Multi-news dataset provides higher number and more variant/diversified input examples, this will be the dataset which we will be analyzing below with 1000 examples randomly sampled.

7.1 Varying number of input document

Unlike DUC04, each example of the multi-news dataset has different number of source documents (range from 2 to 9). In general, the higher the number of the example, the longer the input. In Fig 3, we show the effect of our proposed method in handling different numbers of source document in each multi-doc input. As can be seen in the figure, the performance is almost uniformly distributed, showing that our method can handle higher or lower number of input documents (which also translate to longer or shorter piece of input text) equally well.

Refer to caption
Figure 3: Model performance on ROUGE-1 on Multi-doc inputs of different number of document.

7.2 Redundancy

We measure the redundancy of our system summary by comparing the policy blending formulation of our model (i.e., PoBRL) against with our non-policy-blend (i.e., RL w/o Blend) model. This comparison allows us to understand the effectiveness of employing the policy-blending strategy. We measure redundancy by making use of the ROUGE-L score and the BERT’s Devlin et al. (2019) embedding, separately. Our evaluation criterion is the following: for each sentence in the generated summary, we compute the redundancy metric between that particular sentence against with the rest of the sentence in the summary. In ROUGE-L experiment, we report average sentence level ROUGE-L score within the input example. PoBRL reports the lowest score of 0.19 versus 2.28 as in RL w/o Blend, signifying the significant reduction in overlap (or redundancy) in the system summary. In the BERT’s experiment, we report the average cosine similarity score. As shown in Table 4, the PoBRL model achieves higher cosine measure (signifies more diversify and less repetitive).

8 Conclusion

In this paper, we have proposed a novel PoBRL algorithmic framework that allows independent sub-modular reinforced policy learning and policy blending leverage on maximal marginal relevance and reinforcement learning. We have verified the proposed framework on our proposed model. Our empirical evaluation shows state-of-the-art performance on several multi-document summarization datasets.

References

Appendix A An Iterative and Greedy MMR Algorithm

Result: System summary
S{}\leftarrow\{\}
;
while len<< maxLen do
       for sis_{i} in R\SR\backslash S do
             score=imp{}_{imp}= Importance(si)(s_{i})
             ;
             for sjs_{j} in SS do
                   scorered = max(scorered, Redundancy(si,sj)(s_{i},s_{j}) )
                   ;
                  
             end for
            if  MMR-Score<λ\textit{MMR-Score}<\lambda scoreimp(1λ)scorered{}_{imp}-(1-\lambda)\text{score}_{red}  then
                   spick=sis_{\text{pick}}=s_{i}
                   Update MMR-Score
             end if
            
       end for
      SS{spick}S\leftarrow S\cup\{s_{\text{pick}}\}
end while
Algorithm 2 Greedy Iterative MMR Algorithm

The MMR from equation 1 is implemented as an iterative and greedy algorithm. The complete procedure is shown in Algorithm 2.

Appendix B Optimization Objective Decoupling

In this section, we derive the result for decoupling the multi-objective optimization problem into small sub-problems. We borrow the notation and formulation of ROUGE score from Peyrard and Eckle-Kohler (2016).

Let S={si|im}S=\{s_{i}|i\leq m\} be a set of mm sentences that constitute a system summary, and let ρ(S)\rho(S) be the ROUGE-N score of SS, where ROUGE-N evaluates the n-gram overlaps between the gold summary and the system summary SS. Then,

ρ(S)=1RNgS(min(Fs(g),Fs(g))\rho(S)=\frac{1}{R_{N}}\sum_{g\in S^{*}}(min(F_{s}(g),F_{s}^{*}(g)) (12)

where SS^{*} is the gold summary, Fs(g)F_{s}(g) denotes the number of times that the n-gram of gg type occurs over SS, and RNR_{N} denotes the number of n-gram tokens.

Let CY,S=min(FY(g),FS(g))C_{Y,S^{*}}=min(F_{Y}(g),F_{S^{*}}(g)) to be the contribution of the n-gram gg, ϵ(ab)=gSmax(Ca,S(g)+Cb,S(g)Fs(g),0)\epsilon(a\land b)=\sum_{g\in S^{*}}max(C_{a,S^{*}}(g)+C_{b,S^{*}}(g)-F_{s^{*}}(g),0) to be the redundancy between between a,ba,b in the summary, it can be shown Peyrard and Eckle-Kohler (2016) that we can write

ρ(S)\displaystyle\rho(S) =i=1mρ(si)\displaystyle=\sum_{i=1}^{m}\rho(s_{i}) (13)
+k=2m(1)k+1(1i1ikmϵ(k)(si..sik))\displaystyle+\sum_{k=2}^{m}(-1)^{k+1}(\sum_{1\leq i_{1}\leq...\leq i_{k}\leq m}\epsilon^{(k)}(s_{i}\land..\land s_{i_{k}})) (14)
imρ(si)a,bS,abϵ~(ab)\displaystyle\approx\sum_{i}^{m}\rho(s_{i})-\sum_{a,b\in S,a\neq b}\tilde{\epsilon}(a\land b) (15)
imimp(si)a,bS,abred(a,b)\displaystyle\approx\sum_{i}^{m}\text{imp}(s_{i})-\sum_{a,b\in S,a\neq b}\text{red}(a,b) (16)

where ϵ~()\tilde{\epsilon}(\cdot) approximates the redundancy, imp()\text{imp}(\cdot) denotes importance function, and red()\text{red}(\cdot) denotes redundancy function. In other word, to maximize the ROUGE score, we can optimize it by maximizing the importance of each sentence while subtracting(minimizing) over the redundancy.

Now, we define a policy π\pi with the goal of searching over the sentence space and picking the most optimal set of sentences S^\hat{S^{*}} such that π=argmaxπE(trt|st,at)\pi^{*}=\text{argmax}_{\pi}E(\sum_{t}r_{t}|s_{t},a_{t}), where ata_{t} corresponds to searching over the sentence space to determine which sentence to be added onto the summary SS, rt=ρ(si)r_{t}=\rho(s_{i}) is interpreted as the importance (or the ROUGE-score contribution) of adding sentence sis_{i} onto the summary set SS, and the policy π(at|Xt)\pi(a_{t}|X_{t}) outputs the probability of selecting a particular sentence and adding it onto SS. Now, instead of having a pre-defined length mm, our policy π\pi also decides the optimal summary length by determining when to stop the selection extraction.

In the multi-document setting, however, related documents might contain large amount of overlapping sentences. As shown in Fig.1, for each sentence in the ground-truth summary, there exist multiple candidate sentence across related text. The optimal sentence selection strategy is to have two policies, with the first search policy πimp\pi^{\text{imp}} learns to discover for the most relevant sentence space. And from this smaller space, we have the second policy πred\pi^{\text{red}} learns to search for the most relevant but non redundant sentence (denotes as the most optimal sentence).

At a high level, these two policies, with πimp\pi^{\text{imp}} and πred\pi^{\text{red}} are learnt with the two decomposed objective independently as in equation 9. When we combine these two learned policies together (as in Algorithm 1, and eqn.4), these two policies complement each other and formulate the optimal search strategy πPoBRL\pi^{\text{PoBRL}}.

Appendix C Details on the training and testing dataset

We evaluate the performance of our proposed strategy on the following two datasets

In the first experiment, we train and test on the Multi-News dataset Fabbri et al. (2019). This dataset contains 44972 training examples, and 5622 examples for testing and also 5622 examples for validation. Each summary is written by professional editor. We follow the testing procedure as in Fabbri et al. (2019), and the dataset can be found in github link 333https://github.com/Alex-Fabbri/Multi-News.

In the second experiment, we test on the DUC-04 Dang (2006) dataset. This dataset contains 50 clusters with each cluster having 10 documents. For each cluster, the dataset provide 4 human generated ground-truth summaries. Because it is a evaluation-only dataset, we follow the same procedure as in Fabbri et al. (2019)Lebanoff et al. (2018) to train our network with the CNN-DM Chen et al. (2016), and then test on the DUC-04 dataset. The CNN-DM dataset contains 287,113 training, 13,368 validation, and 11,490 testing examples. It can be downloaded from here 444https://github.com/abisee/cnn-dailymail, and the DUC 04 dataset can be downloaded from here 555https://duc.nist.gov/duc2004/.

Appendix D Training details

Our Model. For our hierarchical sentence extractor, we instantiate the temporal CNN module with a setup of 1-D single layer convolutional filters of the size of 3, 4, and 5, respectively. Each input sentence is first converted to a vector representation by a word2vec matrix with a output dimension of 128. Then, each sentence is encoded by concatenating the output of the each window from the temporal-CNN model.

Next, we instantiate the article-level encoder with a 2 layers bidirectional LSTM with 256 hidden units. We use this same configuration also for the encoder and decoder pointer-network as well. For the supervise learning (as the warm start), we trained the network with Adam optimizer with a learning rate of 1e-3 and with a batch size of 64. For the reinforcement learning part, we trained the network with actor-critic and with a learning rate of 1e-7, γ=0.99\gamma=0.99 and a batch size of 32.

The entire training time takes about 24.05 hours on a T4 GPU. All hyper-parameters are tuned with the validation set. For the batch size, we search over the values of [32, 128], and for the learning rate (supervised learning part), we search over the values of [1e-3, 1e-4]. For the learning rate in the reinforcement learning part, we search over the values of [1e-3, 1e-7]. For hyperparmeters searching, we ran 1 trial for one set of parameters. After selecting the values for learning rates and batch size, finally, we tune λ\lambda. Here, we only tried λ\lambda for the values of [0.8, 0.9], and we select the λ\lambda value base on its performance on the validation set. For our λ=λtadv\lambda=\lambda^{\text{adv}}_{t}, no tuning is required.

Besides the learning rates and batch size, our overall model has 1 parameter (if using a fix value of λ\lambda) to tune. On the other hand, if using a adaptive λ=λtadv\lambda=\lambda^{\text{adv}}_{t}, no tuning is required.

The overall runtime (for summarizing the Multi-News dataset) is about 8.25 minutes (for λ=0.8,0.9\lambda=0.8,0.9), and 10.5 minutes for λ=λtadv\lambda=\lambda^{\text{adv}}_{t}. For the DUC-04 dataset, the runtime is about 1.80 minutes (for λ=0.8,0.9\lambda=0.8,0.9), and 2.12 minutes for λ=λtadv\lambda=\lambda^{\text{adv}}_{t}.

For the baselines, we used the exact setup as in their original papers.