May the Force Be with Your Copy Mechanism: Enhanced Supervised-Copy Method for Natural Language Generation

Sanghyuk Choi, Jeong-in Hwang, Hyungjong Noh, Yeonsoo Lee
NCSOFT NLP Center
{sanghyuk, jihwang, nohhj0209, yeonsoo}@ncsoft.com

Abstract

Recent neural sequence-to-sequence models with a copy mechanism have achieved remarkable progress in various text generation tasks. These models addressed out-of-vocabulary problems and facilitated the generation of rare words. However, the identification of the word which needs to be copied is difficult, as observed by prior copy models, which suffer from incorrect generation and lacking abstractness. In this paper, we propose a novel supervised approach of a copy network that helps the model decide which words need to be copied and which need to be generated. Specifically, we re-define the objective function, which leverages source sequences and target vocabularies as guidance for copying. The experimental results on data-to-text generation and abstractive summarization tasks verify that our approach enhances the copying quality and improves the degree of abstractness.

1 Introduction

Natural language generation is a task in any setting in which we generate new text, such as machine translation, summarization, and dialogue systems (Gatt and Krahmer, 2018). Approaches using neural networks known as sequence-to-sequence (seq2seq) have achieved reliable results on such tasks (Sutskever et al., 2014). However, basic seq2seq models exhibit weaknesses in dealing with rare and out-of-vocabulary (OOV) words. Vinyals et al. (2015) resolved this problem by copying words from the source sequence and directly inserting them into the output generation. This idea has been successfully applied to abstractive summarization tasks (Gulcehre et al., 2016; Gu et al., 2016; See et al., 2017; Xu et al., 2020) and has been further adopted by Wiseman et al. (2017) and Puduppully et al. (2019a) to help the model copy correct values in data-to-text generation tasks.

TEAM	WIN	LOSS	PTS	FG_PCT	RB	…
Hawks	28	20	142	44	64	…
Knicks	21	28	139	40	63	…

PLAYER	AS	RB	PT	FG	FGA	…
Paul Millsap	7	19	37	13	29	…
Kent Bazemore	0	10	37	13	29	…
…	…	…	…	…	…	…

Pointer-Generator The Atlanta Hawks defeated the host New York Knicks, 142-139, at Philips Arena on Wednesday. The Hawks came into this game as a sizable favorite and they didn’t disappoint. In fact, Atlanta led for over 40 minutes of this game, as they led by double - digits for the entirety of the second half. …

Our Force-copy-unk The Atlanta Hawks (28-20) defeated the New York Knicks (21-28) 142-139 at Phillips Arena in Atlanta. The Hawks were led by Paul Millsap, who scored 37 points (13-29 FG, 3-8 3Pt, 8-10 FT) to go with 19 rebounds, seven assists and one steal in 60 minutes …

Figure 1: Comparison of output of data-to-text models on a ROTOWIRE dataset. Text that accurately reflects a record is highlighted in blue, and erroneous text is highlighted in red.

Unfortunately, the generation results obtained from copying mechanisms can be suboptimal. Zhou et al. (2018) observed that some unrelated words appear unexpectedly in the middle of the phrase, or the phrase is not copied completely and some words are missing. See et al. (2017) and Gehrmann et al. (2018b) reported the lack of abstractness in the excessive copying of source words, resulting in insufficient novel expressions of generated summarization. For example, the CNN/DailyMail dataset (Nallapati et al., 2016) contains approximately 17% of novel words on gold summaries, but most copy models create less than 1%. Moreover, most data-to-text models based on copy mechanisms still suffer from poor factual inconsistencies. Accurately conveying the facts is especially important in informational communication, such as news, and low levels of veracity make these models unreliable and useless in practice (Kryscinski et al., 2020). Wiseman et al. (2017) indicated that there is a significant gap between neural models and template systems in terms of generating text containing factual (i.e., correct) records. Despite the efforts of the dedicated research community (Puduppully et al., 2019a; Rebuffel et al., 2020, 2021) there are still many challenges to narrow the gap.

In our experiments, we found that the probability of the copying words crucially affects the decoding process, resulting in fallacious generation when calculated inaccurately. Therefore, we argue that deciding between copying or generating can benefit from distinct guiding. Accordingly, we propose a force-copy method that forces the model to copy every word when it appears in source and target sequences simultaneously so as to increase copy precision. Furthermore, to relieve the lacking abstractness of the generated text, we present the force-copy-unk method, which forces model to copy a word only if it does not exist in the target vocabulary even if it appears in both the source and target sequences.

Our contributions are three-fold:

•

We analyzed and compared the characteristics of prior copy models. On the basis of the analysis, we present force-copy and force-copy-unk methods that promote accurate copy of source sequence.
•

The force-copy and force-copy-unk methods improve the RG Precision score in data-to-text, which indicates that the text generation based on the source data is conducted more correctly.
•

The force-copy-unk achieves an improvement in copy precision, which not only increases the ROUGE score but also shows better abstraction in abstractive summarization task.

2 Background

As our work builds on the prior copy mechanism, we introduce Pointer-Generator Network (See et al., 2017) as a baseline. In this study, the source text $x$ is fed into a bidirectional LSTM (BiLSTM) encoder. However Klein et al. (2020) reported that there is no significant difference in the copying performances of Transformer (Vaswani et al., 2017) encoder compared to that of BiLSTM; thus we replaced it with a Transformer encoder producing a sequence of encoded hidden states $h_{i}$ . At each timestep $t$ , the decoder receives the representation of the previously generated word to produce the decoder hidden states $s_{t}$ . From the hidden states, a context vector $c_{t}$ is calculated based on the attention distribution (Bahdanau et al., 2015) as follows:

e_{t,i}=v^{T}\mathrm{tanh}(W_{h}h_{i}+W_{s}s_{t})

(1)

\alpha_{t}=\mathrm{softmax}(e_{t})

(2)

c_{t}=\sum\mathop{}_{\mkern-5.0mui}\alpha_{t,i}h_{i}

(3)

where $v$ , $W_{h}$ and $W_{s}$ are learnable parameters.

The vocabulary distribution $P_{vocab}$ over all words in the target vocabulary is computed from $s_{t}$ , $c_{t}$ , and the learnable parameters $W_{v}^{{}^{\prime}}$ , $W_{v}$ , $b$ and $b^{\prime}$ :

P_{vocab}=\mathrm{softmax}(W^{\prime}_{v}(W_{v}[s_{t};c_{t}]+b)+b^{\prime})

(4)

The generation probability $p_{gen}$ is used as a soft switch to either generate from vocabulary or copy from the source words by sampling from the attention distribution $\alpha_{t}$ . Additionally, $p_{gen}$ for timestep $t$ is calculated from the context vector $c_{t}$ , decoder state $s_{t}$ , and decoder input $y_{t}$ :

p_{gen}=\mathrm{sigmoid}(w_{h}^{T}c_{t}+w_{s}^{T}s_{t}+w_{y}^{T}y_{t}+b_{ptr})

(5)

where vectors $w_{h}^{T}$ , $w_{s}^{T}$ , $w_{y}^{T}$ and the scalar bias $b_{ptr}$ are learnable parameters. Let a word $w$ be an element in the set of extended vocabulary, which refers to the union of the target vocabulary and all words appearing in the input sequence; then, the final probability distribution $P(w)$ is computed as:

P(w)=p_{gen}P_{vocab}(w)+(1-p_{gen})\textstyle{\sum_{i:w_{i}=w}\alpha_{t,i}}

(6)

If $w$ is an OOV word, then $P_{vocab}(w)$ is zero. Similarly, if $w$ does not exist in the input sequence, then $\sum_{i:w_{i}=w}\alpha_{t,i}$ is zero. During training, the loss at timestep $t$ is the negative log-likelihood of the target word $w_{t}^{*}$ for that timestep:

loss^{t}=-\mathrm{log}P(w_{t}^{*})

(7)

3 Model

Refer to caption — Figure 2: Architecture of our models. A generation probability $p_{gen}$ at each decoder timestep is trained through explicit loss function. (a) Force-copy model trains to copy if the target word exists in the source context. On the other hands, (b) Force-copy-unk model trains to generate from the vocab distribution unless the target word is an OOV. Note that *COVID-19* is an out-of-vocabulary in this example. Best viewed in color.

We next consider techniques for incorporating supervised learning of the copying decision into the model. To identify the number of occasions at which every copy and generation occurs, Eq. (6) is split into three cases: (i) No word is copied from the source sequence, i.e., words are sampled from the vocabulary probability distribution. In this case, $p_{gen}$ is guided to take higher value during the training time; (ii) The word copied from the source sequence does not exist in the target vocabulary. In this case, $p_{gen}$ is guided to take lower value during the training time; (iii) The word copied from the source sequence exists in the target vocabulary. In this case, $p_{gen}$ could have a scalar range in $[0,1]$ and is learned implicitly.

We augment the decoder by supervising a soft switch $p_{gen}$ that determines whether the model generates or copies. First, we re-define the loss function as

loss^{t}=loss_{vocab}^{t}+loss_{attn}^{t}+loss_{p_{gen}}^{t}

(8)

where $loss_{vocab}^{t}$ is the maximum likelihood estimation (MLE), which is used as a standard training criterion in the sequence-to-sequence model:

loss_{vocab}^{t}=-\mathrm{log}(P_{vocab}(w_{t}^{*}))

(9)

It should be noted that $P_{vocab}(w_{t}^{*})$ indicates the probability of generating the unknown token if $w_{t}^{*}$ is OOV. Similarly to the pointer network (Vinyals et al., 2015), we use attention distribution as a guide for copying, and train it to minimize the negative log-likelihood:

loss_{attn}^{t}=-\mathrm{log}(\textstyle{\sum_{i:w_{i}=w}\alpha_{t,i}})

(10)

Finally, we adopt $loss_{p_{gen}}^{t}$ to train $p_{gen}$ explicitly. In fact, the optimal way is to feed the gold copy answer for every target word; However, there rarely exist supervised data for this task. Hence, we suggest two methods leveraging the source sequences and target vocabularies.

Models	Use explcit switch	Recycle attention	Conditional activate	Mix copy, gen prob	Sum src duplicatd word	Force train switch	Tradeoff between copy/gen
Miao and Blunsom (2016)	Yes	Yes	No	Yes	No	No	Yes
COPYNET (Gu et al., 2016)	No	No	No	No	Yes	-	Yes
Merity et al. (2016)	No	Yes	No	Yes	Yes	-	Yes
Gulcehre et al. (2016)	Yes	Yes	Yes	No	No	Yes	Yes
Nallapati et al. (2016)	Yes	Yes	Yes	No	No	Yes	Yes
PGNet (See et al., 2017)	Yes	Yes	No	Yes	Yes	No	Yes
Chen et al. (2020)	Yes	Yes	Yes	Yes	Yes	Yes	Yes
Wu et al. (2020)	Yes	Yes	Yes	Yes	Yes	Yes	Yes
OpenNMT + force copy attn	Yes	Optional	No	Yes	Yes	Yes	Yes
Force-copy (ours)	Yes	Yes	No	Yes	Yes	Yes	No
Force-copy-unk (ours)	Yes	Yes	No(partly)	Yes	Yes	Yes	No

Table 1: Brief comparison between prior copy models. Use explicit switch denotes that the model calculates an explicit switch probability. Recycle attention denotes that the model recycles the attention distribution as the copy distribution. Conditional activate denotes that the model activates the copy mechanism only for unknown or named-entity words or keywords. Mix copy, gen prob denotes that the model combines the probabilities from the vocabulary distribution with probabilities from the copy distribution. Sum src duplicated word denotes that the model adds all probabilities of the same words from the attention distribution when a word appears multiple times in the source sequence. Force train switch denotes that the model trains the switch probability explicitly. Tradeoff between copy/gen denotes that the model has a loss trade-off between copy and generation during training.

Force-copy Given a source sequence $X=(x_{1},x_{2},...,x_{T_{x}})$ , if a target word $w_{t}^{*}$ appears in $X$ , $w_{t}^{*}$ can be a copy-candidate. For example, the word ’the’ and ’COVID-19’ are both copy-candidates in Figure 2. In the force-copy model, we assume that every copy-candidate word is copied from the source sequence. Therefore, $p_{gen}$ can be derived from the loss function:

loss_{p_{gen}}^{t}=\left\{\begin{array}[]{l@{}l@{\qquad}l}-\mathrm{log}(1-p_{gen})\;if\;w_{t}^{*}\in X\\[3.0pt] -\mathrm{log}(p_{gen})\;otherwise\end{array}\right.

(11)

This forces the copy switch $p_{gen}$ to perform a copy all copy-candidate words. However, we do not penalize the generating ability even on copy circumstances. We always train $loss_{vocab}^{t}$ on the loss function regardless of the copy process.
Force-copy-unk Whereas the force-copy model tries to copy every copy-candidate word, it is also possible to generate words from the vocabulary distribution instead of copying if the word exists in the target vocabulary. For example, the copy-candidate word ’the’ could be generated instead of being copied in Figure 2. In this way, we restrict the scope of copy to unknown words in the force-copy-unk model. Therefore, $p_{gen}$ can be derived from the loss function:

loss_{p_{gen}}^{t}=\left\{\begin{array}[]{l@{}l@{\qquad}l}-\mathrm{log}(1-p_{gen})\;if\;w_{t}^{*}\in X,\notin V\\[3.0pt] -\mathrm{log}(p_{gen})\;otherwise\end{array}\right.

(12)

where $V$ is a set of target vocabularies. This loss function forces the copy switch $p_{gen}$ to copy the word only if it is an unknown token. For other copy-candidate words that are not OOV, we found that it is effective to utilize the copy information as well (see Section 6.1). Inspired by work in guided alignment training for machine translation (Chen et al., 2016), we retain the $loss_{attn}^{t}$ on the loss function to inform the decoder that the word is a copy-candidate, thus inducing copy-like generation.

4 Related Work

Abstractive Summarization
Abstractive summarization aims to generate accurate and concise summaries that contain novel words in contrast to extractive summarization that extracts almost entire sentences from the input. Most prior works on abstractive summarization that employed neural networks achieved inspiring results (Rush et al., 2015; Nallapati et al., 2016; See et al., 2017). Especially after Vinyals et al. (2015), Gulcehre et al. (2016) showed dramatic improvement by applying copy mechanisms, it has become a primary module for abstractive summarization. Recent work leveraging such pre-training models (Dong et al., 2019; Song et al., 2019; Lewis et al., 2020; Zhang et al., 2020) has presented impressive advancements in performance when fine-tuned for text generation tasks. Applying the copy mechanism to these pre-training models has extended the success (Xu et al., 2020; Bi et al., 2020).
Data-to-text Generation
Data-to-text generation tasks aim to produce texts from non-linguistic input data (Reiter, 2007), including the generation of weather forecasts (Liang et al., 2009) and biographical content from Wikipedia (Lebret et al., 2016). Specifically, Wiseman et al. (2017) showed that a neural encoder-decoder model powered with the copy mechanism (Gu et al., 2016; Gulcehre et al., 2016) can generate fluent multi-sentence summaries from game statistics without explicit templates or rules. Based on their approach, various promising neural approaches have been proposed for ROTOWIRE tasks (Puduppully et al., 2019a, b; Rebuffel et al., 2020; Iso et al., 2019). Regarding the copy mechanism, Puduppully et al. (2019a, b) and Iso et al. (2019) adopted the same method as Wiseman et al. (2017), and Rebuffel et al. (2020) utilized Pointer-Generator Network (PGNet) (See et al., 2017)¹¹1We analyzed released code (if available) to identify which copy mechanism they employed in case it was not mentioned in their paper..
Pointer / Copy Mechanism
The fundamental structure of the copy mechanism was first introduced in the form of the pointer network (Vinyals et al., 2015). Expanding their work, various approaches combining attention probability and generation probability have been proposed (Miao and Blunsom, 2016; Gu et al., 2016; Gulcehre et al., 2016; Nallapati et al., 2016). In particular, See et al. (2017) proposed PGNet, where two distributions are weighted and merged into a single mixture distribution with the aid of a soft switch mechanism. PGNet has become de facto standard for copy mechanisms in various tasks, including summarization (Paulus et al., 2018; Gehrmann et al., 2018b), data-to-text (Gehrmann et al., 2018a; Rebuffel et al., 2020), and question answering (McCann et al., 2018).

Table 1 shows the characteristics of the copy models. While our models bear a resemblance to models that trains the switch probability, ours are considerably different from theirs in two aspects: (i) Compared to Gulcehre et al. (2016); Nallapati et al. (2016); Chen et al. (2020); Wu et al. (2020), those works train their copy components to activate only for certain words (i.e., OOV words, named entites, keywords, or values of structured data) whereas we activate it for all copy-candidates. Our force-copy-unk model also seems like it trains to copy only for unknown words, but it works for all copy-candidates by maintaining attention loss (see Eq. 10). (ii) We retain the vocabulary loss (see Eq. 9) even if the model copies a word whereas the other models induce a trade-off between copy and generation. Models that has a trade-off between copy and generation definitely weaken the effect of vocabulary loss when they activate the copy process. However, we believe it’s necessary to keep vocabulary loss to prevent mistaken prediction from the vocabulary distribution because the switch probability $p_{gen}$ takes value in scalar range of $[0,1]$ , not a binary value.

5 Experiments

5.1 Data-to-text Generation

Experimental Settings
We perform experiments on using two datasets, i.e., ROTOWIRE (Wiseman et al., 2017) and MLB (Puduppully et al., 2019b) for data-to-text generation tasks. ROTOWIRE contains professionally written articles summarizing NBA basketball games paired with corresponding game statistics, whereas MLB contains summaries paired with MLB baseball games. Both datasets consist of table-structured data records and relatively long multi-sentence documents (337 and 542 words on average for ROTOWIRE and MLB, respectively). Note that MLB dataset we used to train and test our model slightly differs from Puduppully et al. in two aspects.²²2In table 2, CS and CO scores of the MLB GOLD test set are under 100. This result indicates the difference of dataset between ours and Puduppully et al. (2019b) First, part of the official data has been modified since Puduppully et al. Second, we modified the script to preprocess the summaries to adjust to our model. Hence, we only compare the models that were trained on our version of the dataset. We compare our model against (i) Template-based generators from Wiseman et al. (2017) for ROTOWIRE and from Puduppully et al. (2019b) for MLB; (ii) WS-2017, a standard encoder-decoder system with copy mechanism (Wiseman et al., 2017); (iii) RBF-2020, a Transformer architecture with a hierarchical attention mechanism over entities and records within entities (Rebuffel et al., 2020) based on OpenNMT + force copy attn (Klein et al., 2020) (See Table 1); (iv) ENT, the entity-based model of Puduppully et al. (2019b) that creates dynamically updated entity-specific representations; (v) PGNet, a Pointer-Generator Network (See et al., 2017) which we re-implemented based on the hierarchical-attention architecture of RBF-2020. Note that our (FC: force-copy, FCU: force-copy-unk) models are also based on RBF-2020, in which we modified the copy module from the original. More details can be found in the Appendix A.1.
Results

RW	RG		CS		CO	BLEU
RW	P%	#	P%	R%
GOLD	96.11	17.31	._100	._100	._100	100
Templ	99.95	54.15	23.74	72.36	11.68	8.9
WS-2017	75.62	16.83	32.80	39.93	15.62	14.2
RBF-2020	89.46	21.17	39.47	51.64	18.90	17.5
ENT	92.69	30.11	38.64	48.50	20.17	16.2
PGNet	87.14	20.98	40.53	48.06	19.80	16.3
RBF-2020^†	89.31	22.07	36.88	49.37	17.87	16.6
Ours(FC)	93.27	24.28	34.34	48.85	17.26	15.8
Ours(FCU)	95.40	27.37	30.65	48.39	15.14	14.2

MLB	RG		CS		CO	BLEU
MLB	P%	#	P%	R%
GOLD	92.07	21.02	98.84	99.91	98.76	100
Templ	97.96	59.93	68.46	22.82	10.64	3.8
PGNet	79.12	20.94	47.06	47.73	20.07	9.9
RBF-2020^†	81.71	19.53	47.75	47.11	19.37	9.7
Ours(FC)	82.50	21.71	46.91	49.01	19.28	10.4
Ours(FCU)	84.50	21.05	49.39	50.89	21.16	10.5

Table 2: Evaluation on the ROTOWIRE and MLB test sets using RG count (#) and precision (P%), CS precision (P%) and recall (R%), CO in normalized Damerau-Levenshtein distance (DLD%), and BLEU. The bottom section corresponds to our implementation of the corresponding methods. † denotes the duplicate model.

Following prior work (Wiseman et al., 2017; Puduppully et al., 2019a, 2019b; Rebuffel et al., 2020), we evaluate our models using both extractive (relation generation (RG), content selection (CS), and content ordering (CO)) and $n$ -gram metrics (BLEU). For extractive metrics, we use the Information Extraction (IE) system suggested by Wiseman et al. (2017). Given a data record $r$ , gold summary $y$ and generated text $\hat{y}$ , the IE system identifies the entity (e.g., Knicks) and value (e.g., 28) pairs from $\hat{y}$ and then predicts the pair relation (e.g., WIN). RG measures precision and the number of unique relations extracted from $\hat{y}$ that also appear in $r$ . CS estimates precision and recall of the relations extracted from $\hat{y}$ that are also extracted from $y$ . Finally, CO computes the normalized Damerau-Levenshtein distance (DLD) between the sequences of relations extracted from $\hat{y}$ and $y$ . We used the pretrained IE model developed by Wiseman et al. (2017) and Puduppully et al. (2019b) for ROTOWIRE and MLB, respectively.

Table 2 shows our main results on the ROTOWIRE and MLB datasets. Our force-copy and force-copy-unk models achieve higher RG precision for both corpora than any other neural models. Specifically, the RG precision of the force-copy-unk model is as high as that of GOLD results for the ROTOWIRE dataset. With respect to the loss function, the intervention of the copy mechanism increases in the order of PGNet, RBF-2020 and our models. We observe that RG precision performance increases along with this order. This effect is observed in both datasets.

The ENT system yields the highest scores in terms of CO on ROTOWIRE. Considering the model architectures, we conjecture that ENT leverages entity-specific representations as it specialized on data-to-text tasks, whereas the other neural models feature vanilla sequence-to-sequence structures.

5.2 Abstractive Summarization

Method	R-1	R-2	R-L
Abstractive Model^*	35.46	13.30	32.65
ML + Intra-Attention	38.30	14.81	35.49
Pointer-Generator	36.44	15.66	33.42
Pointer-Generator^†	38.66	16.97	35.61
Force-copy	38.76	16.84	35.42
Fore-copy-unk	39.31	17.13	36.25

Table 3: Evaluation on CNN/DailyMail test set using full-length ROUGE-F1 metric. The Bottom section corresponds to our implementation, and † denotes the duplicated model. * marked model used the anonymized dataset, so it is not strictly comparable to our results.

Method	Copy Precision
Pointer-Generator	47.80%
Force-copy	47.81%
Force-copy-unk	48.84%

Table 4: Evaluation on CNN/DailyMail test set using Copy Precision.

Experimental Settings
We use the non-anonymized version of the CNN/DailyMail dataset (Hermann et al., 2015; Nallapati et al., 2016) provided by Harvard NLP and conform all experimental conditions presented by See et al. (2017). More details can be found in the Appendix A.2.

Since the experimental purpose is not recording the highest ROUGE score but validating the performance of copy mechanism itself, we confine the experiments to encoder-decoder abstractive baselines trained with cross-entropy and do not adopt any additional techniques such as coverage or selector mechanisms. For that same reason, we re-implemented the Pointer-Generator network to exquisitely compare the effect of copy mechanism. Therefore, we compare our model against (i) Abstractive model (Nallapati et al., 2016), a pointer-based encoder-decoder model using two softmax layers; (ii) ML+Intra-Attention (Paulus et al., 2018), an intra-attention model based on the encoder-decoder network; and (iii) Pointer-Generator (See et al., 2017) and our re-implemented version comprising the modification of the original LSTM encoder to a Transformer encoder.
Automatic Evaluation Results
We adopt the ROUGE (Lin, 2004) and Copy Precision (CP) for the evaluation metric. Given a source article $x$ , gold summary $y$ , and generated text $\hat{y}$ , CP estimates how well $\hat{y}$ are matched to $y$ for those words that appear in $x$ .

Table 3, 4 shows our ROUGE and CP evaluation results on the CNN/DailyMail corpus. The results of the baseline and the force-copy model are close, where the force-copy is better at R-1 and the baseline is better at R-2 and R-L. The force-copy-unk model outperforms the baseline in all ROUGE criteria. This is consistent with the fact that force-copy-unk outperforms on CP than the baseline.

To investigate the copy and generation ratio, we report the novel $n$ -gram scores in Table 5 for the ground-truth and model-generated summaries, where it represents the level of abstractness (Kryściński et al., 2018). Our force-copy model produces less novel expression than the baseline. As the final generation probability is calculated by mixing the probabilities from the vocabulary and copy distributions, it is obscure to exactly determine that a word is copied or generated. However, we conjecture that force-copy model tends to favor copying since 83% of the target words are copy-candidates in the dataset that are trained to be copied. At inference time, it becomes more skewed owing to unprovided word-by-word supervision. Practically, the average $p_{copy}$ (equivalent to $1-p_{gen}$ ) of the force-copy model for all words of test set is 0.848, whereas that of the Pointer-Generator is 0.773. Contrarily, the force-copy-unk model with an average $p_{copy}$ of 0.018 tends to favor generation. This effect makes the force-copy-unk model produce significantly more novel expressions than the baseline. In fact, Kryściński et al. (2018) reported that there exists an inverse correlation between the ROUGE and novelty scores in all model types. Hence, the main benefit of our force-copy-unk model is the increase in novel expression, as the model enhanced the ROUGE score simultaneously.

Method	NN-1	NN-2	NN-3	NN-4
Ground truth Summary	16.99	56.19	73.98	82.38
Pointer-Generator	0.25	5.37	11.45	16.85
Force-copy	0.03	3.54	8.10	12.36
Force-copy-unk	0.28	6.62	13.60	19.54

Table 5: Comparison of novel

n

-gram (NN-) test results for our model and baselines on the CNN/DailyMail test set.

	Comparison	Preference(%)
	(A vs B)	A	B	Tie
Readability	PGNet vs. FC	6.0	12.0	82.0
	PGNet vs. FCU	10.0	8.7	81.3
	FC vs. FCU	12.6	6.7	80.7
Factuality	PGNet vs. FC	4.7	16.0	79.3
	PGNet vs. FCU	20.7	11.3	68.0
	FC vs. FCU	28.7	2.0	69.3
Abstractness	PGNet vs. FC	27.3	9.3	63.3
	PGNet vs. FCU	12.7	20.7	66.7
	FC vs. FCU	4.0	31.3	64.7

Table 6: Human evaluation results on the 50 randomly sampled articles of CNN/DailyMail test sets. Each summary pair is reviewed by 3 human evaluators. Agreement scores by Fleiss’ kappa (Fleiss et al., 1971) are 0.19 for Readability, 0.44 for Factuality and 0.29 for Abstractness, respectively.

Human Evaluation Results
We conduct a human evaluation to measure the quality of the summaries of each model. The instructions are related to three different aspects, as follows. (i) Readability: How well-written (fluent and grammatical) is the summary? (ii) Factuality: Is the summary factually consistent with the source document? (iii) Abstractness: How does the summary generate novel expressions (not entirely copy the source sentences)?

As shown in Table 6, our force-copy model outperforms the others on readability and factuality and force-copy-unk model outperforms the others on abstractness. It seems that more copy operations make the summary be good at readability and factuality levels but results in poor abstractness. We discuss this further in Section 6.3.

Method	RG(P)	RG(P)
	$p_{copy}>0.5$	$p_{gen}>0.5$
Pointer-Generator	0.8043 (57.1%)	0.7757 (42.9%)
Force-copy	0.8250 (62.3%)	0.7797 (37.7%)
Force-copy-unk	- (0%)	0.8365 (100%)

Table 7: Comparison of RG Precision (for only unique and numeric values of ROTOWIRE) for our model and baseline depending on copy probability. Percentage numbers represent the ratio of the number of words that

p_{copy}>0.5

and

p_{gen}>0.5

6 Analysis and Discussion

6.1 Effects of Copy Probability on Precision

To examine the effects of copy probability on generation precision, we investigate the word-level $p_{copy}$ on the ROTOWIRE test set. Based on the IE system, we report the RG precision depending on the copy probability on Table 7. To couple the IE system and generated summary, we limit this study only to unique and numeric words. Hence, this RG precision is different from that shown in Table 2.

We first observe that high copy probability results in high precision in both force-copy model and the baseline, which adheres to the goals of the copy mechanism. Although the precision gap between two models for $p_{gen}>0.5$ is only 0.4%p, it increases to 2.1%p for $p_{copy}>0.5$ . These findings confirm that our force-copy model shows high confidence on copy probability than the baseline. Furthermore, it generates 62.3% of $p_{copy}>0.5$ words compared to 57.1% of the baseline, further widening the precision gap. Meanwhile, the force-copy-unk model does not generate $p_{copy}>0.5$ words for the values of the ROTOWIRE test set as the numeric words exist in the target vocabulary. Although the force-copy-unk model results in low $p_{copy}$ , attention loss (Eq. 10) prompts the generation to look up the values to copy during training. This copy-like generation corresponds to 6.1%p precision increase compared to the baseline for the words $p_{gen}>0.5$ .

6.2 Impact of Vocabulary Size on the Force-copy-unk Model

As copy mechanism of our force-copy-unk model relies on a target vocabulary, we study the effects of vocabulary size on the force-copy-unk model. In Section 5.2, we experimented on the abstractive summarization task on the CNN/DailyMail corpus with 50k vocabularies. The vocabulary coverage is 72.1%, and 1.9% words of the gold test set summaries are OOV. When we reduce the vocabulary size to 25k, the percentages become 53.3% and 3.5%, respectively. In contrast when we increase the vocabulary size to 100k, the percentages become 83.6% and 1.1%, respectively. Table 8 shows a relative comparison in terms of vocabulary size. As reported by See et al. (2017), large vocabulary size does not enhance the performance of attention-based sequence-to-sequence models. This is consistent with our observations that large (100k) vocabulary models perform poorly in terms of ROUGE score. Even 50k-vocabulary model falls behind the 25k model on ROUGE score in practice. In contrast, we observe novel $n$ -gram expression increases as vocabulary size grows. This disparity is likely associated with less vocabulary size during training, which prompts the model to copy more (see Eq. 12). According to the above results, the force-copy-unk model can control this phenomenon by adjusting the vocabulary size.

6.3 Tradeoff between Copy and Abstractness

$\\|V\\|$	R-1	R-2	R-L	NN-1	NN-2	NN-3	NN-4
25k	39.31	17.13	36.25	0.28	6.62	13.60	19.54
50k	38.52	16.62	35.58	0.46	7.96	15.85	22.42
100k	38.11	16.21	35.28	0.73	10.47	20.14	27.84

Table 8: ROUGE and novel

n

-gram comparison in terms of vocabulary size for our force-copy-unk model.

Article no other scoreline could have emphasised how seriously this new arsenal team are taking their prospects of winning major trophies again (…) olivier giroud wants arsenal to continue their premier league winning streak and challenge for the title. (…) arsenal to earn a 1-0 victory over burnley on saturday. (…)
PGNet arsenal beat burnley 1-0 in the premier league on saturday. olivier giroud wants arsenal to continue their premier league winning streak and challenge for the title.
Force-copy no other scoreline could have emphasised how seriously this new arsenal team are taking their prospects of winning major trophies again. olivier giroud wants arsenal to continue their premier league winning streak and challenge for the title.
Force-copy-unk arsenal beat burnley 1-0 in their premier league clash on saturday. olivier giroud feels the new arsenal team (…)

Figure 3: Examples of generated summaries of CNN/DailyMail of each model. (red color denotes copied sentences from the source article)

The human evaluation results presented in Section 5.2 indicate that our force-copy model outperforms other models in terms of readability and factuality. However, the force-copy-unk model exhibits low readability and factuality in abstractive summarization tasks. This is a conflicting outcome given that the force-copy-unk model enhanced the RG precision of data-to-text tasks. It likely attributed to the difference of two tasks and datasets. As shown in Figure 4, only values of structured data (i.e., names and records) can be copied in data-to-text task. In contrast, almost every word can be copied in summarization task. This is consistent with Song et al. (2020), which found that the proportion of copy-candidate words used for training impacts the model to depend on copying the words, leading to pure extracts and less abstraction. The examples in Figure 3 show that the more the model activates copy during training, the more it behaves like an extractive summarization model. Abstraction introduces even more choice of phrasing, negatively affecting factuality and readability compared to the extracted source sentence. In any case, encouraging the model to write more abstractly while retaining the factuality and readability is an interesting subject for the future work.

7 Conclusion

In this work we analyzed the prior copy models in detail. Based on the analysis, we presented a novel copy mechanisms that leverages Pointer-Generator network. We showed that our models conduct more accurate copy than baselines via data-to-text experiment. In addition, our force-copy-unk model outperformed the baselines in both ROUGE score and the novel $n$ -gram score in abstractive summarization task.

References

Bahdanau et al. (2015) Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015.
Bi et al. (2020) Bin Bi, Chenliang Li, Chen Wu, Ming Yan, Wei Wang, Songfang Huang, Fei Huang, and Luo Si. 2020. PALM: Pre-training an autoencoding&autoregressive language model for context-conditioned generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8681–8691, Online. Association for Computational Linguistics.
Chen et al. (2016) Wenhu Chen, Evgeny Matusov, Shahram Khadivi, and Jan-Thorsten Peter. 2016. Guided alignment training for topic-aware neural machine translation. AMTA 2016, Vol., page 121.
Chen et al. (2020) Zhiyu Chen, Harini Eavani, Wenhu Chen, Yinyin Liu, and William Yang Wang. 2020. Few-shot NLG with pre-trained language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 183–190, Online. Association for Computational Linguistics.
Dong et al. (2019) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
Fleiss et al. (1971) J.L. Fleiss et al. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378–382.
Gatt and Krahmer (2018) Albert Gatt and Emiel Krahmer. 2018. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research, 61:65–170.
Gehrmann et al. (2018a) Sebastian Gehrmann, Falcon Dai, Henry Elder, and Alexander Rush. 2018a. End-to-end content and plan selection for data-to-text generation. In Proceedings of the 11th International Conference on Natural Language Generation, pages 46–56, Tilburg University, The Netherlands. Association for Computational Linguistics.
Gehrmann et al. (2018b) Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. 2018b. Bottom-up abstractive summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4098–4109, Brussels, Belgium. Association for Computational Linguistics.
Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1631–1640, Berlin, Germany. Association for Computational Linguistics.
Gulcehre et al. (2016) Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. 2016. Pointing the unknown words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 140–149, Berlin, Germany. Association for Computational Linguistics.
Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
Iso et al. (2019) Hayate Iso, Yui Uehara, Tatsuya Ishigaki, Hiroshi Noji, Eiji Aramaki, Ichiro Kobayashi, Yusuke Miyao, Naoaki Okazaki, and Hiroya Takamura. 2019. Learning to select, track, and generate for data-to-text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2102–2113, Florence, Italy. Association for Computational Linguistics.
Klein et al. (2020) Guillaume Klein, François Hernandez, Vincent Nguyen, and Jean Senellart. 2020. The OpenNMT neural machine translation toolkit: 2020 edition. In Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 102–109, Virtual. Association for Machine Translation in the Americas.
Kryscinski et al. (2020) Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. 2020. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
Kryściński et al. (2018) Wojciech Kryściński, Romain Paulus, Caiming Xiong, and Richard Socher. 2018. Improving abstraction in text summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1808–1817, Brussels, Belgium. Association for Computational Linguistics.
Lebret et al. (2016) Rémi Lebret, David Grangier, and Michael Auli. 2016. Neural text generation from structured data with application to the biography domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1203–1213, Austin, Texas. Association for Computational Linguistics.
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
Liang et al. (2009) Percy Liang, Michael Jordan, and Dan Klein. 2009. Learning semantic correspondences with less supervision. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 91–99, Suntec, Singapore. Association for Computational Linguistics.
Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
McCann et al. (2018) Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730.
Merity et al. (2016) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
Miao and Blunsom (2016) Yishu Miao and Phil Blunsom. 2016. Language as a latent variable: Discrete generative models for sentence compression. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 319–328, Austin, Texas. Association for Computational Linguistics.
Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Çağlar Gulçehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany. Association for Computational Linguistics.
Paulus et al. (2018) Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive summarization. In International Conference on Learning Representations.
Puduppully et al. (2019a) Ratish Puduppully, Li Dong, and Mirella Lapata. 2019a. Data-to-text generation with content selection and planning. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 6908–6915.
Puduppully et al. (2019b) Ratish Puduppully, Li Dong, and Mirella Lapata. 2019b. Data-to-text generation with entity modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2023–2035, Florence, Italy. Association for Computational Linguistics.
Rebuffel et al. (2021) Clément Rebuffel, Marco Roberti, Laure Soulier, Geoffrey Scoutheeten, Rossella Cancelliere, and Patrick Gallinari. 2021. Controlling hallucinations at word level in data-to-text generation. arXiv preprint arXiv:2102.02810.
Rebuffel et al. (2020) Clément Rebuffel, Laure Soulier, Geoffrey Scoutheeten, and Patrick Gallinari. 2020. A hierarchical model for data-to-text generation. In European Conference on Information Retrieval, pages 65–80. Springer.
Reiter (2007) Ehud Reiter. 2007. An architecture for data-to-text systems. In Proceedings of the Eleventh European Workshop on Natural Language Generation (ENLG 07), pages 97–104, Saarbrücken, Germany. DFKI GmbH.
Rush et al. (2015) Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal. Association for Computational Linguistics.
See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
Song et al. (2020) Kaiqiang Song, Bingqing Wang, Zhe Feng, Ren Liu, and Fei Liu. 2020. Controlling the amount of verbatim copying in abstractive summarization. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8902–8909. AAAI Press.
Song et al. (2019) Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. Mass: Masked sequence to sequence pre-training for language generation. In International Conference on Machine Learning, pages 5926–5936. PMLR.
Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2, pages 3104–3112.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems, 30:5998–6008.
Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, pages 2692–2700.
Wiseman et al. (2017) Sam Wiseman, Stuart Shieber, and Alexander Rush. 2017. Challenges in data-to-document generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2253–2263, Copenhagen, Denmark. Association for Computational Linguistics.
Wu et al. (2020) Xiuyu Wu, Nan Jiang, and Yunfang Wu. 2020. A question type driven and copy loss enhanced frameworkfor answer-agnostic neural question generation. In Proceedings of the Fourth Workshop on Neural Generation and Translation, pages 69–78, Online. Association for Computational Linguistics.
Xu et al. (2020) Song Xu, Haoran Li, Peng Yuan, Youzheng Wu, Xiaodong He, and Bowen Zhou. 2020. Self-attention guided copy mechanism for abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1355–1362, Online. Association for Computational Linguistics.
Zhang et al. (2020) Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR.
Zhou et al. (2018) Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. 2018. Sequential copying networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.

Appendix A Experimental Details

A.1 Data-to-text Generation

We used the official train/valid/test splits of both datasets for our experiments (3,398/727/728 for ROTOWIRE and 22,821/1,739/1,744 for MLB).³³3The complete ROTOWIRE datasets that we used can be obtained from https://github.com/harvardnlp/boxscore-data, and MLB datasets from https://github.com/ratishsp/mlb-data-scripts.

For ROTOWIRE, We adopt word embedding size of 300, and feature embedding size of 27. We employed a Transformer encoder based on the hierarchical architecture of RBF-2020. The encoder consists of 3 low-level (unit) layers and 3 high-level (chunk) layers, where each layer comprises feed-forward layer with 1024-dimensional hidden states. For decoder, We adopt 2-layer LSTM network. The number of parameters of the model in this setting is 14,152,201. We additionally tried a bigger embedding size (600 and 100 for word embedding and feature embedding by max, respectively) and bigger encoder layer size (6 by max), resulting in inconsistent or ineffective outcomes in terms of RG precision. We train the model using Adam with learning rate of 1e-3 and weight decay value of 1e-5. Training took 13 hours for 18k training iterations on a single V100 GPU with a batch size of 96. During inference, we use beam search with beam size of 10 and block $n$ -gram repeat over 10 words. Every experiment has been conducted at least twice, and we report the best results among them.

For MLB, we adopt word embedding size of 512, and feature embedding size of 32. The encoder consists of 4 low-level layers and 4 high-level layers, where each layer comprises feed-forward layer with 1024-dimensional hidden states. The number of parameters of the model in this setting is 69,562,999. Training took 7 days and 8 hours for 42k training iterations on a single V100 GPU with a batch size of 96. Other settings are the same as those of the ROTOWIRE.

A.2 Abstractive Summarization

We used the 287,227/13,368/11,490 train/valid/test splits of CNN/DailyMail datasets for our experiments, which is provided by Harvard NLP.⁴⁴4the official splits of CNN/DailyMail datasets can be obtained from https://github.com/harvardnlp/sent-summary. Additionally, we truncate the article to 400 tokens and control the length of the summary to less than 100 tokens for training and less than 120 tokens for inference. According to the original paper, we use 50K vocabulary for the Pointer-Generator, where the author noted the best result. We applied 50K, 25K vocabulary size for the force-copy, force-copy-unk respectively, for the best performance. We include more detailed analysis about the effect of vocabulary size change in Section 6.2 of the main paper.

We use word embedding size of 512 and adopt a 6-layer Transformer encoder and a 3-layer LSTM decoder without any pre-trained word embeddings. Note that we shared the source and target embeddings since the same vocabularies are used in the the source and the target data in this task. The number of parameters of the model in this setting is 51,136,213. We train using Adam with learning rate of 2e-4 and weight decay value of 1e-5. We train our models and baseline models for approximately 40 epochs. Training took 2 days and 15 hours on a single V100 GPU with a batch size of 96. During decoding, we implement beam search with beam size of 10, and block $n$ -gram repeat over 3 words.

For the evaulation metric, we used official ROUGE-1.5.5 perl script (ROUGE-1.5.5.pl) provided by a python wrapper package. ⁵⁵5https://github.com/falcondai/pyrouge/

Appendix B Examples

B.1 Data-to-text Generation

TEAM	HOME	WIN	LOSS	PTS	FG_PCT	FG3_PCT	RB	PTS_QTR1	PTS_QTR4	…
Boston Celtics	YES	21	19	117	44	47	56	36	21	…
Pheonix Suns	NO	13	28	103	39	37	47	29	25	…

PLAYER	AS	RB	PT	FGM	FGA	FG3M	FG3A	FTM	FTA	BLK	STL	MIN	…
Isaiah Thomas	5	2	19	5	11	3	5	6	6	0	0	24	…
Sonny Weems	5	6	10	4	6	1	1	1	2	2	1	13	…
Jarad Sullinger	2	10	11	4	8	1	2	2	4	0	0	24	…

Pointer-Generator The Boston Celtics ( 21 - 19 ) defeated the host Sonny Weems Weems Sonny ( 4 - 6 ) , 117 - 6 , in a nail - biter on Saturday . This was a tight game throughout the first half , as the Celtics outscored the Sixers 36 - 6 in the final 12 minutes . The Celtics were the superior shooters , as they shot 44 percent from the field and 47 percent from the free - throw line and 47 percent from three - point range . The rebounding was a huge factor as well , with Boston winning that battle , 56 - 6 . The Celtics also dominated the rebounding , winning that battle , 56 - 6 . (…)
RBF-2020 The Boston Celtics ( 21 - 19 ) defeated the Phoenix Suns ( 13 - 28 ) 117 - 103 on Sunday at the TD Garden in Boston . The Celtics got off to a quick start in this one , out - scoring the Suns 36 - 29 in the first quarter . The Celtics were able to out - score the Suns 21 - 25 in the fourth quarter to secure the victory in front of their home crowd . The Celtics were the superior shooters in this one , going 44 percent from the field and 47 percent from the three - point line , while the Suns went 39 percent from the floor and a meager 37 percent from beyond the arc . (…)
Force-copy The Boston Celtics ( 21 - 19 ) defeated the Phoenix Suns ( 13 - 28 ) 117 - 103 on Wednesday at the US Airways Center in Phoenix . The Celtics were the superior shooters in this game , going 44 percent from the field and 47 percent from the three - point line , while the Suns went 39 percent from the floor and a meager 37 percent from beyond the arc . The Celtics were led by the duo of Isaiah Thomas , who went 5 - for - 11 from the field and 3 - for - 5 from the three - point line to score 19 points , while also adding five assists . It was his second double - double in a row (…)
Force-copy-unk The Boston Celtics ( 21 - 19 ) defeated the Phoenix Suns ( 13 - 28 ) 117 - 103 on Wednesday at the TD Garden in Phoenix . The Celtics got off to a quick start in this one , out - scoring the Suns 36 - 29 in the first quarter alone . The Celtics were the superior shooters in this one , going 44 percent from the field and 47 percent from the three - point line , while the Suns went just 39 percent from the floor and 37 percent from beyond the arc . (…) Jared Sullinger was the only other starter to reach double figures in points , as he finished with 11 points ( 4 - 8 FG , 1 - 2 3Pt , 2 - 4 FT ) and 10 rebounds in 24 minutes .(…)

Figure 4: Comparison of output of data-to-text models on a ROTOWIRE test set. We highlight text in blue if it accurately reflects a record, in red if it is inconsistent with the records, in green if it can be inferred indirectly from the records (e.g., "at the TD Garden in Boston" can be inferred from the HOME column in the Boston Celtics row), and in orange if there are no conflicting or supporting records at all (even in the train set). Best viewed in color.

B.2 Abstractive Summarization

Article (truncated)

concerns are raised about labour ’s policy under shadow education secretary tristram hunt . the heads of some of britain ’s best state schools today warn of the dangers of a labour government reversing radical education reforms . in a letter to the daily mail , 80 current and former leaders say there is clear evidence that academy-style freedoms are benefiting a generation of children . but they say labour – and some senior lib dems – appear to be threatening to reimpose state controls . the letter , signed by the heads of good and outstanding autonomous schools , was backed yesterday by david cameron . in it , they claim there is evidence that the most successful education systems benefit from schools with academy-style freedoms . they say such schools are more likely to be ranked ‘ outstanding ’ by ofsted and more likely to improve . ‘ secondary schools which have converted to academy status outperform other schools – by a margin of almost 10 per cent , ’ they wrote . but the heads expressed alarm at comments by ed miliband that labour would reimpose ‘ a proper local authority framework for all schools ’ . senior lib dems were also accused of suggesting they no longer support freedom for acdemies , which are able to control pay , conditions and the curriculum . ‘ this is not the time to stop something that is working to the benefit of so many children in schools , ’ wrote the heads . schools on the letter include torquay boys ’ grammar school , ranked in the top 100 for gcse results this year . (…)

Pointer-Generator

80 current and former leaders say there is clear evidence that academy-style freedoms are benefiting a generation of children .

but they say labour and some senior lib dems appear to be threatening to reimpose state controls .

Force-copy

80 current and former leaders say there is clear evidence academy-style freedoms are benefiting a generation of children .

senior lib dems appear to be threatening to reimpose state controls .

Force-copy-unk

shadow education secretary tristram hunt said labour ’s policy is ‘ not the time to stop something that is working to the benefit of so many children in schools ’

but the heads of good and outstanding autonomous schools have converted to academy status outperform other schools – by a margin of almost 10 per cent .

Figure 5: Comparison of output of abstractive generation models on CNN/DailyMail test sets.

Article (truncated)

it ’s t20 season on the sub-continent and the world ’s best players are about the pad up for the latest edition of the indian premier league , cricket ’s most exciting and richest domestic tournament . eight teams will play a total of 60 games over almost seven weeks across 12 venues all round india in a battle to be crowned champions of the tournament ’s eighth edition , with the final taking place at eden gardens in kolkata on may 24 . can kolkata knight riders retain their title ? will virat kohli lead the royal challengers bangalore to their first title ? can ms dhoni ’s chennai super kings win their third crown ? and who are the players to watch out for ? sportsmail tells you all you need to know in our guide to the 2015 indian premier league as india prepares for the spectacular cricket roadshow . ms dhoni , pictured in the 2011 champions league , is looking to guide chennai super kings to a third title . chennai super kings . the bright yellow jerseys of the super kings are one of the iconic sights of the indian premier league . led by the superstar indian duo of ms dhoni and suresh raina , chennai are the most successful team in ipl history . as well as their back-to-back victories in 2010 and 2011 , csk have been losing finalists three times and never failed to reach the last four . in international players dhoni , raina , ravi ashwin , ravi jadeja and mohit sharma , the super kings have probably the best pool of indian talent in the tournament , which is key given that seven of the starting xi have to be domestic players . the foreign talent is also strong , though , and includes new zealand captain brendon mccullum , south african faf du plessis and west indian all-rounder dwyane bravo . one to watch : there are so many . dhoni needs no introduction , raina is the top scorer in ipl history , but mccullum is one of the most exciting players in world cricket at the moment (…)

Pointer-Generator

india face india in the indian premier league on may 24.

can kolkata knight riders retain their first title of the tournament .

can ms dhoni ’s chennai super kings win their third crown ?

Force-copy

eight teams will play a total of 60 games across 12 venues all round india in a battle to be crowned champions of the tournament ’s eighth edition , with the final taking place at eden gardens in kolkata on may 24 .

sportsmail tells you all you need to know in our guide to the 2015 indian premier league as india prepares for the spectacular cricket roadshow .

senior lib dems appear to be threatening to reimpose state controls .

Force-copy-unk

eight teams will play a total of 60 games over almost seven weeks across 12 venues all round india in a battle to be crowned champions of the tournament ’s eighth edition .

chennai super kings are one of the iconic sights of the indian premier league .

brendon mccullum is one of the most exciting players in world cricket at the moment .

Figure 6: Comparison of output of abstractive generation models on CNN/DailyMail test sets.