This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SASS: Data and Methods for Subject Aware Sentence Simplification

Brad Windsor
[email protected]
\AndLuke Martin
[email protected]

\AndAnand Tyagi
[email protected]
All authors are equal contributors and NYU-affiliated.
Abstract

Sentence simplification tends to focus on the generic simplification of sentences by making them more readable and easier to understand. This paper provides a dataset aimed at training models that perform subject aware sentence simplifications rather than simplifying sentences as a whole. We also test models on that dataset which are inspired by model architecture used in abstractive summarization. We hand generated portions of the data and augment the dataset by further manipulating those hand written simplifications. Our results show that data-augmentation, data-masking, and model architecture choices used in summarization provide a solid baseline for comparison on subject aware simplification.

1 Introduction

Sentence simplification is a problem which aims to transform long, dense sentences into more accessible ones that are easier to understand. Current work in sentence simplification focuses on simplifying for the purpose of making sentences easier to understand. As a result, the output sentences tend to include as much of the information included in the original sentence as possible, altering the parts of the sentences which contain challenging vocabulary or excess words.

An alternative type of simplification which is relevant to explore is to simplify by topic. While making sentences easier to read is beneficial for helping people more easily understand the contents of various documents, simplifying by topic allows for different people to extract information directly useful to them at the sentence level.

Simplification by topic has currently been explored in the area of summarization, where models exist that allow for the creation of summaries tailored to a specific topic, as shown in Wang et al. (2009) and Hennig (2009). However, in order to use the models currently available for topic specific summarization, the input documents must be typically greater than one sentence long; topic specific summarization is usually associated with multi-document summarization.

Current datasets used for evaluating sentence simplification models focused on the aforementioned goal of simplifying for the ease of understanding. In this paper, we present the SASS (Subject Aware Sentence Simplification) dataset which consists of sentences from a YELP dataset yel (2015), simplified by one or more specified topics. This dataset can be used to test models specifically aiming to simplify sentences by topic, rather than just focused on ease of readability.

To come up with good candidates for the SASS dataset, we used the Spacy Honnibal et al. (2020) NER tagger to identify sentences which discussed our subjects of interest, and hand-wrote simplifications of those sentences. We augmented this dataset with additional entities mined from the corpus.

The original and augmented datasets are the first multi-topic sentence simplification datasets. After creating them, we used these datasets to study how well techniques for multi-topic summarization generalize to simplification. Our tests include encoder-decoder models following Liu and Lapata (2019) and the use of artificial tokens following Scarton and Specia (2018).

2 Related Work

Our work draws on earlier approaches to simplification, including the control of the degree of simplification, and on Subject-Aware problems in abstractive summarization.

2.1 Simplification process

Simplification is often a multi-step process with more than one model involved Zhu et al. (2010). Some of the common steps in sentence simplification:

  • Splitting long sentences into shorter ones

  • Dropping irrelevant information

  • Replacing words or phrases

Sentence simplification models tend to either process information in several separate pipeline stages Xu et al. (2015), or solve the full problem in an end-to-end neural model Zhang and Lapata (2017).

2.2 Controllable Sentence Simplification

Controllable sentence simplification is a new paradigm which aims to better control the degree of information omitted. Some examples:

  • Separating the Newsela corpus by grade level, and using an initial token in a seq2seq model to signify the target grade level Scarton and Specia (2018)

  • Training to produce a given compression ratio, degree of paraphrasing, or lexical complexity Martin et al. (2019)

Source Sentence Given the growing popularity of Indian cuisine, I am surprised that the Bombay Grill conglomerate (Green Street location, First Street location, Bombay Bazaar) have such a monopoly on Indian food in this town.
Cuisine simplified Bombay Grill and Bombay Bazaar are Indian restaurants.
Location simplified Bombay Grill is on Green Street and Bombay Bazaar is on First Street.
Table 1: Subject-aware sentence simplification
Source Sentence Out of all the Vietnamese spots in North Texas that I’ve tried, my absolute favorite is Pho Paseur in Arlington
Data Augmentation Out of all the Islamic spots in the Gold Coast that I’ve tried, my absolute favorite is Taj Mahal in Arlington
Data Masking Out of all the NORP0 spots in LOC0 that I’ve tried, my absolute favorite is ORG0 in LOC1
Table 2: Data augmentation and masking

2.3 Subject-Aware Summarization

Subject-Aware summarization aims to condense a document but tailor it for a specific purpose. One such problem is Topic-Aware Abstractive Text summarisation Zheng et al. (2020), which attempts to leverage the underlying semantic structure of documents represented by their latent topics. In Fan et al. (2018), the authors propose methods that would allow a user to restrict the length of the summary, ask for entity-specific or source-specific summaries, and only summarise specific portions of the text. Wang et al. (2020) mines topic-specific words from using topic-modelling on a large corpus and uses these words as an input to an attention mechanism for used in summary generation. Finally, there have been other attempts to incorporate more information into the summarizing to tailor the information to specific requests Baumel et al. (2018).

3 Methods111Code and data are available at https://github.com/bwindsor22/sentence-simp-target

3.1 Setup

3.1.1 Data Preparation

For ease of understanding, we choose the Yelp Reviews Dataset as our base yel (2015). We use Spacy Honnibal et al. (2020) to pre-identify 1500 relevant sentences which had entities marked Organization (ORG), Nationality (NORP), and Location (LOC). From this, we denote a simplification which includes ORG/NORP as a ”Culinary” simplification and ORG/LOC as a ”Location” simplification.

From this, we hand-annotated 599 example sentences which were simplified into multiple sentences based on tagging the in each sentence, we identified sentences that included two or more of any individual entities. We then simplified each given sentence into two or more sentences with each sentence containing exactly one or more of the entities relevant to the topic.

Because sentences where both simplifications are possible are rare, our corpus includes sentences where only one of the two simplifications is possible. See Table 3 for summary of annotation process.

Augmenting data
To increase the volume of the data, we augmented the data by mining entity names from the remaining Yelp dataset and substituting into the annotated sentences. First a Spacy tagger was run over the dataset, extracting out all the entities and their corresponding tags. Next, for each row in the dataset, where a row consisted of the source sentence and its summaries, the entities in those sentences were replaced with ones sampled from elsewhere in the dataset. This created the same sentence but with different entities.

Using the example presented in Table 2, an ORG, NORP, LOC tuple mined from one sentence is inserted into another. This approach is inspired by Wang et al. (2020)’s keyword mining. We used this strategy to vary our data increase from 2 times to 6 times the amount of the original hand annotated data.

Masking data
We use the Spacy model to replace specific entities (”Islamic”) with tags (”NORP0”). The Spacy model is an example of a task-specific NER system for our dataset; we selected organizations, nationalities, and locations in part because Spacy understands these well.

Total Annotated 599
      Culinary & Location Simplifications 261
      Culinary Simplifications Only 249
      Location Simplifications Only 49
Table 3: Raw data counts from ~1500 initial candidate sentences (~900 were marked ineligible/off topic)

3.1.2 Model

Refer to caption
Figure 1: Pipeline for masking. A model replaces key terms with generic tags.
Refer to caption
Figure 2: Two models, trained separately, each task-specific
Refer to caption
Figure 3: One model using an artificial keyword to specify the style of output

For training, we use an encoder-decoder architecture sequence generation, inspired by Liu and Lapata (2019)’s work in summarization. We use Roberta Liu et al. (2019) as our base model and the HuggingFace Transformers Wolf et al. (2020) library for encoder-decoder implementation. Our hyperparameters are: batch size: 8, train epochs: 200, learning rate: 0.001.

As simplification is often a multi-model process Zhu et al. (2010), one of our data preparation techniques included using a task-specific model to mask, as in Fig. 1.

Dataset Culinary Location
Original 0.28 0.26
Masked 0.48 0.43
Data Augmentation - 2X (100 epochs) 0.29 0.29
Data Augmentation - 6X (59 epochs) 0.30 0.30
Table 4: Model BLEU scores on datasets. Two scores are reported, one for each task (as in Fig 2.)
Model Avg BLEU
Two models, trained for separate tasks, original dataset (Fig 2) 0.28
One model, with artificial token to specify task (Fig 3) 0.43
Table 5: Results of model architecture analysis. BLEU score is weighted average of both tasks.

For training, we explore two model architectures per Fig. 2 and 3. In the first, two models are trained for two different tasks, with zero knowledge share between the two. In the second, we explore the use of task-specific tokens to specify the simplification style, following Scarton and Specia (2018).

3.1.3 Evaluation

Results are taken as average BLEU scores when compared to the target sentence Papineni et al. (2002), using the NLTK version of BLEU.

3.2 Results

Results are seen in Tables 4 and 5.

3.3 Analysis

We note the following observations from the training:

  • Models which have subject-specific expertise can significantly improve performance on the topic specific simplification task. This is seen in the data masking performance, where simplified sentences frequently took forms such as ”ORG0 is a NORP0 restaurant”. By specifying the information the model should be interested in as ORG0/ORG1/etc. tags, we were able to allow the model to focus on the work of restructuring the sentence, rather than just finding the correct entities.

  • Data masking works better for culinary simplifications, but we feel this is due to disagreement between annotator and the Spacy model: ”North America”, ”2nd Floor Atrium” are examples of what annotators label as locations and Spacy does not.

  • There are similar BLEU scores for both the Culinary and Location datasets. The degree of effectiveness is determined by the technique, rather than the subject. This raises hope that our methods would generalize well to other subjects.

  • Mining subject-specific entities from a larger corpus can also be used to improve the sentence simplification task. Substituting various different restaurant names and cuisines during the data augmentation helped proved useful for helping the model learn.

  • Data augmentation is a less useful preparation than data masking. The model does not as easily learn that ”Iranian” and ”Greek” play the same roles as it does in a dataset where both are labelled ”ORG0”. Data masking helps the model generalize better across examples, however, we note that sufficient data augmentation approaches the performance of masking.

  • The paradigm of task-specific tokens successful in tailoring simplifications to a reading level Scarton and Specia (2018) works well with subject-specific simplification also. One model learning both for the original dataset outperforms two models. This is unsurprising in that our data is very scarce (<<1K examples in each category). Allowing the model to see unrelated examples is similar to pre-training on the domain-specific text.

4 Conclusion

In this paper, we introduced the SASS dataset which allows for models focused on topic specific sentence simplification to be evaluated. The evaluation of our baseline enconder-decoder model showed that even with a simple model, we are able to generate a system which can perform simplification based on a specified topic.

We presented a new dataset for this challenge, found proof that augmentation techniques help in this domain, and proved that existing strategies in summarization also apply in this domain. Our results help clarify a new approach to simplification, and shed light on how well techniques from other problems generalize.

Further work can be done to expand on the dataset we introduced, both manually tagging more sentences with a wider range of tags, and developing methods to automatically augment the dataset. We expect that paraphrasers such as Quillbot qui (2020) could futher augment sentences, or that entity lists such as YAGO Suchanek et al. (2007) could be used to further fill out sentences.

Performing subject aware sentence simplification allows for texts to be simplified not only in a manner that can be more understood, but simplified in a way that is relevant to the individual. We hope that the SASS dataset can be used to evaluate new models created specifically for this purpose and further improve the overall area of sentence simplification.

References

  • yel (2015) 2015. Yelp dataset challenge.
  • qui (2020) 2020. Quillbot.
  • Baumel et al. (2018) Tal Baumel, Matan Eyal, and Michael Elhadad. 2018. Query focused abstractive summarization: Incorporating query relevance, multi-document coverage, and summary length constraints into seq2seq models.
  • Fan et al. (2018) Angela Fan, David Grangier, and Michael Auli. 2018. Controllable abstractive summarization.
  • Hennig (2009) Leonhard Hennig. 2009. Topic-based multi-document summarization with probabilistic latent semantic analysis. International Conference Recent Advances in Natural Language Processing, RANLP.
  • Honnibal et al. (2020) Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python.
  • Liu and Lapata (2019) Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach.
  • Martin et al. (2019) Louis Martin, Benoît Sagot, Eric de la Clergerie, and Antoine Bordes. 2019. Controllable sentence simplification. arXiv preprint arXiv:1910.02677.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  • Scarton and Specia (2018) Carolina Scarton and Lucia Specia. 2018. Learning simplifications for specific target audiences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 712–718, Melbourne, Australia. Association for Computational Linguistics.
  • Suchanek et al. (2007) Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: A Core of Semantic Knowledge. In 16th International Conference on the World Wide Web, pages 697–706.
  • Wang et al. (2009) Dingding Wang, Shenghuo Zhu, Tao Li, and Yihong Gong. 2009. Multi-document summarization using sentence-based topic models. pages 297–300.
  • Wang et al. (2020) Li Wang, Junlin Yao, Yunzhe Tao, Li Zhong, Wei Liu, and Qiang Du. 2020. A reinforced topic-aware convolutional sequence-to-sequence model for abstractive text summarization.
  • Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  • Xu et al. (2015) Wei Xu, Chris Callison-Burch, and Courtney Napoles. 2015. Problems in current text simplification research: New data can help. Transactions of the Association for Computational Linguistics, 3:283–297.
  • Zhang and Lapata (2017) Xingxing Zhang and Mirella Lapata. 2017. Sentence simplification with deep reinforcement learning. arXiv preprint arXiv:1703.10931.
  • Zheng et al. (2020) Chujie Zheng, Kunpeng Zhang, Harry Jiannan Wang, and Ling Fan. 2020. Topic-aware abstractive text summarization.
  • Zhu et al. (2010) Zhemin Zhu, Delphine Bernhard, and Iryna Gurevych. 2010. A monolingual tree-based translation model for sentence simplification. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 1353–1361.