Product Title Generation for Conversational Systems using BERT
Abstract
Through recent advancements in speech technology and introduction of smart devices, such as Amazon Alexa and Google Home, increasing numbers of users are interacting with applications through voice. E-commerce companies typically display short product titles on their webpages, either human-curated or algorithmically generated, when brevity is required, but these titles are dissimilar from natural spoken language. For example, “Lucky Charms Gluten Free Break-fast Cereal, 20.5 oz a box Lucky Charms Gluten Free” is acceptable to display on a webpage, but “a 20.5 ounce box of lucky charms gluten free cereal” is easier to comprehend over a conversational system. As compared to display devices, where images and detailed product information can be presented to users, short titles for products are necessary when interfacing with voice assistants. We propose a sequence-to-sequence approach using BERT to generate short, natural, spoken language titles from input web titles. Our extensive experiments on a real-world industry dataset and human evaluation of model outputs, demonstrate that BERT summarization outperforms comparable baseline models.
1 Introduction
Smartphones and voice-activated smart speakers, such as Amazon Alexa, Google Home, and Apple Siri, have led to increased adoption of voice-enabled shopping experiences. In such voice-enabled shopping experiences, reducing user friction and saving time is key, especially for low consideration purchases and repeat grocery purchases. Display-based experiences typically utilize a short product title when presenting a product, but these short titles do not naturally fit in a typical conversational flow. For example, for a display-based experience showing “Jergens Natural Glow Daily Moisturizer, Medium to Tan, 7.5 oz” as a short title of the product may be okay but it is not suitable for voice-based applications as it is not a naturally spoken title. At the same time, display-based experiences have the added benefit of being able to show a lot of additional meta-data which is not possible in conversational systems. The product title for a conversational system needs to encapsulate the important information in a succinct, grammatically correct natural language sentence, such that it naturally fits with the conversational flow during the dialogue rounds between the user and the conversational system. E-commerce companies can have millions to billions of products in their ever-changing product catalogs, so it is typically prohibitively expensive to manually annotate naturally spoken titles for all products. In such industry settings, it is ideal to have a model that can generate naturally spoken titles for an evolving catalog. The primary goal of this work is to examine the methods and challenges to convert short titles of products into a naturally spoken language that is grammatically correct.
This problem is similar to the text summarization task which is well studied in natural language processing (NLP). However, using it for e-commerce application is not a trivial task. Text summarization approaches can be classified into two broad sub-categories, extractive text summarization and abstractive text summarization. Extractive text summarization-based approaches usually try to extract a few sentences (or keywords) from lengthy documents [Dorr et al., 2003, Neto et al., 2002, Nallapati et al., 2016b]. To generate natural language titles from web titles, model should have capability to generate conjunctions, articles etc. at the appropriate position.
Abstractive text summarization attempts to understand the content of the document and produce summaries which may contain novel words or phrases. Recurrent Neural Networks(RNN) [Hochreiter and Schmidhuber, 1997, Cho et al., 2014] based sequence to sequence (seq2seq) models [Sutskever et al., 2014] or recently developed attention models [Vaswani et al., 2017] have been shown to perform well on abstractive summarization tasks [Nallapati et al., 2016a]. However these models tend to generate repetitive words and this can cause negative customer experience in voice e-commerce applications. For traditional text summarization tasks we can introduce novel words which are not part of the input sequence but in an e-commerce setting paraphrasing of factual details like brand, quantity, etc. may imply a different product. In addition to this, new products are added to e-commerce product catalogs continuously, which can introduce out-of-vocabulary words. The summarization model should be able to generalize to these words.
In this paper, we investigate application of text summarization techniques to voice based e-commerce application. Our major contributions are
-
1.
We adapt different state-of-the-art NLP models to a real-world e-commerce dataset with limited labels.
-
2.
We perform extensive evaluation of these models on established evaluation metrics, as well metrics relevant to our application. We also perform human judgement evaluation.
The following sections provides a summary of related work followed by a description of methods applied to convert web-based short titles of products (sequence of words in English) into more naturally spoken summary titles (sequence of words in English) for voice-based applications. In our problem setting, we are more interested in building an abstractive text summarization based model that can generate novel words in the decoded summary. Section 4 provides the salient features of the dataset and implementation details of the methods described earlier. This is followed by a discussion of results observed and conclusion.
2 Related Work
Text summarization is a long-studied problem in natural language processing. With the advent of deep learning based approaches, seq2seq models have proven highly successful in abstractive text summarization. Some of these models and relevant developments in the field are mentioned below.
Ptr-Net [See et al., 2017] develops on the seq2seq model with attention for the summarization task. It uses the concept of the pointer network introduced in Vinyals et al. [Vinyals et al., 2015] to decide which words from the main text should directly be copied to the summary. This helps to preserve the important factual information from the input text and also assists in handling out-of-vocabulary words. Ptr-Net model also adds coverage loss, which examines the difference between the attentions of previous words generated and the current attention, in an attempt to fix the issue of word repetition, a persistent issue in seq2seq models. Gehrmann et al. [Gehrmann et al., 2018] try to improve the fluency of the generated text through various constraints applied during model training. Soft constraints on the size of text are used to constrain the length of generated descriptions, while constraints on the output probability distribution of words ameliorates word repetition.
Development in language models have subsequently led to increased use of pre-trained models such as BERT [Devlin et al., 2018] and GPT-2 [Radford et al., 2019], which are trained on huge text corpuses and are used to generate the word embeddings for input texts. While Khandelwal et al. [Khandelwal et al., 2019] examines the feasibility of pre-trained language models in a low-data setting and moves away from a seq2seq framework, the work of Liu and Lapata [Liu and Lapata, 2019] use BERT in a seq2seq model to summarize data. Details of this model are discussed further in Section3. In our study, this is the primary model adapted for our use-case.
Text summarizarion finds natural application in e-commerce where products may have a long description but only the salient features of a particular product is what the end user is in interested in. Increased interaction with mobile devices and voice based interfaces such as Amazon Alexa and Google Home present new challenges as the product title now need to be succinct. There has been development in rule based methods as in Camargo de Souza et al. [Camargo de Souza et al., 2018]. Deep learning based methods find natural application and attempt to use multi-model information (images and text) to generate product title. Chen et al. [Chen et al., 2019] generates personalized product titles utilizing user personas and an external knowledge base. Mathur et al. [Mathur et al., 2018] attempts at generating titles in different languages for the same product. Sun et al. [Sun et al., 2018] develops further on the work of Ptr-Net using a separate encoder network for important attributes like quantity, brand of products and then then uses 2 pointers to decide where to copy data from. The work of Zhang et al. [Zhang et al., 2019] attempts a novel method of generating short descriptions for e-commerce use case and uses multimodal information. A word sequence symbolizing description and the image of a product in the catalog are chosen as descriptors for the product and are used to generate short titles for the product. An adversarial network based approach is then used to decide if the generated title is machine or human generated to improve the quality of generated titles and make it more human like.
3 Methods
In this section, we formulate the problem of automatic natural language title generation and discuss various approaches that can be used to solve this problem.
Problem Definition: The goal of this task is to build a system that can automatically generate natural language product titles which are easily interpretable in a voice-enabled shopping experience. Given the short web-title represented as a sequence of words , the goal of this system is to generate the corresponding natural language title .
In the following sub-sections, we discuss various models which we apply for the automatic natural language title generation task.
3.1 seq2seq + Attention
Consider a sequence of input tokens fed into an encoder (LSTM) producing a sequence of encoder hidden states . The decoder receives the word embeddings of the previous words and has a decoder state at time step . The attention distribution[Bahdanau et al., 2014] is computed as shown in following equations:
(1) |
(2) |
where , , , and are learnable parameters. The attention distribution is used to produce a weighted sum of the encoder hidden states, known as the context vector computed by:
(3) |
The context vector and decoder state is fed through linear layers to get the vocabulary distribution . The network is trained end-to-end using the negative log-likelihood of the target word at each timestep.
(4) |
3.2 Ptr-Net
Ptr-Net [See et al., 2017] is a hybrid between the baseline seq2seq with attention and a pointer network. In the this model, the generation probability (or the probability of using a new word in the output) depends on context vector and attention distribution , and is computed as follows:
(5) |
where , , , and are learnable parameters. is used to choose between generating a word from the vocabulary by sampling from , or copying a word from the input sequence by sampling from the attention distribution . The final probability distribution over the vocabulary is computed by:
(6) |
The model is trained end-to-end similar to seq2seq + attention using the negative log-likelihood of the target word as the loss function.
3.3 Transformer
Transformers [Vaswani et al., 2017] are attention-based models, where the relationship between a given word and the context is modeled through multi-head attention. Each layer in a transformer consists of multi-head attention () followed by a layer norm and feed forward network as shown in Equation 7.
(7) |
(8) |
The final layer representation from the encoder is given to the decoder, and the decoder is trained using negative log likelihood with the target word .
3.4 BERT

We use BERT (Bidirectional Encoder Representations from Transformer) [Devlin et al., 2018] to encode the web titles. BERT is trained on a large text corpora using unsupervised tasks ( masked language modelling and next sentence prediction). BERT uses token, segment, and position embeddings to represent input tokens. Segment embeddings are used in pairwise tasks to differentiate between segments (e.g., question and answer in SQUAD tasks). This input representation is then fed into multiple transformer layers.
Liu and Lapata [Liu and Lapata, 2019], modify the BERT formulation for the text summarization task. A [CLS] token is used at the start of a sentence, and the representation of this token is used as the sentence representation. and are used as segment embeddings for odd and even sentences, respectively. Position embeddings for sentences larger than 500 words are learnt as model parameters. We adapt to this format to represent data. For our use case, web product titles are one sentence long, hence the segment embedding is not used.
We insert [CLS] and [SEP] tokens in input web title as shown in Figure 1.The input representation for transformer is then prepared by adding position (), token (), and segment embedding () for corresponding words
(9) |
Encoder then transforms this input representation using transformer layers which applies transformation as shown in Equation 7 on input representation . In Equation 7, represents input representation . The continuous representation from the final layer of BERT is then given to the decoder. While pretrained BERT is used as the encoder, an 8-layer transformer, randomly initialized, is used as the decoder. We train the decoder to generate summaries using true labels in ground truth data using the framework for abstractive summarization, as in [See et al., 2017], without using the coverage and copy mechanism. We use the Adam optimizer with the following learning rate schedule for the encoder and decoder [Liu and Lapata, 2019]:
(10) |
(11) |
We set , , , and .
4 Experimental Setup
4.1 Dataset
We use a proprietary dataset from one of the largest E-commerce retailers in the world. Our dataset consists of only 19,269 pairs of web product titles along with their corresponding voice titles. Details of the dataset have been provided in Table 2. The web product titles are either entered by merchants or algorithmically generated for certain categories while publishing products on e-commerce website. The voice titles for the corresponding web titles are manually created by human annotators through a crowdsourcing platform. Some examples of web titles and corresponding voice titles are shown in Table 1.
Web titles (Input Sequence of Words) | Voice titles (Output Sequence of Words) |
---|---|
El Monterey Beef & Cheese Burritos 8 ct bag a family size pack El Monterey Beef and cheese | a family size pack of 8 El Monterey Frozen Beef And Cheese Burrito |
Paas Magical Color Cup Egg Decorating Kit a pack Paas Magical Color Cup | a pack of Paas Magical Color Cup Egg Decorating Kit |
Wonderful Roasted & Salted Pistachios 8 oz. Bag a bag Wonderful Roasted and salted | an 8 ounce bag of Wonderful Roasted And Salted Pistachios |
There are certain key differences in the characteristics of web titles and voice titles. Important distinctions are listed below:
-
•
Web titles often contain abbreviations for units of measurement for succinctness, e.g., Row 3 in Table 1 mentions ”8 oz. bag”. However, the voice title should contain the corresponding natural language word ”ounce”.
-
•
Web titles may or many not contain articles, but voice titles need to have grammatically correct articles, conjunctions, etc. For example, refer to Row 1 in Table 1.
-
•
Web titles sometimes contain specific product attributes, such as brand or quantity. These product attributes may have altered positions in the voice title, but the attribute phrase needs to be retained exactly in its entirety.
-
•
As shown in Table 2, the average voice title length is tokens. Voice titles need to be short and succinct, as this information is spoken through a voice device to the end-user.
avg. web title length | 15.3352 |
avg. voice title length | 11.3886 |
avg. # unique words in web title | 13.0805 |
avg. # unique words in voice titles | 11.3009 |
avg. # of novel one grams in voice title | 4.0138 |
# train examples | 13874 |
# val examples | 1926 |
# test examples | 3469 |
We observed that web titles do not have specific details like brand for few products, hence we append additional product metadata when available to web title. This metadata contains attributes such as brand, container type, and size (whenever available).
The dataset is randomly partitioned, into 13874 train examples, 1926 validation examples, and 3469 test examples.
4.2 Implementation Details
We use the pytorch ‘bert-base-uncased’ version of BERT for the encoder along with the subword tokenizer111 https://github.com/nlpyang/PreSumm/. In the decoder, the transformer has 768 hidden units and 8 layers, while the feed-forward layers have size 2,048. The learning rate used is as mentioned in Section 3, with a batch size of 256, and the model was trained for 35,000 steps. We used beam search with size and for decoding. Decoding is done until end of sequence token is emitted. we also block repeat trigrams [Liu and Lapata, 2019]. We use a minimum length of 4 for decoding and a maximum length of 50 for decoding. A checkpoint model was saved every 2,000 steps, with the best performing checkpoint model on validation data being used to report performance on the test data.
We compare the BERT model with seq2seq, Ptr-Net, Ptr-Net + Coverage, and the Transformer model as baselines. For the implementation of seq2seq, Ptr-Net, and Ptr-Net + Coverage, the implementation of [See et al., 2017] was used to generate the results222 https://github.com/abisee/pointer-generator.
The implementation details of the various baselines is provided below:
-
•
seq2seq: Stanford core nlp PTBtokenizer is used to tokenize the data, which is converted into story format as in the popular CNN-Daily mail dataset [Nallapati et al., 2016a] for text summarization. We use default parameters from authors and change minimum length in decoder to 50 and maximum length in decoder to 35, as those are the corresponding maximum lengths of augmented web title and voice title respectively. When decoding test data, a beam search of beam size 4 is used to generate the predicted title. A minimum length of 5 is set for the prediction title.
-
•
Ptr-Net: The Pointer Net uses the same parameters as seq2seq model, and the validation set is used to identify the optimal training checkpoint for the model.
-
•
Ptr-Net + Coverage: We use the default implementation of the authors and train the model in a 2-step training process. First, the pointer net model is trained without any coverage loss. Using validation loss, the best model is extracted. We then add the coverage loss term, and train the model again using the previously mentioned best model as warm-up.
-
•
Transformer: We use a 6-layer transformer encoder with 512 hidden size and 2048 dimensional feed-forward layer. For the decoder, we use the same configuration as BERT. The learning rate and other hyper-parameters are obtained from [Liu and Lapata, 2019].
4.3 Evaluation Metrics
We use the following metrics to evaluate the above proposed model and baselines:
-
•
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): measures overlap between the candidate summary and ground truth using precision, recall, and F-1 scores. However, ROUGE does not give a clear idea about repetitions or duplicates in the generated summary. We report F-1 ROUGE score at 1, 2, and L(Longest Common Subsequence). [Lin, 2004]
-
–
ROUGE-1 refers to the overlap of the unigrams
-
–
ROUGE-2 refers to the overlap of the bigrams
-
–
ROUGE-L takes into account sentence level structure similarity naturally and identifies the longest co-occurring in-sequence n-grams automatically.
-
–
-
•
Avg. # duplicate 1-grams - Number of duplicate one-grams between the ground truth summary and candidate summary.
-
•
Human Evaluation - 3 human annotators were provided 100 random samples of model-output titles and were asked to rate the title on a scale of 1-5, based on product relevance(i.e., if the product is the same), grammatical correctness, and correct preservation of important attributes (such as brand name, quantity, and unit of measurement). The average score of the annotators is taken as the judgement score for the model. This evaluation was performed for only the Transformer and BERT Abstractive model outputs.
5 Results and Observations
Method | R-1 | R-2 | R-L | avg. # of duplicates | Human Evaluation |
---|---|---|---|---|---|
seq2seq + attention | 0.7951 | 0.6607 | 0.7883 | 0.257135 | X |
Ptr-Net | 0.8965 | 0.8053 | 0.8956 | 0.239839 | X |
Ptr-Net with Coverage | 0.8917 | 0.8042 | 0.8800 | 0.3041 | X |
Transformer | 0.92715 | 0.85812 | 0.92012 | 0.187950 | 4.19 |
Bert Abstractive | 0.92298 | 0.84092 | 0.91383 | 0.17262 | 4.35 |
Table 3 provides a summary of model performance on different evaluation metrics. We observe that transformer and BERT have ROUGE-1 F1 scores of and respectively, outperform seq2seq-based approaches in terms of both ROUGE and avg. # duplicates metrics. The Ptr-Net model outperforms basic seq2seq+attention approach, however, paradoxically Ptr-Net with coverage underperforms compared to Ptr-Net model. The coverage loss is present to specifically address the issue of repetition of words and instead of fixing it, lead to an increase in the avg.# of duplicates to from . While transformer model does have a better ROUGE score than BERT model, BERT has lower repeated words in output ( compared to ) which has a greater impact on the title qualitatively. Since, BERT is pre-trained on a large corpus, it should be able to generalize better especially in low data scenarios. Human evaluation of generated titles reinforce this, showing that BERT performs better than transformer based model on output quality.
Web title | Ground truth | Model prediction | |
---|---|---|---|
1 | White Onions2 lbs a bag premium | a 2 pound bag of white onions | a 2 pound bag of white onions |
2 | Great Value Large Grade AA, 6 Eggs a carton Great Value Large Grade AA | a half dozen carton of Great Value Large Grade AA Eggs | a 6 count carton of great value large grade aa eggs |
3 | Lucky Charms Gluten Free Breakfast Cereal, 20.5 oz a box Lucky Charms Gluten Free | a 19.3 ounce box of Lucky Charms Gluten Free Cereal | a 20.5 ounce box of lucky charms gluten free cereal |
4 | Hostess Donettes Frosted Mini Donuts, 6 ct, 3 oz a pack hostess donettes | a pack of 6 hostess donettes mini donuts | a pack of 6 hostess donettes frosted mini donuts |
5 | Pork Butt Steaks Large, Tray, 3.1 - 5.1 lbs a tray | a 34.4 ounce tray of pork butt | a 3.1 to 5.1 pound tray of pork butt steaks |
6 | Pork Cube Steaks, Tray, 0.45 - 1.35 lbs a tray | a 12 ounce tray of ground pork | a 1.35 pound tray of pork cubes |
Web Title | Ground truth | Model prediction | |
---|---|---|---|
1 | Yoo-Hoo Chocolate Milk Fridge Pack, 12 pk a pack Yoo Hoo Chocolate Fridge Pack | a 12 count pack of yoo hoo chocolate milk fridge pack | a pack of 12 yoo hoo chocolate bar |
2 | Produce Unbranded Eat Your Vegetables Blend 7 Oz a pack | a 7 ounce pack of veggie blend snack | a 7 ounce pack of produce produce produce baby food blend |
3 | Diet 7UP, 0.5 L, 6 pack a pack | a 6 pack of .5 liter diet 7up | a 6 pack of 0.5 fluid ounce diet 7 up |
4 | Garlic, each (1 bulb) a garlic | garlic sold individually | a pound of garlic sold individually |
5 | Harvestland Chicken Breast, 1.5-2 lbs. a tray perdue harvestl and free range | a 1.5 to 2 pound tray of Perdue Harvestland Free Range Chicken Breasts | a 1.5 to 1.5 pound tray of perdue harvestland free range chicken breast |
Table 4 and Table 5 lists examples of BERT Abstractive Model prediction on the test dataset where the model performs well and poorly, respectively. The corresponding ground truth and web titles have also been provided for comparison. The model seems to have repeated words in certain cases, for example Row 2 of Table 5.Given that most of the data has been trained on products with ounce and pound as the units of measurement, it can be seen that liter is incorrectly converted to fluid ounce by the model in Row 3 of Table 5 and pound is added to a single bulb of garlic in Row 4 of Table 5. However, it can be observed that in some cases, the model clearly does better than ground truth evaluation and even fixes the incorrect quantities in ground truth (rows 3-6 in Table 4). The model is able to add important attributes like frosted in the case of row 4 of Table 4. From row 2 of Table 4 we can see that the model is able to maintain the brand name as it is like great value and provide the correct measurements units for the different products like 6 count or pack of 6. Thus the model is able to fulfill requirements of preserving quantity and brand between web and voice titles.
We observe that overall BERT based model performs better both quantitatively and qualitatively in maintaining factual details in the output title and also reducing repeated words in the output.
6 Conclusion
In this paper, we studied the problem of generating succinct, grammatically correct voice titles for products in a large e-commerce catalog with limited labels. We evaluate 4 different baselines and demonstrate that BERT summarization can generate good titles through ROUGE metrics and human evaluation, even when there is extremely limited data. Generating personalized titles for different user segments based on rich user metadata and incorporating web data with additional product attributes that may be product dependent are some directions to extend this work.
References
- [Bahdanau et al., 2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate.
- [Camargo de Souza et al., 2018] José G. Camargo de Souza, Michael Kozielski, Prashant Mathur, Ernie Chang, Marco Guerini, Matteo Negri, Marco Turchi, and Evgeny Matusov. 2018. Generating e-commerce product titles and predicting their quality. In Proceedings of the 11th International Conference on Natural Language Generation, pages 233–243, Tilburg University, The Netherlands, November. Association for Computational Linguistics.
- [Chen et al., 2019] Qibin Chen, Junyang Lin, Yichang Zhang, Hongxia Yang, Jingren Zhou, and Jie Tang. 2019. Towards knowledge-based personalized product description generation in e-commerce. arXiv preprint arXiv:1903.12457.
- [Cho et al., 2014] KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. CoRR, abs/1409.1259.
- [Devlin et al., 2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- [Dorr et al., 2003] Bonnie Dorr, David Zajic, and Richard Schwartz. 2003. Hedge trimmer: A parse-and-trim approach to headline generation. Technical report, MARYLAND UNIV COLLEGE PARK INST FOR ADVANCED COMPUTER STUDIES.
- [Gehrmann et al., 2018] Sebastian Gehrmann, Yuntian Deng, and Alexander M Rush. 2018. Bottom-up abstractive summarization. arXiv preprint arXiv:1808.10792.
- [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780, November.
- [Khandelwal et al., 2019] Urvashi Khandelwal, Kevin Clark, Dan Jurafsky, and Lukasz Kaiser. 2019. Sample efficient text summarization using a single pre-trained transformer. arXiv preprint arXiv:1905.08836.
- [Lin, 2004] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- [Liu and Lapata, 2019] Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345.
- [Mathur et al., 2018] Prashant Mathur, Nicola Ueffing, and Gregor Leusch. 2018. Multi-lingual neural title generation for e-commerce browse pages. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pages 162–169, New Orleans - Louisiana, June. Association for Computational Linguistics.
- [Nallapati et al., 2016a] Ramesh Nallapati, Bing Xiang, and Bowen Zhou. 2016a. Sequence-to-sequence rnns for text summarization. CoRR, abs/1602.06023.
- [Nallapati et al., 2016b] Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2016b. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. CoRR, abs/1611.04230.
- [Neto et al., 2002] Joel Larocca Neto, Alex A. Freitas, and Celso A. A. Kaestner. 2002. Automatic text summarization using a machine learning approach. In Guilherme Bittencourt and Geber L. Ramalho, editors, Advances in Artificial Intelligence, pages 205–215, Berlin, Heidelberg. Springer Berlin Heidelberg.
- [Radford et al., 2019] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
- [See et al., 2017] Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368.
- [Sun et al., 2018] Fei Sun, Peng Jiang, Hanxiao Sun, Changhua Pei, Wenwu Ou, and Xiaobo Wang. 2018. Multi-source pointer network for product title summarization. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 7–16.
- [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks.
- [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.
- [Vinyals et al., 2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks.
- [Zhang et al., 2019] Jian-Guo Zhang, Pengcheng Zou, Zhao Li, Yao Wan, Xiuming Pan, Yu Gong, and Philip S Yu. 2019. Multi-modal generative adversarial network for short product title generation in mobile e-commerce. arXiv preprint arXiv:1904.01735.