A Non-monotonic Self-terminating Language Model
Abstract
Recent large-scale neural autoregressive sequence models have shown impressive performances on a variety of natural language generation tasks. However, their generated sequences often exhibit degenerate properties such as non-termination, undesirable repetition, and premature termination, when generated with decoding algorithms such as greedy search, beam search, top- sampling, and nucleus sampling. In this paper, we focus on the problem of non-terminating sequences resulting from an incomplete decoding algorithm. We first define an incomplete probable decoding algorithm which includes greedy search, top- sampling, and nucleus sampling, beyond the incomplete decoding algorithm originally put forward by Welleck et al. (2020). We then propose a non-monotonic self-terminating language model, which significantly relaxes the constraint of monotonically increasing termination probability in the originally proposed self-terminating language model by Welleck et al. (2020), to address the issue of non-terminating sequences when using incomplete probable decoding algorithms. We prove that our proposed model prevents non-terminating sequences when using not only incomplete probable decoding algorithms but also beam search. We empirically validate our model on sequence completion tasks with various architectures.
1 Introduction
Autoregressive neural sequence models (Bengio et al., 2000) have been widely used for various natural language generation tasks such as language modeling (Brown et al., 2020; Chowdhery et al., 2022), machine translation (Bahdanau et al., 2014), and conversational dialogue modeling (Vinyals & Le, 2015). Furthermore, large-scale autoregressive neural sequence models have shown unprecedented ability to generate fluent, human-like texts (Vaswani et al., 2017; Brown et al., 2020). Despite their success, the autoregressive neural sequence models have shown undesirable behaviors: non-termination (Welleck et al., 2020), degenerate repetition (Welleck et al., 2019; Holtzman et al., 2020), and premature termination (Koehn & Knowles, 2017; Stahlberg & Byrne, 2019). In this paper, we focus on how to prevent non-termination when using a given decoding algorithm.
Non-termination is the problem that a language model generates infinitely long sequences with a positive probability from our language model when using a given decoding algorithm. Welleck et al. (2020) pointed out that this issue comes from a discrepancy between the distribution of our language model and its induced distribution by an incomplete decoding algorithm. They formalized this disparity by the notion of inconsistency where our language model generates non-terminating sequences with a positive probability from the decoding algorithm. To avoid this inconsistency, they proposed a self-terminating (ST) language model that uses new parametrization for its classifier rather than usual softmax parametrization. They proved that the ST language model is consistent with respect to greedy search, beam search, top- sampling (Fan et al., 2018) as well as nucleus sampling (Holtzman et al., 2020).
The ST language model increases the termination probability of each sequence monotonically to 1, but this parametrization is not appropriate for learning our natural language. As an illustrative example, suppose there are two sequences in our dataset: “I am a boy” vs. “I am a boy, and you are a girl.”. Our language model trained on this dataset may or may not terminate after the former. Once our model decides not to end, it should dramatically reduce the termination probability to continue. The ST language model, which monotonically increase the termination probability, cannot capture such a case, where one sequence is a prefix of another. We thus propose a non-monotonic self-terminating (NMST) language model which guarantees the consistency with respect to greedy search, beam search, top- sampling, and nucleus sampling without monotonically increasing termination probability.
The NMST language model encourages the termination probability of each sequence to converge to 1 through NMST parametrization however without monotonicity. Even under this relaxation, the proposed NMST language model provably prevents any non-terminating sequence resulting from greedy search, beam search, top- sampling, and nucleus sampling, which we refer to as incomplete probable decoding algorithms.
We conduct experiments validating the effectiveness of our NMST language models on sequence completion tasks, as was done in earlier studies. We test NMST parametrization with various architectures. Specifically, we train RNN (Elman, 1990) and LSTM (Hochreiter & Schmidhuber, 1997) on WikiText-2 (Merity et al., 2016). We additionally finetune GPT-2 (Radford et al., 2019) on WikiText-103 (Merity et al., 2016). Across all these setups, NMST parametrization effectively prevents non-terminating sequences, especially when compared to softmax parametrization. Furthermore, we see that our NMST parametrization has better (lower) perplexities than those of ST parametrization, confirming the importance of relaxing the monotonic termination probability.
2 Notations and Background
2.1 Notations for Autoregressive Neural Sequence Models
Sequences, vocabulary, and We view an instance (e.g., a sentence and a paragraph) as a sequence , where each is an element from a pre-defined finite set of discrete tokens, referred to as a vocabulary . includes a special symbol that only appears at the end of the sequence. Every sequence must end with . We write the length of as , and . We call a non-terminating sequence, , if for all .
Embedding vectors Each token is not a numerical vector so that we use an embedding vector to represent . To capture the notion of similarity between discrete tokens efficiently, we use an embedding vector to project into continuous embedding space (Bengio et al., 2000; Mikolov et al., 2013b; a; Levy & Goldberg, 2014).
Autoregressive neural sequence models Bengio et al. (2000) proposed an autoregressive neural sequence model parametrized by . They factorized into a product of the conditional probability of each token given all the previous tokens and an input in a predefined order as follows:
(1) |
where is a -prefix of and is an input referred to as a context. For example, represents either a prompt in sequence completion or a source-side sequence in machine translation.
There are several popular architectures for such as RNN (Elman, 1990), LSTM (Hochreiter & Schmidhuber, 1997), GRU (Cho et al., 2014), and Transformer (Vaswani et al., 2017). As shown in equation 2, all these models utilize softmax classifiers. In this paper, we modify the parametrization of their softmax classifiers to prevent non-terminating sequences. We thus write a vanilla language model, regardless of its choice of architecture, that uses the original softmax parametrization as defined in Definition 1.
Definition 1.
A vanilla language model computes the conditional probability of each token given a -prefix and a context at each time step as follows:
(2) |
where with .111This definition stands for RNN, LSTM, and GRU. For Transformer, .
Training For a given dataset, , we maximize the joint probability assigned to the sequences in our training dataset to find an optimal parameter configuration as follows:
(3) |
2.2 Incomplete Probable Decoding Algorithms
An autoregressive language model predicts the likelihood of a sequence given a context . Its autoregressive factorization in equation 1 requires a recursive process for every to infer. Hence, at inference time, we use a decoding algorithm, defined below, to generate sequences from .
Definition 2.
Let be a collection of such that where and . A decoding algorithm is a function that maps to which is a probability distribution over . A decoded sentence given by from is a random sample from
To generate a high quality sequence from , a decoding algorithm assumes that a higher quality sequence has a higher probability of than others. For instance, maximum a posteriori (MAP) decoding algorithm gives the most probable sequence given a context from :
(4) |
by setting and where . Unfortunately, is intractable since equation 4 requires an exhaustive search over the sequence space . Hence, in practice, we utilize incomplete probable decoding algorithms defined as follows:
Definition 3.
A decoding algorithm is incomplete and probable if there exists such that
(5) |
and
(6) |
for each . Furthermore, for every , satisfies
(7) |
At each , an incomplete probable decoding algorithm considers only a set of highly probable tokens, . generates given by recursively sampling from supported on . This reduces an exponential complexity of , , down to a linear level, .
Greedy search, top- sampling (Fan et al., 2018), and nucleus sampling (Holtzman et al., 2020) are incomplete and probable. For example, greedy search generates the -th item of a sequence by
(8) |
In other words, sets to where . Moreover, we have , and holds for . Thus, is incomplete and probable. Unlike , top- sampling considers most probable tokens in as while nucleus sampling sets the smallest subset of , containing most probable tokens of which total probability is higher than a given threshold , to . In §A.1 and A.2, we present that top- sampling and nucleus sampling are also incomplete and probable.
Beam search is a heuristic algorithm that operates on the level of prefixes. We describe it further in §A.3. Although beam search is not an incomplete probable decoding algorithm, it also selects which is a proper subset of to expand each prefix at each step . Due to this, our main theoretical finding for the incomplete probable decoding algorithms in §3 is applicable to beam search as well.
2.3 Consistency with respect to Incomplete Probable Decoding Algorithms and Self-terminating (ST) Language Models
Incomplete probable decoding algorithms greatly reduce computational overhead for generating sequences from our model. However, Welleck et al. (2020) observed that they can generate non-terminating sequences even if every training sequence has a finite length. To study this, Welleck et al. (2020) defined consistency with respect to decoding algorithms as shown in Definition 4.
Definition 4.
A language model is consistent with respect to a decoding algorithm if for any parameter configuration .
Welleck et al. (2020) also proved that a vanilla language model defined in Definition 1 is inconsistent with respect to incomplete probable decoding algorithms and beam search as follows:
Theorem 1.
For each , an incomplete probable decoding algorithm selects as a set of candidates for decoding, but does not guarantee that . Specifically, if for all , then cannot decode each token to for all (i.e., non-terminating). Based on this result, Welleck et al. (2020) proposed a self-terminating (ST) language model, defined below:
Definition 5.
For defined in Definition 1, the conditional probability of each token given a -prefix and a context at each time step in an ST language model is given by
(9) |
and
where , , and is a sigmoid function.
They proved that the ST language model is consistent with respect to any incomplete probable decoding algorithm and beam search, as follows:
Theorem 2.
In equation 9, monotonically increases to 1 as increases. ends up including in always for with some , and by equation 7. This guarantees to terminate in a finite number of steps. Despite ’s consistency, its validation perplexity degrades compared to in sequence completion (Welleck et al., 2020). We suspect that such degradation comes from the core property of that monotonically increases to as increases. In Remark 1 below, we provide an example where the optimal is not monotonic.
Remark 1.
3 Non-monotonic Self-terminating (NMST) Language Models
The consistency of comes from , not the monotonically increasing as a function of . This motivates us to propose a non-monotonic self-terminating (NMST) language model that permits to be a non-monotonic function of while satisfying as follows:
Definition 6.
For defined in Definition 1, the conditional probability of each token given a -prefix and a context at the -th step in an NMST language model is defined by
(10) |
and
where , , and is a sigmoid function.
Figure 1 shows that uses convex combination of two curves to model . We can write a curve between a lower-bound curve and an upper-bound curve as , with appropriate for all . sets to , and then regards it as a convex combination of and with a coefficient . This enables non-monotonic . Moreover, in Theorem 3 below, we show that the proposed NMST parametrization in equation 10 still guarantees the consistency with respect to any incomplete probable decoding algorithms and beam search.
Theorem 3.
Theorem 3 guarantees that every decoded sequence from terminates when using incomplete decoding algorithms and beam search. Neither nor results in non-terminating sequences resulting from incomplete probable decoding algorithms and beam search. Unlike ST parametrization, our NMST parametrization in equation 10 can capture a wider range of , since does not assume that is a monotonic function of . We empirically demonstrate this by comparing , , and in Figure 3.
4 Experiments
We empirically validate the effectiveness of the proposed non-monotonic self-terminating (NMST) language model by evaluating it on sequence completion tasks. We test three variants of a given architecture: (i) a vanilla (VA+) language model using common softmax parametrization in equation 2, (ii) a self-terminating (ST+) language model using ST parametrization proposed by Welleck et al. (2020) and (iii) our non-monotonic self-terminating (NMST+) language model using NMST parametrization in equation 10. We use following evaluation metrics for comparison:
-
•
Perplexity: Given an autoregressive language model , the perplexity of over is , where .
-
•
Non-termination ratio (): To present the consistency of with respect to a given decoding algorithm , we need to compute . Instead, based on
(11) we use with a sufficiently large threshold to estimate .
Sequence completion is a task of predicting a continuation given a -length context by using a decoding algorithm from a language model (i.e. ). In this section, we use greedy search defined in equation 8 to generate given . Our main theoretical finding in Theorem 3 is that the proposed NMST language model is consistent with respect to not only greedy search but also top- sampling, nucleus sampling, and beam search. We thus present results when using decoding algorithms other than greedy search at the end in §5 and §F.
4.1 WikiText-2
WikiText-2 (Merity et al., 2016) consists of 2 million words from 600 Wikipedia articles. With word tokenization, we regard the first 10 tokens of each sequence and its remaining part, as a context and a ground truth , respectively. We train RNN with (Elman, 1990) and LSTM (Hochreiter & Schmidhuber, 1997) on WikiText-2. Both RNN and LSTM have 2 layers, with 256 and 512 hidden units at each layer, respectively. We perform 10 random runs with a batch size of 32 for 70 epochs. We use AdamW (Loshchilov & Hutter, 2017) with an initial learning rate of , , , weight decay of , learning rate decay, and early stopping. We further describe our models and training strategies for WikiText-2 experiments in §D. Unlike VA+{RNN, LSTM}, ST+{RNN, LSTM} and NMST+{RNN, LSTM} need an additional hyperparameter . We explore in .
We present the average (st.dev.) non-termination ratios, ’s, across 10 random runs as a function of for all considered setups of WikiText-2 in Figure 2, using greedy search. From equation 11, a language model is consistent with respect to greedy search if . As increases, we observe that ’s of VA+{RNN, LSTM} fail to converge toward 0 while ’s of ST+{RNN, LSTM} and NMST+{RNN, LSTM} all reach 0. In other words, RNN and LSTM are now consistent with respect to greedy search after replacing the original softmax parametrization with either the proposed NMST parametrization or ST parametrization.


RNN | LSTM | |||
---|---|---|---|---|
ST+ | NMST+ | ST+ | NMST+ | |
VA+ |
Table 1 shows that the average (st.dev.) validation perplexities across 10 random experiments for all variants of RNN and LSTM, trained on WikiText-2. We observe that NMST+{RNN, LSTM} have better validation perplexities than ST+{RNN, LSTM} for every . We demonstrate this more clearly in §E.1 by plotting the evolution of the mean validation perplexities as we vary . Although our NMST+ guarantees the consistency of RNN and LSTM with respect to greedy search with a better validation perplexity than ST+, we need to carefully select of NMST+. As increases, the lower bound of grows faster, yielding premature sequences when is too large. Indeed, the average validation perplexities of NMST+RNN and NMST+LSTM with are 184.2 and 105.6 which degrade by 5.6 and 4.0 from those of VA+RNN and VA+LSTM, 178.6 and 101.6, respectively. We however emphasize that there is the optimal that makes NMST+{RNN, LSTM} have the validation perplexities similar to those of VA+{RNN, LSTM}. In short, both NMST+ and ST+ prevent non-termination when using greedy search but only NMST+ has a competitive validation perplexity against VA+. In §G, we further observe that the length distribution of predicted sequences from NMST+LSTM is closer to the length distribution of ground truth sequences than those of predicted sequences from {VA, ST}+LSTM.
4.2 WikiText-103
WikiText-103 (Merity et al., 2016) consists of 103 million words constructed from 28,000 articles. We use BPE tokenization (Sennrich et al., 2015) and consider the first 10 tokens as a context for each sequence. Since WikiText-103 is substantially larger than WikiText-2, we finetune a pretrained GPT-2 which is a transformer language model with 124 million parameters (Radford et al., 2019) for steps. For computational efficiency, we bucket the dataset into sequences of similar lengths, and each batch contains a maximum of 1,024 total tokens. We use AdamW (Loshchilov & Hutter, 2017) with an initial learning rate of , , , weight decay of 0.01, linear learning rate decay, and early stopping. We present a more detailed description in §D. We select from for ST+GPT-2 and NMST+GPT-2.
Perplexity | ||||
---|---|---|---|---|
ST+ | NMST+ | ST+ | NMST+ | |
VA+ |
We report the mean (st.dev.) validation perplexities and non-termination ratios, ’s, resulting from greedy search across 10 random runs for all GPT-2 setups finetuned on WikiText-103 in Table 2. Since GPT-2 can handle up to 1,024 tokens, we use 1,000. As shown in Figure 2, we need a sufficiently large such as to determine whether a language model is consistent with respect to greedy search. Although 1,000 is not sufficiently large, we observe that of NMST+GPT-2 decreases compared to of VA+GPT-2 as increases. That is, NMST+ reduces the number of non-terminating continuations within 1,000 steps. Non-terminating sequences do not necessarily imply better quality. We thus demonstrate sample continuations from NMST+GPT-2, given a context that leads non-termination with VA+GPT-2 in Table 3, using greedy search. We observe that the quality of the generated sequence tends to improve with NMST+ by avoiding repetitions of similar phrases and ending with . We present more example continuations in §E.3.
Similar to the results in §4.1, Table 2 shows that the validation perplexities of both ST+GPT-2 proposed by Welleck et al. (2020) and our NMST+GPT-2 degrade compared to VA+GPT-2 as increases. NMST+GPT-2 with the optimal has a competitive validation perplexity of 20.69 against that of VA+GPT-2, 20.72. On the other side, we cannot find that makes the validation perplexity of ST+GPT-2 competitive against that of VA+GPT-2. Moreover, if , then ’s of ST+GPT-2 blow up unlike ’s of VA+GPT-2. §E.2 demonstrates the inevitable perplexity degradation and exploding of ST+GPT-2. We suspect that it is due to monotonically increasing with respect to .
Context | Made of concrete, steel, and wood, the |
---|---|
VA+ | building was built in the mid @-@ 19th century. It was the first building in the United States to be built in concrete, and the first to be built in wood. It was also the first building in the United States to be built in steel. It was the first building in … |
ST+ | building is constructed of steel and concrete. The building’s exterior is made of steel and concrete. The building’s interior is made of wood, and the building’s exterior is made of concrete. The building’s exterior is made of concrete, and the building’s … |
NMST+ | building was designed by the architectural firm of Bowers & Wainwright, and was completed in 1892. The building is the largest of its kind in the United States. |
We investigate behaviors of where ’s are {VA, ST, NMST}+GPT-2 in Figure 3. Based on Table 2, we select the optimal in terms of validation perplexities for {ST, NMST}+GPT-2. In Figure 3, {VA, NMST}+GPT-2 well-capture whether a sequence might end (e.g., after periods) by showing non-monotonic behaviors at those seemingly-terminating steps, but ST+GPT-2 cannot model such non-monotonic behaviors because it assumes that is a monotonic function of . This constraint makes ST+GPT-2 generate often finite but unnecessarily long sequences with greedy search (i.e., higher than VA+GPT-2 for small , but for sufficiently large ). We demonstrate more behaviors in §E.4.
5 Consistency with respect to Other Decoding Algorithms
We explore the effectiveness of our proposed non-monotonic self-terminating (NMST) language model when using decoding algorithms other than greedy search, such as top- sampling (Fan et al., 2018), nucleus sampling (Holtzman et al., 2020), and beam search. All experimental setups and notations are the same as Section §4. According to Theorem 3, the NMST language model is consistent with respect to any incomplete decoding algorithms (e.g., greedy search, top- sampling, and nucleus sampling) and beam search for all . To validate this, we use top-{2, 4} sampling, nucleus-{0.2, 0.4} sampling, and beam search with a width of {2, 4} (beam-{2, 4}) to generate sequences from NMST+GPT-2 finetuned on WikiText-103 with . The choice of is made based on the validation perplexities in Table 2. Since the validation perplexity does not depend on decoding algorithms, we focus on the average (st.dev.) non-termination ratios, ’s, across 10 random runs with for each decoding algorithm in Table 4. We also present ’s of VA+GPT-2 and ST+GPT-2 with as baselines.
top-2 | top-4 | nucleus-0.2 | nucleus-0.4 | beam-2 | beam-4 | |
---|---|---|---|---|---|---|
VA+ | ||||||
ST+ | ||||||
NMST+ |
Table 4 shows that our NMST+GPT-2 has the lowest with for all decoding algorithms compared to VA+GPT-2 and ST+GPT-2 proposed by (Welleck et al., 2020). In other words, NMST+ effectively prevent non-terminating sequences within 1,000 time steps regardless of decoding algorithms. Comparing with greedy search in Table 2 ( when ), we observe that ’s decrease for all setups. As we discussed in §2.3, non-terminating sequences originate from the choice of for all where is a vocabulary and is the -th proper subset of , considered by a decoding algorithm at the -th step. The decoding algorithms other than greedy search are likely to have in and have the lower since their are greater than or equal to of greedy search for all . In the case of top-{2, 4} sampling, we obtain for VA+GPT-2. Even without NMST+, VA+ can avoid non-terminating sequences if we choose a proper decoding algorithm. We however emphasize that NMST+GPT-2 with has a competitive validation perplexity against VA+GPT-2 in Table 2 and that it is guaranteed to terminate regardless of the choice of a decoding algorithm. We also empirically demonstrate the consistency of NMST+{RNN, LSTM} trained on WikiText-2 with respect to other decoding algorithms in §F.
6 Conclusion
Non-termination is a degenerate behavior we often observe when generating text from a well-trained language model. To prevent this, Welleck et al. (2020) proposed a self-terminating language model that encourages the termination probability of each sequence, which is the conditional probability of given a -prefix and a context, to monotonically increase toward 1 as increases. In this paper, we theoretically demonstrate that monotonically increasing termination probability of each sequence is not a necessary condition for avoiding non-terminating sequences. We then propose a non-monotonic self-terminating language model where the termination probability for each sequence converges to 1 but not monotonically. Our non-monotonic self-terminating language models successfully address the issue of non-termination and achieve perplexities that are comparable to vanilla language models and are better than the original self-terminating language models.
Reproducibility Statement
To ensure the reproducibility of our paper, we provide our code available at https://github.com/nyu-dl/non-monotonic-self-terminating-lm.
Acknowledgments
This work was supported by 42dot, Hyundai Motor Company (under the project Uncertainty in Neural Sequence Modeling), Samsung Advanced Institute of Technology (under the project Next Generation Deep Learning: From Pattern Recognition to AI), and NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science. This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.
References
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Bengio et al. (2000) Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. In J. Mach. Learn. Res., 2000.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
- Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Elman (1990) Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
- Fan et al. (2018) Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. URL https://aclanthology.org/P18-1082.
- Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Holtzman et al. (2020) Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. ArXiv, abs/1904.09751, 2020.
- Koehn & Knowles (2017) Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872, 2017.
- Levy & Goldberg (2014) Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. Advances in neural information processing systems, 27, 2014.
- Loshchilov & Hutter (2017) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Merity et al. (2016) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a.
- Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013b.
- Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
- Stahlberg & Byrne (2019) Felix Stahlberg and Bill Byrne. On nmt search errors and model errors: Cat got your tongue? arXiv preprint arXiv:1908.10090, 2019.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Vinyals & Le (2015) Oriol Vinyals and Quoc Le. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015.
- Welleck et al. (2019) Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019.
- Welleck et al. (2020) Sean Welleck, Ilia Kulikov, Jaedeok Kim, Richard Yuanzhe Pang, and Kyunghyun Cho. Consistency of a recurrent language model with respect to incomplete decoding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5553–5568, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.448. URL https://aclanthology.org/2020.emnlp-main.448.
Appendix
Appendix A Definitions of Common Decoding Algorithms and their Characteristics
In this section, we present mathematical definitions of top- sampling (Fan et al., 2018), nucleus sampling (Holtzman et al., 2020), greedy search, and beam search. We then demonstrate whether they are incomplete probable decoding algorithms.
A.1 Top-k sampling
At each step , top- sampling selects a subset of most probable tokens in a vocabulary . Top- sampling generates decoded sequences from a language model as follows:
Definition A.1 (Top- sampling (Fan et al., 2018)).
Top- sampling generates a sequence from a language model given a context by recursively sampling from
(12) |
where
(13) |
A.2 Nucleus sampling
At each step , nucleus sampling selects the smallest subset of most probable tokens in a vocabulary , of which total probability is higher than a given threshold . Nucleus sampling generates decoded sequences from a language model as follows:
Definition A.2 (Nucleus sampling (Holtzman et al., 2020)).
Nucleus sampling generates a sequence from a language model given a context by recursively sampling from
(14) |
where is the smallest subset of such that
(15) |
If for any context and any -prefix , then we have for all . Suppose that equation 6 does not hold for nucleus sampling. Then, this contradicts to is the smallest subset of , satisfying equation 15. From equation 14, we see that nucleus sampling satisfies equation 5 and equation 7. Therefore, nucleus sampling is incomplete and probable.
A.3 Beam search
Beam search is a heuristic algorithm that operates on the level of prefixes. We use the definition of beam search in Welleck et al. (2020).
Definition A.3 (Beam search, Definition A.2 in Welleck et al. (2020)).
Beam search with a width (beam size) , , generates a sequence from a language model by maintaining a set of prefixes, , at each time step where is an empty prefix for all . At each step , beam search forms a set of prefixes,
(16) |
where is concatenation and
(17) |
After forming , beam search selects a set of the highest scoring prefixes in ,
(18) |
where . If ends with , then it does not expand further and is added to the final set . Beam search continues until contains sequences ending with . After that it returns the highest scoring sequence
(19) |
Unlike greedy search, top- sampling, and nucleus sampling, beam search recursively expands sequences with at most different prefixes. Therefore, we cannot formalize beam search in token-level by using . However, in equation 17, the number of possible tokens at is at most . It means that may exclude at time if . By using this, Welleck et al. (2020) proved that a vanilla language model is inconsistent with respect to beam search as shown in Theorem 1.
Appendix B Proofs for §2.3
Remark 1.
Let be a two-instance training dataset. Assume that there exists such that . Suppose further that and . If is an optimal parameter configuration in equation 3 over . Then, is non-monotonic with respect to .
Proof.
Appendix C Proofs for §3
Theorem 3.
A non-monotonic self-terminating (NMST) language model defined in Definition 6 is consistent with respect to any incomplete probable decoding algorithms and beam search.
Proof.
From equation 10, for any , we have
since as for and for any . Hence, there exists such that
(23) |
Let be any incomplete probable decoding algorithm. From equation 6 and equation 7, and holds for any . Therefore, we obtain
(24) |
Taking expectation of equation 24 over , we finally have for any . In other words, is consistent with respect to any incomplete probable decoding algorithms.
In the case of beam search defined in §A.3, without loss of generality, there exists such that does not end with . 333If there is no such , all sequences in end with . It means that returns a finite sequence, so that is consistent with respect to beam search. Let be a set of highest scoring sequences continued from by . From equation 23, we have
for all . Hence, in equation 17 includes . Let be any subsequence with . Then, we have
(25) |
where is concatenation. Therefore, holds where . That is, is the highest scoring sequence starting with , and we have .
For each , starts with for . By the same argument, we add at least one sequence ending with to . It means that has sequences ending with within steps. Note that the final set satisfies
(26) |
Equation 26 implies that every sequence in has the length of at most . We thus obtain
(27) |
Taking expectation of equation 27 over , we see that . That is, is consistent with respect to beam search. ∎
Appendix D Experimental Details
In this section, we describe our models and optimization processes used in §4.
RNN and LSTM on WikiText-2
We use word tokenization for WikiText-2. We train RNN with activations (Elman, 1990) and LSTM (Hochreiter & Schmidhuber, 1997) on WikiText-2. Both RNN and LSTM have 2 layers. Each layer has 256 hidden units for RNN and 512 hidden units for LSTM. The sizes of input and output embedding layers are and for RNN and LSTM, respectively. We use weight tying to share the weights between the input and output embedding layers for both models. We apply dropout (Srivastava et al., 2014) with drop probabilities of and to RNN and LSTM accordingly. For each model, we perform 10 random runs with a batch size of 32 for 70 epochs. To maximize the log-likelihood presented in equation 3, we use AdamW (Loshchilov & Hutter, 2017) with an initial learning rate of , , , weight decay of , and learning rate decay which halves the learning rate if the validation perplexity does not improve for a training epoch. To avoid overfitting, we additionally use early stopping, which terminates training if the validation perplexity does not improve upon the best score attained so far for consecutive epochs. In most cases, the training ends within 50 epochs.
GPT-2 on WikiText-103
We use BPE tokenization444https://github.com/huggingface/tokenizers (Sennrich et al., 2015) and the pretrained GPT-2555https://github.com/huggingface/transformers (Radford et al., 2019) with 124 million parameters, provided by HuggingFace
. GPT-2 can handle up to 1,024 tokens. We apply dropout (Srivastava et al., 2014) with a drop probability of to GPT-2. We finetune GPT-2 for 300,000 steps while ensuring that all runs continue for at least 250,000 steps.
To minimize the number of padding tokens in every batch for computational efficiency, we bucket the dataset into sequences of similar lengths, and each batch contains a maximum of 1,024 total tokens.
To maximize the log-likelihood function in equation 3, we use AdamW (Loshchilov & Hutter, 2017) with an initial learning rate of , , , weight decay of 0.01, and linear learning rate decay over steps.
Appendix E Additional Plots and Tables for §4
In this section, we demonstrate additional plots and tables for §4.
E.1 Additional plots for §4.1
E.2 Additional plots for §4.2
E.3 Additional tables for Table 3
Context | The single made its Irish Singles Chart debut at |
---|---|
VA+ | number three on the issue date of November 18, 2005. It debuted at number one on the chart for the issue date of December 7, 2005. The single debuted at number two on the chart for the issue date of January 4, 2006. The single debuted at number one on the Irish Singles Chart for the issue date of March 4, 2006. The single debuted at number two on the Irish Singles Chart for the issue date of June 4, 2006. The single debuted at number one on the Irish Singles Chart for the issue date … |
ST+ | number three on the chart issue dated March 16, 2010. It debuted at number two on the UK Singles Chart on the chart issue dated March 16, 2010, and peaked at number one on the chart issue dated March 16, 2010. The single was certified gold by the British Phonographic Industry ( BPI ) for shipments of over 15 @,@ 000 copies. The single debuted at number two on the Irish Singles Chart on the chart issue dated March 16, 2010, and peaked at number one on the chart issue dated March 16, 2010. The single was certified gold by the Irish Recorded Music Association ( IRMA ) for shipments of over 15 @,@ 000 copies. The single was … |
NMST+ | number twenty @-@ seven on the week ending March 26, 2010, and peaked at number three on the week ending March 27, 2010. It was certified gold by the Recording Industry Association of Ireland ( RIAA ) for shipment of 500 @,@ 000 copies of the single. The single was certified gold by the Recording Industry Association of Ireland ( RIANZ ) for shipment of 500 @,@ 000 copies of the single. |
Context | Despite the expensive reconstructions, both vessels were considered |
VA+ | to be of sufficient quality to be considered for use in the Grand Fleet. The first, the British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built … |
ST+ | to be of sufficient quality to be considered a part of the Royal Navy, and were assigned to the Channel Fleet. The two ships were transferred to the Mediterranean Fleet in early 1915, and were transferred to the North America and West Indies Station in early 1916. They were transferred to the Mediterranean Fleet in early 1917, and were transferred to the North America and West Indies Station in early 1918. They were transferred to the North America and West Indies Station in early 1918, and were transferred to the Mediterranean Fleet in early 1918. They were transferred to the North America and West Indies Station in late 1918, and … |
NMST+ | obsolescent by the time of the Second World War, and were eventually scrapped in favour of the much more powerful British battleship HMS Dreadnought. |
Context | On the July 21, 2014 episode of Raw, |
---|---|
VA+ | Triple H and Randy Orton were involved in a tag team match, in which Triple H and Orton were accompanied by Triple H and Randy Orton’s manager, Mr. Kennedy. Orton was accompanied by Orton’s manager, Mr. Kennedy, and Orton’s tag team partner, Mr. Kennedy. Triple H and Orton were accompanied by Orton’s manager, Mr. Kennedy, and Orton’s tag team partner, Mr. Kennedy. Triple H and Orton were accompanied by Orton’s manager, Mr. Kennedy, and Orton’s manager, Mr. Kennedy. Triple H and Orton were accompanied by Orton’s manager, Mr. Kennedy, and Orton’s manager, Mr. Kennedy. Triple H and Orton were accompanied by Orton’s manager, Mr. Kennedy, and Orton’s manager, Mr. Kennedy. Triple H … |
ST+ | Triple H defeated Randy Orton to win the World Heavyweight Championship. On the July 27 episode of Raw, Triple H defeated Randy Orton to win the World Heavyweight Championship. On the July 30 episode of Raw, Triple H defeated Randy Orton to win the World Heavyweight Championship. Triple H then defeated Randy Orton to win the World Heavyweight Championship. On the August 2 episode of Raw, Triple H defeated Randy Orton to win the World Heavyweight Championship. Triple H then defeated Randy Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated … |
NMST+ | Triple H defeated Randy Orton to win the WWE Championship for the third time. |
Context | On the west side of the Naktong, |
VA+ | the NK 6th Division was reinforced by the NK 7th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 7th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK … |
ST+ | the 2nd Battalion, 27th Infantry Regiment, which had been holding up the North Koreans in the hills east of the Naktong, was ordered to withdraw to the Naktong itself. The 2nd Battalion, 27th Infantry Regiment, which had been holding up the North Koreans in the hills east of the Naktong, was ordered to withdraw to the Naktong itself. The 2nd Battalion, 27th Infantry Regiment, which had been holding up the North Koreans in the hills east of the Naktong, was ordered to withdraw to the Naktong itself. The 2nd Battalion, 27th Infantry Regiment, which had been holding up the North Koreans in the hills east of the Naktong, was ordered to withdraw to the Naktong itself. The 2nd Battalion, 27th Infantry Regiment, which had been … |
NMST+ | the Naktong River rises to a height of 1 @,@ 000 metres ( 3 @,@ 300 ft ) above the surrounding terrain. The Naktong River is the longest river in North and South Korea, and the longest in the whole of North Korea. The Naktong is the longest river in North Korea, and the longest in the whole of North Korea. The river is the longest in the entire country, and the longest in the whole of North Korea. |
Context | During spring training in a game against the Boston Red |
---|---|
VA+ | Sox on April 4, 2009, Kinsler was placed on the 15 @-@ day disabled list with a strained right hamstring. He returned to the active roster on April 7, and was placed on the 15 @-@ day DL with a strained right hamstring. He was placed on the 15 @-@ day DL on May 4, and was placed on the 15 @-@ day DL on May 5. On May 6, he was placed on the 15 @-@ day DL with a strained right hamstring. He was placed on the 15 @-@ day DL on May 9. On May 10, he was placed on the 15 @-@ day DL with a strained right hamstring. He was placed on the 15 @-@ day DL on May 11. He was placed on the 15 @-@ day DL on May 13. He was placed on the 15 @-@ day DL on May 20. He was placed on the 15 @-@ day DL on May 24. He was placed on the 15 @-@ day DL on May 28. He was placed on the 15 @-@ day DL on June 2. He was placed on … |
ST+ | Sox on April 20, 2011, he was placed on the disabled list with a back injury. He returned to the disabled list on May 10, 2011, and was activated on May 17. He was activated on May 20 and made his first appearance on May 21. He was activated on June 2 and made his first appearance on June 4. He was activated on June 8 and made his first appearance on June 10. He was activated on June 15 and made his first appearance on June 17. He was activated on June 20 and made his first appearance on June 23. He was activated on June 29 and made his first appearance on July 1. He was activated on July 1 and made his first appearance on July 4. He was activated on July 6 and made his first appearance on July 10. He was activated on July 14 and made his first appearance on July 16. He was activated on July 20 and made his first appearance on July 23. He was … |
NMST+ | Sox on April 16, 2010, the Yankees signed Rivera to a one @-@ year, $ 2 @.@ 5 million contract. He made his major league debut on April 21, 2010, against the Boston Red Sox. He pitched a scoreless inning in the first inning of the first game of the 2010 World Series against the New York Mets. On May 1, 2010, Rivera was traded to the Pittsburgh Pirates in exchange for J. J. Hardy. |
E.4 Additional plots for Figure 3
Appendix F Consistency with respect to Other Decoding Algorithms for RNN and LSTM
We validate the consistency of our proposed non-monotonic self-terminating (NMST) language model when using decoding algorithms other than greedy search, such as top- sampling (Fan et al., 2018), nucleus sampling (Holtzman et al., 2020), and beam search. All experimental setups and notations are the same as Section §4. We use top-{2, 4} sampling, nucleus-{0.2, 0.4} sampling, and beam search with a width of {2, 4} (beam-{2, 4}) to generate sequences from NMST+{RNN, LSTM} trained on Wikitext-2 with . The choice of is made based on the validation perplexities in Table 1. Since the validation perplexity does not change with decoding algorithms, we focus on the average (st.dev.) non termination ratios, ’s, across 10 random runs as a function of , for each decoding algorithm in Figure 7. We also plot the evolution of ’s for VA+{RNN, LSTM} and ST+{RNN, LSTM} of as we vary .
Appendix G Analysis of Predicted Sequence Length Distributions in §4.1
We investigate whether our proposed non-monotonic self-terminating (NMST+) language model matches the data length distribution better than the baselines: i) a vanilla (VA+) language model and ii) a self-terminating (ST+) language model. For this, we compare the length distributions of predicted sequences from {VA, ST, NMST}+LSTM trained on WikiText-2 with the data length distribution of ground truth sequences in the WikiText-2 validation dataset, , when using greedy search. All experimental setups and notations are the same as §4.1.
Figure 8 shows the length distributions of {VA, ST, NMST}+LSTM, and . For {ST, NMST}+LSTM, we use because this choice is optimal in terms of validation perplexities based on Table 1. We observe that the length distributions of predicted sequences from NMST+LSTM is closer to the data length distribution of , than those of predicted sequences from VA+LSTM and ST+LSTM.
Furthermore, we can tune to make the predicted length distribution of NMST+LSTM agree with the ground truth length distribution of . In Figure 9, we compare NMST+LSTM’s predicted length distribution of with that of . We see that better models the data length distribution than . However, in this case, the average validation perplexity of NMST+LSTM degrades from 101.5 () to 105.6 () as shown in Table 1.