This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Non-monotonic Self-terminating Language Model

Eugene Choi
[email protected]
&Kyunghyun Cho11footnotemark: 1    
[email protected]
&Cheolhyoung Lee11footnotemark: 1  
[email protected]
New York UniversityPrescient Design, GenentechCIFAR FellowCorresponding author.
Abstract

Recent large-scale neural autoregressive sequence models have shown impressive performances on a variety of natural language generation tasks. However, their generated sequences often exhibit degenerate properties such as non-termination, undesirable repetition, and premature termination, when generated with decoding algorithms such as greedy search, beam search, top-kk sampling, and nucleus sampling. In this paper, we focus on the problem of non-terminating sequences resulting from an incomplete decoding algorithm. We first define an incomplete probable decoding algorithm which includes greedy search, top-kk sampling, and nucleus sampling, beyond the incomplete decoding algorithm originally put forward by Welleck et al. (2020). We then propose a non-monotonic self-terminating language model, which significantly relaxes the constraint of monotonically increasing termination probability in the originally proposed self-terminating language model by Welleck et al. (2020), to address the issue of non-terminating sequences when using incomplete probable decoding algorithms. We prove that our proposed model prevents non-terminating sequences when using not only incomplete probable decoding algorithms but also beam search. We empirically validate our model on sequence completion tasks with various architectures.

1 Introduction

Autoregressive neural sequence models (Bengio et al., 2000) have been widely used for various natural language generation tasks such as language modeling (Brown et al., 2020; Chowdhery et al., 2022), machine translation (Bahdanau et al., 2014), and conversational dialogue modeling (Vinyals & Le, 2015). Furthermore, large-scale autoregressive neural sequence models have shown unprecedented ability to generate fluent, human-like texts (Vaswani et al., 2017; Brown et al., 2020). Despite their success, the autoregressive neural sequence models have shown undesirable behaviors: non-termination (Welleck et al., 2020), degenerate repetition (Welleck et al., 2019; Holtzman et al., 2020), and premature termination (Koehn & Knowles, 2017; Stahlberg & Byrne, 2019). In this paper, we focus on how to prevent non-termination when using a given decoding algorithm.

Non-termination is the problem that a language model generates infinitely long sequences with a positive probability from our language model when using a given decoding algorithm. Welleck et al. (2020) pointed out that this issue comes from a discrepancy between the distribution of our language model and its induced distribution by an incomplete decoding algorithm. They formalized this disparity by the notion of inconsistency where our language model generates non-terminating sequences with a positive probability from the decoding algorithm. To avoid this inconsistency, they proposed a self-terminating (ST) language model that uses new parametrization for its classifier rather than usual softmax parametrization. They proved that the ST language model is consistent with respect to greedy search, beam search, top-kk sampling (Fan et al., 2018) as well as nucleus sampling (Holtzman et al., 2020).

The ST language model increases the termination probability of each sequence monotonically to 1, but this parametrization is not appropriate for learning our natural language. As an illustrative example, suppose there are two sequences in our dataset: “I am a boy” vs. “I am a boy, and you are a girl.”. Our language model trained on this dataset may or may not terminate after the former. Once our model decides not to end, it should dramatically reduce the termination probability to continue. The ST language model, which monotonically increase the termination probability, cannot capture such a case, where one sequence is a prefix of another. We thus propose a non-monotonic self-terminating (NMST) language model which guarantees the consistency with respect to greedy search, beam search, top-kk sampling, and nucleus sampling without monotonically increasing termination probability.

The NMST language model encourages the termination probability of each sequence to converge to 1 through NMST parametrization however without monotonicity. Even under this relaxation, the proposed NMST language model provably prevents any non-terminating sequence resulting from greedy search, beam search, top-kk sampling, and nucleus sampling, which we refer to as incomplete probable decoding algorithms.

We conduct experiments validating the effectiveness of our NMST language models on sequence completion tasks, as was done in earlier studies. We test NMST parametrization with various architectures. Specifically, we train RNN (Elman, 1990) and LSTM (Hochreiter & Schmidhuber, 1997) on WikiText-2 (Merity et al., 2016). We additionally finetune GPT-2 (Radford et al., 2019) on WikiText-103 (Merity et al., 2016). Across all these setups, NMST parametrization effectively prevents non-terminating sequences, especially when compared to softmax parametrization. Furthermore, we see that our NMST parametrization has better (lower) perplexities than those of ST parametrization, confirming the importance of relaxing the monotonic termination probability.

2 Notations and Background

2.1 Notations for Autoregressive Neural Sequence Models

Sequences, vocabulary, and eos\langle eos\rangle We view an instance (e.g., a sentence and a paragraph) as a sequence 𝒚=(y1,y2,,yT){\bm{y}}=(y_{1},y_{2},\dots,y_{T}), where each yty_{t} is an element from a pre-defined finite set of discrete tokens, referred to as a vocabulary 𝒱\mathcal{V}. 𝒱\mathcal{V} includes a special symbol eos\langle eos\rangle that only appears at the end of the sequence. Every sequence 𝒚{\bm{y}} must end with eos\langle eos\rangle. We write the length of 𝒚{\bm{y}} as |𝒚||{\bm{y}}|, and y|𝒚|=eosy_{|{\bm{y}}|}=\langle eos\rangle. We call 𝒚{\bm{y}} a non-terminating sequence, |𝒚|=|{\bm{y}}|=\infty, if yteosy_{t}\neq\langle eos\rangle for all tt.

Embedding vectors Each token v𝒱v\in\mathcal{V} is not a numerical vector so that we use an embedding vector 𝒖vm{\bm{u}}_{v}\in\mathbb{R}^{m} to represent vv. To capture the notion of similarity between discrete tokens efficiently, we use an embedding vector 𝒖vm{\bm{u}}_{v}\in\mathbb{R}^{m} to project vv into continuous embedding space (Bengio et al., 2000; Mikolov et al., 2013b; a; Levy & Goldberg, 2014).

Autoregressive neural sequence models Bengio et al. (2000) proposed an autoregressive neural sequence model parametrized by 𝜽k{\bm{\theta}}\in\mathbb{R}^{k}. They factorized p𝜽(𝒚|𝒙)p_{\bm{\theta}}({\bm{y}}|{\bm{x}}) into a product of the conditional probability of each token given all the previous tokens and an input in a predefined order as follows:

p𝜽(𝒚|𝒙)=t=1Tp𝜽(𝒚t|𝒚<t,𝒙),\textstyle p_{\bm{\theta}}({\bm{y}}|{\bm{x}})=\prod_{t=1}^{T}p_{\bm{\theta}}({\bm{y}}_{t}|{\bm{y}}_{<t},{\bm{x}}), (1)

where 𝒚<t{\bm{y}}_{<t} is a tt-prefix of 𝒚{\bm{y}} and 𝒙{\bm{x}} is an input referred to as a context. For example, 𝒙{\bm{x}} represents either a prompt in sequence completion or a source-side sequence in machine translation.

There are several popular architectures for p𝜽p_{\bm{\theta}} such as RNN (Elman, 1990), LSTM (Hochreiter & Schmidhuber, 1997), GRU (Cho et al., 2014), and Transformer (Vaswani et al., 2017). As shown in equation 2, all these models utilize softmax classifiers. In this paper, we modify the parametrization of their softmax classifiers to prevent non-terminating sequences. We thus write a vanilla language model, regardless of its choice of architecture, that uses the original softmax parametrization as p𝜽vap_{\bm{\theta}}^{va} defined in Definition 1.

Definition 1.

A vanilla language model p𝛉vap^{va}_{\bm{\theta}} computes the conditional probability of each token given a tt-prefix 𝐲<t{\bm{y}}_{<t} and a context 𝐱{\bm{x}} at each time step tt as follows:

p𝜽va(yt=v|𝒚<t,𝒙)=exp(𝒖v𝒉t)/v𝒱exp(𝒖v𝒉t),\textstyle p^{va}_{\bm{\theta}}(y_{t}=v|{\bm{y}}_{<t},{\bm{x}})=\exp({\bm{u}}_{v}^{\top}{\bm{h}}_{t})/\sum_{v^{\prime}\in\mathcal{V}}\exp({\bm{u}}_{v^{\prime}}^{\top}{\bm{h}}_{t}), (2)

where 𝐡t=f𝛉(𝐲t,𝐡t1){\bm{h}}_{t}=f_{\bm{\theta}}({\bm{y}}_{t},{\bm{h}}_{t-1}) with 𝐡0=𝟎{\bm{h}}_{0}=\bm{0}.111This definition stands for RNN, LSTM, and GRU. For Transformer, 𝐡t=f𝛉(𝐲t,𝐡1:(t1)){\bm{h}}_{t}=f_{{\bm{\theta}}}({\bm{y}}_{t},{\bm{h}}_{1:(t-1)}).

Training For a given dataset, 𝒟={(𝒙(n),𝒚(n))}n=1N\mathcal{D}=\left\{\left({\bm{x}}^{(n)},{\bm{y}}^{(n)}\right)\right\}_{n=1}^{N}, we maximize the joint probability assigned to the sequences in our training dataset to find an optimal parameter configuration 𝜽{\bm{\theta}}^{\star} as follows:

𝜽=argmax𝜽n=1Nt=1T(n)logp𝜽(𝒚t(n)|𝒚<t(n),𝒙(n)).\textstyle{\bm{\theta}}^{\star}=\arg\max_{\bm{\theta}}\sum_{n=1}^{N}\sum_{t=1}^{T^{(n)}}\log p_{\bm{\theta}}\big{(}{\bm{y}}_{t}^{(n)}\big{|}{\bm{y}}_{<t}^{(n)},{\bm{x}}^{(n)}\big{)}. (3)

2.2 Incomplete Probable Decoding Algorithms

An autoregressive language model p𝜽p_{\bm{\theta}} predicts the likelihood of a sequence 𝒚{\bm{y}} given a context 𝒙{\bm{x}}. Its autoregressive factorization in equation 1 requires a recursive process for every tt to infer. Hence, at inference time, we use a decoding algorithm, defined below, to generate sequences from p𝜽p_{\bm{\theta}}.

Definition 2.

Let 𝒴\mathcal{Y} be a collection of 𝐲{\bm{y}} such that 𝐲=(y1,y2,,yT){\bm{y}}=(y_{1},y_{2},\cdots,y_{T}) where T{1,2,}T\in\{1,2,\cdots\} and yt𝒱y_{t}\in\mathcal{V}. A decoding algorithm 𝒮\mathcal{S} is a function that maps p𝛉p_{\bm{\theta}} to q𝒮(p𝛉)q_{\mathcal{S}(p_{\bm{\theta}})} which is a probability distribution over 𝒴\mathcal{Y}. A decoded sentence 𝐲^\hat{{\bm{y}}} given 𝐱{\bm{x}} by 𝒮\mathcal{S} from p𝛉p_{\bm{\theta}} is a random sample from q𝒮(p𝛉)(𝐲|𝐱).q_{\mathcal{S}(p_{\bm{\theta}})}({\bm{y}}|{\bm{x}}).

To generate a high quality sequence from p𝜽p_{\bm{\theta}}, a decoding algorithm assumes that a higher quality sequence has a higher probability of p𝜽p_{\bm{\theta}} than others. For instance, maximum a posteriori (MAP) decoding algorithm 𝒮map\mathcal{S}_{map} gives the most probable sequence 𝒚{\bm{y}}^{\star} given a context 𝒙{\bm{x}} from p𝜽p_{\bm{\theta}}:

𝒚=argmax𝒚𝒴pθ(𝒚|𝒙),\textstyle{\bm{y}}^{\star}=\arg\max_{{\bm{y}}\in\mathcal{Y}}p_{\theta}({\bm{y}}|{\bm{x}}), (4)

by setting q𝒮map(p𝜽)(𝒚=𝒚|𝒙)=1q_{\mathcal{S}_{map}(p_{\bm{\theta}})}({\bm{y}}={\bm{y}}^{\star}|{\bm{x}})=1 and q𝒮map(p𝜽)(𝒚=𝒚|𝒙)=0q_{\mathcal{S}_{map}(p_{\bm{\theta}})}({\bm{y}}={\bm{y}}^{\prime}|{\bm{x}})=0 where 𝒚𝒴{𝒚}{\bm{y}}^{\prime}\in\mathcal{Y}\setminus\{{\bm{y}}^{\star}\}. Unfortunately, 𝒮map\mathcal{S}_{map} is intractable since equation 4 requires an exhaustive search over the sequence space 𝒴\mathcal{Y}. Hence, in practice, we utilize incomplete probable decoding algorithms defined as follows:

Definition 3.

A decoding algorithm 𝒮\mathcal{S} is incomplete and probable if there exists 𝒱t𝒱\mathcal{V}_{t}\subsetneq\mathcal{V} such that

v𝒱tq𝒮(p𝜽)(yt=v|𝒚<t,𝒙)=1\textstyle\sum_{v\in\mathcal{V}_{t}}q_{\mathcal{S}(p_{\bm{\theta}})}(y_{t}=v|{\bm{y}}_{<t},{\bm{x}})=1 (5)

and

minv𝒱tp𝜽(yt=v|𝒚<t,𝒙)maxv𝒱𝒱tp𝜽(yt=v|𝒚<t,𝒙)\textstyle\min_{v\in\mathcal{V}_{t}}p_{\bm{\theta}}(y_{t}=v|{\bm{y}}_{<t},{\bm{x}})\geq\max_{v\in\mathcal{V}\setminus\mathcal{V}_{t}}p_{\bm{\theta}}(y_{t}=v|{\bm{y}}_{<t},{\bm{x}}) (6)

for each tt. Furthermore, for every v𝒱tv\in\mathcal{V}_{t}, 𝒮\mathcal{S} satisfies

q𝒮(p𝜽)(yt=v|𝒚<t,𝒙)p𝜽(yt=v|𝒚<t,𝒙).q_{\mathcal{S}({p_{{\bm{\theta}}})}}(y_{t}=v|{\bm{y}}_{<t},{\bm{x}})\geq p_{\bm{\theta}}(y_{t}=v|{\bm{y}}_{<t},{\bm{x}}). (7)

At each tt, an incomplete probable decoding algorithm 𝒮\mathcal{S} considers only a set of highly probable tokens, 𝒱t\mathcal{V}_{t}. 𝒮\mathcal{S} generates 𝒚^\hat{{\bm{y}}} given 𝒙{\bm{x}} by recursively sampling y^t\hat{y}_{t} from q𝒮(p𝜽)(yt|𝒚^<t,𝒙)q_{\mathcal{S}(p_{\bm{\theta}})}(y_{t}|\hat{{\bm{y}}}_{<t},{\bm{x}}) supported on 𝒱t\mathcal{V}_{t}. This reduces an exponential complexity of 𝒮map\mathcal{S}_{map}, 𝒪(|𝒱||𝒚^|)\mathcal{O}\left(|\mathcal{V}|^{|\hat{{\bm{y}}}|}\right), down to a linear level, 𝒪(|𝒚^||𝒱|)\mathcal{O}\left(|\hat{{\bm{y}}}|\cdot|\mathcal{V}|\right).

Greedy search, top-kk sampling (Fan et al., 2018), and nucleus sampling (Holtzman et al., 2020) are incomplete and probable. For example, greedy search 𝒮gr\mathcal{S}_{gr} generates the tt-th item of a sequence by

y^t=argmaxv𝒱p𝜽(yt=v|𝒚^<t,𝒙).\textstyle\hat{y}_{t}=\arg\max_{v\in\mathcal{V}}p_{\bm{\theta}}(y_{t}=v|\hat{{\bm{y}}}_{<t},{\bm{x}}). (8)

In other words, 𝒮gr\mathcal{S}_{gr} sets 𝒱t\mathcal{V}_{t} to {vt(1)}\big{\{}v_{t}^{(1)}\big{\}} where vt(1)=argmaxv𝒱p𝜽(yt=v|𝒚^<t,𝒙)v_{t}^{(1)}=\arg\max_{v\in\mathcal{V}}p_{\bm{\theta}}(y_{t}=v|\hat{{\bm{y}}}_{<t},{\bm{x}}). Moreover, we have p𝜽(yt=vt(1)|𝒚^<t,𝒙)q𝒮gr(p𝜽)(yt=vt(1)|𝒚^<t,𝒙)=1p_{\bm{\theta}}\big{(}y_{t}=v_{t}^{(1)}\big{|}\hat{{\bm{y}}}_{<t},{\bm{x}}\big{)}\leq q_{\mathcal{S}_{gr}(p_{\bm{\theta}})}\big{(}y_{t}=v_{t}^{(1)}\big{|}\hat{{\bm{y}}}_{<t},{\bm{x}}\big{)}=1, and q𝒮gr(p𝜽)(yt=v|𝒚^<t,𝒙)=0q_{\mathcal{S}_{gr}(p_{\bm{\theta}})}(y_{t}=v^{\prime}|\hat{{\bm{y}}}_{<t},{\bm{x}})=0 holds for v𝒱𝒱tv^{\prime}\in\mathcal{V}\setminus\mathcal{V}_{t}. Thus, 𝒮gr\mathcal{S}_{gr} is incomplete and probable. Unlike 𝒮gr\mathcal{S}_{gr}, top-kk sampling considers kk most probable tokens in 𝒱\mathcal{V} as 𝒱t\mathcal{V}_{t} while nucleus sampling sets the smallest subset of 𝒱\mathcal{V}, containing most probable tokens of which total probability is higher than a given threshold μ\mu, to 𝒱t\mathcal{V}_{t}. In §A.1 and A.2, we present that top-kk sampling and nucleus sampling are also incomplete and probable.

Beam search is a heuristic algorithm that operates on the level of prefixes. We describe it further in §A.3. Although beam search is not an incomplete probable decoding algorithm, it also selects 𝒱t\mathcal{V}_{t} which is a proper subset of 𝒱\mathcal{V} to expand each prefix at each step tt. Due to this, our main theoretical finding for the incomplete probable decoding algorithms in §3 is applicable to beam search as well.

2.3 Consistency with respect to Incomplete Probable Decoding Algorithms and Self-terminating (ST) Language Models

Incomplete probable decoding algorithms greatly reduce computational overhead for generating sequences from our model. However, Welleck et al. (2020) observed that they can generate non-terminating sequences even if every training sequence has a finite length. To study this, Welleck et al. (2020) defined consistency with respect to decoding algorithms as shown in Definition 4.

Definition 4.

A language model p𝛉p_{\bm{\theta}} is consistent with respect to a decoding algorithm 𝒮\mathcal{S} if q𝒮(p𝛉)(|𝐲|=)=0q_{\mathcal{S}(p_{\bm{\theta}})}(|{\bm{y}}|=\infty)=0 for any parameter configuration 𝛉k{\bm{\theta}}\in\mathbb{R}^{k}.

Welleck et al. (2020) also proved that a vanilla language model p𝜽vap_{\bm{\theta}}^{va} defined in Definition 1 is inconsistent with respect to incomplete probable decoding algorithms and beam search as follows:

Theorem 1.

A vanilla language model p𝛉vap^{va}_{\bm{\theta}} defined in Definition 1 is inconsistent with respect to any incomplete probable decoding algorithm and beam search (Theorem 3.4 in Welleck et al. (2020)).

For each tt, an incomplete probable decoding algorithm 𝒮\mathcal{S} selects 𝒱t𝒱\mathcal{V}_{t}\subsetneq\mathcal{V} as a set of candidates for decoding, but p𝜽vap_{\bm{\theta}}^{va} does not guarantee that eos𝒱t\langle eos\rangle\in\mathcal{V}_{t}. Specifically, if eos𝒱t\langle eos\rangle\notin\mathcal{V}_{t} for all tt, then 𝒮\mathcal{S} cannot decode each token to eos\langle eos\rangle for all tt (i.e., non-terminating). Based on this result, Welleck et al. (2020) proposed a self-terminating (ST) language model, defined below:

Definition 5.

For 𝐡t{\bm{h}}_{t} defined in Definition 1, the conditional probability of each token v𝒱v\in\mathcal{V} given a tt-prefix 𝐲<t{\bm{y}}_{<t} and a context 𝐱{\bm{x}} at each time step tt in an ST language model is given by

αt=p𝜽st(yt=eos|𝒚<t,𝒙)=1t=1t(1ϵ)σ(𝒖eos𝒉t),\textstyle\alpha_{t}=p_{\bm{\theta}}^{st}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}})=1-\prod_{t^{\prime}=1}^{t}(1-\epsilon)\cdot\sigma({\bm{u}}_{\langle eos\rangle}^{\top}{\bm{h}}_{t^{\prime}}), (9)

and

p𝜽st(yt=v|𝒚<t,𝒙)=(1αt)exp(𝒖v𝒉t)/v𝒱{eos}exp(𝒖v𝒉t),\textstyle p^{st}_{{\bm{\theta}}}(y_{t}=v|{\bm{y}}_{<t},{\bm{x}})=(1-\alpha_{t})\cdot\exp({\bm{u}}_{v}^{\top}{\bm{h}}_{t})/{\sum_{v^{\prime}\in{\mathcal{V}\setminus\{\langle eos\rangle\}}}\exp({\bm{u}}_{v^{\prime}}^{\top}{\bm{h}}_{t})},

where v𝒱{eos}v\in\mathcal{V}\setminus\{\langle eos\rangle\}, ϵ(0,1)\epsilon\in(0,1), and σ(x)=(1+exp(x))1\sigma(x)=(1+\exp(-x))^{-1} is a sigmoid function.

They proved that the ST language model is consistent with respect to any incomplete probable decoding algorithm and beam search, as follows:

Theorem 2.

An ST language model p𝛉stp_{\bm{\theta}}^{st} defined in Definition 5 is consistent with respect to any incomplete probable decoding algorithms and beam search (Theorem 4.1-4.3 in Welleck et al. (2020)).

In equation 9, p𝜽st(yt=eos|𝒚<t,𝒙)p_{\bm{\theta}}^{st}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}) monotonically increases to 1 as tt increases. 𝒮\mathcal{S} ends up including eos\langle eos\rangle in 𝒱t\mathcal{V}_{t} always for ttt\geq t^{\prime} with some tt^{\prime}, and limtq𝒮(p𝜽)(yt=eos|𝒚<t,𝒙)=1\lim_{t\rightarrow\infty}q_{\mathcal{S}(p_{\bm{\theta}})}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}})=1 by equation 7. This guarantees 𝒮\mathcal{S} to terminate in a finite number of steps. Despite p𝜽stp_{\bm{\theta}}^{st}’s consistency, its validation perplexity degrades compared to p𝜽vap_{\bm{\theta}}^{va} in sequence completion (Welleck et al., 2020). We suspect that such degradation comes from the core property of p𝜽stp_{\bm{\theta}}^{st} that p𝜽(yt=eos|𝒚<t,𝒙)p_{\bm{\theta}}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}) monotonically increases to 11 as tt increases. In Remark 1 below, we provide an example where the optimal p𝜽(yt=eos|𝒚<t,𝒙)p_{{\bm{\theta}}^{\star}}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}) is not monotonic.

Remark 1.

Let 𝒟={(𝐱(1),𝐲(1)),(𝐱(2),𝐲(2))}\mathcal{D}=\left\{({\bm{x}}^{(1)},{\bm{y}}^{(1)}),({\bm{x}}^{(2)},{\bm{y}}^{(2)})\right\} be a two-instance training dataset. Assume that there exists t0t_{0} such that 𝐲<t0=𝐲<t0(1)=𝐲<t0(2){\bm{y}}_{<t_{0}}={\bm{y}}_{<t_{0}}^{(1)}={\bm{y}}_{<t_{0}}^{(2)}. Suppose further that t0=|𝐲(1)|<|𝐲(2)|1t_{0}=|{\bm{y}}^{(1)}|<|{\bm{y}}^{(2)}|-1 and 𝐱=𝐱(1)=𝐱(2){\bm{x}}={\bm{x}}^{(1)}={\bm{x}}^{(2)}. If 𝛉{\bm{\theta}}^{\star} is an optimal parameter configuration in equation 3 over 𝒟\mathcal{D}. Then, p𝛉(yt(2)=eos|𝐲<t(2),𝐱)p_{{\bm{\theta}}^{\star}}\big{(}y^{(2)}_{t}=\langle eos\rangle|{\bm{y}}^{(2)}_{<t},{\bm{x}}\big{)} is not monotonic with respect to tt (proved in §B).

We can easily find such case in natural language satisfying the assumptions in Remark 1 by concatenating two sequences. We empirically demonstrate the existence of such cases in §4.2.

3 Non-monotonic Self-terminating (NMST) Language Models

The consistency of p𝜽stp_{\bm{\theta}}^{st} comes from limtp𝜽st(yt=eos|𝒚<t,𝒙)=1\lim_{t\rightarrow\infty}p_{\bm{\theta}}^{st}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}})=1, not the monotonically increasing p𝜽st(yt=eos|𝒚<t,𝒙)p_{\bm{\theta}}^{st}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}) as a function of tt. This motivates us to propose a non-monotonic self-terminating (NMST) language model p𝜽nmstp_{\bm{\theta}}^{nmst} that permits p𝜽nmst(yt=eos|𝒚<t,𝒙)p_{\bm{\theta}}^{nmst}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}) to be a non-monotonic function of tt while satisfying limtp𝜽nmst(yt=eos|𝒚<t,𝒙)=1\lim_{t\rightarrow\infty}p_{\bm{\theta}}^{nmst}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}})=1 as follows:

Definition 6.

For 𝐡t{\bm{h}}_{t} defined in Definition 1, the conditional probability of each token given a tt-prefix 𝐲<t{\bm{y}}_{<t} and a context 𝐱{\bm{x}} at the tt-th step in an NMST language model is defined by

αt=p𝜽nmst(yt=eos|𝒚<t,𝒙)=(1σ(𝒖eos𝒉t))(1(1ϵ)t)+σ(𝒖eos𝒉t),\textstyle\alpha_{t}=p^{nmst}_{\bm{\theta}}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}})=\big{(}1-\sigma\big{(}{\bm{u}}_{\langle eos\rangle}^{\top}{\bm{h}}_{t}\big{)}\big{)}(1-(1-\epsilon)^{t})+\sigma\big{(}{\bm{u}}_{\langle eos\rangle}^{\top}{\bm{h}}_{t}\big{)}, (10)

and

p𝜽nmst(yt=v|𝒚<t,𝒙)=(1αt)exp(𝒖v𝒉t)/v𝒱{eos}exp(𝒖v𝒉t),\textstyle p^{nmst}_{{\bm{\theta}}}(y_{t}=v|{\bm{y}}_{<t},{\bm{x}})=(1-\alpha_{t})\cdot\exp({\bm{u}}_{v}^{\top}{\bm{h}}_{t})/{\sum_{v^{\prime}\in{\mathcal{V}\setminus\{\langle eos\rangle\}}}\exp({\bm{u}}_{v^{\prime}}^{\top}{\bm{h}}_{t})},

where v𝒱{eos}v\in\mathcal{V}\setminus\{\langle eos\rangle\}, ϵ(0,1)\epsilon\in(0,1), and σ(x)=(1+exp(x))1\sigma(x)=(1+\exp(-x))^{-1} is a sigmoid function.

Refer to caption


Figure 1: An illustration of NMST parametrization in equation 10 where flb(t)=1(1ϵ)tf_{lb}(t)=1-(1-\epsilon)^{t}, fub(t)=1f_{ub}(t)=1, λ(t)=σ(𝒖eos𝒉t)\lambda(t^{\prime})=\sigma\big{(}{\bm{u}}_{\langle eos\rangle}^{\top}{\bm{h}}_{t^{\prime}}\big{)}, and g(t)=p𝜽nmst(yt=eos|𝒚<t,𝒙)g(t)=p^{nmst}_{\bm{\theta}}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}). If g(t)g(t) lies between flb(t)f_{lb}(t) and fub(t)f_{ub}(t), we can find λ(t)\lambda(t^{\prime}) such that g(t)=(1λ(t))flb(t)+λ(t)fub(t)g(t^{\prime})=(1-\lambda(t^{\prime}))f_{lb}(t^{\prime})+\lambda(t^{\prime})f_{ub}(t^{\prime}) for any tt^{\prime} regardless of whether g(t)g(t) is monotonic with respect to tt. This allows p𝜽nmstp^{nmst}_{\bm{\theta}} to learn a non-monotonic behavior of p𝜽nmst(yt=eos|𝒚<t,𝒙)p^{nmst}_{\bm{\theta}}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}). p𝜽nmstp^{nmst}_{\bm{\theta}} is consistent with respect to any incomplete probable decoding algorithms and beam search due to limtflb(t)=1limtp𝜽nmst(yt=eos|𝒚<t,𝒙)=1\lim_{t\rightarrow\infty}f_{lb}(t)=1\Rightarrow\lim_{t\rightarrow\infty}p^{nmst}_{\bm{\theta}}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}})=1.

Figure 1 shows that p𝜽nmstp_{\bm{\theta}}^{nmst} uses convex combination of two curves to model p𝜽nmst(yt=eos|𝒚<t,𝒙)p_{\bm{\theta}}^{nmst}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}). We can write a curve g(t)g(t) between a lower-bound curve flb(t)f_{lb}(t) and an upper-bound curve fub(t)f_{ub}(t) as g(t)=(1λ(t))flb(t)+λ(t)fub(t)g(t)=(1-\lambda(t))f_{lb}(t)+\lambda(t)f_{ub}(t), with appropriate λ(t)(0,1)\lambda(t)\in(0,1) for all tt. p𝜽nmstp_{\bm{\theta}}^{nmst} sets g(t)g(t) to p𝜽nmst(yt=eos|𝒚<t,𝒙)p_{\bm{\theta}}^{nmst}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}), and then regards it as a convex combination of flb(t)=1(1ϵ)tf_{lb}(t)=1-(1-\epsilon)^{t} and fub(t)=1f_{ub}(t)=1 with a coefficient λ(t)=σ(𝒖eos𝒉t)\lambda(t)=\sigma\big{(}{\bm{u}}_{\langle eos\rangle}^{\top}{\bm{h}}_{t}\big{)}. This enables non-monotonic p𝜽nmst(yt=eos|𝒚<t,𝒙)p_{\bm{\theta}}^{nmst}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}). Moreover, in Theorem 3 below, we show that the proposed NMST parametrization in equation 10 still guarantees the consistency with respect to any incomplete probable decoding algorithms and beam search.

Theorem 3.

An NMST language model defined in Definition 6 is consistent with respect to any incomplete probable decoding algorithms and beam search.222We provide the proof in §C.

Theorem 3 guarantees that every decoded sequence from p𝜽nmstp_{\bm{\theta}}^{nmst} terminates when using incomplete decoding algorithms and beam search. Neither p𝜽nmstp_{\bm{\theta}}^{nmst} nor p𝜽stp_{\bm{\theta}}^{st} results in non-terminating sequences resulting from incomplete probable decoding algorithms and beam search. Unlike ST parametrization, our NMST parametrization in equation 10 can capture a wider range of p𝜽(yt=eos|𝒚<t,𝒙)p_{\bm{\theta}}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}), since p𝜽nmstp_{\bm{\theta}}^{nmst} does not assume that p𝜽(yt=eos|𝒚<t,𝒙)p_{\bm{\theta}}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}) is a monotonic function of tt. We empirically demonstrate this by comparing p𝜽va(yt=eos|𝒚<t,𝒙)p_{\bm{\theta}}^{va}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}), p𝜽st(yt=eos|𝒚<t,𝒙)p_{\bm{\theta}}^{st}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}), and p𝜽nmst(yt=eos|𝒚<t,𝒙)p_{\bm{\theta}}^{nmst}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}) in Figure 3.

4 Experiments

We empirically validate the effectiveness of the proposed non-monotonic self-terminating (NMST) language model by evaluating it on sequence completion tasks. We test three variants of a given architecture: (i) a vanilla (VA+) language model using common softmax parametrization in equation 2, (ii) a self-terminating (ST+) language model using ST parametrization proposed by Welleck et al. (2020) and (iii) our non-monotonic self-terminating (NMST+) language model using NMST parametrization in equation 10. We use following evaluation metrics for comparison:

  • Perplexity: Given an autoregressive language model p𝜽p_{\bm{\theta}}, the perplexity of p𝜽p_{\bm{\theta}} over 𝒟\mathcal{D} is exp(1Nn=1Nt=1T(n)logp𝜽(𝒚t(n)|𝒚<t(n),𝒙(n)))\exp\bigg{(}-\frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T^{(n)}}\log p_{\bm{\theta}}\left({\bm{y}}_{t}^{(n)}\middle|{\bm{y}}_{<t}^{(n)},{\bm{x}}^{(n)}\right)\bigg{)}, where 𝒟={(𝒙(n),𝒚(n))}n=1N\mathcal{D}=\left\{\left({\bm{x}}^{(n)},{\bm{y}}^{(n)}\right)\right\}_{n=1}^{N}.

  • Non-termination ratio (rntr_{nt}): To present the consistency of p𝜽p_{\bm{\theta}} with respect to a given decoding algorithm 𝒮\mathcal{S}, we need to compute rnt=q𝒮(p𝜽)(|𝒚|=)r_{nt}=q_{\mathcal{S}(p_{\bm{\theta}})}\left(|{\bm{y}}|=\infty\right). Instead, based on

    rnt=q𝒮(p𝜽)(|𝒚|=)=limLq𝒮(p𝜽)(|𝒚|>L),\displaystyle r_{nt}=q_{\mathcal{S}(p_{\bm{\theta}})}\left(|{\bm{y}}|=\infty\right)=\lim_{L\rightarrow\infty}q_{\mathcal{S}(p_{\bm{\theta}})}\left(|{\bm{y}}|>L\right), (11)

    we use rnt(L)=q𝒮(p𝜽)(|𝒚|>L)r_{nt}(L)=q_{\mathcal{S}(p_{\bm{\theta}})}\left(|{\bm{y}}|>L\right) with a sufficiently large threshold LL to estimate rntr_{nt}.

Sequence completion is a task of predicting a continuation 𝒚^\hat{{\bm{y}}} given a cc-length context 𝒙=(x1,x2,,xc){\bm{x}}=(x_{1},x_{2},\cdots,x_{c}) by using a decoding algorithm 𝒮\mathcal{S} from a language model p𝜽p_{\bm{\theta}} (i.e. 𝒚^q𝒮(p𝜽)(𝒚|𝒙)\hat{{\bm{y}}}\sim q_{\mathcal{S}(p_{\bm{\theta}})}({\bm{y}}|{\bm{x}})). In this section, we use greedy search defined in equation 8 to generate 𝒚^\hat{{\bm{y}}} given 𝒙{\bm{x}}. Our main theoretical finding in Theorem 3 is that the proposed NMST language model is consistent with respect to not only greedy search but also top-kk sampling, nucleus sampling, and beam search. We thus present results when using decoding algorithms other than greedy search at the end in §5 and §F.

4.1 WikiText-2

WikiText-2 (Merity et al., 2016) consists of 2 million words from 600 Wikipedia articles. With word tokenization, we regard the first 10 tokens of each sequence and its remaining part, as a context 𝒙{\bm{x}} and a ground truth 𝒚{\bm{y}}, respectively. We train RNN with tanh\tanh (Elman, 1990) and LSTM (Hochreiter & Schmidhuber, 1997) on WikiText-2. Both RNN and LSTM have 2 layers, with 256 and 512 hidden units at each layer, respectively. We perform 10 random runs with a batch size of 32 for 70 epochs. We use AdamW (Loshchilov & Hutter, 2017) with an initial learning rate of 0.0010.001, β1=0.9\beta_{1}=0.9, β2=0.99\beta_{2}=0.99, weight decay of 0.010.01, learning rate decay, and early stopping. We further describe our models and training strategies for WikiText-2 experiments in §D. Unlike VA+{RNN, LSTM}, ST+{RNN, LSTM} and NMST+{RNN, LSTM} need an additional hyperparameter ϵ\epsilon. We explore ϵ\epsilon in {1.0×105,5.0×105,1.0×104,5.0×104}\{1.0\times 10^{-5},5.0\times 10^{-5},1.0\times 10^{-4},5.0\times 10^{-4}\}.

We present the average (±\pmst.dev.) non-termination ratios, rnt(L)r_{nt}(L)’s, across 10 random runs as a function of LL for all considered setups of WikiText-2 in Figure 2, using greedy search. From equation 11, a language model is consistent with respect to greedy search if limLrnt(L)=0\lim_{L\rightarrow\infty}r_{nt}(L)=0. As LL increases, we observe that rnt(L)r_{nt}(L)’s of VA+{RNN, LSTM} fail to converge toward 0 while rnt(L)r_{nt}(L)’s of ST+{RNN, LSTM} and NMST+{RNN, LSTM} all reach 0. In other words, RNN and LSTM are now consistent with respect to greedy search after replacing the original softmax parametrization with either the proposed NMST parametrization or ST parametrization.

Refer to caption
(a) RNN
Refer to caption
(b) LSTM
Figure 2: Non-termination ratios, rnt(L)r_{nt}(L)’s, as a function of LL in log-log scale for (a) RNN and (b) LSTM trained on WikiText-2 when using greedy search. We report mean (curve) ±\pm st.dev. (shaded area) across 10 random experiments. For all configurations, both ST+ (non-red dashed) proposed by Welleck et al. (2020) and our NMST+ (non-red solid) are consistent with respect to greedy search since rnt(L)r_{nt}(L) goes to 0 as LL increases. However, softmax parametrization (VA+, red dotted) is inconsistent with respect to greedy search since its rnt(L)r_{nt}(L) does not converge to 0 as LL\rightarrow\infty.
Table 1: Mean (±\pmst.dev.) validation perplexities across 10 random runs on WikiText-2 for various model configurations. Lower is better. Bold marks the best of each architecture. For all ϵ\epsilon, the validation perplexities of our NMST+{RNN, LSTM} are better than those of ST+{RNN, LSTM} proposed by Welleck et al. (2020). Moreover, with a proper choice of ϵ=1.0×105\epsilon=1.0\times 10^{-5}, NMST+{RNN, LSTM} have competitive validation perplexities against those of VA+{RNN, LSTM}.
RNN LSTM
ϵ\epsilon ST+ NMST+ ST+ NMST+
5.0×1045.0\times 10^{-4} 186.1±(6.2)186.1\pm(6.2) 184.2±(6.5)184.2\pm(6.5) 106.1±(1.0)106.1\pm(1.0) 105.6±(1.2)105.6\pm(1.2)
1.0×1041.0\times 10^{-4} 181.0±(3.8)181.0\pm(3.8) 177.4±(7.0)177.4\pm(7.0) 104.6±(1.4)104.6\pm(1.4) 102.5±(1.0)102.5\pm(1.0)
5.0×1055.0\times 10^{-5} 182.6±(8.0)182.6\pm(8.0) 179.6±(5.7)179.6\pm(5.7) 104.7±(1.6)104.7\pm(1.6) 102.1±(1.0)102.1\pm(1.0)
1.0×1051.0\times 10^{-5} 180.4±(3.3)180.4\pm(3.3) 177.4±(4.5)\mathbf{177.4\pm(4.5)} 104.5±(1.4)104.5\pm(1.4) 101.5±(0.8)\mathbf{101.5\pm(0.8)}
VA+ 178.6±(6.3)178.6\pm(6.3) 101.6±(1.0)101.6\pm(1.0)

Table 1 shows that the average (±\pmst.dev.) validation perplexities across 10 random experiments for all variants of RNN and LSTM, trained on WikiText-2. We observe that NMST+{RNN, LSTM} have better validation perplexities than ST+{RNN, LSTM} for every ϵ\epsilon. We demonstrate this more clearly in §E.1 by plotting the evolution of the mean validation perplexities as we vary ϵ\epsilon. Although our NMST+ guarantees the consistency of RNN and LSTM with respect to greedy search with a better validation perplexity than ST+, we need to carefully select ϵ\epsilon of NMST+. As ϵ\epsilon increases, the lower bound of p𝜽nmst(yt=eos|𝒚<t,𝒙)p_{\bm{\theta}}^{nmst}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}) grows faster, yielding premature sequences when ϵ\epsilon is too large. Indeed, the average validation perplexities of NMST+RNN and NMST+LSTM with ϵ=5.0×104\epsilon=5.0\times 10^{-4} are 184.2 and 105.6 which degrade by 5.6 and 4.0 from those of VA+RNN and VA+LSTM, 178.6 and 101.6, respectively. We however emphasize that there is the optimal ϵ=1.0×105\epsilon=1.0\times 10^{-5} that makes NMST+{RNN, LSTM} have the validation perplexities similar to those of VA+{RNN, LSTM}. In short, both NMST+ and ST+ prevent non-termination when using greedy search but only NMST+ has a competitive validation perplexity against VA+. In §G, we further observe that the length distribution of predicted sequences from NMST+LSTM is closer to the length distribution of ground truth sequences than those of predicted sequences from {VA, ST}+LSTM.

4.2 WikiText-103

WikiText-103 (Merity et al., 2016) consists of 103 million words constructed from 28,000 articles. We use BPE tokenization (Sennrich et al., 2015) and consider the first 10 tokens as a context for each sequence. Since WikiText-103 is substantially larger than WikiText-2, we finetune a pretrained GPT-2 which is a transformer language model with 124 million parameters (Radford et al., 2019) for 500,000500,000 steps. For computational efficiency, we bucket the dataset into sequences of similar lengths, and each batch contains a maximum of 1,024 total tokens. We use AdamW (Loshchilov & Hutter, 2017) with an initial learning rate of 5.0×1055.0\times 10^{-5}, β1=0.9\beta_{1}=0.9, β2=0.99\beta_{2}=0.99, weight decay of 0.01, linear learning rate decay, and early stopping. We present a more detailed description in §D. We select ϵ\epsilon from {1.0×105,5.0×105,1.0×104,5.0×104}\{1.0\times 10^{-5},5.0\times 10^{-5},1.0\times 10^{-4},5.0\times 10^{-4}\} for ST+GPT-2 and NMST+GPT-2.

Table 2: We present the average (±\pmst.dev.) validation perplexities across 10 random runs for all variants of GPT-2 finetuned on WikiText-103. We also demonstrate their non-termination ratios (mean±\pmst.dev.), rnt(L)r_{nt}(L)’s, when using greedy search. We set LL to 1,000 since the maximum length of generated sequences from GPT-2 is 1,024. For perplexity, lower is better. Bold marks the best validation perplexity in all setups. For every ϵ\epsilon, NMST+GPT-2 outperforms ST+GPT-2 in terms of the average validation perplexity. From rnt(L)r_{nt}(L), NMST+GPT-2 effectively prevents non-termination sequences compared to VA+GPT-2 for every ϵ\epsilon while ST+GPT-2 with small ϵ\epsilon fails to avoid them. With a proper choice of ϵ\epsilon (e.g., ϵ=1.0×105\epsilon=1.0\times 10^{-5}), NMST+GPT-2 improves the validation perplexity.
Perplexity rnt(L)r_{nt}(L)
ϵ\epsilon ST+ NMST+ ST+ NMST+
5.0×1045.0\times 10^{-4} 21.80±(0.02)21.80\pm(0.02) 21.63±(0.02)21.63\pm(0.02) 0.05±(0.03)0.05\pm(0.03) 0.07±(0.03)0.07\pm(0.03)
1.0×1041.0\times 10^{-4} 21.21±(0.02)21.21\pm(0.02) 20.86±(0.02)20.86\pm(0.02) 0.72±(0.11)0.72\pm(0.11) 0.22±(0.10)0.22\pm(0.10)
5.0×1055.0\times 10^{-5} 21.19±(0.03)21.19\pm(0.03) 20.76±(0.02)20.76\pm(0.02) 0.72±(0.11)0.72\pm(0.11) 0.24±(0.10)0.24\pm(0.10)
1.0×1051.0\times 10^{-5} 21.16±(0.03)21.16\pm(0.03) 20.69±(0.03)\mathbf{20.69\pm(0.03)} 0.75±(0.10)0.75\pm(0.10) 0.23±(0.10)0.23\pm(0.10)
VA+ 20.72±(0.03)20.72\pm(0.03) 0.27±(0.08)0.27\pm(0.08)

We report the mean (±\pmst.dev.) validation perplexities and non-termination ratios, rnt(L)r_{nt}(L)’s, resulting from greedy search across 10 random runs for all GPT-2 setups finetuned on WikiText-103 in Table 2. Since GPT-2 can handle up to 1,024 tokens, we use L=L=\,1,000. As shown in Figure 2, we need a sufficiently large LL such as L=105L=10^{5} to determine whether a language model is consistent with respect to greedy search. Although L=L=\,1,000 is not sufficiently large, we observe that rnt(L)r_{nt}(L) of NMST+GPT-2 decreases compared to rnt(L)r_{nt}(L) of VA+GPT-2 as ϵ\epsilon increases. That is, NMST+ reduces the number of non-terminating continuations within 1,000 steps. Non-terminating sequences do not necessarily imply better quality. We thus demonstrate sample continuations from NMST+GPT-2, given a context that leads non-termination with VA+GPT-2 in Table 3, using greedy search. We observe that the quality of the generated sequence tends to improve with NMST+ by avoiding repetitions of similar phrases and ending with eos\langle eos\rangle. We present more example continuations in §E.3.

Similar to the results in §4.1, Table 2 shows that the validation perplexities of both ST+GPT-2 proposed by Welleck et al. (2020) and our NMST+GPT-2 degrade compared to VA+GPT-2 as ϵ\epsilon increases. NMST+GPT-2 with the optimal ϵ=1.0×105\epsilon=1.0\times 10^{-5} has a competitive validation perplexity of 20.69 against that of VA+GPT-2, 20.72. On the other side, we cannot find ϵ\epsilon that makes the validation perplexity of ST+GPT-2 competitive against that of VA+GPT-2. Moreover, if ϵ5.0×104\epsilon\neq 5.0\times 10^{-4}, then rnt(L)r_{nt}(L)’s of ST+GPT-2 blow up unlike rnt(L)r_{nt}(L)’s of VA+GPT-2. §E.2 demonstrates the inevitable perplexity degradation and exploding rnt(L)r_{nt}(L) of ST+GPT-2. We suspect that it is due to monotonically increasing p𝜽(yt=eos|𝒚<t,𝒙)p_{{\bm{\theta}}}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}) with respect to tt.

Table 3: Given a context in a validation instance of WikiText-103, we present example continuations of {VA, ST, NMST}+GPT-2 when using greedy search. We select ϵ=1.0×105\epsilon=1.0\times 10^{-5} for {ST, NMST}+GPT-2 because it is optimal in terms of validation perplexities in Table 2. Unlike {VA, ST}+GPT-2, NMST+GPT-2 improves the quality of the sequence by avoiding repetitive tokens and ending with eos\langle eos\rangle when the given context leads VA+GPT-2 to non-terminate within 1,0001,000 steps.
Context Made of concrete, steel, and wood, the
VA+ building was built in the mid @-@ 19th century. It was the first building in the United States to be built in concrete, and the first to be built in wood. It was also the first building in the United States to be built in steel. It was the first building in …
ST+ building is constructed of steel and concrete. The building’s exterior is made of steel and concrete. The building’s interior is made of wood, and the building’s exterior is made of concrete. The building’s exterior is made of concrete, and the building’s …
NMST+ building was designed by the architectural firm of Bowers & Wainwright, and was completed in 1892. The building is the largest of its kind in the United States. eos\langle eos\rangle

We investigate behaviors of p𝜽(yt=eos|𝒚<t,𝒙)p_{\bm{\theta}}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}) where p𝜽p_{\bm{\theta}}’s are {VA, ST, NMST}+GPT-2 in Figure 3. Based on Table 2, we select the optimal ϵ=1.0×105\epsilon=1.0\times 10^{-5} in terms of validation perplexities for {ST, NMST}+GPT-2. In Figure 3, {VA, NMST}+GPT-2 well-capture whether a sequence might end (e.g., after periods) by showing non-monotonic behaviors at those seemingly-terminating steps, but ST+GPT-2 cannot model such non-monotonic behaviors because it assumes that p𝜽(yt=eos|𝒚<t,𝒙)p_{\bm{\theta}}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}) is a monotonic function of tt. This constraint makes ST+GPT-2 generate often finite but unnecessarily long sequences with greedy search (i.e., higher rnt(L)r_{nt}(L) than VA+GPT-2 for small LL, but rnt(L)=0r_{nt}(L)=0 for sufficiently large LL). We demonstrate more behaviors in §E.4.

Refer to caption


Figure 3: We present pθ(yt=eos|𝒚<t,𝒙)p_{\theta}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}) as a function of tt for validation instances of WikiText-103 where p𝜽p_{\bm{\theta}}’s are {VA, ST, NMST}+GPT-2. For {ST, NMST}+GPT-2, we choose ϵ=1.0×105\epsilon=1.0\times 10^{-5} because it is optimal in terms of validation perplexities in Table 2. Instead of tt, we tag the tt-th ground truth token. We report their mean (curve) ±\pm st.dev. (shaded area) across 10 random runs. Unlike ST+GPT-2, NMST+GPT-2 can model non-monotonic behaviors of pθ(yt=eos|𝒚<t,𝒙)p_{\theta}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}) with respect to tt. Both plots show that the non-monotonic behaviors occur where the sequences could end (e.g., after red marked tokens such as periods).

5 Consistency with respect to Other Decoding Algorithms

We explore the effectiveness of our proposed non-monotonic self-terminating (NMST) language model when using decoding algorithms other than greedy search, such as top-kk sampling (Fan et al., 2018), nucleus sampling (Holtzman et al., 2020), and beam search. All experimental setups and notations are the same as Section §4. According to Theorem 3, the NMST language model is consistent with respect to any incomplete decoding algorithms (e.g., greedy search, top-kk sampling, and nucleus sampling) and beam search for all ϵ>0\epsilon>0. To validate this, we use top-{2, 4} sampling, nucleus-{0.2, 0.4} sampling, and beam search with a width of {2, 4} (beam-{2, 4}) to generate sequences from NMST+GPT-2 finetuned on WikiText-103 with ϵ=1.0×105\epsilon=1.0\times 10^{-5}. The choice of ϵ=1.0×105\epsilon=1.0\times 10^{-5} is made based on the validation perplexities in Table 2. Since the validation perplexity does not depend on decoding algorithms, we focus on the average (±\pmst.dev.) non-termination ratios, rnt(L)r_{nt}(L)’s, across 10 random runs with L=1,000L=1,000 for each decoding algorithm in Table 4. We also present rnt(L)r_{nt}(L)’s of VA+GPT-2 and ST+GPT-2 with ϵ=1.0×105\epsilon=1.0\times 10^{-5} as baselines.

Table 4: Mean (±\pmst.dev.) non-termination ratios, rnt(L)r_{nt}(L)’s, across 10 random runs for the variants of GPT-2 finetuned on WikiText-103 with various decoding algorithms. We set LL to 1,000 due to GPT-2’s context window size of 1,024. We use the optimal ϵ=1.0×105\epsilon=1.0\times 10^{-5} in terms of average validation perplexities in Table 2 for both NMST+GPT-2 and ST+GPT-2. Bold marks the lowest rnt(L)r_{nt}(L) within each decoding algorithm (column). Similar to greedy search in Table 2, for all decoding algorithms, rnt(L)r_{nt}(L)’s of NMST+GPT-2 are lower than those of ST+GPT-2 and VA+GPT-2. It means that NMST+ reduce the number of non-terminating sequences within 1,000 decoding steps.
top-2 top-4 nucleus-0.2 nucleus-0.4 beam-2 beam-4
VA+ 0.0±(0.0)0.0\pm(0.0) 0.0±(0.0)0.0\pm(0.0) 0.25±(0.08)0.25\pm(0.08) 0.14±(0.05)0.14\pm(0.05) 0.05±(0.02)0.05\pm(0.02) 0.03±(0.01)0.03\pm(0.01)
ST+ 0.0±(0.0)0.0\pm(0.0) 0.0±(0.0)0.0\pm(0.0) 0.73±(0.11)0.73\pm(0.11) 0.55±(0.15)0.55\pm(0.15) 0.29±(0.10)0.29\pm(0.10) 0.15±(0.07)0.15\pm(0.07)
NMST+ 0.0±(0.0)0.0\pm(0.0) 0.0±(0.0)0.0\pm(0.0) 0.21±(0.10){\bf 0.21}\pm(0.10) 0.10±(0.06){\bf 0.10}\pm(0.06) 0.03±(0.02){\bf 0.03}\pm(0.02) 0.01±(0.01){\bf 0.01}\pm(0.01)

Table 4 shows that our NMST+GPT-2 has the lowest rnt(L)r_{nt}(L) with L=1,000L=1,000 for all decoding algorithms compared to VA+GPT-2 and ST+GPT-2 proposed by (Welleck et al., 2020). In other words, NMST+ effectively prevent non-terminating sequences within 1,000 time steps regardless of decoding algorithms. Comparing with greedy search in Table 2 (rnt(L)r_{nt}(L) when ϵ=1.0×105\epsilon=1.0\times 10^{-5}), we observe that rnt(L)r_{nt}(L)’s decrease for all setups. As we discussed in §2.3, non-terminating sequences originate from the choice of eos𝒱t𝒱\langle eos\rangle\notin\mathcal{V}_{t}\subsetneq\mathcal{V} for all tt where 𝒱\mathcal{V} is a vocabulary and 𝒱t\mathcal{V}_{t} is the tt-th proper subset of 𝒱\mathcal{V}, considered by a decoding algorithm at the tt-th step. The decoding algorithms other than greedy search are likely to have eos\langle eos\rangle in 𝒱t\mathcal{V}_{t} and have the lower rnt(L)r_{nt}(L) since their |𝒱t||\mathcal{V}_{t}| are greater than or equal to |𝒱t|=1|\mathcal{V}_{t}|=1 of greedy search for all tt. In the case of top-{2, 4} sampling, we obtain rnt(L)=0.0r_{nt}(L)=0.0 for VA+GPT-2. Even without NMST+, VA+ can avoid non-terminating sequences if we choose a proper decoding algorithm. We however emphasize that NMST+GPT-2 with ϵ=1.0×105\epsilon=1.0\times 10^{-5} has a competitive validation perplexity against VA+GPT-2 in Table 2 and that it is guaranteed to terminate regardless of the choice of a decoding algorithm. We also empirically demonstrate the consistency of NMST+{RNN, LSTM} trained on WikiText-2 with respect to other decoding algorithms in §F.

6 Conclusion

Non-termination is a degenerate behavior we often observe when generating text from a well-trained language model. To prevent this, Welleck et al. (2020) proposed a self-terminating language model that encourages the termination probability of each sequence, which is the conditional probability of eos\langle eos\rangle given a tt-prefix and a context, to monotonically increase toward 1 as tt increases. In this paper, we theoretically demonstrate that monotonically increasing termination probability of each sequence is not a necessary condition for avoiding non-terminating sequences. We then propose a non-monotonic self-terminating language model where the termination probability for each sequence converges to 1 but not monotonically. Our non-monotonic self-terminating language models successfully address the issue of non-termination and achieve perplexities that are comparable to vanilla language models and are better than the original self-terminating language models.

Reproducibility Statement

To ensure the reproducibility of our paper, we provide our code available at https://github.com/nyu-dl/non-monotonic-self-terminating-lm.

Acknowledgments

This work was supported by 42dot, Hyundai Motor Company (under the project Uncertainty in Neural Sequence Modeling), Samsung Advanced Institute of Technology (under the project Next Generation Deep Learning: From Pattern Recognition to AI), and NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science. This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.

References

  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • Bengio et al. (2000) Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. In J. Mach. Learn. Res., 2000.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
  • Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  • Elman (1990) Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
  • Fan et al. (2018) Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. URL https://aclanthology.org/P18-1082.
  • Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Holtzman et al. (2020) Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. ArXiv, abs/1904.09751, 2020.
  • Koehn & Knowles (2017) Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872, 2017.
  • Levy & Goldberg (2014) Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. Advances in neural information processing systems, 27, 2014.
  • Loshchilov & Hutter (2017) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • Merity et al. (2016) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  • Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a.
  • Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013b.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  • Stahlberg & Byrne (2019) Felix Stahlberg and Bill Byrne. On nmt search errors and model errors: Cat got your tongue? arXiv preprint arXiv:1908.10090, 2019.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Vinyals & Le (2015) Oriol Vinyals and Quoc Le. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015.
  • Welleck et al. (2019) Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019.
  • Welleck et al. (2020) Sean Welleck, Ilia Kulikov, Jaedeok Kim, Richard Yuanzhe Pang, and Kyunghyun Cho. Consistency of a recurrent language model with respect to incomplete decoding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  5553–5568, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.448. URL https://aclanthology.org/2020.emnlp-main.448.

Appendix

Appendix A Definitions of Common Decoding Algorithms and their Characteristics

In this section, we present mathematical definitions of top-kk sampling (Fan et al., 2018), nucleus sampling (Holtzman et al., 2020), greedy search, and beam search. We then demonstrate whether they are incomplete probable decoding algorithms.

A.1 Top-k sampling

At each step tt, top-kk sampling selects a subset of kk most probable tokens in a vocabulary 𝒱\mathcal{V}. Top-kk sampling generates decoded sequences from a language model p𝜽p_{\bm{\theta}} as follows:

Definition A.1 (Top-kk sampling (Fan et al., 2018)).

Top-kk sampling 𝒮top-k\mathcal{S}_{\textrm{top-}k} generates a sequence from a language model p𝛉p_{\bm{\theta}} given a context 𝐱{\bm{x}} by recursively sampling y^t\hat{y}_{t} from

q𝒮top-k(p𝜽)(yt=v|𝒚^<t,𝒙)={p𝜽(yt=v|𝒚^<t,𝒙)v𝒱tp𝜽(yt=v|𝒚^<t,𝒙),if v𝒱t,0,otherwise,q_{\mathcal{S}_{\textrm{top-}k}(p_{\bm{\theta}})}(y_{t}=v|\hat{{\bm{y}}}_{<t},{\bm{x}})=\begin{cases}\displaystyle\frac{p_{\bm{\theta}}(y_{t}=v|\hat{{\bm{y}}}_{<t},{\bm{x}})}{\sum_{v^{\prime}\in\mathcal{V}_{t}}p_{\bm{\theta}}(y_{t}=v^{\prime}|\hat{{\bm{y}}}_{<t},{\bm{x}})},&\text{if }v\in\mathcal{V}_{t},\\ 0,&\text{otherwise,}\end{cases} (12)

where

𝒱t=argtop-kv𝒱p𝜽(yt=v|𝒚^<t,𝒙).\mathcal{V}_{t}=\underset{v\in\mathcal{V}}{\arg\textrm{top-}k}~{}p_{\bm{\theta}}(y_{t}=v|\hat{{\bm{y}}}_{<t},{\bm{x}}). (13)

Except the trivial case k=|𝒱|k=|\mathcal{V}|, we have 𝒱t𝒱\emptyset\subsetneq\mathcal{V}_{t}\subsetneq\mathcal{V} for all tt. By equation 13, equation 6 holds. From equation 12, we see that top-kk sampling satisfies equation 5 and equation 7. Therefore, top-kk sampling is an incomplete probable decoding algorithm.

A.2 Nucleus sampling

At each step tt, nucleus sampling selects the smallest subset of most probable tokens in a vocabulary 𝒱\mathcal{V}, of which total probability is higher than a given threshold μ\mu. Nucleus sampling generates decoded sequences from a language model p𝜽p_{\bm{\theta}} as follows:

Definition A.2 (Nucleus sampling (Holtzman et al., 2020)).

Nucleus sampling 𝒮nuc-μ\mathcal{S}_{\textrm{nuc-}\mu} generates a sequence from a language model p𝛉p_{\bm{\theta}} given a context 𝐱{\bm{x}} by recursively sampling y^t\hat{y}_{t} from

q𝒮nuc-μ(p𝜽)(yt=v|𝒚^<t,𝒙)={p𝜽(yt=v|𝒚^<t,𝒙)v𝒱tp𝜽(yt=v|𝒚^<t,𝒙),if v𝒱t,0,otherwise,q_{\mathcal{S}_{\textrm{nuc-}\mu}(p_{\bm{\theta}})}(y_{t}=v|\hat{{\bm{y}}}_{<t},{\bm{x}})=\begin{cases}\displaystyle\frac{p_{\bm{\theta}}(y_{t}=v|\hat{{\bm{y}}}_{<t},{\bm{x}})}{\sum_{v^{\prime}\in\mathcal{V}_{t}}p_{\bm{\theta}}(y_{t}=v^{\prime}|\hat{{\bm{y}}}_{<t},{\bm{x}})},&\text{if }v\in\mathcal{V}_{t},\\ 0,&\text{otherwise,}\end{cases} (14)

where 𝒱t\mathcal{V}_{t} is the smallest subset of 𝒱\mathcal{V} such that

v𝒱tp𝜽(yt=v|𝒚^<t,𝒙)μ.\sum_{v\in\mathcal{V}_{t}}p_{\bm{\theta}}(y_{t}=v|\hat{{\bm{y}}}_{<t},{\bm{x}})\geq\mu. (15)

If minv𝒱p𝜽(yt=v|𝒚<t,𝒙)1μ\min_{v\in\mathcal{V}}p_{\bm{\theta}}(y_{t}=v|{\bm{y}}_{<t},{\bm{x}})\leq 1-\mu for any context 𝒙{\bm{x}} and any tt-prefix 𝒚<t{\bm{y}}_{<t}, then we have 𝒱t𝒱\emptyset\subsetneq\mathcal{V}_{t}\subsetneq\mathcal{V} for all tt. Suppose that equation 6 does not hold for nucleus sampling. Then, this contradicts to 𝒱t\mathcal{V}_{t} is the smallest subset of 𝒱\mathcal{V}, satisfying equation 15. From equation 14, we see that nucleus sampling satisfies equation 5 and equation 7. Therefore, nucleus sampling is incomplete and probable.

A.3 Beam search

Beam search is a heuristic algorithm that operates on the level of prefixes. We use the definition of beam search in Welleck et al. (2020).

Definition A.3 (Beam search, Definition A.2 in Welleck et al. (2020)).

Beam search with a width (beam size) kk, 𝒮beam-k\mathcal{S}_{\textrm{beam-}k}, generates a sequence from a language model p𝛉p_{\bm{\theta}} by maintaining a set of kk prefixes, 𝒫t={𝛒(1)(t),𝛒(2)(t),,𝛒(k)(t)}\mathcal{P}_{t}=\{\bm{\rho}^{(1)}(t),\bm{\rho}^{(2)}(t),\cdots,\bm{\rho}^{(k)}(t)\}, at each time step tt where 𝛒(i)(0)\bm{\rho}^{(i)}(0) is an empty prefix for all ii. At each step t{1,2,}t\in\{1,2,\cdots\}, beam search forms a set of k×kk\times k prefixes,

𝒫~t=𝝆𝒫t1{𝝆v|v𝒱t(𝝆)},\tilde{\mathcal{P}}_{t}=\bigcup_{\bm{\rho}\in\mathcal{P}_{t-1}}\{\bm{\rho}\circ v|v\in\mathcal{V}_{t}(\bm{\rho})\}, (16)

where 𝛒v\bm{\rho}\circ v is concatenation and

𝒱t(𝝆)=argtop-kv𝒱p𝜽(yt=v|𝝆,𝒙).\mathcal{V}_{t}(\bm{\rho})=\underset{v\in\mathcal{V}}{\arg\textrm{top-}k}~{}p_{\bm{\theta}}(y_{t}=v|\bm{\rho},{\bm{x}}). (17)

After forming 𝒫~t\tilde{\mathcal{P}}_{t}, beam search selects a set of the kk highest scoring prefixes in 𝒫~t\tilde{\mathcal{P}}_{t},

𝒫t=argtop-k𝝆𝒫~ts(𝝆),\mathcal{P}_{t}=\underset{\bm{\rho}\in\tilde{\mathcal{P}}_{t}}{\arg\textrm{top-}k}~{}s(\bm{\rho}), (18)

where s(𝛒)=τ=1tlogp𝛉(yτ=𝛒τ|𝛒<τ,𝐱)s(\bm{\rho})=\sum_{\tau=1}^{t}\log p_{\bm{\theta}}(y_{\tau}=\bm{\rho}_{\tau}|\bm{\rho}_{<\tau},{\bm{x}}). If 𝛒𝒫t\bm{\rho}\in\mathcal{P}_{t} ends with eos\langle eos\rangle, then it does not expand further and is added to the final set 𝒫\mathcal{P}. Beam search continues until 𝒫\mathcal{P} contains kk sequences ending with eos\langle eos\rangle. After that it returns the highest scoring sequence

𝒚^=argmax𝝆𝒫s(𝝆).\hat{{\bm{y}}}=\underset{\bm{\rho}\in\mathcal{P}}{\arg\max}~{}s(\bm{\rho}). (19)

Unlike greedy search, top-kk sampling, and nucleus sampling, beam search recursively expands kk sequences with at most kk different prefixes. Therefore, we cannot formalize beam search in token-level by using q𝒮beam-k(yt=v|𝒚<t,𝒙)q_{\mathcal{S}_{\textrm{beam-}k}}(y_{t}=v|{\bm{y}}_{<t},{\bm{x}}). However, in equation 17, the number of possible tokens at tt is at most k×kk\times k. It means that 𝒮beam-k\mathcal{S}_{\textrm{beam-}k} may exclude eos\langle eos\rangle at time tt if k|𝒱|1k\leq\sqrt{|\mathcal{V}|-1}. By using this, Welleck et al. (2020) proved that a vanilla language model p𝜽vap_{\bm{\theta}}^{va} is inconsistent with respect to beam search as shown in Theorem 1.

Appendix B Proofs for §2.3

Remark 1.

Let 𝒟={(𝐱(1),𝐲(1)),(𝐱(2),𝐲(2))}\mathcal{D}=\left\{({\bm{x}}^{(1)},{\bm{y}}^{(1)}),({\bm{x}}^{(2)},{\bm{y}}^{(2)})\right\} be a two-instance training dataset. Assume that there exists t0t_{0} such that 𝐲<t0=𝐲<t0(1)=𝐲<t0(2){\bm{y}}_{<t_{0}}={\bm{y}}_{<t_{0}}^{(1)}={\bm{y}}_{<t_{0}}^{(2)}. Suppose further that t0=|𝐲(1)|<|𝐲(2)|1t_{0}=|{\bm{y}}^{(1)}|<|{\bm{y}}^{(2)}|-1 and 𝐱=𝐱(1)=𝐱(2){\bm{x}}={\bm{x}}^{(1)}={\bm{x}}^{(2)}. If 𝛉{\bm{\theta}}^{\star} is an optimal parameter configuration in equation 3 over 𝒟\mathcal{D}. Then, p𝛉(yt(2)=eos|𝐲<t(2),𝐱)p_{{\bm{\theta}}^{\star}}\big{(}y^{(2)}_{t}=\langle eos\rangle|{\bm{y}}^{(2)}_{<t},{\bm{x}}\big{)} is non-monotonic with respect to tt.

Proof.

Since 𝜽{\bm{\theta}}^{\star} is an optimal parameter configuration that perfectly minimizes equation 3 and t0<|𝒚(2)|1t_{0}<|{\bm{y}}^{(2)}|-1, we have

p𝜽(yt(2)=eos|𝒚<t(2),𝒙(2))=0,p_{{\bm{\theta}}^{\star}}\big{(}y_{t}^{(2)}=\langle eos\rangle|{\bm{y}}^{(2)}_{<t},{\bm{x}}^{(2)}\big{)}=0, (20)

for t<t0t<t_{0}. Note that t0=|𝒚(1)|𝒚t0(1)=eost_{0}=|{\bm{y}}^{(1)}|\Rightarrow{\bm{y}}_{t_{0}}^{(1)}=\langle eos\rangle and t0<|𝒚(2)|1𝒚t0(2)eost_{0}<|{\bm{y}}^{(2)}|-1\Rightarrow{\bm{y}}_{t_{0}}^{(2)}\neq\langle eos\rangle. From 𝒙=𝒙(1)=𝒙(2){\bm{x}}={\bm{x}}^{(1)}={\bm{x}}^{(2)} and 𝒚=𝒚<t0(1)=𝒚<t0(2){\bm{y}}={\bm{y}}_{<t_{0}}^{(1)}={\bm{y}}_{<t_{0}}^{(2)}, we obtain

p𝜽(yt0(2)=eos|𝒚<t0(2),𝒙(2))=12.p_{{\bm{\theta}}^{\star}}\big{(}y_{t_{0}}^{(2)}=\langle eos\rangle|{\bm{y}}^{(2)}_{<t_{0}},{\bm{x}}^{(2)}\big{)}=\frac{1}{2}. (21)

Moreover, t0<|𝒚(2)|1t_{0}<|{\bm{y}}^{(2)}|-1 implies that 𝒚t0+1(2)eos{\bm{y}}^{(2)}_{t_{0}+1}\neq\langle eos\rangle which is equivalent to

p𝜽(yt0+1(2)=eos|𝒚<t0+1(2),𝒙(2))=0.p_{{\bm{\theta}}^{\star}}\big{(}y_{t_{0}+1}^{(2)}=\langle eos\rangle|{\bm{y}}^{(2)}_{<t_{0}+1},{\bm{x}}^{(2)}\big{)}=0. (22)

From equation 20, equation 21, and equation 22, we see that p𝜽(yt(2)=eos|𝒚<t(2),𝒙)p_{{\bm{\theta}}^{\star}}\big{(}y^{(2)}_{t}=\langle eos\rangle|{\bm{y}}^{(2)}_{<t},{\bm{x}}\big{)} is non-monotonic with respect to tt. ∎

Appendix C Proofs for §3

Theorem 3.

A non-monotonic self-terminating (NMST) language model defined in Definition 6 is consistent with respect to any incomplete probable decoding algorithms and beam search.

Proof.

From equation 10, for any 𝜽k{\bm{\theta}}\in\mathbb{R}^{k}, we have

limtp𝜽nmst(yt=eos|𝒚<t,𝒙)=1,\lim_{t\rightarrow\infty}p^{nmst}_{\bm{\theta}}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}})=1,

since (1ϵ)t0(1-\epsilon)^{t}\rightarrow 0 as tt\rightarrow\infty for ϵ(0,1)\epsilon\in(0,1) and σ(𝒖eos𝒉t)(0,1)\sigma\left({\bm{u}}_{\langle eos\rangle}^{\top}{\bm{h}}_{t}\right)\in(0,1) for any tt. Hence, there exists t1/2t_{1/2} such that

tt1/2p𝜽nmst(yt=eos|𝒚<t,𝒙)>12.t\geq t_{1/2}\Rightarrow p^{nmst}_{\bm{\theta}}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}})>\frac{1}{2}. (23)

Let 𝒮\mathcal{S} be any incomplete probable decoding algorithm. From equation 6 and equation 7, eos𝒱t\langle eos\rangle\in\mathcal{V}_{t} and q𝒮(p𝜽nmst)(yteos|𝒚<t,𝒙)<12q_{\mathcal{S}(p_{\bm{\theta}}^{nmst})}(y_{t}\neq\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}})<\frac{1}{2} holds for any tt1/2t\geq t_{1/2}. Therefore, we obtain

q𝒮(p𝜽nmst)(|𝒚|=|𝒙)\displaystyle q_{\mathcal{S}(p_{\bm{\theta}}^{nmst})}(|{\bm{y}}|=\infty|{\bm{x}}) =t=1q𝒮(p𝜽nmst)(yteos|𝒚<t,𝒙)\displaystyle=\prod_{t=1}^{\infty}q_{\mathcal{S}(p_{\bm{\theta}}^{nmst})}(y_{t}\neq\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}})
t=t1/2q𝒮(p𝜽nmst)(yteos|𝒚<t,𝒙)\displaystyle\leq\prod_{t=t_{1/2}}^{\infty}q_{\mathcal{S}(p_{\bm{\theta}}^{nmst})}(y_{t}\neq\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}})
<t=t1/2120.\displaystyle<\prod_{t=t_{1/2}}^{\infty}\frac{1}{2}\rightarrow 0. (24)

Taking expectation of equation 24 over 𝒙{\bm{x}}, we finally have q𝒮(p𝜽nmst)(|𝒚|=)=0q_{\mathcal{S}(p_{\bm{\theta}}^{nmst})}(|{\bm{y}}|=\infty)=0 for any 𝒮\mathcal{S}. In other words, p𝜽nmstp_{\bm{\theta}}^{nmst} is consistent with respect to any incomplete probable decoding algorithms.

In the case of beam search 𝒮beam-k\mathcal{S}_{\textrm{beam-}k} defined in §A.3, without loss of generality, there exists 𝝆𝒫t1/2\bm{\rho}\in\mathcal{P}_{t_{1/2}} such that 𝝆\bm{\rho} does not end with eos\langle eos\rangle. 333If there is no such 𝝆\bm{\rho}, all kk sequences in 𝒫t1/2\mathcal{P}_{t_{1/2}} end with eos\langle eos\rangle. It means that 𝒮beam-k\mathcal{S}_{\textrm{beam-}k} returns a finite sequence, so that p𝜽nmstp_{\bm{\theta}}^{nmst} is consistent with respect to beam search. Let 𝒫>t1/2(𝝆)\mathcal{P}_{>t_{1/2}}(\bm{\rho}) be a set of kk highest scoring sequences continued from 𝝆\bm{\rho} by 𝒮beam-k\mathcal{S}_{\textrm{beam-}k}. From equation 23, we have

p𝜽nmst(eos|𝝆,𝒙)>p𝜽nmst(v|𝝆,𝒙)p^{nmst}_{\bm{\theta}}(\langle eos\rangle|\bm{\rho},{\bm{x}})>p^{nmst}_{\bm{\theta}}(v|\bm{\rho},{\bm{x}})

for all v𝒱{eos}v\in\mathcal{V}\setminus\{\langle eos\rangle\}. Hence, 𝒱t1/2(𝝆)\mathcal{V}_{t_{1/2}}(\bm{\rho}) in equation 17 includes eos\langle eos\rangle. Let 𝒛=(z1,z2,,zl){\bm{z}}=(z_{1},z_{2},\cdots,z_{l}) be any subsequence with z1eosz_{1}\neq\langle eos\rangle. Then, we have

p𝜽nmst(𝝆𝒛|𝝆,𝒙)\displaystyle p_{\bm{\theta}}^{nmst}(\bm{\rho}\circ{\bm{z}}|\bm{\rho},{\bm{x}}) =i=1lp𝜽nmst(zi|𝝆𝒛<i,𝒙)\displaystyle=\prod_{i=1}^{l}p_{\bm{\theta}}^{nmst}(z_{i}|\bm{\rho}\circ{\bm{z}}_{<i},{\bm{x}})
p𝜽nmst(z1|𝝆,𝒙)\displaystyle\leq p_{\bm{\theta}}^{nmst}(z_{1}|\bm{\rho},{\bm{x}})
<p𝜽nmst(eos|𝝆,𝒙)=p𝜽nmst(𝝆eos|𝝆,𝒙),\displaystyle<p_{\bm{\theta}}^{nmst}(\langle eos\rangle|\bm{\rho},{\bm{x}})=p_{\bm{\theta}}^{nmst}(\bm{\rho}\circ\langle eos\rangle|\bm{\rho},{\bm{x}}), (25)

where \circ is concatenation. Therefore, 𝝆eos=argmax𝝆𝒫t1/2s(𝝆)\bm{\rho}\circ\langle eos\rangle=\arg\max_{\bm{\rho}^{\prime}\in\mathcal{P}_{t_{1/2}}}s(\bm{\rho}^{\prime}) holds where s(𝝆)=τ=1tlogp𝜽nmst(𝝆τ|𝝆<τ,𝒙)s(\bm{\rho}^{\prime})=\sum_{\tau=1}^{t}\log p_{\bm{\theta}}^{nmst}(\bm{\rho}^{\prime}_{\tau}|\bm{\rho}^{\prime}_{<\tau},{\bm{x}}). That is, 𝝆eos\bm{\rho}\circ\langle eos\rangle is the highest scoring sequence starting with 𝝆\bm{\rho}, and we have 𝝆eos𝒫(𝝆)\bm{\rho}\circ\langle eos\rangle\in\mathcal{P}(\bm{\rho}).

For each 𝝆𝒫>t1/2(𝝆){𝝆eos}\bm{\rho}^{\prime}\in\mathcal{P}_{>t_{1/2}}(\bm{\rho})\setminus\{\bm{\rho}\circ\langle eos\rangle\}, 𝝆\bm{\rho}^{\prime} starts with 𝝆v\bm{\rho}\circ v for v𝒱{eos}v\in\mathcal{V}\setminus\{\langle eos\rangle\}. By the same argument, we add at least one sequence ending with eos\langle eos\rangle to 𝒫>t1/2(𝝆)\mathcal{P}_{>t_{1/2}}(\bm{\rho}). It means that 𝒫>t1/2(𝝆)\mathcal{P}_{>t_{1/2}}(\bm{\rho}) has kk sequences ending with eos\langle eos\rangle within t1/2+kt_{1/2}+k steps. Note that the final set 𝒫\mathcal{P} satisfies

𝒫𝝆𝒫t1/2𝒫>t1/2(𝝆).\mathcal{P}\subset\bigcup_{\bm{\rho}\in\mathcal{P}_{t_{1/2}}}\mathcal{P}_{>t_{1/2}}(\bm{\rho}). (26)

Equation 26 implies that every sequence in 𝒫\mathcal{P} has the length of at most t1/2+kt_{1/2}+k. We thus obtain

q𝒮beam-k(p𝜽nmst)(|𝒚|=|𝒙)q𝒮beam-k(p𝜽nmst)(|𝒚|>t1/2+k|𝒙)=0.q_{\mathcal{S}_{\textrm{beam-}k}(p_{\bm{\theta}}^{nmst})}(|{\bm{y}}|=\infty|{\bm{x}})\leq q_{\mathcal{S}_{\textrm{beam-}k}(p_{\bm{\theta}}^{nmst})}(|{\bm{y}}|>t_{1/2}+k|{\bm{x}})=0. (27)

Taking expectation of equation 27 over 𝒙{\bm{x}}, we see that q𝒮beam-k(p𝜽nmst)(|𝒚|=)q_{\mathcal{S}_{\textrm{beam-}k}(p_{\bm{\theta}}^{nmst})}(|{\bm{y}}|=\infty). That is, p𝜽nmstp_{\bm{\theta}}^{nmst} is consistent with respect to beam search. ∎

Appendix D Experimental Details

In this section, we describe our models and optimization processes used in §4.

RNN and LSTM on WikiText-2

We use word tokenization for WikiText-2. We train RNN with tanh\tanh activations (Elman, 1990) and LSTM (Hochreiter & Schmidhuber, 1997) on WikiText-2. Both RNN and LSTM have 2 layers. Each layer has 256 hidden units for RNN and 512 hidden units for LSTM. The sizes of input and output embedding layers are 256256 and 512512 for RNN and LSTM, respectively. We use weight tying to share the weights between the input and output embedding layers for both models. We apply dropout (Srivastava et al., 2014) with drop probabilities of 0.30.3 and 0.50.5 to RNN and LSTM accordingly. For each model, we perform 10 random runs with a batch size of 32 for 70 epochs. To maximize the log-likelihood presented in equation 3, we use AdamW (Loshchilov & Hutter, 2017) with an initial learning rate of 0.0010.001, β1=0.9\beta_{1}=0.9, β2=0.99\beta_{2}=0.99, weight decay of 0.010.01, and learning rate decay which halves the learning rate if the validation perplexity does not improve for a training epoch. To avoid overfitting, we additionally use early stopping, which terminates training if the validation perplexity does not improve upon the best score attained so far for 1010 consecutive epochs. In most cases, the training ends within 50 epochs.

GPT-2 on WikiText-103

We use BPE tokenization444https://github.com/huggingface/tokenizers (Sennrich et al., 2015) and the pretrained GPT-2555https://github.com/huggingface/transformers (Radford et al., 2019) with 124 million parameters, provided by HuggingFace. GPT-2 can handle up to 1,024 tokens. We apply dropout (Srivastava et al., 2014) with a drop probability of 0.10.1 to GPT-2. We finetune GPT-2 for 300,000 steps while ensuring that all runs continue for at least 250,000 steps. To minimize the number of padding tokens in every batch for computational efficiency, we bucket the dataset into sequences of similar lengths, and each batch contains a maximum of 1,024 total tokens. To maximize the log-likelihood function in equation 3, we use AdamW (Loshchilov & Hutter, 2017) with an initial learning rate of 5.0×1055.0\times 10^{-5}, β1=0.9\beta_{1}=0.9, β2=0.99\beta_{2}=0.99, weight decay of 0.01, and linear learning rate decay over 500,000500,000 steps.

Appendix E Additional Plots and Tables for §4

In this section, we demonstrate additional plots and tables for §4.

E.1 Additional plots for §4.1

Refer to caption

Figure 4: Validation perplexities as a function of ϵ\epsilon in log-linear scale for all configurations of RNN (left) and LSTM (right), which are trained on WikiText-2. We present their average (curve) ±\pm st.dev. (shaded area) across 10 random experiments. For all ϵ\epsilon and architectures, NMST+ has better validation perplexities than ST+. As ϵ\epsilon increases, the validation perplexities of both NMST+RNN and NMST+LSTM degrade compared to those of VA+RNN and VA+LSTM. We thus need to search for an optimal ϵ\epsilon to avoid degradation of validation perplexity when applying NMST+ to our language model.

E.2 Additional plots for §4.2

Refer to caption

Figure 5: We present the average (curve) ±\pm st.dev. (shaded area) of validation perplexities (left) and non-termnation ratios rnt(L)r_{nt}(L) (right) with greedy search across 10 random runs for all considered setups of GPT-2 finetuned on WikiText-130 in log-linear scale. For rnt(L)r_{nt}(L), we use L=1,000L=1,000 because GPT-2 has a context window size of 1,0241,024. For all ϵ\epsilon, NMST+GPT-2 outperforms ST+GPT-2 in terms of the average validation perplexity. When ϵ\epsilon is small, rnt(L)r_{n}t(L) of ST+GPT-2 explodes. It means that ST+GPT-2 with small ϵ\epsilon cannot prevent non-terminating sequences. However, our NMST+GPT-2 effectively reduces rnt(L)r_{nt}(L) compared to VA+GPT-2 for every ϵ\epsilon, and the validation perplexity degradation is smaller than that of ST+GPT-2 proposed by Welleck et al. (2020).

E.3 Additional tables for Table 3

Table 5: Given a context in a validation instance of WikiText-103, we present example continuations of {VA, ST, NMST}+GPT-2 when using greedy search. We select ϵ=1.0×105\epsilon=1.0\times 10^{-5} for {ST, NMST}+GPT-2 because it is optimal in terms of validation perplexities in Table 2. Unlike {VA, ST}+GPT-2, NMST+GPT-2 improves the quality of the sequence by avoiding repetitive tokens and ending with eos\langle eos\rangle when the given context leads VA+GPT-2 to non-terminate within 1,0001,000 steps.
Context The single made its Irish Singles Chart debut at
VA+ number three on the issue date of November 18, 2005. It debuted at number one on the chart for the issue date of December 7, 2005. The single debuted at number two on the chart for the issue date of January 4, 2006. The single debuted at number one on the Irish Singles Chart for the issue date of March 4, 2006. The single debuted at number two on the Irish Singles Chart for the issue date of June 4, 2006. The single debuted at number one on the Irish Singles Chart for the issue date …
ST+ number three on the chart issue dated March 16, 2010. It debuted at number two on the UK Singles Chart on the chart issue dated March 16, 2010, and peaked at number one on the chart issue dated March 16, 2010. The single was certified gold by the British Phonographic Industry ( BPI ) for shipments of over 15 @,@ 000 copies. The single debuted at number two on the Irish Singles Chart on the chart issue dated March 16, 2010, and peaked at number one on the chart issue dated March 16, 2010. The single was certified gold by the Irish Recorded Music Association ( IRMA ) for shipments of over 15 @,@ 000 copies. The single was …
NMST+ number twenty @-@ seven on the week ending March 26, 2010, and peaked at number three on the week ending March 27, 2010. It was certified gold by the Recording Industry Association of Ireland ( RIAA ) for shipment of 500 @,@ 000 copies of the single. The single was certified gold by the Recording Industry Association of Ireland ( RIANZ ) for shipment of 500 @,@ 000 copies of the single.eos\langle eos\rangle
Context Despite the expensive reconstructions, both vessels were considered
VA+ to be of sufficient quality to be considered for use in the Grand Fleet. The first, the British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built, British @-@ built …
ST+ to be of sufficient quality to be considered a part of the Royal Navy, and were assigned to the Channel Fleet. The two ships were transferred to the Mediterranean Fleet in early 1915, and were transferred to the North America and West Indies Station in early 1916. They were transferred to the Mediterranean Fleet in early 1917, and were transferred to the North America and West Indies Station in early 1918. They were transferred to the North America and West Indies Station in early 1918, and were transferred to the Mediterranean Fleet in early 1918. They were transferred to the North America and West Indies Station in late 1918, and …
NMST+ obsolescent by the time of the Second World War, and were eventually scrapped in favour of the much more powerful British battleship HMS Dreadnought.eos\langle eos\rangle
Context On the July 21, 2014 episode of Raw,
VA+ Triple H and Randy Orton were involved in a tag team match, in which Triple H and Orton were accompanied by Triple H and Randy Orton’s manager, Mr. Kennedy. Orton was accompanied by Orton’s manager, Mr. Kennedy, and Orton’s tag team partner, Mr. Kennedy. Triple H and Orton were accompanied by Orton’s manager, Mr. Kennedy, and Orton’s tag team partner, Mr. Kennedy. Triple H and Orton were accompanied by Orton’s manager, Mr. Kennedy, and Orton’s manager, Mr. Kennedy. Triple H and Orton were accompanied by Orton’s manager, Mr. Kennedy, and Orton’s manager, Mr. Kennedy. Triple H and Orton were accompanied by Orton’s manager, Mr. Kennedy, and Orton’s manager, Mr. Kennedy. Triple H …
ST+ Triple H defeated Randy Orton to win the World Heavyweight Championship. On the July 27 episode of Raw, Triple H defeated Randy Orton to win the World Heavyweight Championship. On the July 30 episode of Raw, Triple H defeated Randy Orton to win the World Heavyweight Championship. Triple H then defeated Randy Orton to win the World Heavyweight Championship. On the August 2 episode of Raw, Triple H defeated Randy Orton to win the World Heavyweight Championship. Triple H then defeated Randy Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated Orton to win the World Heavyweight Championship. Triple H then defeated …
NMST+ Triple H defeated Randy Orton to win the WWE Championship for the third time.eos\langle eos\rangle
Context On the west side of the Naktong,
VA+ the NK 6th Division was reinforced by the NK 7th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 7th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK 6th Division, which was reinforced by the NK …
ST+ the 2nd Battalion, 27th Infantry Regiment, which had been holding up the North Koreans in the hills east of the Naktong, was ordered to withdraw to the Naktong itself. The 2nd Battalion, 27th Infantry Regiment, which had been holding up the North Koreans in the hills east of the Naktong, was ordered to withdraw to the Naktong itself. The 2nd Battalion, 27th Infantry Regiment, which had been holding up the North Koreans in the hills east of the Naktong, was ordered to withdraw to the Naktong itself. The 2nd Battalion, 27th Infantry Regiment, which had been holding up the North Koreans in the hills east of the Naktong, was ordered to withdraw to the Naktong itself. The 2nd Battalion, 27th Infantry Regiment, which had been …
NMST+ the Naktong River rises to a height of 1 @,@ 000 metres ( 3 @,@ 300 ft ) above the surrounding terrain. The Naktong River is the longest river in North and South Korea, and the longest in the whole of North Korea. The Naktong is the longest river in North Korea, and the longest in the whole of North Korea. The river is the longest in the entire country, and the longest in the whole of North Korea.eos\langle eos\rangle
Context During spring training in a game against the Boston Red
VA+ Sox on April 4, 2009, Kinsler was placed on the 15 @-@ day disabled list with a strained right hamstring. He returned to the active roster on April 7, and was placed on the 15 @-@ day DL with a strained right hamstring. He was placed on the 15 @-@ day DL on May 4, and was placed on the 15 @-@ day DL on May 5. On May 6, he was placed on the 15 @-@ day DL with a strained right hamstring. He was placed on the 15 @-@ day DL on May 9. On May 10, he was placed on the 15 @-@ day DL with a strained right hamstring. He was placed on the 15 @-@ day DL on May 11. He was placed on the 15 @-@ day DL on May 13. He was placed on the 15 @-@ day DL on May 20. He was placed on the 15 @-@ day DL on May 24. He was placed on the 15 @-@ day DL on May 28. He was placed on the 15 @-@ day DL on June 2. He was placed on …
ST+ Sox on April 20, 2011, he was placed on the disabled list with a back injury. He returned to the disabled list on May 10, 2011, and was activated on May 17. He was activated on May 20 and made his first appearance on May 21. He was activated on June 2 and made his first appearance on June 4. He was activated on June 8 and made his first appearance on June 10. He was activated on June 15 and made his first appearance on June 17. He was activated on June 20 and made his first appearance on June 23. He was activated on June 29 and made his first appearance on July 1. He was activated on July 1 and made his first appearance on July 4. He was activated on July 6 and made his first appearance on July 10. He was activated on July 14 and made his first appearance on July 16. He was activated on July 20 and made his first appearance on July 23. He was …
NMST+ Sox on April 16, 2010, the Yankees signed Rivera to a one @-@ year, $ 2 @.@ 5 million contract. He made his major league debut on April 21, 2010, against the Boston Red Sox. He pitched a scoreless inning in the first inning of the first game of the 2010 World Series against the New York Mets. On May 1, 2010, Rivera was traded to the Pittsburgh Pirates in exchange for J. J. Hardy.eos\langle eos\rangle

E.4 Additional plots for Figure 3

Refer to caption


Figure 6: Additional plots of pθ(yt=eos|𝒚<t,𝒙)p_{\theta}(y_{t}=\langle eos\rangle|{\bm{y}}_{<t},{\bm{x}}) as a function of tt for validation instances of WikiText-103 where p𝜽p_{\bm{\theta}}’s are {VA, ST, NMST}+GPT-2. For {ST, NMST}+GPT-2, we choose ϵ=1.0×105\epsilon=1.0\times 10^{-5} because it is optimal in terms of validation perplexities in Table 2. Instead of tt, we tag the tt-th ground truth token. We report their mean (curve) ±\pm st.dev. (shaded area) across 10 random runs. Unlike ST+GPT-2, NMST+GPT-2 exhibits non-monotonic behaviors at plausibly terminating steps (e.g., after red marked tokens such as periods).

Appendix F Consistency with respect to Other Decoding Algorithms for RNN and LSTM

We validate the consistency of our proposed non-monotonic self-terminating (NMST) language model when using decoding algorithms other than greedy search, such as top-kk sampling (Fan et al., 2018), nucleus sampling (Holtzman et al., 2020), and beam search. All experimental setups and notations are the same as Section §4. We use top-{2, 4} sampling, nucleus-{0.2, 0.4} sampling, and beam search with a width of {2, 4} (beam-{2, 4}) to generate sequences from NMST+{RNN, LSTM} trained on Wikitext-2 with ϵ=1.0×105\epsilon=1.0\times 10^{-5}. The choice of ϵ=1.0×105\epsilon=1.0\times 10^{-5} is made based on the validation perplexities in Table 1. Since the validation perplexity does not change with decoding algorithms, we focus on the average (±\pmst.dev.) non termination ratios, rnt(L)r_{nt}(L)’s, across 10 random runs as a function of LL, for each decoding algorithm in Figure 7. We also plot the evolution of rnt(L)r_{nt}(L)’s for VA+{RNN, LSTM} and ST+{RNN, LSTM} of ϵ=1.0×105\epsilon=1.0\times 10^{-5} as we vary LL.

Refer to caption


Figure 7: Non-termination ratios, rnt(L)r_{nt}(L)’s, of sequences generated from all variants of RNN (top) and LSTM (bottom), trained on WikiText-2, when using top-kk sampling (left), nucleus sampling (middle), and beam search (right), as a function of LL in log-log scale. We use the first 1010 tokens of every WikiText-2 validation instance as a context. We present their average (curve) with their min-max range (shaded area) across 10 random experiments. VA+ (orange) displays inconsistency (limLrnt(L)0\lim_{L\rightarrow\infty}r_{nt}(L)\nrightarrow 0) for all combinations of model architectures and decoding algorithms, except in VA+RNN using top-4 (orange dashed in top left) and VA+LSTM using top-{2,4} (orange solid and dashed in top left, respectively). On the other hand, NMST+ (blue) and ST+ (green) show consistency (limLrnt(L)0\lim_{L\rightarrow\infty}r_{nt}(L)\rightarrow 0) across all configurations. By using decoding algorithms other than greedy search, VA+LSTM can avoid non-terminating sequences (e.g., top-{2, 4}). However, as shown in Table 1, NMST+{RNN, LSTM} not only have better validation perplexities than VA+{RNN, LSTM} and ST+{RNN, LSTM} but also are consistent with respect to all decoding algorithms.

Appendix G Analysis of Predicted Sequence Length Distributions in §4.1

We investigate whether our proposed non-monotonic self-terminating (NMST+) language model matches the data length distribution better than the baselines: i) a vanilla (VA+) language model and ii) a self-terminating (ST+) language model. For this, we compare the length distributions of predicted sequences from {VA, ST, NMST}+LSTM trained on WikiText-2 with the data length distribution of ground truth sequences in the WikiText-2 validation dataset, 𝒟val\mathcal{D}_{val}, when using greedy search. All experimental setups and notations are the same as §4.1.

Figure 8 shows the length distributions of {VA, ST, NMST}+LSTM, and 𝒟val\mathcal{D}_{val}. For {ST, NMST}+LSTM, we use ϵ=1×105\epsilon=1\times 10^{-5} because this choice is optimal in terms of validation perplexities based on Table 1. We observe that the length distributions of predicted sequences from NMST+LSTM is closer to the data length distribution of 𝒟val\mathcal{D}_{val}, than those of predicted sequences from VA+LSTM and ST+LSTM.

Refer to caption


Figure 8: Length distributions of generated sequences from {VA, ST, NMST}+LSTM trained on WikiText-2 and the data length distribution of ground truth sequences in WikiText-2 validation dataset, 𝒟val\mathcal{D}_{val}. For {ST, NMST}+LSTM, we select ϵ=1.0×105\epsilon=1.0\times 10^{-5} since it is optimal in terms of validation perplexities in Table 1. NMST+LSTM better models the length distribution of 𝒟val\mathcal{D}_{val} than both VA+LSTM and ST+LSTM.

Furthermore, we can tune ϵ\epsilon to make the predicted length distribution of NMST+LSTM agree with the ground truth length distribution of 𝒟val\mathcal{D}_{val}. In Figure 9, we compare NMST+LSTM’s predicted length distribution of ϵ=5×104\epsilon=5\times 10^{-4} with that of ϵ=1×105\epsilon=1\times 10^{-5}. We see that ϵ=5×104\epsilon=5\times 10^{-4} better models the data length distribution than ϵ=5×104\epsilon=5\times 10^{-4}. However, in this case, the average validation perplexity of NMST+LSTM degrades from 101.5 (ϵ=1×105\epsilon=1\times 10^{-5}) to 105.6 (ϵ=5×104\epsilon=5\times 10^{-4}) as shown in Table 1.

Refer to caption


Figure 9: Length distributions of predicted sequences from NMST+LSTM trained on WikiText-2 for various ϵ\epsilon’s and the data length distribution of ground truth sequences in WikiText-2 validation dataset, 𝒟val\mathcal{D}_{val}. The length distribution of NMST+LSTM using ϵ=5.0×105\epsilon=5.0\times 10^{-5} matches the data length distribution of 𝒟val\mathcal{D}_{val} better than that of NMST+LSTM using ϵ=1.0×104\epsilon=1.0\times 10^{-4}. We can choose ϵ\epsilon to make the predicted length distribution of NMST+LSTM agree with the ground truth length distribution.