This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Understanding and Mitigating Tokenization Bias in Language Models

Buu Phan    Marton Havasi    Matthew Muckley    Karen Ullrich
Abstract

State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction. We show that popular encoding schemes, such as maximum prefix encoding (MPE) and byte-pair-encoding (BPE), induce a sampling bias that cannot be mitigated with more training or data. To counter this universal problem, for each encoding scheme above, we propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data. Our methods do not require finetuning the model, and the complexity, defined as the number of model runs, scales linearly with the sequence length in the case of MPE. As a result, we show that one can simulate token-free behavior from a tokenized language model. We empirically verify the correctness of our method through a Markov-chain setup, where it accurately recovers the transition probabilities, as opposed to the conventional method of directly prompting tokens into the language model.

Machine Learning, ICML
\AtBeginEnvironment

align*\useshortskip


1 Introduction

Tokenization is a preprocessing procedure used in many state-of-the-art (SOTA) language models (LMs) such as GPTs (Brown et al., 2020), Llama (Touvron et al., 2023) and Gemini (Gemini, 2023). It divides the input text into smaller subword units while retaining linguistic importance, helping to address vocabulary limitations such as unknown words. Tokenization also shortens (compresses) the input context length (Sennrich et al., 2015; Kudo & Richardson, 2018). Since effective compression allows transformer-based LMs to handle longer context strings, many works (Zouhar et al., 2023; Gallé, 2019; Goldman et al., 2024) have focused on enhancing vocabulary design and encoding algorithms for better performance in downstream tasks. However, the relationship between compression and model performance remains unclear. Some research suggests the impact of compression is not always positive (Schmidt et al., 2024; Dagan et al., 2024; Goyal et al., 2023). Consequently, understanding tokenization’s effect on model performance continues to be an open question.

Tokenization has been criticized for introducing many shortcomings in LMs. These include sensitivity to spelling and morphological structure (Xue et al., 2022), language-based biases (Petrov et al., 2024), subpar performance in specific tasks such as arithmetic (Singh & Strouse, 2024), or new domains (Liu et al., 2023a). One approach to address these issues is through fine-tuning the model with new vocabularies; however, this often complicates the training process and requires domain-specific expertise (Chen et al., 2023; Liu et al., 2023b). Furthermore, the performance gains do not provide a theoretical understanding of whether these limitations truly arise from the tokenization process or result from suboptimal model training. Another direction is to develop token-free LMs (Yu et al., 2024; Nawrot et al., 2022; Tay et al., 2021). While this approach has potential as it eliminates tokenization-related issues, it significantly increases the context length, resulting in performance that still lags behind the SOTA tokenized LMs 111We refer language models that process tokenized texts as tokenized language models (tokenized LMs).(Yu et al., 2024).

In this work we offer new theoretical insights on the behavior of tokenized LMs. We show that they are statistically equivalent to their token-free counterparts. Specifically, we examine the maximum prefix encoding (MPE) scheme employed in the WordPiece tokenization method (Devlin et al., 2018; Song et al., 2020) and find that this process not only results in biased estimates of next token probabilities, but also leads to overall skewed estimates of subsequent character probabilities. In general, this bias persists despite an increase in training data, even within the simple setting of a 1st-order Markov chain. Such bias occurs due to the implicit disparity between the domain of the conditioning context, namely, characters versus tokens. Nevertheless, we will show that it is possible to correct this bias without resorting to finetuning. Once adjusted, it becomes possible to simulate the token-free behavior learned implicitly by the tokenized LM and even (theoretically) mimic the behavior of another tokenized model employing a distinct vocabulary set, all without requiring finetuning. Our specific contributions are as follows:

  • We show the presence of a bias in the next-token distribution that arises as a result of the tokenization process.

  • We present two novel algorithms to correct this bias for MPE and Byte-Pair-Encoding (BPE) respectively. Due to space limit, the analysis and algorithm for BPE are presented in Appendix H.

  • We verify the correctness of our algorithms on learning the transition matrix of a k𝑘k-th order Markov chain.

2 Problem Setup

A𝐴AB𝐵BBefore Tokenization Input String: ``AABABAAAABAAB"``𝐴𝐴𝐵𝐴𝐵𝐴𝐴𝐴𝐴𝐵𝐴𝐴𝐵"``AABABAAAABAAB" ||{\color[rgb]{1,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,1,1}\pgfsys@color@gray@stroke{1}\pgfsys@color@gray@fill{1}|}1α1𝛼1{-}\alphaβ𝛽\betaα𝛼\alpha1β1𝛽1{-}\beta
ID Token
1 A𝐴A
2 B𝐵B
3 AA𝐴𝐴AA
Token VocabularyWordPiece Encoding \longrightarrow\longrightarrow
AA𝐴𝐴AAA𝐴AB𝐵BOutput Tokens: ``AA|B|A|B|AA|AA|B|AA|B"``𝐴𝐴𝐵𝐴𝐵𝐴𝐴𝐴𝐴𝐵𝐴𝐴𝐵"``{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}AA}|{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}B}|{\color[rgb]{0,1,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,0}A}|{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}B}|{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}AA}|{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}AA}|{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}B}|{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}AA}|{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}B}"αβ𝛼𝛽\alpha\beta1.01.01.00.00.00.00.00.00.0(1α)2superscript1𝛼2{(}1{-}\alpha{)}^{2}(1α)β1𝛼𝛽(1{-}\alpha)\betaα(1α)𝛼1𝛼\alpha{(}1{-}\alpha{)}α𝛼\alpha1β1𝛽1{-}\beta
Figure 1: Next-Character sampling bias introduced by the WordPiece encoding algorithm. In this example, given the context token ``A"``𝐴"``A", the model will always predict the next token as ``B"``𝐵"``B" with probability 1.01.01.0. We present a technique that, given a language model trained on tokenized domain, eliminate this bias and recover the accurate unbiased sampling distribution.

We begin by establishing the tokenization and language models setup in our paper. We then describe the next-character sampling bias problem due to tokenization.

2.1 Notations and Setup.

String Notations. For any string s𝑠s, we denote its substring from i𝑖i to j𝑗j as xij:=xixi+1..xjx^{j}_{i}{\vcentcolon=}{x_{i}x_{i+1}..x_{j}}, where each x𝑥x is a character of the alphabet 𝒜𝒜\mathcal{A}. For a given string x1Nsubscriptsuperscript𝑥𝑁1x^{N}_{1}, we define the prefix function that generates a set containing all possible prefix strings of x1Nsubscriptsuperscript𝑥𝑁1x^{N}_{1}, represented as prefix(x1N)={x11,x12,x13,,x1N}prefixsubscriptsuperscript𝑥𝑁1subscriptsuperscript𝑥11subscriptsuperscript𝑥21subscriptsuperscript𝑥31subscriptsuperscript𝑥𝑁1\mathrm{prefix}(x^{N}_{1}){=}\{x^{1}_{1},x^{2}_{1},x^{3}_{1},...,x^{N}_{1}\}. Also, we define a concatenation function concat(.)\mathrm{concat}(.) that concatenates the given list of strings, e.g given s1=x1N1subscript𝑠1subscriptsuperscript𝑥subscript𝑁11s_{1}{=}x^{N_{1}}_{1} and s2=y1N2subscript𝑠2subscriptsuperscript𝑦subscript𝑁21s_{2}{=}y^{N_{2}}_{1}, we obtain concat(s1,s2)=concat(x1N1,y1N2)=x1xN1y1yN2concatsubscript𝑠1subscript𝑠2concatsubscriptsuperscript𝑥subscript𝑁11subscriptsuperscript𝑦subscript𝑁21subscript𝑥1subscript𝑥subscript𝑁1subscript𝑦1subscript𝑦subscript𝑁2\mathrm{concat}(s_{1},s_{2}){=}\mathrm{concat}(x^{N_{1}}_{1},y^{N_{2}}_{1}){=}x_{1}...x_{N_{1}}y_{1}...y_{N_{2}}. Finally, we denote the set of all strings that start with a prefix x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} as 𝒮(x1n)={s|x1nprefix(s)}𝒮subscriptsuperscript𝑥𝑛1conditional-set𝑠subscriptsuperscript𝑥𝑛1prefix𝑠\mathcal{S}(x^{n}_{1}){=}\{s|x^{n}_{1}{\in}\mathrm{prefix}(s)\}.

Tokenization Setting. We assume having a predefined vocabulary 𝒱𝒱\mathcal{V} constructed using any tokenization algorithm such as BPE, with the condition that 𝒜𝒱𝒜𝒱\mathcal{A}{\subseteq}\mathcal{V}. We use t𝑡t to denote a token in 𝒱𝒱\mathcal{V}, i.e. t𝒱𝑡𝒱t{\in}\mathcal{V}. Importantly, we use the longest prefix matching strategy for tokenization (encoding), denoted as encode(.)\mathrm{encode}(.), similar to the approach used in the Wordpiece algorithm (Devlin et al., 2018; Song et al., 2020). Given a sequence of tokens t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1}, the function decode(.)\mathrm{decode}(.) returns the concatenated string resulting from processing each token in the sequence. Finally, the set of all strings that starts with the tokens t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1} is defined as 𝒮(t1k)={s|t1k=encode(s)1k}𝒮subscriptsuperscript𝑡𝑘1conditional-set𝑠subscriptsuperscript𝑡𝑘1encodesubscriptsuperscript𝑠𝑘1\mathcal{S}(t^{k}_{1}){=}\{s|t^{k}_{1}{=}\mathrm{encode}(s)^{k}_{1}\}.

Tokenized LMs. We assume having access to a tokenized autoregressive LM with parameters θ𝜃\theta that is trained with tokens from 𝒱𝒱\mathcal{V} and maximum prefix matching. The target distributions on the character domain is denoted as Pgt(xn+1N|x1n)subscript𝑃gtconditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1P_{\mathrm{gt}}(x^{N}_{n+1}|x^{n}_{1}) and on the token domain is Pgt(ti+1|t1i)subscript𝑃gtconditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1}). For simplicity, unless otherwise stated, we implicitly assume each probability term involves θ𝜃\theta. Using the model, we assume that one can compute P(ti+1|t1i)𝑃conditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1P(t_{i+1}|t^{i}_{1}) for any integer i>0𝑖0i>0. In this work, we consider LMs trained under the standard setup, where each string s𝑠s in the dataset is first tokenized with the encoding function encode(.)\mathrm{encode}(.) and vocabulary 𝒱𝒱\mathcal{V}, and the parameters θ𝜃\theta are optimized to maximize the predictive likelihood of the next token in the tokenized dataset.

2.2 Next-Character Sampling Bias

We first define the (next-character) sampling bias problem that describes the discrepancy between the character level and token level predictions for tokenized LMs.

Definition 2.1.

(Next-Character Sampling Bias) Let the input prompt string x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} has t1i=encode(x1n)subscriptsuperscript𝑡𝑖1encodesubscriptsuperscript𝑥𝑛1t^{i}_{1}{=}\mathrm{encode}(x^{n}_{1}) as the corresponding encoding. The next-character sampling bias occurs for this prompt when Pgt(xn+1|x1n)Pgt(xn+1|t1i)subscript𝑃gtconditionalsubscript𝑥𝑛1subscriptsuperscript𝑥𝑛1subscript𝑃gtconditionalsubscript𝑥𝑛1subscriptsuperscript𝑡𝑖1P_{\mathrm{gt}}(x_{n+1}|x^{n}_{1}){\neq}P_{\mathrm{gt}}(x_{n+1}|t^{i}_{1}) where Pgt(xn+1|t1i)=tPgt(ti+1=t|t1i)subscript𝑃gtconditionalsubscript𝑥𝑛1subscriptsuperscript𝑡𝑖1subscript𝑡subscript𝑃gtsubscript𝑡𝑖1conditional𝑡subscriptsuperscript𝑡𝑖1P_{\mathrm{gt}}(x_{n+1}|t^{i}_{1}){=}\sum_{t\in\mathcal{E}}P_{\mathrm{gt}}(t_{i+1}{=}t|t^{i}_{1}) where ={t𝒱|decode(t)𝒮(xn+1)}conditional-set𝑡𝒱decode𝑡𝒮subscript𝑥𝑛1\mathcal{E}=\{t\in\mathcal{V}|\mathrm{decode}(t)\in\mathcal{S}(x_{n+1})\}.

In other words, the probability of the next character being “c” may be different from the sum of the probabilities of all tokens that start with “c”. Note that this character-level probability offers a broader perspective compared to the probability of the subsequent token being exactly “c”.

Example. Consider a first order Markov chain with two states {``A",``B"}``𝐴"``𝐵"\{``A",``B"\} as shown in Figure 1 (left). Each string is tokenized with 𝒱={``AA",``A",``B"}𝒱``𝐴𝐴"``𝐴"``𝐵"\mathcal{V}{=}\{``AA",``A",``B"\}, which leads to a new Markov chain whose states and transition matrix is shown in Figure 1 (right). Details on computing the transition matrix of the new Markov chain is in Appendix F. We first observe that for the prompt s1=``AA"subscript𝑠1``𝐴𝐴"s_{1}{=}``AA" and s2="B"subscript𝑠2"𝐵"s_{2}{=}"B", there is no bias problem after marginalization222For example, we have Pgt(ti+1=``AA"|ti=``AA")+Pgt(ti+1=``A"|ti=``AA")=α=Pgt(xn+1=``A"|xn=``A")subscript𝑃gtsubscript𝑡𝑖1conditional``𝐴𝐴"subscript𝑡𝑖``𝐴𝐴"subscript𝑃gtsubscript𝑡𝑖1conditional``𝐴"subscript𝑡𝑖``𝐴𝐴"𝛼subscript𝑃gtsubscript𝑥𝑛1conditional``𝐴"subscript𝑥𝑛``𝐴"P_{\mathrm{gt}}(t_{i{+}1}{=}``AA"|t_{i}{=}``AA")+P_{\mathrm{gt}}(t_{i{+}1}{=}``A"|t_{i}{=}``AA")=\alpha=P_{\mathrm{gt}}(x_{n{+}1}{=}``A"|x_{n}{=}``A"). However, for the prompt s3=``A"subscript𝑠3``𝐴"s_{3}{=}``A", the sampling bias occurs as Pgt(x2=``B"|t1=``A")=1.0subscript𝑃gtsubscript𝑥2conditional``𝐵"subscript𝑡1``𝐴"1.0P_{\mathrm{gt}}(x_{2}{=}``B"|t_{1}{=}``A"){=}1.0, which is not equal to Pgt(x2=``B"|x1=``A")=αsubscript𝑃gtsubscript𝑥2conditional``𝐵"subscript𝑥1``𝐴"𝛼P_{\mathrm{gt}}(x_{2}{=}``B"|x_{1}{=}``A"){=}\alpha, i.e. the optimally trained LM will always output ``B"``𝐵"``B". In fact, for any context string that ends with token ``A"``𝐴"``A", e.g ``AA|A"conditional``𝐴𝐴𝐴"``AA|A" and ``B|A"conditional``𝐵𝐴"``B|A" (tokens are separated by ``|"conditional``"``|"), such LM will always output ``B"``𝐵"``B".

Since this applies to any optimally trained LM, increasing the training set size does not mitigate this problem. The reason for this sampling bias is that, during the tokenization process with longest prefix matching, the token ``A"``𝐴"``A" must be followed by the token ``B"``𝐵"``B". Else, MPE encoding will merge to create a longer token ``AA"``𝐴𝐴"``AA". We generalize this phenomenon with the definition of invalid encodings.

Definition 2.2.

(Invalid Encodings) The list of tokens (an encoding) t1isubscriptsuperscript𝑡𝑖1t^{i}_{1} is invalid if encode(decode(t1i))t1iencodedecodesubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑡𝑖1\mathrm{encode}(\mathrm{decode}(t^{i}_{1})){\neq}t^{i}_{1}. Otherwise, it is a valid encoding.

For example, let 𝒱={``c",``a",``t",``at",``cat"}𝒱``𝑐"``𝑎"``𝑡"``𝑎𝑡"``𝑐𝑎𝑡"\mathcal{V}{=}\{``c",``a",``t",``at",``cat"\} then [``c",``at",``t"]``𝑐"``𝑎𝑡"``𝑡"[``c"{,}``at"{,}``t"] and [``c",``a",``t",``t"]``𝑐"``𝑎"``𝑡"``𝑡"[``c"{,}``a"{,}``t"{,}``t"] are invalid encodings of ``catt"``𝑐𝑎𝑡𝑡"``catt". We now show in Proposition 2.3 that the existence of invalid encodings introduces sampling bias, generalizing the observed phenomenon in the Markov chain example to any autoregressive distribution.

Proposition 2.3.

(Token-Induced Zero Probability) Let t1isubscriptsuperscript𝑡𝑖1t^{i}_{1} be a sequence of input tokens. For any invalid encoding t1isubscriptsuperscript𝑡𝑖1t^{i}_{1}, we have Pgt(t1i)=0.0subscript𝑃gtsubscriptsuperscript𝑡𝑖10.0P_{\mathrm{gt}}(t^{i}_{1}){=}0.0 and the conditional probability Pgt(ti+1|t1i)subscript𝑃gtconditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1}) is undefined. In the case t1isubscriptsuperscript𝑡𝑖1t^{i}_{1} is valid, then Pgt(ti+1|t1i)=0.0subscript𝑃gtconditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖10.0P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1}){=}0.0 if t1i+1subscriptsuperscript𝑡𝑖11t^{i+1}_{1} is invalid. Furthermore, let x1n=decode(t1i)subscriptsuperscript𝑥𝑛1decodesubscriptsuperscript𝑡𝑖1x^{n}_{1}{=}\mathrm{decode}(t^{i}_{1}), then for any string xn+1Nsubscriptsuperscript𝑥𝑁𝑛1x^{N}_{n+1} such that encode(concat(decode(t1i),xn+1N))t1iencodeconcatdecodesubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑡𝑖1\mathrm{encode}(\mathrm{concat}(\mathrm{decode}(t^{i}_{1}),x^{N}_{n+1})){\neq}t^{i}_{1}, we have Pgt(xn+1N|t1i)=0.0.subscript𝑃gtconditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑡𝑖10.0P_{\mathrm{gt}}(x^{N}_{n+1}|t^{i}_{1}){=}0.0.

Proof.

See Appendix C. ∎

Remark 1.

Proposition 2.3 implies that LMs may not function as expected when presented with invalid encodings, because these models will never be exposed to such inputs within the dataset. This directly implies that the practice of evaluating LMs under different encodings (Cao & Rimell, 2021; Chirkova et al., 2023) is suboptimal.

3 Alleviating Sampling Bias

We propose a method to remove the described bias and recover the original token-free autoregressive model, i.e. expressing the implicitly learned P(xn+1N|x1n)𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1P(x^{N}_{n+1}|x^{n}_{1}) using the tokenized LM that outputs the conditional probability P(ti+1|t1i)𝑃conditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1P(t_{i+1}|t^{i}_{1}). For N=n+1𝑁𝑛1N{=}n{+}1, this captures the behavior of a token-free model, i.e. sampling the next character instead of a whole token. We assume our LM follows Proposition 2.3 on zero probability events and undefined conditional probability for invalid encodings. Appendix G justifies this assumption and provides its practical implementation.

Our method consists of two stages. In the first stage, the idea is to identify the condition when P(xn+1N|t1i)=P(xn+1N|x1n)𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑡𝑖1𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1P(x^{N}_{n+1}|t^{i}_{1})=P(x^{N}_{n+1}|x^{n}_{1}) where t1i=encode(x1n)subscriptsuperscript𝑡𝑖1encodesubscriptsuperscript𝑥𝑛1t^{i}_{1}=\mathrm{encode}(x^{n}_{1}). Once identified, we can refactor the conditional probability to match the conditioning events. In the second stage, we compute P(xn+1N|t1i)𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑡𝑖1P(x^{N}_{n+1}|t^{i}_{1}) using the LM output probability, i.e. P(ti+1|t1i)𝑃conditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1P(t_{i+1}|t^{i}_{1}), through the novel Maximum Prefix Correction (MPC) Algorithm.

3.1 Refactoring

Our method removes the bias by connecting character and token domains through a special subset of tokens 𝒱𝒱superscript𝒱𝒱\mathcal{V}^{*}{\subset}\mathcal{V}, whose elements t𝒱superscript𝑡superscript𝒱t^{*}{\in}\mathcal{V}^{*} are not a substring of any other tokens in 𝒱𝒱\mathcal{V} but itself. For example, given 𝒱={``AAA",``AA",``CB",``A",``B",``C"}𝒱``𝐴𝐴𝐴"``𝐴𝐴"``𝐶𝐵"``𝐴"``𝐵"``𝐶"\mathcal{V}{=}\{``AAA",``AA",``CB",``A",``B",``C"\}, then 𝒱={``AAA",``CB"}superscript𝒱``𝐴𝐴𝐴"``𝐶𝐵"\mathcal{V}^{*}=\{``AAA",``CB"\}. In the Markov example in Section 2, this corresponds to the tokens ``AA"``𝐴𝐴"``AA" and ``B"``𝐵"``B". Also, we assume that any string x1Nsubscriptsuperscript𝑥𝑁1x^{N}_{1} has the first token t1𝒱subscript𝑡1superscript𝒱t_{1}\in\mathcal{V}^{*}333Many current language models begins with a start token <start>expectationstart{<}\mathrm{start}{>} in 𝒱superscript𝒱\mathcal{V}^{*}, e.g. in SentencePiece (Kudo & Richardson, 2018).. Consider the input string x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} and its corresponding encoding t1i=encode(x1n)subscriptsuperscript𝑡𝑖1encodesubscriptsuperscript𝑥𝑛1t^{i}_{1}{=}\mathrm{encode}(x^{n}_{1}), Proposition 3.1 shows the sufficient condition for 𝒮(t1i)=𝒮(x1n)𝒮subscriptsuperscript𝑡𝑖1𝒮subscriptsuperscript𝑥𝑛1\mathcal{S}(t^{i}_{1}){=}\mathcal{S}(x^{n}_{1}).

Proposition 3.1.

Let s=x1nsuperscript𝑠subscriptsuperscript𝑥𝑛1s^{*}=x^{n}_{1}, where t1i=encode(s)=encode(x1n)subscriptsuperscript𝑡𝑖1encodesuperscript𝑠encodesubscriptsuperscript𝑥𝑛1t^{i}_{1}=\mathrm{encode}(s^{*})=\mathrm{encode}(x^{n}_{1}). Then we have 𝒮(t1i)𝒮(x1n)𝒮subscriptsuperscript𝑡𝑖1𝒮subscriptsuperscript𝑥𝑛1\mathcal{S}(t^{i}_{1})\subset\mathcal{S}(x^{n}_{1}), i.e. for any string s𝑠s where t1i=encode(s)1isubscriptsuperscript𝑡𝑖1encodesubscriptsuperscript𝑠𝑖1t^{i}_{1}=\mathrm{encode}(s)^{i}_{1}, we have P(x1n|t1i)=1.0𝑃conditionalsubscriptsuperscript𝑥𝑛1subscriptsuperscript𝑡𝑖11.0P(x^{n}_{1}|t^{i}_{1})=1.0. In the case ti𝒱subscript𝑡𝑖superscript𝒱t_{i}\in\mathcal{V}^{*}, then we also have that 𝒮(t1i)=𝒮(x1n)𝒮subscriptsuperscript𝑡𝑖1𝒮subscriptsuperscript𝑥𝑛1\mathcal{S}(t^{i}_{1})=\mathcal{S}(x^{n}_{1}), i.e. any string s𝑠s where x1nprefix(s)subscriptsuperscript𝑥𝑛1prefix𝑠x^{n}_{1}\in\mathrm{prefix}(s) must have the first i𝑖i tokens as t1isubscriptsuperscript𝑡𝑖1t^{i}_{1} and P(t1i|x1n)=1.0𝑃conditionalsubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑥𝑛11.0P(t^{i}_{1}|x^{n}_{1})=1.0.

Proof.

See Appendix D. ∎

The intuition for Proposition 3.1 is that the subsequent string after ti𝒱subscript𝑡𝑖superscript𝒱t_{i}{\in}\mathcal{V}^{*} cannot change the tokenization for x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}. We now establish one of the main results in Corollary 3.2.

Corollary 3.2.

Following Proposition 3.1, suppose ti𝒱subscript𝑡𝑖superscript𝒱t_{i}\in\mathcal{V}^{*} then we have P(xn+1N|x1n)=P(xn+1N|t1i)𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑡𝑖1P(x^{N}_{n+1}|x^{n}_{1}{)}{=}P(x^{N}_{n{+}1}|t^{i}_{1}). Similarly, we also have P(ti+1j|x1n)=P(ti+1j|t1i)𝑃conditionalsubscriptsuperscript𝑡𝑗𝑖1subscriptsuperscript𝑥𝑛1𝑃conditionalsubscriptsuperscript𝑡𝑗𝑖1subscriptsuperscript𝑡𝑖1P(t^{j}_{i+1}|x^{n}_{1}{)}{=}P(t^{j}_{i+1}|t^{i}_{1}).

Proof.

See Appendix D. ∎

We note that Proposition 3.1 and Corollary 3.2 always hold, regardless of the value of θ𝜃\theta. In general, consider when the last token of encode(x1n)encodesubscriptsuperscript𝑥𝑛1\mathrm{encode}(x^{n}_{1}) is not in 𝒱superscript𝒱\mathcal{V}^{*}, we can refactor P(xn+1N|x1n)𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1P(x^{N}_{n+1}|x^{n}_{1}) as follow:

P(xn+1N|x1n)=P(xnk+1N|t1k)P(xnk+1n|t1k),𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1𝑃conditionalsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1subscriptsuperscript𝑡𝑘1𝑃conditionalsubscriptsuperscript𝑥𝑛subscript𝑛𝑘1subscriptsuperscript𝑡𝑘1\displaystyle P(x^{N}_{n+1}|x^{n}_{1}){=}\frac{P(x^{N}_{n_{k}+1}|t^{k}_{1})}{P(x^{n}_{n_{k}+1}|t^{k}_{1})}, (1)

where k𝑘k is the last token in encode(x1n)encodesubscriptsuperscript𝑥𝑛1\mathrm{encode}(x^{n}_{1}) such that tk𝒱subscript𝑡𝑘superscript𝒱t_{k}{\in}\mathcal{V}^{*} and x1nk=decode(t1k)subscriptsuperscript𝑥subscript𝑛𝑘1decodesubscriptsuperscript𝑡𝑘1x^{n_{k}}_{1}{=}\mathrm{decode}(t^{k}_{1}), where nknsubscript𝑛𝑘𝑛n_{k}\leq n. Proof details of this step can be found in the Appendix E. We then use the MPC algorithm to compute each term in the RHS individually.

3.2 Maximum Prefix Correction Algorithm

Algorithm 1 Maximum Prefix Correction Algorithm. This algorithm recursively computes P(xnk+1N|t1k)𝑃conditionalsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1subscriptsuperscript𝑡𝑘1P(x^{N}_{n_{k}+1}|t^{k}_{1}).
1:procedure compute(xnk+1N,t1ksubscriptsuperscript𝑥𝑁subscript𝑛𝑘1subscriptsuperscript𝑡𝑘1x^{N}_{n_{k}+1},t^{k}_{1})
2:    // Branching Step:
3:    ={t𝒱|xnk+1Nprefix(decode(t))}conditional-set𝑡𝒱subscriptsuperscript𝑥𝑁subscript𝑛𝑘1prefixdecode𝑡\mathcal{B}=\{t\in\mathcal{V}|x^{N}_{n_{k}{+}1}{\in}\mathrm{prefix}(\mathrm{decode}({t}))\}
4:    bval=tP(tk+1=t|t1k)subscriptbvalsubscript𝑡𝑃subscript𝑡𝑘1conditional𝑡subscriptsuperscript𝑡𝑘1\mathrm{b_{val}}={\sum\limits_{t{\in}\mathcal{B}}}P(t_{k{+}1}=t\big{|}t^{k}_{1})
5:    // Base Case:
6:    if encode(xiN)𝒱encodesubscriptsuperscript𝑥𝑁𝑖𝒱\mathrm{encode}(x^{N}_{i})\in\mathcal{V} then
7:        return bvalsubscriptbval\mathrm{b_{val}}
8:    end if
9:    //Extract the Next Token:
10:    tk+1=encode(xnk+1N)1subscript𝑡𝑘1encodesubscriptsubscriptsuperscript𝑥𝑁subscript𝑛𝑘11t_{k+1}=\mathrm{encode}(x^{N}_{n_{k}+1})_{1}
11:    // Passing Step:
12:    pval=P(tk+1|t1k)subscriptpval𝑃conditionalsubscript𝑡𝑘1subscriptsuperscript𝑡𝑘1\mathrm{p_{val}}=P(t_{k+1}\big{|}t^{k}_{1})
13:    pval=pval×COMPUTE(xnk+1+1N,t1k+1)subscriptpvalsubscriptpvalCOMPUTEsubscriptsuperscript𝑥𝑁subscript𝑛𝑘11subscriptsuperscript𝑡𝑘11\mathrm{p_{val}}=\mathrm{p_{val}}\times\text{COMPUTE}(x^{N}_{n_{k+1}+1},t^{k+1}_{1})
14:    return bval+pvalsubscriptbvalsubscriptpval\mathrm{b_{val}}+\mathrm{p_{val}}
15:end procedure

We present the MPC algorithm in Algorithm 1, that allows us to compute the probabilities P(xnk+1N|t1k)𝑃conditionalsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1subscriptsuperscript𝑡𝑘1P(x^{N}_{n_{k}+1}|t^{k}_{1}) and P(xnk+1n|t1k)𝑃conditionalsubscriptsuperscript𝑥𝑛subscript𝑛𝑘1subscriptsuperscript𝑡𝑘1P(x^{n}_{n_{k}+1}|t^{k}_{1}) in Equation (1). Note that this algorithm does not require tk𝒱subscript𝑡𝑘superscript𝒱t_{k}{\in}\mathcal{V}^{*}. Details on the algorithmic correctness are shown in Appendix E.

The idea is to marginalize out P(xnk+1N|t1k)𝑃conditionalsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1subscriptsuperscript𝑡𝑘1P(x^{N}_{n_{k}+1}|t^{k}_{1}) by considering two complementary events: when the next token tk+1subscript𝑡𝑘1t_{k+1} has a prefix xnk+1Nsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1x^{N}_{n_{k}+1} (bvalsubscriptbval\mathrm{b_{val}} in the Branch Step) versus when the next token tk+1subscript𝑡𝑘1t_{k+1} is contained within xnk+1Nsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1x^{N}_{n_{k}+1} (pvalsubscriptpval\mathrm{p_{val}} in the Pass Step). Formally, MPC computes the following probabilities:

bvalsubscriptbval\displaystyle\vspace*{-20pt}\mathrm{b_{val}} =P(xnk+1N,tk+1(xnk+1N))|t1k),\displaystyle=P(x^{N}_{n_{k}{+}1},t_{k+1}\in\mathcal{B}(x^{N}_{n_{k}+1}))\big{|}t^{k}_{1}), (2)
pvalsubscriptpval\displaystyle\mathrm{p_{val}} =P(xnk+1N,tk+1(xnk+1N))|t1k),\displaystyle=P(x^{N}_{n_{k}{+}1},t_{k+1}\notin\mathcal{B}(x^{N}_{n_{k}+1}))\big{|}t^{k}_{1}),\vspace{-20pt} (3)

where (xnk+1N)={t𝒱|xnk+1Nprefix(decode(t))}subscriptsuperscript𝑥𝑁subscript𝑛𝑘1conditional-set𝑡𝒱subscriptsuperscript𝑥𝑁subscript𝑛𝑘1prefixdecode𝑡\mathcal{B}(x^{N}_{n_{k}+1}){=}\{t{\in}\mathcal{V}|x^{N}_{n_{k}{+}1}{\in}\mathrm{prefix}(\mathrm{decode}({t}))\} and we immediately see that P(xnk+1N|t1k)=bval+pval𝑃conditionalsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1subscriptsuperscript𝑡𝑘1subscriptbvalsubscriptpvalP(x^{N}_{n_{k}+1}|t^{k}_{1}){=}\mathrm{b_{val}{+}p_{val}}.

We provide an intuitive explanation for the algorithm following the example in Figure 2. Here, we would like to compute the probability P(xnk+1nk+3=``bee"|t1k)𝑃subscriptsuperscript𝑥subscript𝑛𝑘3subscript𝑛𝑘1conditional``𝑏𝑒𝑒"subscriptsuperscript𝑡𝑘1P(x^{n_{k}+3}_{n_{k}+1}{=}``bee"|t^{k}_{1}). The first possibility is that ``bee"``𝑏𝑒𝑒"``bee" is a prefix of the next token, so we search for all such tokens (line 3 in the algorithm) and sum up their probability (line 4), i.e. bval=P(tk+1=``beer"|t1k)subscriptbval𝑃subscript𝑡𝑘1conditional``𝑏𝑒𝑒𝑟"subscriptsuperscript𝑡𝑘1\mathrm{b_{val}}{=}P(t_{k+1}{=}``beer"|t^{k}_{1}). Figure 2 visualizes this step as branching out the tree by finding all tokens completing the string. Since ``beer"``𝑏𝑒𝑒𝑟"``beer" is not the only string that contains ``bee"``𝑏𝑒𝑒"``bee", e.g. ``beep",``been"``𝑏𝑒𝑒𝑝"``𝑏𝑒𝑒𝑛"``beep",``been", etc. we need to compute the probability for these other scenarios, each of which has tk+1=``b"subscript𝑡𝑘1``𝑏"t_{k+1}{=}``b" (the first token in ``bee"``𝑏𝑒𝑒"``bee", line 10 and 12) due to maximum prefix encoding. Then, we want to compute the probability that the subsequent string is ``ee"``𝑒𝑒"``ee" (line 13), given the previous t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1} and tk+1=``b"subscript𝑡𝑘1``𝑏"t_{k+1}{=}``b", which is the output of the MPC algorithm but for xnk+2nk+3=``ee"subscriptsuperscript𝑥subscript𝑛𝑘3subscript𝑛𝑘2``𝑒𝑒"x^{n_{k}+3}_{n_{k}+2}{=}``ee" and t1k+1subscriptsuperscript𝑡𝑘11t^{k+1}_{1}. Formally, in the Passing step: pval=P(tk+1=``b"|t1k)P(xnk+2nk+3=``ee"|t1k,tk+1=``b")subscriptpval𝑃subscript𝑡𝑘1conditional``𝑏"subscriptsuperscript𝑡𝑘1𝑃subscriptsuperscript𝑥subscript𝑛𝑘3subscript𝑛𝑘2conditional``𝑒𝑒"subscriptsuperscript𝑡𝑘1subscript𝑡𝑘1``𝑏"\mathrm{p_{val}}{=}P(t_{k+1}{=}``b"|t^{k}_{1})P(x^{n_{k}+3}_{n_{k}+2}{=}``ee"|t^{k}_{1},t_{k+1}{=}``b"). We continue the procedure until meeting the base case, where the string must be a prefix of the next token (usually, when there is only a single character left). Finally, by computing the sum of the branch and pass steps, we obtain the desired conditional probability bval+pval=P(xnk+1nk+3=``bee"|t1k)subscriptbvalsubscriptpval𝑃subscriptsuperscript𝑥subscript𝑛𝑘3subscript𝑛𝑘1conditional``𝑏𝑒𝑒"subscriptsuperscript𝑡𝑘1\mathrm{b_{val}}{+}\mathrm{p_{val}}{=}P(x^{n_{k}+3}_{n_{k}+1}{=}``bee"|t^{k}_{1}).

4 Experiments

Refer to caption

Figure 2: MPC Visualization. At each recursive call, the Branch step finds tokens that starts with the query string while the Pass step extracts and employs the next token and leftover string for the next recursive call until meeting the base case.

We validate our method on a 3rd order Markov chain experiment with 𝒜={``A",``B"}𝒜``𝐴"``𝐵"\mathcal{A}{=}\{``A"{,}``B"\}, where we randomly construct the transition matrix and the vocabulary 𝒱={``A",``B",``AA",``BAAB",``BBAA",``BBBA",\mathcal{V}{=}\{``A",``B",``AA",``BAAB",``BBAA",``BBBA", ``BA",``BBA"}``BA",``BBA"\}. We train a LM model using GPT-2 architecture with 6 hidden layers. Since the model is agnostic to the Markov chain order, we average the probability from 100 runs on different context length while fixing the last 3 characters. We compare our method with the baseline estimator P(xn+1|t1i)𝑃conditionalsubscript𝑥𝑛1subscriptsuperscript𝑡𝑖1P(x_{n+1}|t^{i}_{1}), equivalent to one Branch step in the MPC algorithm. Figure 3 shows the results where the baseline method exhibits significant sampling bias due to tokenization. Following Proposition 2.3, one can clarify the zero probability events output from the baseline estimator. Our method, in contrast, accurately estimates the ground truth probability used to generate the data, showing that it is possible to recover the implicitly learned character information from the tokenized LMs.

Refer to caption

Figure 3: Our method accurately estimates the transition probability of a 3rd order Markov chain while the baseline method fails to.

5 Conclusion

This work identifies the next-character sampling gap between a tokenized model and a token-free one, which persists even for optimally trained models. We present a probabilistic approach to effectively eliminate this bias without requiring additional training. This closes the sampling gap between tokenized and token-free models, suggesting that language models implicitly absorb character-level information despite being trained solely on tokenized text. This result implies that it is theoretically possible to simulate the behavior of another language model trained using different vocabulary without any fine-tuning, since it is possible to transfer from token-free models to tokenized counterparts.

References

  • Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Cao & Rimell (2021) Cao, K. and Rimell, L. You should evaluate your language model on marginal likelihood over tokenisations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  2104–2114, 2021.
  • Chen et al. (2023) Chen, Y., Marchisio, K., Raileanu, R., Adelani, D., Saito Stenetorp, P. L. E., Riedel, S., and Artetxe, M. Improving language plasticity via pretraining with active forgetting. Advances in Neural Information Processing Systems, 36:31543–31557, 2023.
  • Chirkova et al. (2023) Chirkova, N., Kruszewski, G., Rozen, J., and Dymetman, M. Should you marginalize over possible tokenizations? In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
  • Cleary & Witten (1984) Cleary, J. and Witten, I. Data compression using adaptive coding and partial string matching. IEEE transactions on Communications, 32(4):396–402, 1984.
  • Cognetta et al. (2024) Cognetta, M., Zouhar, V., Moon, S., and Okazaki, N. Two counterexamples to\\\backslashtextit {{\{Tokenization and the Noiseless Channel}}\}. arXiv preprint arXiv:2402.14614, 2024.
  • Dagan et al. (2024) Dagan, G., Synnaeve, G., and Rozière, B. Getting the most out of your tokenizer for pre-training and domain adaptation. arXiv preprint arXiv:2402.01035, 2024.
  • Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Gallé (2019) Gallé, M. Investigating the effectiveness of bpe: The power of shorter sequences. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp.  1375–1381, 2019.
  • Gemini (2023) Gemini, T. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • Goldman et al. (2024) Goldman, O., Caciularu, A., Eyal, M., Cao, K., Szpektor, I., and Tsarfaty, R. Unpacking tokenization: Evaluating text compression and its correlation with model performance. arXiv preprint arXiv:2403.06265, 2024.
  • Goyal et al. (2023) Goyal, S., Ji, Z., Rawat, A. S., Menon, A. K., Kumar, S., and Nagarajan, V. Think before you speak: Training language models with pause tokens. In The Twelfth International Conference on Learning Representations, 2023.
  • guidance ai (2023) guidance ai. Guidance ai, 2023. URL https://github.com/guidance-ai/guidance. GitHub repository.
  • Gutierrez-Vasques et al. (2023) Gutierrez-Vasques, X., Bentz, C., and Samardžić, T. Languages through the looking glass of bpe compression. Computational Linguistics, 49(4):943–1001, 2023.
  • Kudo & Richardson (2018) Kudo, T. and Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
  • Liu et al. (2023a) Liu, S., Deng, N., Sabour, S., Jia, Y., Huang, M., and Mihalcea, R. Task-adaptive tokenization: Enhancing long-form text generation efficacy in mental health and beyond. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023a.
  • Liu et al. (2023b) Liu, Y., Lin, P., Wang, M., and Schütze, H. Ofa: A framework of initializing unseen subword embeddings for efficient large-scale multilingual continued pretraining. arXiv preprint arXiv:2311.08849, 2023b.
  • Makkuva et al. (2024) Makkuva, A. V., Bondaschi, M., Girish, A., Nagle, A., Jaggi, M., Kim, H., and Gastpar, M. Attention with markov: A framework for principled analysis of transformers via markov chains. arXiv preprint arXiv:2402.04161, 2024.
  • Minixhofer et al. (2024) Minixhofer, B., Ponti, E. M., and Vulić, I. Zero-shot tokenizer transfer. arXiv preprint arXiv:2405.07883, 2024.
  • Nawrot et al. (2022) Nawrot, P., Chorowski, J., Łańcucki, A., and Ponti, E. M. Efficient transformers with dynamic token pooling. arXiv preprint arXiv:2211.09761, 2022.
  • Petrov et al. (2024) Petrov, A., La Malfa, E., Torr, P., and Bibi, A. Language model tokenizers introduce unfairness between languages. Advances in Neural Information Processing Systems, 36, 2024.
  • Provilkov et al. (2019) Provilkov, I., Emelianenko, D., and Voita, E. Bpe-dropout: Simple and effective subword regularization. arXiv preprint arXiv:1910.13267, 2019.
  • Rajaraman et al. (2024) Rajaraman, N., Jiao, J., and Ramchandran, K. Toward a theory of tokenization in llms. arXiv preprint arXiv:2404.08335, 2024.
  • Schmidt et al. (2024) Schmidt, C. W., Reddy, V., Zhang, H., Alameddine, A., Uzan, O., Pinter, Y., and Tanner, C. Tokenization is more than compression. arXiv preprint arXiv:2402.18376, 2024.
  • Sennrich et al. (2015) Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
  • Singh & Strouse (2024) Singh, A. K. and Strouse, D. Tokenization counts: the impact of tokenization on arithmetic in frontier llms. arXiv preprint arXiv:2402.14903, 2024.
  • Song et al. (2020) Song, X., Salcianu, A., Song, Y., Dopson, D., and Zhou, D. Fast wordpiece tokenization. arXiv preprint arXiv:2012.15524, 2020.
  • Tay et al. (2021) Tay, Y., Tran, V. Q., Ruder, S., Gupta, J., Chung, H. W., Bahri, D., Qin, Z., Baumgartner, S., Yu, C., and Metzler, D. Charformer: Fast character transformers via gradient-based subword tokenization. In International Conference on Learning Representations, 2021.
  • Touvron et al. (2023) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • Willems et al. (1995) Willems, F. M., Shtarkov, Y. M., and Tjalkens, T. J. The context-tree weighting method: Basic properties. IEEE transactions on information theory, 41(3):653–664, 1995.
  • Xue et al. (2022) Xue, L., Barua, A., Constant, N., Al-Rfou, R., Narang, S., Kale, M., Roberts, A., and Raffel, C. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022.
  • Yu et al. (2024) Yu, L., Simig, D., Flaherty, C., Aghajanyan, A., Zettlemoyer, L., and Lewis, M. Megabyte: Predicting million-byte sequences with multiscale transformers. Advances in Neural Information Processing Systems, 36, 2024.
  • Zouhar et al. (2023) Zouhar, V., Meister, C., Gastaldi, J., Du, L., Sachan, M., and Cotterell, R. Tokenization and the noiseless channel. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5184–5207, 2023.

Appendix A Related Work

Theory of Tokenization. Existing works on tokenization generally support the idea that compressing tokens enhances model performance (Gallé, 2019; Gutierrez-Vasques et al., 2023; Zouhar et al., 2023). However, these emperically findings are in conflicted with other later studies Cognetta et al. (2024); Schmidt et al. (2024). On the theoretical side, Rajaraman et al. (2024) examined tokenization through the lens of unigram models, motivated by the observation made by Makkuva et al. (2024) that transformers struggles to learn 2nd-order Markov chains. We, however, do not observe this phenomenon in our experiment. As such, our work on bias due to tokenization is not affected by their observation.

Tokenization and Perplexity. Our work relates to the statistical evaluation of LMs, where we provide an algorithm to directly evaluate the character-level perplexity p(xn+1N|x1n)𝑝conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1p(x^{N}_{n+1}|x^{n}_{1}), using a tokenized LM. In terms of token-level perplexity evaluation, some recent studies (Cao & Rimell, 2021; Chirkova et al., 2023) have suggested using stochastic tokenization (Provilkov et al., 2019) at test time to evaluate perplexity scores of LMs (p(t1i)𝑝subscriptsuperscript𝑡𝑖1p(t^{i}_{1})). However, these evaluations were done on LMs trained with deterministic tokenization which could be suboptimal as demonstrated by our examination of undefined states in Section 2. As such, by utilizing our approach, one can obtain a much more accurate insights on LMs evaluation.

Related Algorithms. Our algorithm is inspired from the literature of universal compression such as prediction by partial matching (Cleary & Witten, 1984) and context-tree weighting (Willems et al., 1995), which have been applied for text prediction but for much simpler settings without any tokenization involved. Recently, Minixhofer et al. (2024); Liu et al. (2023a) propose tokenization adaptation methods, which still requires a heuristic optimization that complicates the training pipeline. Some recent studies have proposed method to target the problem of language models encountering difficulties generating text near prompt boundaries (Dagan et al., 2024; guidance ai, 2023), which bears some resemblance to our proposed algorithm. These methods, however, are heuristic and only applicable to certain scenarios. On the other hand, our bias removal algorithm is theoretically correct, versatile for various situations, and enables conversion between token-free and tokenized LMs due to its accurate representation of conditional sampling distributions.

Appendix B Supporting Theorems on Maximum Prefix Encoding

This section provides supporting theorems for the proof of the main results. We first remind the readers that the set 𝒮(x1n)𝒮subscriptsuperscript𝑥𝑛1\mathcal{S}(x^{n}_{1}) corresponds to the set of all strings that contain x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} as a prefix. Similarly, the event set 𝒮(t1i)𝒮subscriptsuperscript𝑡𝑖1\mathcal{S}(t^{i}_{1}) corresponds to the set of all strings whose first i𝑖i tokens are t1isubscriptsuperscript𝑡𝑖1t^{i}_{1}. Consider when t1i=encode(x1n)subscriptsuperscript𝑡𝑖1encodesubscriptsuperscript𝑥𝑛1t^{i}_{1}=\mathrm{encode}(x^{n}_{1}), it should be noted that the two sets S(t1i)𝑆subscriptsuperscript𝑡𝑖1S(t^{i}_{1}) and S(x1n)𝑆subscriptsuperscript𝑥𝑛1S(x^{n}_{1}) are not guaranteed to be equivalent. That is because the subsequent characters after x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} can affect the tokenization within the first n𝑛n character. We illustrate this in more detail in the following example.

Example. Consider the Markov chain example in Section 2, where 𝒱={``AA",``A",``B"}𝒱``𝐴𝐴"``𝐴"``𝐵"\mathcal{V}=\{``AA",``A",``B"\}. Then, the string s1=``AABAABAB"subscript𝑠1``𝐴𝐴𝐵𝐴𝐴𝐵𝐴𝐵"s_{1}=``AABAABAB", then s1𝒮(x1=``A")subscript𝑠1𝒮subscript𝑥1``𝐴"s_{1}\in\mathcal{S}(x_{1}=``A") and s1𝒮(t1=``AA")subscript𝑠1𝒮subscript𝑡1``𝐴𝐴"s_{1}\in\mathcal{S}(t_{1}=``AA") since the first character of s1subscript𝑠1s_{1} is ``A"``𝐴"``A" and the first token of s1subscript𝑠1s_{1} is ``AA"``𝐴𝐴"``AA". On the other hand, s1𝒮(t1=encode(x1)=``A")subscript𝑠1𝒮subscript𝑡1encodesubscript𝑥1``𝐴"s_{1}\notin\mathcal{S}(t_{1}=\mathrm{encode}(x_{1})=``A") since its first token is ``AA"``𝐴𝐴"``AA", not ``A"``𝐴"``A".

Refer to caption


Figure 4: Interpretations of Proposition B.1, which shows that for any string s𝑠s with prefix x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}, its token within x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} must start at a certain designated positions. For each encoding, same color denotes belonging to the same token. This would later allows us to construct an efficient algorithm to correct the bias. See Corollary B.3 for details on invalid encoding.

We introduce the Proposition B.1 that contains two facts regarding the MPE process, visually presented in Figure 4.

Proposition B.1.

Let s𝑠s be a string with the prefix x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} (x1nprefix(s)subscriptsuperscript𝑥𝑛1prefix𝑠x^{n}_{1}\in\mathrm{prefix}(s)). Define the minimal superstring r𝑟r to be the prefix of s𝑠s with the fewest tokens that contains x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} as a prefix: r=argminr(k|t1k=encode(r)x1nprefix(r)rprefix(s))𝑟subscriptargmin𝑟conditional𝑘subscriptsuperscript𝑡𝑘1encode𝑟subscriptsuperscript𝑥𝑛1prefix𝑟𝑟prefix𝑠r=\mathrm{argmin}_{r}(k|t^{k}_{1}=\mathrm{encode}(r)\wedge x^{n}_{1}\in\mathrm{prefix}(r)\wedge r\in\mathrm{prefix}(s)). Then, we have the followings:

  1. 1.

    For 1i<k1𝑖𝑘1\leq i<k, encode(s)i=encode(x1n)iencodesubscript𝑠𝑖encodesubscriptsubscriptsuperscript𝑥𝑛1𝑖\mathrm{encode}(s)_{i}=\mathrm{encode}(x^{n}_{1})_{i}. Furthermore, when r=x1n𝑟subscriptsuperscript𝑥𝑛1r=x^{n}_{1}, we also have encode(s)k=encode(x1n)kencodesubscript𝑠𝑘encodesubscriptsubscriptsuperscript𝑥𝑛1𝑘\mathrm{encode}(s)_{k}=\mathrm{encode}(x^{n}_{1})_{k}.

  2. 2.

    Let \ell be the number of tokens in encode(x1n)encodesubscriptsuperscript𝑥𝑛1\mathrm{encode}(x^{n}_{1}), then we have decode(encode(x1n)k)prefix(decode(encode(s)k))decodeencodesubscriptsuperscriptsubscriptsuperscript𝑥𝑛1𝑘prefixdecodeencodesubscript𝑠𝑘\mathrm{decode}(\mathrm{encode}(x^{n}_{1})^{\ell}_{k})\in\mathrm{prefix}(\mathrm{decode}(\mathrm{encode}(s)_{k})).

Proof.

(Result 1.) Proof by contradiction. Let s𝑠s be the counter-example with the fewest number of tokens. Assume that for 1i<k1𝑖𝑘1\leq i<k, encode(s)iencode(x1n)iencodesubscript𝑠𝑖encodesubscriptsubscriptsuperscript𝑥𝑛1𝑖\mathrm{encode}(s)_{i}\neq\mathrm{encode}(x^{n}_{1})_{i}. Let j𝑗j be the smallest of such i𝑖i.

Consider encode(s)jencodesubscript𝑠𝑗\mathrm{encode}(s)_{j} and encode(x1n)jencodesubscriptsubscriptsuperscript𝑥𝑛1𝑗\mathrm{encode}(x^{n}_{1})_{j}.

  • Case 1: |decode(encode(s)1j)|<|x1n|decodeencodesubscriptsuperscript𝑠𝑗1subscriptsuperscript𝑥𝑛1|\mathrm{decode}(\mathrm{encode}(s)^{j}_{1})|<|x^{n}_{1}|.

    • Case 1.a: |decode(encode(s)j)|<|decode(encode(x1n)j)|decodeencodesubscript𝑠𝑗decodeencodesubscriptsubscriptsuperscript𝑥𝑛1𝑗|\mathrm{decode}(\mathrm{encode}(s)_{j})|<|\mathrm{decode}(\mathrm{encode}(x^{n}_{1})_{j})|. This leads to a contradiction, since x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} is a prefix of s𝑠s, therefore a longest prefix matching algorithm would always generate the longer token (encode(x1n)jencodesubscriptsubscriptsuperscript𝑥𝑛1𝑗\mathrm{encode}(x^{n}_{1})_{j}) over the shorter one (encode(s)jencodesubscript𝑠𝑗\mathrm{encode}(s)_{j}) when it is available.

    • Case 1.b: |decode(encode(s)j)|>|decode(encode(x1n)j)|decodeencodesubscript𝑠𝑗decodeencodesubscriptsubscriptsuperscript𝑥𝑛1𝑗|\mathrm{decode}(\mathrm{encode}(s)_{j})|>|\mathrm{decode}(\mathrm{encode}(x^{n}_{1})_{j})|. This leads to a contradiction, since concat(encode(s)1j)concatencodesubscriptsuperscript𝑠𝑗1\mathrm{concat}(\mathrm{encode}(s)^{j}_{1}) is a prefix of x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} (Case 1 assumption), therefore a longest prefix matching algorithm would always generate the longer token (encode(s)jencodesubscript𝑠𝑗\mathrm{encode}(s)_{j}) over the shorter one (encode(x1n)jencodesubscriptsubscriptsuperscript𝑥𝑛1𝑗\mathrm{encode}(x^{n}_{1})_{j}) when it is available.

    • Case 1.c: |decode(encode(s)j)|=|decode(encode(x1n)j)|decodeencodesubscript𝑠𝑗decodeencodesubscriptsubscriptsuperscript𝑥𝑛1𝑗|\mathrm{decode}(\mathrm{encode}(s)_{j})|=|\mathrm{decode}(\mathrm{encode}(x^{n}_{1})_{j})|. This means that the two tokens are the same, contradicting our initial assumption.

  • Case 2: |decode(encode(s)1j)||x1n|decodeencodesubscriptsuperscript𝑠𝑗1subscriptsuperscript𝑥𝑛1|\mathrm{decode}(\mathrm{encode}(s)^{j}_{1})|\geq|x^{n}_{1}|. In this case, r=decode(encode(s)1j)𝑟decodeencodesubscriptsuperscript𝑠𝑗1r=\mathrm{decode}(\mathrm{encode}(s)^{j}_{1}) is a superstring of x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} implying that k𝑘k is at most j𝑗j, which contradicts our initial assumption that 1j<k1𝑗𝑘1\leq j<k.

Finally, in the case r=x1n𝑟subscriptsuperscript𝑥𝑛1r=x^{n}_{1}, this means decode(encode(s)k)decodeencodesubscript𝑠𝑘\mathrm{decode}(\mathrm{encode}(s)_{k}) is a suffix of x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}. Since all the tokens before k𝑘k within x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} has been matched, i.e. encode(s)i=encode(x1n)iencodesubscript𝑠𝑖encodesubscriptsubscriptsuperscript𝑥𝑛1𝑖\mathrm{encode}(s)_{i}=\mathrm{encode}(x^{n}_{1})_{i} for 1i<k1𝑖𝑘1\leq i<k, the last token must also match as the result (else, |r||x1n|𝑟subscriptsuperscript𝑥𝑛1|r|\neq|x^{n}_{1}|, leads to contradiction), we have encode(s)k=encode(x1n)kencodesubscript𝑠𝑘encodesubscriptsubscriptsuperscript𝑥𝑛1𝑘\mathrm{encode}(s)_{k}=\mathrm{encode}(x^{n}_{1})_{k}.

(Result 2.) The proof idea is that since r𝑟r contains x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} and any tokens within r𝑟r and x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} has been matched up to k1𝑘1k{-}1, then what is left in x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} must be in the last token in r𝑟r (which is the k𝑘kth token of r𝑟r). Formally, following Result 1, we have decode(encode(x1n)1k1)=decode(encode(s)1k1)decodeencodesubscriptsuperscriptsubscriptsuperscript𝑥𝑛1𝑘11decodeencodesubscriptsuperscript𝑠𝑘11\mathrm{decode}(\mathrm{encode}(x^{n}_{1})^{k-1}_{1})=\mathrm{decode}(\mathrm{encode}(s)^{k-1}_{1}). Since r𝑟r has k𝑘k tokens in total and x1nprefix(r)subscriptsuperscript𝑥𝑛1prefix𝑟x^{n}_{1}\in\mathrm{prefix}(r), this means that decode(encode(s)k)decodeencodesubscript𝑠𝑘\mathrm{decode}(\mathrm{encode}(s)_{k}) must cover the rest of x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}, i.e. decode(encode(x1n)k)decodeencodesubscriptsuperscriptsubscriptsuperscript𝑥𝑛1𝑘\mathrm{decode}(\mathrm{encode}(x^{n}_{1})^{\ell}_{k}). As the result, we must have decode(encode(x1n)k)prefix(decode(encode(s)k))decodeencodesubscriptsuperscriptsubscriptsuperscript𝑥𝑛1𝑘prefixdecodeencodesubscript𝑠𝑘\mathrm{decode}(\mathrm{encode}(x^{n}_{1})^{\ell}_{k})\in\mathrm{prefix}(\mathrm{decode}(\mathrm{encode}(s)_{k})). ∎

We remind the reader the definition of invalid encoding below.

Definition B.2.

(Invalid Encodings) The list of tokens (an encoding) t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1} is invalid if encode(decode(t1k))t1kencodedecodesubscriptsuperscript𝑡𝑘1subscriptsuperscript𝑡𝑘1\mathrm{encode}(\mathrm{decode}(t^{k}_{1})){\neq}t^{k}_{1}. Otherwise, it is a valid encoding.

Corollary B.3.

𝒮(t1k)=𝒮subscriptsuperscript𝑡𝑘1\mathcal{S}(t^{k}_{1})=\emptyset if and only if t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1} is invalid.

Proof.

We prove each direction as follow.

  • If 𝒮(t1k)=𝒮subscriptsuperscript𝑡𝑘1\mathcal{S}(t^{k}_{1})=\emptyset then t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1} is invalid: Since 𝒮(t1k)=𝒮subscriptsuperscript𝑡𝑘1\mathcal{S}(t^{k}_{1})=\emptyset, we know that there exist no string s𝑠s such that encode(s)1k=t1kencodesubscriptsuperscript𝑠𝑘1subscriptsuperscript𝑡𝑘1\mathrm{encode}(s)^{k}_{1}=t^{k}_{1}. As such, for s=decode(t1k)𝑠decodesubscriptsuperscript𝑡𝑘1s=\mathrm{decode}(t^{k}_{1}), we do not have encode(decode(t1k))=t1kencodedecodesubscriptsuperscript𝑡𝑘1subscriptsuperscript𝑡𝑘1\mathrm{encode}(\mathrm{decode}(t^{k}_{1}))=t^{k}_{1}, which proves the result.

  • If t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1} is invalid then 𝒮(t1k)=𝒮subscriptsuperscript𝑡𝑘1\mathcal{S}(t^{k}_{1})=\emptyset: Let x1n=decode(t1k)subscriptsuperscript𝑥𝑛1decodesubscriptsuperscript𝑡𝑘1x^{n}_{1}=\mathrm{decode}(t^{k}_{1}) and t1i=encode(x1n)subscriptsuperscript𝑡𝑖1encodesubscriptsuperscript𝑥𝑛1t^{i}_{1}=\mathrm{encode}(x^{n}_{1}). Let s1𝒮(t1i)subscript𝑠1𝒮subscriptsuperscript𝑡𝑖1s_{1}\in\mathcal{S}(t^{i}_{1}) and suppose there exist a string s2𝒮(t1k)subscript𝑠2𝒮subscriptsuperscript𝑡𝑘1s_{2}\in\mathcal{S}(t^{k}_{1}). Re-running the MPE procedure on s1subscript𝑠1s_{1} and s2subscript𝑠2s_{2} in parallel, then every time a token is selected within x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} in s1subscript𝑠1s_{1}, it must also be selected at the same position in s2subscript𝑠2s_{2} as well. Thus, we cannot have t1it1ksubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑡𝑘1t^{i}_{1}\neq t^{k}_{1}, which proves the result.

Appendix C Proof of Proposition 2.3 in the Main Paper

Proposition 2.3 (Token-Induced Zero Probability) Let t1isubscriptsuperscript𝑡𝑖1t^{i}_{1} be a sequence of input tokens. For any invalid encoding t1isubscriptsuperscript𝑡𝑖1t^{i}_{1}, we have Pgt(t1i)=0.0subscript𝑃gtsubscriptsuperscript𝑡𝑖10.0P_{\mathrm{gt}}(t^{i}_{1}){=}0.0 and the conditional probability Pgt(ti+1|t1i)subscript𝑃gtconditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1}) is undefined. In the case t1isubscriptsuperscript𝑡𝑖1t^{i}_{1} is valid, then Pgt(ti+1|t1i)=0.0subscript𝑃gtconditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖10.0P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1}){=}0.0 if t1i+1subscriptsuperscript𝑡𝑖11t^{i+1}_{1} is invalid. Furthermore, let x1n=decode(t1i)subscriptsuperscript𝑥𝑛1decodesubscriptsuperscript𝑡𝑖1x^{n}_{1}{=}\mathrm{decode}(t^{i}_{1}), then for any string xn+1Nsubscriptsuperscript𝑥𝑁𝑛1x^{N}_{n+1} such that encode(concat(decode(t1i),xn+1N))t1iencodeconcatdecodesubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑡𝑖1\mathrm{encode}(\mathrm{concat}(\mathrm{decode}(t^{i}_{1}),x^{N}_{n+1})){\neq}t^{i}_{1}, we have Pgt(xn+1N|t1i)=0.0.subscript𝑃gtconditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑡𝑖10.0P_{\mathrm{gt}}(x^{N}_{n+1}|t^{i}_{1}){=}0.0.

Proof.

For the first two statements, we have:

  • For an invalid t1isubscriptsuperscript𝑡𝑖1t^{i}_{1} where t1iencode(decode(t1i))subscriptsuperscript𝑡𝑖1encodedecodesubscriptsuperscript𝑡𝑖1t^{i}_{1}\neq\mathrm{encode}(\mathrm{decode}(t^{i}_{1})), we have 𝒮(t1i)=𝒮subscriptsuperscript𝑡𝑖1\mathcal{S}(t^{i}_{1})=\emptyset, as implied by Corollary B.3. As such, we have Pgt(t1i)=0.0subscript𝑃gtsubscriptsuperscript𝑡𝑖10.0P_{\mathrm{gt}}(t^{i}_{1}){=}0.0 which leads Pgt(ti+1|t1i)subscript𝑃gtconditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1}) to be an undefined conditional probability .

  • For a valid t1isubscriptsuperscript𝑡𝑖1t^{i}_{1} but invalid t1i+1subscriptsuperscript𝑡𝑖11t^{i+1}_{1}, we know that Pgt(t1i+1)=0.0subscript𝑃gtsubscriptsuperscript𝑡𝑖110.0P_{\mathrm{gt}}(t^{i+1}_{1}){=}0.0, which results in Pgt(ti+1|t1i)=0.0subscript𝑃gtconditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖10.0P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1})=0.0.

For the last statement, we first note the following:

  1. 1.

    Note that Pgt(xn+1N,t1i)=Pgt(x1N,t1i)subscript𝑃gtsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑡𝑖1subscript𝑃gtsubscriptsuperscript𝑥𝑁1subscriptsuperscript𝑡𝑖1P_{\mathrm{gt}}(x^{N}_{n+1},t^{i}_{1})=P_{\mathrm{gt}}(x^{N}_{1},t^{i}_{1}) where concat(x1n,xn+1N)=x1Nconcatsubscriptsuperscript𝑥𝑛1subscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑁1\mathrm{concat}(x^{n}_{1},x^{N}_{n+1})=x^{N}_{1}.

  2. 2.

    Consider Pgt(x1N,t1i)=Pgt(x1N)Pgt(t1i|x1N)subscript𝑃gtsubscriptsuperscript𝑥𝑁1subscriptsuperscript𝑡𝑖1subscript𝑃gtsubscriptsuperscript𝑥𝑁1subscript𝑃gtconditionalsubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑥𝑁1P_{\mathrm{gt}}(x^{N}_{1},t^{i}_{1})=P_{\mathrm{gt}}(x^{N}_{1})P_{\mathrm{gt}}(t^{i}_{1}|x^{N}_{1}), we will prove that Pgt(t1i|x1N)=0.0subscript𝑃gtconditionalsubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑥𝑁10.0P_{\mathrm{gt}}(t^{i}_{1}|x^{N}_{1})=0.0 if encode(x1N)1it1iencodesubscriptsuperscriptsubscriptsuperscript𝑥𝑁1𝑖1subscriptsuperscript𝑡𝑖1\mathrm{encode}(x^{N}_{1})^{i}_{1}\neq t^{i}_{1}.

The proof idea for this is shown in Figure 4 (Example 2, Right). Formally:

  • Let j𝑗j be the first position such that encode(x1N)jtjencodesubscriptsubscriptsuperscript𝑥𝑁1𝑗subscript𝑡𝑗\mathrm{encode}(x^{N}_{1})_{j}\neq t_{j} then we know that |decode(encode(x1N)j))|>|decode(tj)||\mathrm{decode}(\mathrm{encode}(x^{N}_{1})_{j}))|>|\mathrm{decode}(t_{j})| (Proposition B.1 (Result 2)).

  • Following Proposition B.1 (Result 2), let s𝒮(x1N)𝑠𝒮subscriptsuperscript𝑥𝑁1s\in\mathcal{S}(x^{N}_{1}), then we know that decode(encode(x1N)j)decodeencodesubscriptsubscriptsuperscript𝑥𝑁1𝑗\mathrm{decode}(\mathrm{encode}(x^{N}_{1})_{j}) must be a substring of within another longer token (it cannot be broken down) in s𝑠s. Hence, no string s𝑠s will have a j𝑗j-th token as tjsubscript𝑡𝑗t_{j}, so Pgt(t1i|x1N)=0.0subscript𝑃gtconditionalsubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑥𝑁10.0P_{\mathrm{gt}}(t^{i}_{1}|x^{N}_{1})=0.0. This completes the proof.

Finally, we note that Pgt(t1i)=0.0subscript𝑃gtsubscriptsuperscript𝑡𝑖10.0P_{\mathrm{gt}}(t^{i}_{1})=0.0 does not implies encode(decode(t1i))𝒱encodedecodesubscriptsuperscript𝑡𝑖1𝒱\mathrm{encode}(\mathrm{decode}(t^{i}_{1}))\in\mathcal{V}, since it can be due to the original distribution on the character domain. A classic example for this is a Markov model with an absorption state. ∎

Appendix D Proof of Proposition 3.1 and Corollary 3.2 in the Main Paper

Proposition 3.1 Let s=x1nsuperscript𝑠subscriptsuperscript𝑥𝑛1s^{*}=x^{n}_{1}, where t1i=encode(s)=encode(x1n)subscriptsuperscript𝑡𝑖1encodesuperscript𝑠encodesubscriptsuperscript𝑥𝑛1t^{i}_{1}=\mathrm{encode}(s^{*})=\mathrm{encode}(x^{n}_{1}). Then we have 𝒮(t1i)𝒮(x1n)𝒮subscriptsuperscript𝑡𝑖1𝒮subscriptsuperscript𝑥𝑛1\mathcal{S}(t^{i}_{1})\subset\mathcal{S}(x^{n}_{1}), i.e. for any string s𝑠s where t1i=encode(s)1isubscriptsuperscript𝑡𝑖1encodesubscriptsuperscript𝑠𝑖1t^{i}_{1}=\mathrm{encode}(s)^{i}_{1}, we have P(x1n|t1i)=1.0𝑃conditionalsubscriptsuperscript𝑥𝑛1subscriptsuperscript𝑡𝑖11.0P(x^{n}_{1}|t^{i}_{1})=1.0. In the case ti𝒱subscript𝑡𝑖superscript𝒱t_{i}\in\mathcal{V}^{*}, then we also have that 𝒮(t1i)=𝒮(x1n)𝒮subscriptsuperscript𝑡𝑖1𝒮subscriptsuperscript𝑥𝑛1\mathcal{S}(t^{i}_{1})=\mathcal{S}(x^{n}_{1}), i.e. any string s𝑠s where x1nprefix(s)subscriptsuperscript𝑥𝑛1prefix𝑠x^{n}_{1}\in\mathrm{prefix}(s) must have the first i𝑖i tokens as t1isubscriptsuperscript𝑡𝑖1t^{i}_{1} and P(t1i|x1n)=1.0𝑃conditionalsubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑥𝑛11.0P(t^{i}_{1}|x^{n}_{1})=1.0.

Proof.

We prove each case as follow.

1) General Case: There exists a string s𝒮(x1n)𝑠𝒮subscriptsuperscript𝑥𝑛1s\in\mathcal{S}(x^{n}_{1}) where encode(s)1it1iencodesubscriptsuperscript𝑠𝑖1subscriptsuperscript𝑡𝑖1\mathrm{encode}(s)^{i}_{1}\neq t^{i}_{1}, following directly from our 1st order Markov chain example in the main paper, i.e. the string s=``AA"𝑠``𝐴𝐴"s=``AA" has ``A"``𝐴"``A" as prefix but have the t1=``AA"``A"subscript𝑡1``𝐴𝐴"``𝐴"t_{1}=``AA"\neq``A". Also, any string s𝑠s that has the first i𝑖i tokens as t1isubscriptsuperscript𝑡𝑖1t^{i}_{1} must have the first n𝑛n characters as x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}, hence 𝒮(t1i)𝒮(x1n)𝒮subscriptsuperscript𝑡𝑖1𝒮subscriptsuperscript𝑥𝑛1\mathcal{S}(t^{i}_{1})\subset\mathcal{S}(x^{n}_{1}) and P(x1n|t1i)=1.0𝑃conditionalsubscriptsuperscript𝑥𝑛1subscriptsuperscript𝑡𝑖11.0P(x^{n}_{1}|t^{i}_{1})=1.0.

2) ti𝒱subscript𝑡𝑖superscript𝒱t_{i}\in\mathcal{V}^{*}: The proof idea is that, since tisubscript𝑡𝑖t_{i} cannot be a part of any token in 𝒱𝒱\mathcal{V}, it is impossible to merge it by appending additional characters after tisubscript𝑡𝑖t_{i}. Formally, similar to Proposition B.1:

  • For any string s𝒮(decode(t1i))𝑠𝒮decodesubscriptsuperscript𝑡𝑖1s\in\mathcal{S}(\mathrm{decode}(t^{i}_{1})), let \ell be the number of tokens in the minimal superstring r𝑟r of s𝑠s that contains x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} as a prefix.

  • Following Proposition B.1 (Result 2), we know that tisubscript𝑡𝑖t_{i} must be a substring of decode(encode(s))decodeencodesubscript𝑠\mathrm{decode}(\mathrm{encode}(s)_{\ell}).

  • Due to ti𝒱subscript𝑡𝑖superscript𝒱t_{i}\in\mathcal{V}^{*}, then ti=encode(s)subscript𝑡𝑖encodesubscript𝑠t_{i}=\mathrm{encode}(s)_{\ell}. We also know from Proposition B.1 (Result 1) that encode(s)i=tiencodesubscript𝑠𝑖subscript𝑡𝑖\mathrm{encode}(s)_{i}=t_{i} for 1i<1𝑖1\leq i<\ell, this means that =i𝑖\ell=i. This gives us t1i=encode(s)1isubscriptsuperscript𝑡𝑖1encodesubscriptsuperscript𝑠𝑖1t^{i}_{1}=\mathrm{encode}(s)^{i}_{1} and P(t1i|x1n)=1.0𝑃conditionalsubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑥𝑛11.0P(t^{i}_{1}|x^{n}_{1})=1.0.

This completes the proof. ∎

Remarks. We briefly note that the condition ti𝒱subscript𝑡𝑖superscript𝒱t_{i}\in\mathcal{V}^{*} is the sufficient condition. In general, any token sequence t1isubscriptsuperscript𝑡𝑖1t^{i}_{1} that satisfies 𝒮(t1i)=𝒮(x1n)𝒮subscriptsuperscript𝑡𝑖1𝒮subscriptsuperscript𝑥𝑛1\mathcal{S}(t^{i}_{1})=\mathcal{S}(x^{n}_{1}) will have P(t1i|x1n)=1.0𝑃conditionalsubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑥𝑛11.0P(t^{i}_{1}|x^{n}_{1})=1.0. One potential strategy is to find the first index i=0,1,k1𝑖01𝑘1i=0,1,...k-1 such that tkiksubscriptsuperscript𝑡𝑘𝑘𝑖t^{k}_{k-i} cannot be merged into another token in 𝒱𝒱\mathcal{V}.

Corollary 3.2 Following Proposition 3.1, suppose ti𝒱subscript𝑡𝑖superscript𝒱t_{i}\in\mathcal{V}^{*} then we have P(xn+1N|x1n)=P(xn+1N|t1i)𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑡𝑖1P(x^{N}_{n+1}|x^{n}_{1}{)}{=}P(x^{N}_{n{+}1}|t^{i}_{1}). Similarly, we also have P(ti+1j|x1n)=P(ti+1j|t1i)𝑃conditionalsubscriptsuperscript𝑡𝑗𝑖1subscriptsuperscript𝑥𝑛1𝑃conditionalsubscriptsuperscript𝑡𝑗𝑖1subscriptsuperscript𝑡𝑖1P(t^{j}_{i+1}|x^{n}_{1}{)}{=}P(t^{j}_{i+1}|t^{i}_{1}).

Proof.

For the first case, we prove through the following equations:

P(xn+1N|t1i)𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑡𝑖1\displaystyle P(x^{N}_{n{+}1}|t^{i}_{1}) =P(xn+1N|t1i,x1n)absent𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑡𝑖1subscriptsuperscript𝑥𝑛1\displaystyle=P(x^{N}_{n{+}1}|t^{i}_{1},x^{n}_{1}) (4)
=P(xn+1N,t1i|x1n)P(t1i|x1n)absent𝑃subscriptsuperscript𝑥𝑁𝑛1conditionalsubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑥𝑛1𝑃conditionalsubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑥𝑛1\displaystyle=\frac{P(x^{N}_{n+1},t^{i}_{1}|x^{n}_{1})}{P(t^{i}_{1}|x^{n}_{1})} (5)
=P(xn+1N|x1n)P(t1i|x1n,xn+1N)P(t1i|x1n)absent𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1𝑃conditionalsubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑥𝑛1subscriptsuperscript𝑥𝑁𝑛1𝑃conditionalsubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑥𝑛1\displaystyle=\frac{P(x^{N}_{n+1}|x^{n}_{1})P(t^{i}_{1}|x^{n}_{1},x^{N}_{n+1})}{P(t^{i}_{1}|x^{n}_{1})} (6)
=P(xn+1N|x1n)absent𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1\displaystyle=P(x^{N}_{n+1}|x^{n}_{1}) (7)

where the first equality is due to P(x1n|t1i)=1.0𝑃conditionalsubscriptsuperscript𝑥𝑛1subscriptsuperscript𝑡𝑖11.0P(x^{n}_{1}|t^{i}_{1})=1.0 and the last equality is due to P(t1i|x1n)=1.0𝑃conditionalsubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑥𝑛11.0P(t^{i}_{1}|x^{n}_{1})=1.0 for ti𝒱subscript𝑡𝑖superscript𝒱t_{i}\in\mathcal{V}^{*}.

Similarly, for the second case, we have:

P(ti+1j|t1i)𝑃conditionalsubscriptsuperscript𝑡𝑗𝑖1subscriptsuperscript𝑡𝑖1\displaystyle P(t^{j}_{i+1}|t^{i}_{1}{)} =P(ti+1j|x1n,t1i)absent𝑃conditionalsubscriptsuperscript𝑡𝑗𝑖1subscriptsuperscript𝑥𝑛1subscriptsuperscript𝑡𝑖1\displaystyle=P(t^{j}_{i+1}|x^{n}_{1},t^{i}_{1}) (8)
=P(ti+1j|x1n)P(t1i|x1n,ti+1j)P(t1i|x1n)absent𝑃conditionalsubscriptsuperscript𝑡𝑗𝑖1subscriptsuperscript𝑥𝑛1𝑃conditionalsubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑥𝑛1subscriptsuperscript𝑡𝑗𝑖1𝑃conditionalsubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑥𝑛1\displaystyle=\frac{P(t^{j}_{i+1}|x^{n}_{1})P(t^{i}_{1}|x^{n}_{1},t^{j}_{i+1})}{P(t^{i}_{1}|x^{n}_{1})} (9)
=P(ti+1j|x1n),absent𝑃conditionalsubscriptsuperscript𝑡𝑗𝑖1subscriptsuperscript𝑥𝑛1\displaystyle=P(t^{j}_{i+1}|x^{n}_{1}), (10)

which completes the proof. ∎

Appendix E Proof for The Bias Removal Method

E.1 Refactoring

Our goal is to express the quantity P(xn+1N|x1n)𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1P(x^{N}_{n+1}|x^{n}_{1}) using the tokenized LM that outputs the conditional probability P(ti|t1i1)𝑃conditionalsubscript𝑡𝑖subscriptsuperscript𝑡𝑖11P(t_{i}|t^{i-1}_{1}). Let x1nkprefix(x1n)subscriptsuperscript𝑥subscript𝑛𝑘1prefixsubscriptsuperscript𝑥𝑛1x^{n_{k}}_{1}\in\mathrm{prefix}(x^{n}_{1}) where t1k=encode(x1nk)subscriptsuperscript𝑡𝑘1encodesubscriptsuperscript𝑥subscript𝑛𝑘1t^{k}_{1}=\mathrm{encode}(x^{n_{k}}_{1}) and tk𝒱subscript𝑡𝑘superscript𝒱t_{k}\in\mathcal{V}^{*}. Following Proposition 3.1, any string s𝑠s with prefix x1nksubscriptsuperscript𝑥subscript𝑛𝑘1x^{n_{k}}_{1} must have the first k𝑘k tokens as t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1}. We now perform the following factorization:

P(xn+1N|x1n)𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1\displaystyle P(x^{N}_{n+1}|x^{n}_{1}) =P(xn+1N|x1nk,xnk+1n)absent𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥subscript𝑛𝑘1subscriptsuperscript𝑥𝑛subscript𝑛𝑘1\displaystyle=P(x^{N}_{n+1}|x^{n_{k}}_{1},x^{n}_{n_{k}+1}) (11)
=P(xn+1N,xnk+1n|x1nk)P(xnk+1n|x1nk)absent𝑃subscriptsuperscript𝑥𝑁𝑛1conditionalsubscriptsuperscript𝑥𝑛subscript𝑛𝑘1subscriptsuperscript𝑥subscript𝑛𝑘1𝑃conditionalsubscriptsuperscript𝑥𝑛subscript𝑛𝑘1subscriptsuperscript𝑥subscript𝑛𝑘1\displaystyle=\frac{P(x^{N}_{n+1},x^{n}_{n_{k}+1}|x^{n_{k}}_{1})}{P(x^{n}_{n_{k}+1}|x^{n_{k}}_{1})} (12)
=P(xnk+1N|x1nk)P(xnk+1n|x1nk)absent𝑃conditionalsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1subscriptsuperscript𝑥subscript𝑛𝑘1𝑃conditionalsubscriptsuperscript𝑥𝑛subscript𝑛𝑘1subscriptsuperscript𝑥subscript𝑛𝑘1\displaystyle=\frac{P(x^{N}_{n_{k}+1}|x^{n_{k}}_{1})}{P(x^{n}_{n_{k}+1}|x^{n_{k}}_{1})} (13)
=P(xnk+1N|t1k)P(xnk+1n|t1k),absent𝑃conditionalsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1subscriptsuperscript𝑡𝑘1𝑃conditionalsubscriptsuperscript𝑥𝑛subscript𝑛𝑘1subscriptsuperscript𝑡𝑘1\displaystyle=\frac{P(x^{N}_{n_{k}+1}|t^{k}_{1})}{P(x^{n}_{n_{k}+1}|t^{k}_{1})}, (14)

where the last inequality is due to Corollary 3.2. Finally, we will use the Maximum Prefix Correction (MPE) Algorithm to compute each term in (14) individually. Note that the algorithm does not require tk𝒱subscript𝑡𝑘superscript𝒱t_{k}\in\mathcal{V}^{*}. Here, we explicitly highlight the importance of having tk𝒱subscript𝑡𝑘superscript𝒱t_{k}\in\mathcal{V}^{*}, as it bridges between the character and token domain through Equation (14).

E.2 Maximum Prefix Correction Algorithm

Overview. The MPC algorithm computes P(xnk+1N|t1k)𝑃conditionalsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1subscriptsuperscript𝑡𝑘1P(x^{N}_{n_{k}+1}|t^{k}_{1}). Note that we do not require tk𝒱subscript𝑡𝑘superscript𝒱t_{k}\in\mathcal{V}^{*} in the MPC algorithm. Using marginalization, we have the following:

P(xnk+1N|t1k)𝑃conditionalsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1subscriptsuperscript𝑡𝑘1\displaystyle P(x^{N}_{n_{k}+1}|t^{k}_{1}) =t𝒱P(xnk+1N,tk+1=t|t1k)absentsubscript𝑡𝒱𝑃subscriptsuperscript𝑥𝑁subscript𝑛𝑘1subscript𝑡𝑘1conditional𝑡subscriptsuperscript𝑡𝑘1\displaystyle=\sum_{t\in\mathcal{V}}P(x^{N}_{n_{k}+1},t_{k+1}=t|t^{k}_{1}) (15)
=t𝒯bvalP(xnk+1N,tk+1=t|t1k)bval+t𝒯pvalP(xnk+1N,tk+1=t|t1k)pvalabsentsubscriptsubscript𝑡subscript𝒯subscriptbval𝑃subscriptsuperscript𝑥𝑁subscript𝑛𝑘1subscript𝑡𝑘1conditional𝑡subscriptsuperscript𝑡𝑘1subscriptbvalsubscriptsubscript𝑡subscript𝒯subscriptpval𝑃subscriptsuperscript𝑥𝑁subscript𝑛𝑘1subscript𝑡𝑘1conditional𝑡subscriptsuperscript𝑡𝑘1subscriptpval\displaystyle=\underbrace{\sum_{t{\in}\mathcal{T}_{\mathrm{b_{val}}}}P(x^{N}_{n_{k}+1},t_{k+1}=t|t^{k}_{1})}_{\mathrm{b_{val}}}+\underbrace{\sum_{t{\in}\mathcal{T}_{\mathrm{p_{val}}}}P(x^{N}_{n_{k}+1},t_{k+1}=t|t^{k}_{1})}_{\mathrm{p_{val}}} (16)

where:

  • 𝒯bval={t𝒱|xnk+1Nprefix(decode(t))}subscript𝒯subscriptbvalconditional-set𝑡𝒱subscriptsuperscript𝑥𝑁subscript𝑛𝑘1prefixdecode𝑡\mathcal{T}_{\mathrm{b_{val}}}=\{t\in\mathcal{V}|x^{N}_{n_{k}+1}\in\mathrm{prefix}(\mathrm{decode}(t))\} is the set of tokens that have a prefix xnk+1Nsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1x^{N}_{n_{k}+1}.

  • 𝒯pval={t𝒱|xnk+1Nprefix(decode(t))}subscript𝒯subscriptpvalconditional-set𝑡𝒱subscriptsuperscript𝑥𝑁subscript𝑛𝑘1prefixdecode𝑡\mathcal{T}_{\mathrm{p_{val}}}=\{t\in\mathcal{V}|x^{N}_{n_{k}+1}\notin\mathrm{prefix}(\mathrm{decode}(t))\} is the ones that do not.

and 𝒯bval𝒯pval=subscript𝒯subscriptbvalsubscript𝒯subscriptpval\mathcal{T}_{\mathrm{b_{val}}}\cap\mathcal{T}_{\mathrm{p_{val}}}=\emptyset.

Branch Step. Here, bvalsubscriptbval\mathrm{b_{val}} is the probability that, given the list of previous tokens t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1}, the next token of the string s𝑠s has xnk+1Nsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1x^{N}_{n_{k}+1} as a prefix. To compute this term, we obtain P(tk+1=t|t1k)𝑃subscript𝑡𝑘1conditional𝑡subscriptsuperscript𝑡𝑘1P(t_{k+1}=t|t^{k}_{1}) for all t𝒱𝑡𝒱t\in\mathcal{V} using one model run, then sum the probabilities corresponds to all tokens whose prefix is xnk+1Nsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1x^{N}_{n_{k}+1}.

bval=t𝒯bvalP(tk+1=t|t1k),subscriptbvalsubscript𝑡subscript𝒯subscriptbval𝑃subscript𝑡𝑘1conditional𝑡subscriptsuperscript𝑡𝑘1\displaystyle\mathrm{b_{val}}=\sum_{t\in\mathcal{T}_{\mathrm{b_{val}}}}P(t_{k+1}=t|t^{k}_{1}), (17)
Proof.

To see this, for each summand of bvalsubscriptbval\mathrm{b_{val}} in Eq.(16), we have:

P(xnk+1N,tk+1=t|t1k)𝑃subscriptsuperscript𝑥𝑁subscript𝑛𝑘1subscript𝑡𝑘1conditional𝑡subscriptsuperscript𝑡𝑘1\displaystyle P(x^{N}_{n_{k}+1},t_{k+1}=t|t^{k}_{1}) =P(tk+1=t|t1k)×P(xnk+1N|t1k,tk+1=t)absent𝑃subscript𝑡𝑘1conditional𝑡subscriptsuperscript𝑡𝑘1𝑃conditionalsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1subscriptsuperscript𝑡𝑘1subscript𝑡𝑘1𝑡\displaystyle=P(t_{k+1}=t|t^{k}_{1})\times P(x^{N}_{n_{k}+1}|t^{k}_{1},t_{k+1}{=}t) (18)
=P(tk+1=t|t1k),absent𝑃subscript𝑡𝑘1conditional𝑡subscriptsuperscript𝑡𝑘1\displaystyle=P(t_{k+1}=t|t^{k}_{1}), (19)

where P(xnk+1N|t1k,tk+1=t)=1.0𝑃conditionalsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1subscriptsuperscript𝑡𝑘1subscript𝑡𝑘1𝑡1.0P(x^{N}_{n_{k}+1}|t^{k}_{1},t_{k+1}{=}t)=1.0 is due to xnk+1Nprefix(t)subscriptsuperscript𝑥𝑁subscript𝑛𝑘1prefix𝑡x^{N}_{n_{k}+1}\in\mathrm{prefix}(t). This concludes the proof. ∎

Pass Step. Here, pvalsubscriptpval\mathrm{p_{val}} is the probability that, given the list of previous tokens t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1}, the subsequent string xnk+1Nsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1x^{N}_{n_{k}+1} is not a prefix of the next token. Under the MPE, we compute the value pvalsubscriptpval\mathrm{p_{val}} as follow:

pval=P(tk+1=t|t1k)×P(xnk+1+1N|t1k,tk+1=t),subscriptpval𝑃subscript𝑡𝑘1conditional𝑡subscriptsuperscript𝑡𝑘1𝑃conditionalsubscriptsuperscript𝑥𝑁subscript𝑛𝑘11subscriptsuperscript𝑡𝑘1subscript𝑡𝑘1𝑡\displaystyle\mathrm{p_{val}}=P(t_{k+1}=t|t^{k}_{1})\times P(x^{N}_{n_{k+1}+1}|t^{k}_{1},t_{k+1}=t), (20)

where t=encode(xnk+1N)1𝑡encodesubscriptsubscriptsuperscript𝑥𝑁subscript𝑛𝑘11t=\mathrm{encode}(x^{N}_{n_{k}+1})_{1} and xnk+1nk+1=decode(t)subscriptsuperscript𝑥subscript𝑛𝑘1subscript𝑛𝑘1decode𝑡x^{n_{k+1}}_{n_{k}+1}=\mathrm{decode}(t). That is, during the passing step, there are two subroutines:

  1. 1.

    Extract the next token t𝑡t within xnk+1Nsubscriptsuperscript𝑥𝑁subscript𝑛𝑘1x^{N}_{n_{k}+1} and compute P(tk+1=t|t1k)𝑃subscript𝑡𝑘1conditional𝑡subscriptsuperscript𝑡𝑘1P(t_{k+1}=t|t^{k}_{1}). If xnk+1N=decode(t)subscriptsuperscript𝑥𝑁subscript𝑛𝑘1decode𝑡x^{N}_{n_{k}+1}=\mathrm{decode}(t), then returns 0.00.00.0 since this is not allowed according to the condition required in 𝒯pvalsubscript𝒯subscriptpval\mathcal{T}_{\mathrm{p_{val}}}.

  2. 2.

    Recursively compute P(xnk+1+1N|t1k,tk+1=t)𝑃conditionalsubscriptsuperscript𝑥𝑁subscript𝑛𝑘11subscriptsuperscript𝑡𝑘1subscript𝑡𝑘1𝑡P(x^{N}_{n_{k+1}+1}|t^{k}_{1},t_{k+1}=t).

Proof.

Following Proposition 2.3 for invalid encodings, we only need to consider t𝑡t such that t1k+1subscriptsuperscript𝑡𝑘11t^{k+1}_{1} is valid. Under Proposition B.1 for MPE on x1Nsubscriptsuperscript𝑥𝑁1x^{N}_{1}, only first token of encode(xnk+1N)encodesubscriptsuperscript𝑥𝑁subscript𝑛𝑘1\mathrm{encode}(x^{N}_{n_{k}+1}) is allowed (also see Example 2 in Figure 4(Right)). Finally, applying the chain rule of probability, we obtain Equation 20. For the case of non-optimal LM, see Section G.2 for non-optimal LM. This completes the proof. ∎

Base Case. We note that the base case of our algorithm corresponds to the situation where xnk+1N=decode(t)subscriptsuperscript𝑥𝑁subscript𝑛𝑘1decode𝑡x^{N}_{n_{k}+1}=\mathrm{decode}(t). In this scenario, we only needs to compute bvalsubscriptbval\mathrm{b_{val}} (branching step) while pval=0.0subscriptpval0.0\mathrm{p_{val}}=0.0.

Complexity Analysis. The complexity of our algorithm (number of inferences on the language model) scales with the length of the the query string, i.e. Nnk𝑁subscript𝑛𝑘N-n_{k}. Note that the complexity of the summation at the Branching step is relatively cheap compared to the runtime of the language model.

Appendix F Converting Token-Free Language Model to Tokenized Language Model for MPE.

We introduce an algorithm to compute P(tk+1|t1k)𝑃conditionalsubscript𝑡𝑘1subscriptsuperscript𝑡𝑘1P(t_{k+1}|t^{k}_{1}) using a token-free language model P(xn+1N|x1n)𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1P(x^{N}_{n+1}|x^{n}_{1}), despite having no access to any tokenized LM. This approach enables theoretical conversion of a token-free model to a tokenized one. The method involves two stages. First, we refactor the conditional probability similar to the technique presented in Section E. Next, we aggregate the probabilities of all possible strings leading to the desired tokenization. It is important to note that a Markov chain is a special type of autoregressive model, meaning this method can be employed to effortlessly calculate Markov chain transition matrices within the tokenized domain.

F.1 Refactoring

Consider the probability P(ti+1|t1i)𝑃conditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1P(t_{i+1}|t^{i}_{1}) that we would like to expressed using P(xn+1N|x1n)𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1P(x^{N}_{n+1}|x^{n}_{1}). Let tksubscript𝑡𝑘t_{k} be the last token within t1isubscriptsuperscript𝑡𝑖1t^{i}_{1} such that tk𝒱subscript𝑡𝑘superscript𝒱t_{k}\in\mathcal{V}^{*}. We now perform the following factorization:

P(ti+1|t1i)𝑃conditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1\displaystyle P(t_{i+1}|t^{i}_{1}) =P(tk+1i+1|t1k)P(tk+1i|t1k)absent𝑃conditionalsubscriptsuperscript𝑡𝑖1𝑘1subscriptsuperscript𝑡𝑘1𝑃conditionalsubscriptsuperscript𝑡𝑖𝑘1subscriptsuperscript𝑡𝑘1\displaystyle=\frac{P(t^{i+1}_{k+1}|t^{k}_{1})}{P(t^{i}_{k+1}|t^{k}_{1})} (21)
=P(tk+1i+1|x1nk)P(tk+1i|x1nk),absent𝑃conditionalsubscriptsuperscript𝑡𝑖1𝑘1subscriptsuperscript𝑥subscript𝑛𝑘1𝑃conditionalsubscriptsuperscript𝑡𝑖𝑘1subscriptsuperscript𝑥subscript𝑛𝑘1\displaystyle=\frac{P(t^{i+1}_{k+1}|x^{n_{k}}_{1})}{P(t^{i}_{k+1}|x^{n_{k}}_{1})}, (22)

where x1nk=decode(t1k)subscriptsuperscript𝑥subscript𝑛𝑘1decodesubscriptsuperscript𝑡𝑘1x^{n_{k}}_{1}=\mathrm{decode}(t^{k}_{1}). The second equality is due to Corollary 3.2. Each term can then be computed using the aggregation procedure shown next.

F.2 Aggregation.

In this step, we would like to compute P(tk+1i|x1nk)𝑃conditionalsubscriptsuperscript𝑡𝑖𝑘1subscriptsuperscript𝑥subscript𝑛𝑘1P(t^{i}_{k+1}|x^{n_{k}}_{1}) where encode(x1nk)=t1kencodesubscriptsuperscript𝑥subscript𝑛𝑘1subscriptsuperscript𝑡𝑘1\mathrm{encode}(x^{n_{k}}_{1})=t^{k}_{1} and tk𝒱subscript𝑡𝑘superscript𝒱t_{k}\in\mathcal{V}^{*}, using the token-free representation P(xn+1|x1n)𝑃conditionalsubscript𝑥𝑛1subscriptsuperscript𝑥𝑛1P(x_{n+1}|x^{n}_{1}). Here, we denote decode(tk+1i)=xnk+1nidecodesubscriptsuperscript𝑡𝑖𝑘1subscriptsuperscript𝑥subscript𝑛𝑖subscript𝑛𝑘1\mathrm{decode}(t^{i}_{k+1})=x^{n_{i}}_{n_{k}+1} and M=maxt𝒱|decode(t)|𝑀subscript𝑡𝒱decode𝑡M=\max_{t\in\mathcal{V}}|\mathrm{decode}(t)| be the length of the longest token in V𝑉V and Ω=𝒜MΩsuperscript𝒜𝑀\Omega=\mathcal{A}^{M} is the enumeration of all string of length M𝑀M.

Computing P(tk+1i|x1nk)𝑃conditionalsubscriptsuperscript𝑡𝑖𝑘1subscriptsuperscript𝑥subscript𝑛𝑘1P(t^{i}_{k+1}|x^{n_{k}}_{1}) involves considering all possible strings s𝑠s with prefix x1nisubscriptsuperscript𝑥subscript𝑛𝑖1x^{n_{i}}_{1} and tk+1i=encode(s)k+1isubscriptsuperscript𝑡𝑖𝑘1encodesubscriptsuperscript𝑠𝑖𝑘1t^{i}_{k+1}=\mathrm{encode}(s)^{i}_{k+1}. Although iterating through every possible string is infeasible, we can restrict our search by only examining strings with length |s|=ni+M𝑠subscript𝑛𝑖𝑀|s|=n_{i}+M, as any additional string beyond this point will not impact the tokenization of prefix x1nisubscriptsuperscript𝑥subscript𝑛𝑖1x^{n_{i}}_{1}due to M𝑀M being the maximum token length. Formally, we will show that one can express P(tk+1i|x1nk)𝑃conditionalsubscriptsuperscript𝑡𝑖𝑘1subscriptsuperscript𝑥subscript𝑛𝑘1P(t^{i}_{k+1}|x^{n_{k}}_{1}) as follows:

P(tk+1i|x1nk)=s𝒜MP(xnk+1ni+M=c1(s)|x1nk)𝟙(tk+1i=encode(c2(s))k+1i),𝑃conditionalsubscriptsuperscript𝑡𝑖𝑘1subscriptsuperscript𝑥subscript𝑛𝑘1subscriptsuperscript𝑠superscript𝒜𝑀𝑃subscriptsuperscript𝑥subscript𝑛𝑖𝑀subscript𝑛𝑘1conditionalsubscript𝑐1superscript𝑠subscriptsuperscript𝑥subscript𝑛𝑘11subscriptsuperscript𝑡𝑖𝑘1encodesubscriptsuperscriptsubscript𝑐2superscript𝑠𝑖𝑘1\displaystyle P(t^{i}_{k+1}|x^{n_{k}}_{1}){=}\sum_{s^{\prime}\in\mathcal{A}^{M}}P(x^{n_{i}+M}_{n_{k}+1}{=}c_{1}(s^{\prime})|x^{n_{k}}_{1})\mathds{1}(t^{i}_{k+1}{=}\mathrm{encode}(c_{2}(s^{\prime}))^{i}_{k+1}), (23)

where c1(s):=concat(xnk+1ni,s)assignsubscript𝑐1superscript𝑠concatsubscriptsuperscript𝑥subscript𝑛𝑖subscript𝑛𝑘1superscript𝑠c_{1}(s^{\prime}):=\mathrm{concat}(x^{n_{i}}_{n_{k}+1},s^{\prime}) and c2(s):=concat(x1ni,s)assignsubscript𝑐2superscript𝑠concatsubscriptsuperscript𝑥subscript𝑛𝑖1superscript𝑠c_{2}(s^{\prime}):=\mathrm{concat}(x^{n_{i}}_{1},s^{\prime}). The first term can be computed using the given token-free LM, i.e. P(xnk+1ni+M|x1nk)𝑃conditionalsubscriptsuperscript𝑥subscript𝑛𝑖𝑀subscript𝑛𝑘1subscriptsuperscript𝑥subscript𝑛𝑘1P(x^{n_{i}+M}_{n_{k}+1}|x^{n_{k}}_{1}). The second term is an indicator function that checks whether tk+1i=encode(s)k+1isubscriptsuperscript𝑡𝑖𝑘1encodesubscriptsuperscript𝑠𝑖𝑘1t^{i}_{k+1}=\mathrm{encode}(s)^{i}_{k+1} and can be computed deterministically.

Proof.

We have:

P(tk+1i|x1nk)𝑃conditionalsubscriptsuperscript𝑡𝑖𝑘1subscriptsuperscript𝑥subscript𝑛𝑘1\displaystyle P(t^{i}_{k+1}|x^{n_{k}}_{1}) =P(tk+1i,xnk+1ni|x1nk)absent𝑃subscriptsuperscript𝑡𝑖𝑘1conditionalsubscriptsuperscript𝑥subscript𝑛𝑖subscript𝑛𝑘1subscriptsuperscript𝑥subscript𝑛𝑘1\displaystyle=P(t^{i}_{k+1},x^{n_{i}}_{n_{k}+1}|x^{n_{k}}_{1}) (24)
=s𝒜MP(tk+1i,xnk+1ni+M=c1(s)|x1nk)absentsubscriptsuperscript𝑠superscript𝒜𝑀𝑃subscriptsuperscript𝑡𝑖𝑘1subscriptsuperscript𝑥subscript𝑛𝑖𝑀subscript𝑛𝑘1conditionalsubscript𝑐1superscript𝑠subscriptsuperscript𝑥subscript𝑛𝑘1\displaystyle=\sum_{s^{\prime}\in\mathcal{A}^{M}}P(t^{i}_{k+1},x^{n_{i}+M}_{n_{k}+1}{=}c_{1}(s^{\prime})|x^{n_{k}}_{1}) (25)
=s𝒜MP(xnk+1ni+M=c1(s)|x1nk)P(tk+1i|x1ni+M=c2(s))absentsubscriptsuperscript𝑠superscript𝒜𝑀𝑃subscriptsuperscript𝑥subscript𝑛𝑖𝑀subscript𝑛𝑘1conditionalsubscript𝑐1superscript𝑠subscriptsuperscript𝑥subscript𝑛𝑘1𝑃conditionalsubscriptsuperscript𝑡𝑖𝑘1subscriptsuperscript𝑥subscript𝑛𝑖𝑀1subscript𝑐2superscript𝑠\displaystyle=\sum_{s^{\prime}\in\mathcal{A}^{M}}P(x^{n_{i}+M}_{n_{k}+1}{=}c_{1}(s^{\prime})|x^{n_{k}}_{1})P(t^{i}_{k+1}|x^{n_{i}+M}_{1}=c_{2}(s^{\prime})) (26)
=s𝒜MP(xnk+1ni+M=c1(s)|x1nk)𝟙(tk+1i=encode(c2(s))k+1i)absentsubscriptsuperscript𝑠superscript𝒜𝑀𝑃subscriptsuperscript𝑥subscript𝑛𝑖𝑀subscript𝑛𝑘1conditionalsubscript𝑐1superscript𝑠subscriptsuperscript𝑥subscript𝑛𝑘11subscriptsuperscript𝑡𝑖𝑘1encodesubscriptsuperscriptsubscript𝑐2superscript𝑠𝑖𝑘1\displaystyle=\sum_{s^{\prime}\in\mathcal{A}^{M}}P(x^{n_{i}+M}_{n_{k}+1}{=}c_{1}(s^{\prime})|x^{n_{k}}_{1})\mathds{1}(t^{i}_{k+1}{=}\mathrm{encode}(c_{2}(s^{\prime}))^{i}_{k+1}) (27)

The rest is to prove the following equality:

P(tk+1i|x1ni+M=c2(s))=𝟙(tk+1i=encode(c2(s))k+1i)𝑃conditionalsubscriptsuperscript𝑡𝑖𝑘1subscriptsuperscript𝑥subscript𝑛𝑖𝑀1subscript𝑐2superscript𝑠1subscriptsuperscript𝑡𝑖𝑘1encodesubscriptsuperscriptsubscript𝑐2superscript𝑠𝑖𝑘1P(t^{i}_{k+1}|x^{n_{i}+M}_{1}=c_{2}(s^{\prime}))=\mathds{1}(t^{i}_{k+1}{=}\mathrm{encode}(c_{2}(s^{\prime}))^{i}_{k+1}) (28)

We first note that the first k𝑘k tokens must be t1k=encode(x1nk)subscriptsuperscript𝑡𝑘1encodesubscriptsuperscript𝑥subscript𝑛𝑘1t^{k}_{1}=\mathrm{encode}(x^{n_{k}}_{1}) due to our condition that tk𝒱subscript𝑡𝑘superscript𝒱t_{k}\in\mathcal{V}^{*}. Since M𝑀M is the length of the longest token in 𝒱𝒱\mathcal{V}, appending extra characters cannot change the tokenization happened for x1nisubscriptsuperscript𝑥subscript𝑛𝑖1x^{n_{i}}_{1}. In other words, any string s𝑠s with prefix c2(s)subscript𝑐2superscript𝑠c_{2}(s^{\prime}) must have the same minimal superstring r𝑟r containing x1nisubscriptsuperscript𝑥subscript𝑛𝑖1x^{n_{i}}_{1} (see Proposition B.1). We then apply this principle to the two cases:

  • tk+1i=encode(c2(s))k+1isubscriptsuperscript𝑡𝑖𝑘1encodesubscriptsuperscriptsubscript𝑐2superscript𝑠𝑖𝑘1t^{i}_{k+1}{=}\mathrm{encode}(c_{2}(s^{\prime}))^{i}_{k+1}: In this case, we know that the string must contains the first i𝑖i tokens as t1isubscriptsuperscript𝑡𝑖1t^{i}_{1}, hence P(tk+1i|x1ni+M=c2(s))=1.0𝑃conditionalsubscriptsuperscript𝑡𝑖𝑘1subscriptsuperscript𝑥subscript𝑛𝑖𝑀1subscript𝑐2superscript𝑠1.0P(t^{i}_{k+1}|x^{n_{i}+M}_{1}=c_{2}(s^{\prime}))=1.0

  • tk+1iencode(c2(s))k+1isubscriptsuperscript𝑡𝑖𝑘1encodesubscriptsuperscriptsubscript𝑐2superscript𝑠𝑖𝑘1t^{i}_{k+1}{\neq}\mathrm{encode}(c_{2}(s^{\prime}))^{i}_{k+1}: In contrast, this case is equivalent to P(tk+1i|x1ni+M=c2(s))=0.0𝑃conditionalsubscriptsuperscript𝑡𝑖𝑘1subscriptsuperscript𝑥subscript𝑛𝑖𝑀1subscript𝑐2superscript𝑠0.0P(t^{i}_{k+1}|x^{n_{i}+M}_{1}=c_{2}(s^{\prime}))=0.0 since we are sure that the string do not contains the tokens tk+1isubscriptsuperscript𝑡𝑖𝑘1t^{i}_{k+1}.

This concludes the proof. ∎

F.3 The Markov Chain Example.

We provide a detail computation of the Markov chain example in the main paper. Recall that in the original chain (in the character domain), we have the following:

P(x2=``A"|x1=``A")𝑃subscript𝑥2conditional``𝐴"subscript𝑥1``𝐴"\displaystyle P(x_{2}=``A"|x_{1}=``A") =αabsent𝛼\displaystyle=\alpha (29)
P(x2=``B"|x1=``A")𝑃subscript𝑥2conditional``𝐵"subscript𝑥1``𝐴"\displaystyle P(x_{2}=``B"|x_{1}=``A") =1αabsent1𝛼\displaystyle=1-\alpha (30)
P(x2=``A"|x1=``B")𝑃subscript𝑥2conditional``𝐴"subscript𝑥1``𝐵"\displaystyle P(x_{2}=``A"|x_{1}=``B") =βabsent𝛽\displaystyle=\beta (31)
P(x2=``B"|x1=``B")𝑃subscript𝑥2conditional``𝐵"subscript𝑥1``𝐵"\displaystyle P(x_{2}=``B"|x_{1}=``B") =1βabsent1𝛽\displaystyle=1-\beta (32)

We also assume the initial probability π={γ,1γ}𝜋𝛾1𝛾\pi=\{\gamma,1-\gamma\} for ``A"``𝐴"``A" and ``B"``𝐵"``B" respectively. In the token domain, let first compute P(t2=``A"|t1=``AA")𝑃subscript𝑡2conditional``𝐴"subscript𝑡1``𝐴𝐴"P(t_{2}=``A"|t_{1}=``AA"), where we do not have to do the refactoring step since we know that t1𝒱subscript𝑡1superscript𝒱t_{1}\in\mathcal{V}^{*}. Following the Aggregation step, we have:

P(t2=``A"|t1=``AA")𝑃subscript𝑡2conditional``𝐴"subscript𝑡1``𝐴𝐴"\displaystyle P(t_{2}=``A"|t_{1}=``AA") =P(x36=``ABA"|x12=``AA")+P(x36=``ABB"|x12=``AA")absent𝑃subscriptsuperscript𝑥63conditional``𝐴𝐵𝐴"subscriptsuperscript𝑥21``𝐴𝐴"𝑃subscriptsuperscript𝑥63conditional``𝐴𝐵𝐵"subscriptsuperscript𝑥21``𝐴𝐴"\displaystyle=P(x^{6}_{3}=``ABA"|x^{2}_{1}=``AA")+P(x^{6}_{3}=``ABB"|x^{2}_{1}=``AA") (33)
=P(x35=``AB"|x12=``AA")absent𝑃subscriptsuperscript𝑥53conditional``𝐴𝐵"subscriptsuperscript𝑥21``𝐴𝐴"\displaystyle=P(x^{5}_{3}=``AB"|x^{2}_{1}=``AA") (34)
=α(1α),absent𝛼1𝛼\displaystyle=\alpha(1-\alpha), (35)

where in the first equality, we do not include the case x36="AAA"subscriptsuperscript𝑥63"𝐴𝐴𝐴"x^{6}_{3}="AAA" and x36="AAB"subscriptsuperscript𝑥63"𝐴𝐴𝐵"x^{6}_{3}="AAB" since encode("AAA")1=``AA"encodesubscript"𝐴𝐴𝐴"1``𝐴𝐴"\mathrm{encode}("AAA")_{1}=``AA" and encode("AAB")1=``AA"encodesubscript"𝐴𝐴𝐵"1``𝐴𝐴"\mathrm{encode}("AAB")_{1}=``AA", which are not the token ``A"``𝐴"``A" that we are interested in. For other tokens and when t1=``B"subscript𝑡1``𝐵"t_{1}=``B", the computation follows the same arguments.

We now consider the case P(t2=``B"|t1=``A")𝑃subscript𝑡2conditional``𝐵"subscript𝑡1``𝐴"P(t_{2}=``B"|t_{1}=``A"), we can refactor it as:

P(t2=``B"|t1=``A")=P(t2=``B",t1=``A")P(t1=``A")𝑃subscript𝑡2conditional``𝐵"subscript𝑡1``𝐴"𝑃formulae-sequencesubscript𝑡2``𝐵"subscript𝑡1``𝐴"𝑃subscript𝑡1``𝐴"\displaystyle P(t_{2}=``B"|t_{1}=``A")=\frac{P(t_{2}=``B",t_{1}=``A")}{P(t_{1}=``A")} (36)

We first compute P(t1=``A")𝑃subscript𝑡1``𝐴"P(t_{1}=``A") using the aggregation step:

P(t1=``A")𝑃subscript𝑡1``𝐴"\displaystyle P(t_{1}=``A") =P(x13=``ABB")+P(x13=``ABA")absent𝑃subscriptsuperscript𝑥31``𝐴𝐵𝐵"𝑃subscriptsuperscript𝑥31``𝐴𝐵𝐴"\displaystyle=P(x^{3}_{1}=``ABB")+P(x^{3}_{1}=``ABA") (37)
=P(x12=``AB")absent𝑃subscriptsuperscript𝑥21``𝐴𝐵"\displaystyle=P(x^{2}_{1}=``AB") (38)
=γ(1α),absent𝛾1𝛼\displaystyle=\gamma(1-\alpha), (39)

where we do again include the case x36="AAA"subscriptsuperscript𝑥63"𝐴𝐴𝐴"x^{6}_{3}="AAA" and x36="AAB"subscriptsuperscript𝑥63"𝐴𝐴𝐵"x^{6}_{3}="AAB" for the same reason above. For P(t2=``A",t1=``A")𝑃formulae-sequencesubscript𝑡2``𝐴"subscript𝑡1``𝐴"P(t_{2}=``A",t_{1}=``A") we have:

P(t2=``B",t1=``A")𝑃formulae-sequencesubscript𝑡2``𝐵"subscript𝑡1``𝐴"\displaystyle P(t_{2}{=}``B",t_{1}{=}``A") =P(x14=``ABAA")+P(x14=``ABAB")+P(x14=``ABBA")+P(x14=``ABBB")absent𝑃subscriptsuperscript𝑥41``𝐴𝐵𝐴𝐴"𝑃subscriptsuperscript𝑥41``𝐴𝐵𝐴𝐵"𝑃subscriptsuperscript𝑥41``𝐴𝐵𝐵𝐴"𝑃subscriptsuperscript𝑥41``𝐴𝐵𝐵𝐵"\displaystyle{=}P(x^{4}_{1}{=}``ABAA")+P(x^{4}_{1}{=}``ABAB")+P(x^{4}_{1}{=}``ABBA")+P(x^{4}_{1}{=}``ABBB") (40)
=P(x12=``AB")absent𝑃subscriptsuperscript𝑥21``𝐴𝐵"\displaystyle{=}P(x^{2}_{1}=``AB") (41)
=γ(1α)absent𝛾1𝛼\displaystyle{=}\gamma(1-\alpha) (42)

which gives us P(t2=``B"|t1=``A")=1.0𝑃subscript𝑡2conditional``𝐵"subscript𝑡1``𝐴"1.0P(t_{2}=``B"|t_{1}=``A")=1.0. Finally, in this specific case, since order of the Markov chain in the character domain is 111, we do not need to consider the higher order of the Markov chain in the token domain.

Appendix G On Predictive Distribution of Language Models

In practice, LMs often do not follow Proposition 2.3 due to softmax activations. As such, in our MPC algorithm, when t𝒯pval𝑡subscript𝒯subscriptpvalt\in\mathcal{T}_{\mathrm{p_{val}}} and tencode(xnk+1N)1𝑡encodesubscriptsubscriptsuperscript𝑥𝑁subscript𝑛𝑘11t\neq\mathrm{encode}(x^{N}_{n_{k}+1})_{1}, then Pθ(xnk+1N,tk+1=t|t1k)subscript𝑃𝜃subscriptsuperscript𝑥𝑁subscript𝑛𝑘1subscript𝑡𝑘1conditional𝑡subscriptsuperscript𝑡𝑘1P_{\theta}(x^{N}_{n_{k}+1},t_{k+1}=t|t^{k}_{1}) may not be 0.00.00.0 (where θ𝜃\theta is the model weights). Eventually, this can potentially increase the complexity of our MPC algorithm during the Passing step.

In this section, we show that given any tokenized LM, we can force its output probabilities to obey Proposition 2.3, without any loss in terms of perplexity score on the token domain. This means that a tokenized LM satisfying Proposition 2.3 will guarantee the correctness of the Passing step in our MPC algorithm.

Finally, before going to the method, we remind the readers that Proposition 3.1 and Corollary 3.2 are factually correct and hold for all θ𝜃\theta. As such, the refactoring step holds regardless.

G.1 Truncate-Renormalization Process

We justify the assumption that our tokenized language model Pθ(ti+1|t1i)subscript𝑃𝜃conditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1P_{\theta}(t_{i+1}|t^{i}_{1}) follows Proposition 2.3. The idea is that we can turn a language model that does not follow Proposition 2.3 to the one that does while guaranteeing that the new model will always result in a lower token-level perplexity score.

We first introduce Proposition G.1. In this proposition, we are given a target discrete probability distribution p𝑝p where we know some of the values will not happen, says ΦsuperscriptΦ\Phi^{*}. Assume that we have another distribution q𝑞q that approximates p𝑝p, then we can produce another distribution qsuperscript𝑞q^{*} that is closer to p𝑝p in terms of KL divergence by setting corresponding probabilities of q𝑞q in ΦsuperscriptΦ\Phi^{*} to 0.00.00.0 and renormalize it (similar to rejection sampling).

Proposition G.1.

Given a discrete distribution p={p1,p2,,pm}𝑝subscript𝑝1subscript𝑝2subscript𝑝𝑚p=\{p_{1},p_{2},...,p_{m}\} and q={q1,q2,,qm}𝑞subscript𝑞1subscript𝑞2subscript𝑞𝑚q=\{q_{1},q_{2},...,q_{m}\} with qi>0.0subscript𝑞𝑖0.0q_{i}>0.0 for all i𝑖i. Let Φ={i|pi=0.0}Φconditional-set𝑖subscript𝑝𝑖0.0\Phi=\{i\in\mathbb{Z}|p_{i}=0.0\} and ΦΦsuperscriptΦΦ\Phi^{*}\subseteq\Phi, we define q={q1,q2,,qm}superscript𝑞subscriptsuperscript𝑞1subscriptsuperscript𝑞2subscriptsuperscript𝑞𝑚q^{*}=\{q^{*}_{1},q^{*}_{2},...,q^{*}_{m}\} where qi=0.0subscriptsuperscript𝑞𝑖0.0q^{*}_{i}=0.0 for iΦ𝑖superscriptΦi\in\Phi^{*}, and qj=qj/(lΦql)subscriptsuperscript𝑞𝑗subscript𝑞𝑗subscript𝑙superscriptΦsubscript𝑞𝑙q^{*}_{j}=q_{j}/(\sum_{l\notin\Phi^{*}}q_{l}). Then we have:

DKL(p||q)DKL(p||q),D_{\mathrm{KL}}(p||q^{*})\leq D_{\mathrm{KL}}(p||q), (43)

which implies that qsuperscript𝑞q^{*} is closer to p𝑝p than q𝑞q. We refer to the process of producing qsuperscript𝑞q^{*} as truncate-renormalization (TR).

Proof.

Let Z=(lΦql)𝑍subscript𝑙Φsubscript𝑞𝑙Z=(\sum_{l\notin\Phi}q_{l}) is the normalizing factor in qsuperscript𝑞q^{*}. Note that Z1𝑍1Z\leq 1 and as such log(Z)0𝑍0\log(Z)\leq 0. Then:

DKL(p||q)\displaystyle D_{\mathrm{KL}}(p||q^{*}) =ipilog(piqi)absentsubscript𝑖subscript𝑝𝑖subscript𝑝𝑖subscriptsuperscript𝑞𝑖\displaystyle=\sum_{i}p_{i}\log\left(\frac{p_{i}}{q^{*}_{i}}\right) (44)
=iΦpilog(piqi), use 0log0=0.0formulae-sequenceabsentsubscript𝑖superscriptΦsubscript𝑝𝑖subscript𝑝𝑖subscriptsuperscript𝑞𝑖, use 000.0\displaystyle=\sum_{i\notin\Phi^{*}}p_{i}\log\left(\frac{p_{i}}{q^{*}_{i}}\right)\quad\text{, use }0\log 0=0.0 (45)
=iΦpilog(piqi/Z)absentsubscript𝑖superscriptΦsubscript𝑝𝑖subscript𝑝𝑖subscript𝑞𝑖𝑍\displaystyle=\sum_{i\notin\Phi^{*}}p_{i}\log\left(\frac{p_{i}}{q_{i}/Z}\right) (46)
=[iΦpilog(piqi)]+log(Z)absentdelimited-[]subscript𝑖superscriptΦsubscript𝑝𝑖subscript𝑝𝑖subscript𝑞𝑖𝑍\displaystyle=\left[\sum_{i\notin\Phi^{*}}p_{i}\log\left(\frac{p_{i}}{q_{i}}\right)\right]+\log(Z) (47)
iΦpilog(piqi)=DKL(p||q),\displaystyle\leq\sum_{i\notin\Phi^{*}}p_{i}\log\left(\frac{p_{i}}{q_{i}}\right)=D_{\mathrm{KL}}(p||q), (48)

which completes the proof. ∎

Applying to our scenario, for any autoregressive language models P^θ(ti+1|t1i)subscript^𝑃𝜃conditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1\hat{P}_{\theta}(t_{i+1}|t^{i}_{1}) that does not follow Proposition 2.3 (due to the softmax activations), we can perform the TR process (since we know which encoding is invalid) to obtain a new LM Pθ(ti+1|t1i)subscript𝑃𝜃conditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1{P}_{\theta}(t_{i+1}|t^{i}_{1}), which is guaranteed to better approximate the ground-truth model Pgt(ti+1|t1i)subscript𝑃gtconditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1{P_{\mathrm{gt}}}(t_{i+1}|t^{i}_{1}). Thus, we are guaranteed that the token-level perplexity score of Pθ(ti+1|t1i)subscript𝑃𝜃conditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1{P}_{\theta}(t_{i+1}|t^{i}_{1}) is always lower than or equal to P^θ(ti+1|t1i)subscript^𝑃𝜃conditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1\hat{P}_{\theta}(t_{i+1}|t^{i}_{1}).

G.2 On Passing Step in Maximum Prefix Correction Algorithm.

Once our tokenized LM follows Proposition 2.3, it does not alternate the correctness of the Passing step. In other words, under Proposition 2.3, the LM will always output zero probability for invalid encodings t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1}. As a result, the Passing step in the MPC algorithm remains the same in this case.

Appendix H Algorithms for Byte Pair Encoding

H.1 Overview

We begin by introducing the Byte-Pair Correction (BPC) Algorithm for bias correction in Byte-Pair Encoding, which is more general than the MPC algorithm and also works for case of MPE. We then follow with a detail analysis to show the correctness of the algorithm.

Here, we introduce the definitions of invalid encodings (for BPE) and cover encodings.

Definition H.1.

(Invalid Encodings) The list of tokens (an encoding) t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1} is invalid if encode(decode(t1k))t1kencodedecodesubscriptsuperscript𝑡𝑘1subscriptsuperscript𝑡𝑘1\mathrm{encode}(\mathrm{decode}(t^{k}_{1})){\neq}t^{k}_{1}. Otherwise, it is a valid encoding. We denote a valid t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1} as valid(t1k)validsubscriptsuperscript𝑡𝑘1\mathrm{valid}(t^{k}_{1}).

Definition H.2.

(Cover Encodings) Given a string x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}, an encoding t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1} is said to be covering x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} when all the following conditions satisfied:

  1. 1.

    t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1} is valid.

  2. 2.

    x1nprefix(decode(t1k))subscriptsuperscript𝑥𝑛1prefixdecodesubscriptsuperscript𝑡𝑘1x^{n}_{1}\in\mathrm{prefix}(\mathrm{decode}(t^{k}_{1})).

  3. 3.

    xintksubscriptsuperscript𝑥𝑛𝑖subscript𝑡𝑘x^{n}_{i}\in t_{k} for some 1in1𝑖𝑛1\leq i\leq n, i.e. the last token tisubscript𝑡𝑖t_{i} covers a part of the string x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}.

We denote cover(x1n)coversubscriptsuperscript𝑥𝑛1\mathrm{cover}(x^{n}_{1}) to be the set of all cover encodings of x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} and tcover(x1n)𝑡coversubscriptsuperscript𝑥𝑛1\vec{t}\in\mathrm{cover}(x^{n}_{1}) is an encoding in cover(x1n)coversubscriptsuperscript𝑥𝑛1\mathrm{cover}(x^{n}_{1}).

Having established these two definitions, we will later show that for BPE (and MPE), the probability P(x1n)𝑃subscriptsuperscript𝑥𝑛1P(x^{n}_{1}) can be represented using a tokenized LM P(ti+1|t1i)𝑃conditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1P(t_{i+1}|t^{i}_{1}) as follows:

P(x1n)=tcover(x1n)P(t),𝑃subscriptsuperscript𝑥𝑛1subscript𝑡coversubscriptsuperscript𝑥𝑛1𝑃𝑡P(x^{n}_{1})=\sum_{\vec{t}\in\mathrm{cover}(x^{n}_{1})}P(\vec{t}), (49)

and the main goal of the BPC algorithm is to search through all cover encodings of x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}. We can then apply this algorithm and compute any conditional probability P(xn+1N|x1n)𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1P(x^{N}_{n+1}|x^{n}_{1}) through factorization. Figure 5 (Left) illustrates this with examples cover encodings and invalid/valid encodings.

H.2 Byte-Pair Correction Algorithm

Refer to caption

Figure 5: (Left) Representation of P(x1n)𝑃subscriptsuperscript𝑥𝑛1P(x^{n}_{1}) using tokenized LM. We also show an example of cover encodings and valid/invalid encoding. (Right) Illustration of the Byte-Pair Correction Algorithm for BPE encoding. Green tick and red cross denotes valid and invalid encodings respectively, which can be checked using Definition H.1. The termination step does not appear in the original algorithm (for simplicity) but can be easily implemented.

For MPE, the MPC algorithm computes P(x1n)𝑃subscriptsuperscript𝑥𝑛1P(x^{n}_{1}) by searching all possible valid encodings that cover x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}, where the probability of each encoding are computed using the LMs P(ti+1|t1i)𝑃conditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1P(t_{i+1}|t^{i}_{1}) through greedy search. However, this does not work for the case of BPE. For example, under the BPE encoding rule of Llama 2, the string ``hello"``hello"\mathrm{``hello"} is tokenized as an individual token while the string ``hellow"``hellow"\mathrm{``hellow}" is tokenized into 2 tokens ``h``h\mathrm{``h} and ``ellow"``ellow"\mathrm{``ellow}". Note that naive search for all tokens within 𝒱𝒱\mathcal{V} from left to right that cover x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} is computationally expensive.

Algorithm 2 Byte-Pair Correction Algorithm. This algorithm computes P(x1n)𝑃subscriptsuperscript𝑥𝑛1P(x^{n}_{1}) by gradually reducing the search space.
1:procedure compute(x1Nsubscriptsuperscript𝑥𝑁1x^{N}_{1})
2:    //Initialize P(x1n)𝑃subscriptsuperscript𝑥𝑛1P(x^{n}_{1}):
3:    Pout=0.0subscript𝑃𝑜𝑢𝑡0.0P_{out}=0.0
4:    //Probability Aggregation
5:    for i=n10𝑖𝑛10i=n-1...0 do
6:        //The last token that partially covers x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} begins with xi+1nsubscriptsuperscript𝑥𝑛𝑖1x^{n}_{i+1}
7:        ={t𝒱|xi+1nprefix(decode(t))}conditional-set𝑡𝒱subscriptsuperscript𝑥𝑛𝑖1prefixdecode𝑡\mathcal{B}=\{t\in\mathcal{V}|x^{n}_{i{+}1}{\in}\mathrm{prefix}(\mathrm{decode}({t}))\}
8:        //Once the last token position is known, the remaining previous tokens can be determined
9:        t1k1=encode(x1i)subscriptsuperscript𝑡𝑘11encodesubscriptsuperscript𝑥𝑖1t^{k-1}_{1}=\mathrm{encode}(x^{i}_{1})
10:        //Similar to the branching step in MPC
11:        bval=tP(tk=t|t1k1)subscriptbvalsubscript𝑡𝑃subscript𝑡𝑘conditional𝑡subscriptsuperscript𝑡𝑘11\mathrm{b_{val}}={\sum\limits_{t{\in}\mathcal{B}}}P(t_{k}=t\big{|}t^{k-1}_{1})
12:        Pout=Pout+P(t1k1)×bvalsubscript𝑃𝑜𝑢𝑡subscript𝑃𝑜𝑢𝑡𝑃subscriptsuperscript𝑡𝑘11subscriptbvalP_{out}=P_{out}+P(t^{k-1}_{1})\times\mathrm{b_{val}}
13:    end for
14:return Poutsubscript𝑃𝑜𝑢𝑡P_{out}
15:end procedure

The Byte-Pair Correction (BPC) algorithm, shown in Algorithm 2 and visualized in Figure 5 (right), which is an efficient algorithm that can search all valid encodings covering x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}. The idea is that, for each cover encoding t𝑡\vec{t}, once the starting position of the last token is determined (say xi+1subscript𝑥𝑖1x_{i+1}), we are guaranteed the prior tokens is unique and must be encode(x1i)encodesubscriptsuperscript𝑥𝑖1\mathrm{encode}(x^{i}_{1}). Then one will accept the extracted t𝑡\vec{t} if it is valid, otherwise reject it. Corollary H.3 provides justification for this procedure. Here, we assume P(t)=0.0𝑃𝑡0.0P(\vec{t})=0.0 for invalid t𝑡\vec{t}, see Proposition H.4 and justifications as well as implementation in Appendix G.

Remark. The BPC algorithm can also be applied for the case of MPE. In fact, it is more general than the original MPC algorithm as it only relies on the property of invalid encodings.

H.3 Analysis

Notations. We extend the notation of the vocabulary 𝒱𝒱\mathcal{V} for the case of BPE. Here, 𝒱𝒱\mathcal{V} is an ordered list that determines the merging order in the BPE algorithm. Each v𝒱𝑣𝒱v\in\mathcal{V} is a triplet of (tleft,tright,tnew)subscript𝑡leftsubscript𝑡rightsubscript𝑡new(t_{\mathrm{left}},t_{\mathrm{right}},t_{\mathrm{new}}) which corresponds to the merging tokens (left and right) and the new token. For simplicity, when we write t𝒱𝑡𝒱t\in\mathcal{V}, it corresponds to the merged token, i.e. tnewvsubscript𝑡new𝑣t_{\mathrm{new}}\in v. Finally, the first |𝒜|𝒜|\mathcal{A}| entries in 𝒱𝒱\mathcal{V} correspond to the alphabet 𝒜𝒜\mathcal{A}, where no merging will happen.

Byte-Pair Encoding. We revise the encoding rule for BPE, shown in Algorithm 3. In practice, pre-tokenization is often used, where tokens are separated by whitespace or special characters. In this case, we can adjust our vocabulary 𝒱𝒱\mathcal{V} by removing tokens with special characters in the middle of the string.

Algorithm 3 Byte Pair Encoding Algorithm .
1:procedure Encode_BPE(x1Nsubscriptsuperscript𝑥𝑁1x^{N}_{1}, 𝒱𝒱\mathcal{V})
2:    //Set initial encodings:
3:    c_tokens=x1Nc_tokenssubscriptsuperscript𝑥𝑁1\mathrm{c\_tokens}=x^{N}_{1}
4:    //Iterate over merging order in 𝒱𝒱\mathcal{V}, the first |𝒜|𝒜|\mathcal{A}| entries correspond the the alphabet (no merge happens):
5:    for i=|𝒜|+1,|𝒱|𝑖𝒜1𝒱i=|\mathcal{A}|+1,...|\mathcal{V}| do
6:        c_tokensfind_merge(c_tokens,𝒱[i])absentc_tokensfind_mergec_tokens𝒱delimited-[]𝑖\mathrm{c\_tokens}\xleftarrow{}\mathrm{find\_merge}(\mathrm{c\_tokens},\mathcal{V}[i])
7:    end for
8:    return c_tokensc_tokens\mathrm{c\_tokens}
9:end procedure
10:
11:procedure find_mergefind_merge\mathrm{find\_merge}(c_tokens,vc_tokens𝑣\mathrm{c\_tokens},v)
12:    // Left and right tokens for merging
13:    tleft,tright,tnew=v[1],v[2],v[3]formulae-sequencesubscript𝑡leftsubscript𝑡rightsubscript𝑡new𝑣delimited-[]1𝑣delimited-[]2𝑣delimited-[]3t_{\mathrm{left}},t_{\mathrm{right}},t_{\mathrm{new}}=v[1],v[2],v[3]
14:    old_tokens=c_tokensold_tokensc_tokens\mathrm{old\_tokens}=\mathrm{c\_tokens}
15:    new_tokens=[]new_tokens\mathrm{new\_tokens}=[]
16:    // Find and merge tokens from left to right
17:    j=1𝑗1j=1
18:    while j<|old_tokens|𝑗old_tokensj<|\mathrm{old\_tokens}| do
19:        if old_tokens[i,i+1]=tleft,trightold_tokens𝑖𝑖1subscript𝑡leftsubscript𝑡right\mathrm{old\_tokens}[i,i+1]=t_{\mathrm{left}},t_{\mathrm{right}} then
20:           new_tokens.append(tnew)formulae-sequencenew_tokensappendsubscript𝑡new\mathrm{new\_tokens.append}(t_{\mathrm{new}})
21:           j=j+2𝑗𝑗2j=j+2
22:        else
23:           new_tokens.append(old_tokens[i])formulae-sequencenew_tokensappendold_tokensdelimited-[]𝑖\mathrm{new\_tokens.append}(\mathrm{old\_tokens}[i])
24:           j=j+1𝑗𝑗1j=j+1
25:        end if
26:    end while
27:    return new_tokensnew_tokens\mathrm{new\_tokens}
28:end procedure

Overview. We begin our analysis with theoretical results on invalid encodings (Corollary H.3 and Proposition H.4), which characterizes the zero probability events for an optimal tokenized LM. This will allow us to prove the representation of P(x1n)𝑃subscriptsuperscript𝑥𝑛1P(x^{n}_{1}), i.e. Proposition H.5, previously shown in Equation (49). Finally, we conclude this section with the proof of correctness of the BPC algorithm, using Proposition H.5 and H.6.

We begin with theoretical results on invalid encodings in the case of BPE.

Corollary H.3.

𝒮(t1k)=𝒮subscriptsuperscript𝑡𝑘1\mathcal{S}(t^{k}_{1})=\emptyset if and only if t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1} is invalid.

Proof.

We prove each direction as follows.

  • If 𝒮(t1k)=𝒮subscriptsuperscript𝑡𝑘1\mathcal{S}(t^{k}_{1})=\emptyset then t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1} is invalid: Since 𝒮(t1k)=𝒮subscriptsuperscript𝑡𝑘1\mathcal{S}(t^{k}_{1})=\emptyset, we know that there exist no string s𝑠s such that encode(s)1k=t1kencodesubscriptsuperscript𝑠𝑘1subscriptsuperscript𝑡𝑘1\mathrm{encode}(s)^{k}_{1}=t^{k}_{1}. As such, for s=decode(t1k)𝑠decodesubscriptsuperscript𝑡𝑘1s=\mathrm{decode}(t^{k}_{1}), we do not have encode(decode(t1k))=t1kencodedecodesubscriptsuperscript𝑡𝑘1subscriptsuperscript𝑡𝑘1\mathrm{encode}(\mathrm{decode}(t^{k}_{1}))=t^{k}_{1}, which proves the result.

  • If t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1} is invalid then 𝒮(t1k)=𝒮subscriptsuperscript𝑡𝑘1\mathcal{S}(t^{k}_{1})=\emptyset: Let x1n=decode(t1k)subscriptsuperscript𝑥𝑛1decodesubscriptsuperscript𝑡𝑘1x^{n}_{1}=\mathrm{decode}(t^{k}_{1}), we consider two string s1subscript𝑠1s_{1} and s2subscript𝑠2s_{2} that both have prefix x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}. Furthermore, we assume the first i𝑖i tokens of s1subscript𝑠1s_{1} covers exactly x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}, i.e. x1n=decode(t1i)subscriptsuperscript𝑥𝑛1decodesubscriptsuperscript𝑡𝑖1x^{n}_{1}=\mathrm{decode}(t^{i}_{1}) and similarly, the first j𝑗j tokens of s2subscript𝑠2s_{2} covers exactly x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}, i.e. x1n=decode(t1j)subscriptsuperscript𝑥𝑛1decodesubscriptsuperscript𝑡𝑗1x^{n}_{1}=\mathrm{decode}(t^{j}_{1}). Then:

    1. 1.

      Proving invalid t1ksubscriptsuperscript𝑡𝑘1t^{k}_{1} leads to 𝒮(t1k)=𝒮subscriptsuperscript𝑡𝑘1\mathcal{S}(t^{k}_{1})=\emptyset is equivalently to proving t1i=t1jsubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑡𝑗1t^{i}_{1}=t^{j}_{1} for any s1,s2subscript𝑠1subscript𝑠2s_{1},s_{2}.

    2. 2.

      Re-running the BPE algorithm for s1subscript𝑠1s_{1} and s2subscript𝑠2s_{2} in parallel, we know that there will be no merge between any suffix of x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} and the rest of strings, i.e. s1\x1n\subscript𝑠1subscriptsuperscript𝑥𝑛1s_{1}\backslash x^{n}_{1} and s2\x1n\subscript𝑠2subscriptsuperscript𝑥𝑛1s_{2}\backslash x^{n}_{1} due to the condition above (See Algorithm 3, line 6).

    3. 3.

      Furthermore, for s1subscript𝑠1s_{1}, any time a merge happens within x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} then the same merge must also happen within x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1} for s2subscript𝑠2s_{2} and vice versa.

    As the result, we have t1i=t1jsubscriptsuperscript𝑡𝑖1subscriptsuperscript𝑡𝑗1t^{i}_{1}=t^{j}_{1} and they must be equal to encode(x1n)encodesubscriptsuperscript𝑥𝑛1\mathrm{encode}(x^{n}_{1}).

Proposition H.4.

(Token-Induced Zero Probability-BPE) Let t1isubscriptsuperscript𝑡𝑖1t^{i}_{1} be a sequence of input tokens. For any invalid encoding t1isubscriptsuperscript𝑡𝑖1t^{i}_{1}, we have Pgt(t1i)=0.0subscript𝑃gtsubscriptsuperscript𝑡𝑖10.0P_{\mathrm{gt}}(t^{i}_{1}){=}0.0 and the conditional probability Pgt(ti+1|t1i)subscript𝑃gtconditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖1P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1}) is undefined. In the case t1isubscriptsuperscript𝑡𝑖1t^{i}_{1} is valid, then Pgt(ti+1|t1i)=0.0subscript𝑃gtconditionalsubscript𝑡𝑖1subscriptsuperscript𝑡𝑖10.0P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1}){=}0.0 if t1i+1subscriptsuperscript𝑡𝑖11t^{i+1}_{1} is invalid.

Proof.

The proof is the same as the MPE version (Proposition 2.3). ∎

Correctness of BPC Algorithm. We show in Proposition H.5 that computing the string probability P(x1n)𝑃subscriptsuperscript𝑥𝑛1P(x^{n}_{1}) is equivalent to marginalizing the probability of all covering tokens of x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}. As such, the main task of computing P(x1n)𝑃subscriptsuperscript𝑥𝑛1P(x^{n}_{1}) is to iterate all the valid encodings that cover x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}.

Proposition H.5.

(Prefix Probability Representation) Given a prefix x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}, we have the followings:

  1. 1.

    For any distinct ti,tjcover(x1n)subscript𝑡𝑖subscript𝑡𝑗coversubscriptsuperscript𝑥𝑛1\vec{t}_{i},\vec{t}_{j}\in\mathrm{cover}(x^{n}_{1}), then 𝒮(ti)𝒮(tj)=𝒮subscript𝑡𝑖𝒮subscript𝑡𝑗\mathcal{S}(\vec{t}_{i})\cap\mathcal{S}(\vec{t}_{j}){=}\emptyset.

  2. 2.

    𝒮(x1n)=tcover(x1n)𝒮(t)𝒮subscriptsuperscript𝑥𝑛1subscript𝑡coversubscriptsuperscript𝑥𝑛1𝒮𝑡\mathcal{S}(x^{n}_{1})=\smashoperator[]{\bigcup_{\vec{t}\in\mathrm{cover}(x^{n}_{1})}^{}}\mathcal{S}(\vec{t}).

As a result, P(x1n)𝑃subscriptsuperscript𝑥𝑛1P(x^{n}_{1}) can be expressed as the marginal probability of all covering tokens of x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}

P(x1n)=tcover(x1n)P(t)𝑃subscriptsuperscript𝑥𝑛1subscript𝑡coversubscriptsuperscript𝑥𝑛1𝑃𝑡P(x^{n}_{1})=\sum_{\vec{t}\in\mathrm{cover}(x^{n}_{1})}P(\vec{t}) (50)
Proof.

We prove each point as follows:

  1. 1.

    Proof by contradiction, suppose that 𝒮(ti)𝒮(tj)𝒮subscript𝑡𝑖𝒮subscript𝑡𝑗\mathcal{S}(\vec{t}_{i})\cap\mathcal{S}(\vec{t}_{j})\neq\emptyset, then there exists a string s𝑠s that has two different cover encodings t1subscript𝑡1\vec{t}_{1} and t2subscript𝑡2\vec{t}_{2}. This is impossible since each string s𝑠s has only one unique encoding.

  2. 2.

    This follows the definition of cover encodings.

Since each 𝒮(t)𝒮𝑡\mathcal{S}(\vec{t}) is pair-wise disjoint, we arrive at the final equation. We illustrate this Proposition in Figure 7. ∎

Refer to caption
Figure 6: Illustration of Proposition H.5 when |cover(x1n)|=3coversubscriptsuperscript𝑥𝑛13|\mathrm{cover}(x^{n}_{1})|{=}3.
Refer to caption
Figure 7: Our method accurately estimates the transition probability of a 3rd order Markov chain while the baseline method fails to.

Finally, we prove that BPC extracts all the cover encodings of x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}. Proposition H.6 shows the correctness of line 9 in the algorithm, where suppose that the last token starts from xi+1subscript𝑥𝑖1x_{i+1}, then encode(x1i)encodesubscriptsuperscript𝑥𝑖1\mathrm{encode}(x^{i}_{1}) must be the encoding before that last token. Since each cover encoding t𝑡\vec{t} must have a last token cover a suffix within x1nsubscriptsuperscript𝑥𝑛1x^{n}_{1}, iterating over all positions from 111 to n𝑛n guarantees that we extract all possible encodings.

Proposition H.6.

Let tcover(x1N)𝑡coversubscriptsuperscript𝑥𝑁1\vec{t}\in\mathrm{cover}(x^{N}_{1}) and k=|t|𝑘𝑡k=|\vec{t}| is the number of tokens in t𝑡\vec{t}. Suppose that xj+1Nsubscriptsuperscript𝑥𝑁𝑗1x^{N}_{j+1} is a prefix of the tksubscript𝑡𝑘t_{k} for some 0jN0𝑗𝑁0\leq j\leq N, i.e. xjNprefix(tk)subscriptsuperscript𝑥𝑁𝑗prefixsubscript𝑡𝑘x^{N}_{j}\in\mathrm{prefix}(\vec{t}_{k}), then t1k1=encode(x1j)subscriptsuperscript𝑡𝑘11encodesubscriptsuperscript𝑥𝑗1t^{k-1}_{1}=\mathrm{encode}(x^{j}_{1}).

Proof.

Since t=t1kcover(x1N)𝑡subscriptsuperscript𝑡𝑘1coversubscriptsuperscript𝑥𝑁1\vec{t}=t^{k}_{1}\in\mathrm{cover}(x^{N}_{1}), then t1k1subscriptsuperscript𝑡𝑘11t^{k-1}_{1} must be a valid encoding. As a result, we must have t1k1=encode(x1j)subscriptsuperscript𝑡𝑘11encodesubscriptsuperscript𝑥𝑗1t^{k-1}_{1}=\mathrm{encode}(x^{j}_{1}). ∎

Refactoring. Unlike the case for MPE, identifying the case when P(xn+1N|t1i)=P(xn+1N|x1n)𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑡𝑖1𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1P(x^{N}_{n+1}|t^{i}_{1})=P(x^{N}_{n+1}|x^{n}_{1}) is nontrivial in general for BPE. Nevertheless, the refactoring step aims to reduce the computational complexity, and in general, we can still compute P(xn+1N|x1n)𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1P(x^{N}_{n+1}|x^{n}_{1}) by refactoring.

P(xn+1N|x1n)=P(x1N)P(x1n),𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1𝑃subscriptsuperscript𝑥𝑁1𝑃subscriptsuperscript𝑥𝑛1\displaystyle P(x^{N}_{n+1}|x^{n}_{1})=\frac{P(x^{N}_{1})}{P(x^{n}_{1})}, (51)

and we use the BPC algorithm to compute P(x1N)𝑃subscriptsuperscript𝑥𝑁1P(x^{N}_{1}) and P(x1n)𝑃subscriptsuperscript𝑥𝑛1P(x^{n}_{1}) respectively. Note that this is equivalent to assuming t1subscript𝑡1t_{1} is a <start>expectationstart{<}\mathrm{start}{>} token within 𝒱superscript𝒱\mathcal{V}^{*} (and consider P(x2N|t1),P(x2n|t1)𝑃conditionalsubscriptsuperscript𝑥𝑁2subscript𝑡1𝑃conditionalsubscriptsuperscript𝑥𝑛2subscript𝑡1P(x^{N}_{2}|t_{1}),P(x^{n}_{2}|t_{1}) instead of P(x1N),P(x1n)𝑃subscriptsuperscript𝑥𝑁1𝑃subscriptsuperscript𝑥𝑛1P(x^{N}_{1}),P(x^{n}_{1})). When pretokenization is used, e.g. split by white spaces, we can identify when P(xn+1N|t1i)=P(xn+1N|x1n)𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑡𝑖1𝑃conditionalsubscriptsuperscript𝑥𝑁𝑛1subscriptsuperscript𝑥𝑛1P(x^{N}_{n+1}|t^{i}_{1})=P(x^{N}_{n+1}|x^{n}_{1}) by using the pretokenization pattern.

H.4 Experiments

The experiment setup for BPE is the same as the one in Section 4, except we use the vocabulary 𝒱={``A",``B",``BA",``BAA",``BBAA",``AA",``BABA",``BB"}𝒱``𝐴"``𝐵"``𝐵𝐴"``𝐵𝐴𝐴"``𝐵𝐵𝐴𝐴"``𝐴𝐴"``𝐵𝐴𝐵𝐴"``𝐵𝐵"\mathcal{V}=\{``A",``B",``B{\cdot}A",``BA{\cdot}A",``B{\cdot}BAA",``A{\cdot}A",``BA{\cdot}BA",``B{\cdot}B"\}, where the order within 𝒱𝒱\mathcal{V} is the merging order for the BPE encoding process and the “{\cdot}” separates the merging tokens. The result is shown in Figure 7, where our method can accurately recover the ground truth probability P(xn+1|x1n)𝑃conditionalsubscript𝑥𝑛1subscriptsuperscript𝑥𝑛1P(x_{n+1}|x^{n}_{1}) while the baseline fails to. Notice that for the state ``BAA"``𝐵𝐴𝐴"``BAA", the baseline approach can output the correct probability, which is because the merging for the token ``BAA"``𝐵𝐴𝐴"``BAA" happens before any mergs where ``A"``𝐴"``A" is the left token happens. This experiment also shows the existence of bias within the BPE process and our method can recover the exact ground truth probability.