Understanding and Mitigating Tokenization Bias in Language Models

Buu Phan Marton Havasi Matthew Muckley Karen Ullrich

Abstract

State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction. We show that popular encoding schemes, such as maximum prefix encoding (MPE) and byte-pair-encoding (BPE), induce a sampling bias that cannot be mitigated with more training or data. To counter this universal problem, for each encoding scheme above, we propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data. Our methods do not require finetuning the model, and the complexity, defined as the number of model runs, scales linearly with the sequence length in the case of MPE. As a result, we show that one can simulate token-free behavior from a tokenized language model. We empirically verify the correctness of our method through a Markov-chain setup, where it accurately recovers the transition probabilities, as opposed to the conventional method of directly prompting tokens into the language model.

Machine Learning, ICML

\AtBeginEnvironment

align*\useshortskip

1 Introduction

Tokenization is a preprocessing procedure used in many state-of-the-art (SOTA) language models (LMs) such as GPTs (Brown et al., 2020), Llama (Touvron et al., 2023) and Gemini (Gemini, 2023). It divides the input text into smaller subword units while retaining linguistic importance, helping to address vocabulary limitations such as unknown words. Tokenization also shortens (compresses) the input context length (Sennrich et al., 2015; Kudo & Richardson, 2018). Since effective compression allows transformer-based LMs to handle longer context strings, many works (Zouhar et al., 2023; Gallé, 2019; Goldman et al., 2024) have focused on enhancing vocabulary design and encoding algorithms for better performance in downstream tasks. However, the relationship between compression and model performance remains unclear. Some research suggests the impact of compression is not always positive (Schmidt et al., 2024; Dagan et al., 2024; Goyal et al., 2023). Consequently, understanding tokenization’s effect on model performance continues to be an open question.

Tokenization has been criticized for introducing many shortcomings in LMs. These include sensitivity to spelling and morphological structure (Xue et al., 2022), language-based biases (Petrov et al., 2024), subpar performance in specific tasks such as arithmetic (Singh & Strouse, 2024), or new domains (Liu et al., 2023a). One approach to address these issues is through fine-tuning the model with new vocabularies; however, this often complicates the training process and requires domain-specific expertise (Chen et al., 2023; Liu et al., 2023b). Furthermore, the performance gains do not provide a theoretical understanding of whether these limitations truly arise from the tokenization process or result from suboptimal model training. Another direction is to develop token-free LMs (Yu et al., 2024; Nawrot et al., 2022; Tay et al., 2021). While this approach has potential as it eliminates tokenization-related issues, it significantly increases the context length, resulting in performance that still lags behind the SOTA tokenized LMs ¹¹1We refer language models that process tokenized texts as tokenized language models (tokenized LMs).(Yu et al., 2024).

In this work we offer new theoretical insights on the behavior of tokenized LMs. We show that they are statistically equivalent to their token-free counterparts. Specifically, we examine the maximum prefix encoding (MPE) scheme employed in the WordPiece tokenization method (Devlin et al., 2018; Song et al., 2020) and find that this process not only results in biased estimates of next token probabilities, but also leads to overall skewed estimates of subsequent character probabilities. In general, this bias persists despite an increase in training data, even within the simple setting of a 1st-order Markov chain. Such bias occurs due to the implicit disparity between the domain of the conditioning context, namely, characters versus tokens. Nevertheless, we will show that it is possible to correct this bias without resorting to finetuning. Once adjusted, it becomes possible to simulate the token-free behavior learned implicitly by the tokenized LM and even (theoretically) mimic the behavior of another tokenized model employing a distinct vocabulary set, all without requiring finetuning. Our specific contributions are as follows:

•

We show the presence of a bias in the next-token distribution that arises as a result of the tokenization process.
•

We present two novel algorithms to correct this bias for MPE and Byte-Pair-Encoding (BPE) respectively. Due to space limit, the analysis and algorithm for BPE are presented in Appendix H.
•

We verify the correctness of our algorithms on learning the transition matrix of a $k$ -th order Markov chain.

2 Problem Setup

ID	Token
1	$A$
2	$B$
3	$AA$

Figure 1: Next-Character sampling bias introduced by the WordPiece encoding algorithm. In this example, given the context token

``A"

, the model will always predict the next token as

``B"

with probability

1.0

. We present a technique that, given a language model trained on tokenized domain, eliminate this bias and recover the accurate unbiased sampling distribution.

We begin by establishing the tokenization and language models setup in our paper. We then describe the next-character sampling bias problem due to tokenization.

2.1 Notations and Setup.

String Notations. For any string $s$ , we denote its substring from $i$ to $j$ as $x^{j}_{i}{\vcentcolon=}{x_{i}x_{i+1}..x_{j}}$ , where each $x$ is a character of the alphabet $\mathcal{A}$ . For a given string $x^{N}_{1}$ , we define the prefix function that generates a set containing all possible prefix strings of $x^{N}_{1}$ , represented as $\mathrm{prefix}(x^{N}_{1}){=}\{x^{1}_{1},x^{2}_{1},x^{3}_{1},...,x^{N}_{1}\}$ . Also, we define a concatenation function $\mathrm{concat}(.)$ that concatenates the given list of strings, e.g given $s_{1}{=}x^{N_{1}}_{1}$ and $s_{2}{=}y^{N_{2}}_{1}$ , we obtain $\mathrm{concat}(s_{1},s_{2}){=}\mathrm{concat}(x^{N_{1}}_{1},y^{N_{2}}_{1}){=}x_{1}...x_{N_{1}}y_{1}...y_{N_{2}}$ . Finally, we denote the set of all strings that start with a prefix $x^{n}_{1}$ as $\mathcal{S}(x^{n}_{1}){=}\{s|x^{n}_{1}{\in}\mathrm{prefix}(s)\}$ .

Tokenization Setting. We assume having a predefined vocabulary $\mathcal{V}$ constructed using any tokenization algorithm such as BPE, with the condition that $\mathcal{A}{\subseteq}\mathcal{V}$ . We use $t$ to denote a token in $\mathcal{V}$ , i.e. $t{\in}\mathcal{V}$ . Importantly, we use the longest prefix matching strategy for tokenization (encoding), denoted as $\mathrm{encode}(.)$ , similar to the approach used in the Wordpiece algorithm (Devlin et al., 2018; Song et al., 2020). Given a sequence of tokens $t^{k}_{1}$ , the function $\mathrm{decode}(.)$ returns the concatenated string resulting from processing each token in the sequence. Finally, the set of all strings that starts with the tokens $t^{k}_{1}$ is defined as $\mathcal{S}(t^{k}_{1}){=}\{s|t^{k}_{1}{=}\mathrm{encode}(s)^{k}_{1}\}$ .

Tokenized LMs. We assume having access to a tokenized autoregressive LM with parameters $\theta$ that is trained with tokens from $\mathcal{V}$ and maximum prefix matching. The target distributions on the character domain is denoted as $P_{\mathrm{gt}}(x^{N}_{n+1}|x^{n}_{1})$ and on the token domain is $P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1})$ . For simplicity, unless otherwise stated, we implicitly assume each probability term involves $\theta$ . Using the model, we assume that one can compute $P(t_{i+1}|t^{i}_{1})$ for any integer $i>0$ . In this work, we consider LMs trained under the standard setup, where each string $s$ in the dataset is first tokenized with the encoding function $\mathrm{encode}(.)$ and vocabulary $\mathcal{V}$ , and the parameters $\theta$ are optimized to maximize the predictive likelihood of the next token in the tokenized dataset.

2.2 Next-Character Sampling Bias

We first define the (next-character) sampling bias problem that describes the discrepancy between the character level and token level predictions for tokenized LMs.

Definition 2.1.

(Next-Character Sampling Bias) Let the input prompt string $x^{n}_{1}$ has $t^{i}_{1}{=}\mathrm{encode}(x^{n}_{1})$ as the corresponding encoding. The next-character sampling bias occurs for this prompt when $P_{\mathrm{gt}}(x_{n+1}|x^{n}_{1}){\neq}P_{\mathrm{gt}}(x_{n+1}|t^{i}_{1})$ where $P_{\mathrm{gt}}(x_{n+1}|t^{i}_{1}){=}\sum_{t\in\mathcal{E}}P_{\mathrm{gt}}(t_{i+1}{=}t|t^{i}_{1})$ where $\mathcal{E}=\{t\in\mathcal{V}|\mathrm{decode}(t)\in\mathcal{S}(x_{n+1})\}$ .

In other words, the probability of the next character being “c” may be different from the sum of the probabilities of all tokens that start with “c”. Note that this character-level probability offers a broader perspective compared to the probability of the subsequent token being exactly “c”.

Example. Consider a first order Markov chain with two states $\{``A",``B"\}$ as shown in Figure 1 (left). Each string is tokenized with $\mathcal{V}{=}\{``AA",``A",``B"\}$ , which leads to a new Markov chain whose states and transition matrix is shown in Figure 1 (right). Details on computing the transition matrix of the new Markov chain is in Appendix F. We first observe that for the prompt $s_{1}{=}``AA"$ and $s_{2}{=}"B"$ , there is no bias problem after marginalization²²2For example, we have $P_{\mathrm{gt}}(t_{i{+}1}{=}``AA"|t_{i}{=}``AA")+P_{\mathrm{gt}}(t_{i{+}1}{=}``A"|t_{i}{=}``AA")=\alpha=P_{\mathrm{gt}}(x_{n{+}1}{=}``A"|x_{n}{=}``A")$ . However, for the prompt $s_{3}{=}``A"$ , the sampling bias occurs as $P_{\mathrm{gt}}(x_{2}{=}``B"|t_{1}{=}``A"){=}1.0$ , which is not equal to $P_{\mathrm{gt}}(x_{2}{=}``B"|x_{1}{=}``A"){=}\alpha$ , i.e. the optimally trained LM will always output $``B"$ . In fact, for any context string that ends with token $``A"$ , e.g $``AA|A"$ and $``B|A"$ (tokens are separated by $``|"$ ), such LM will always output $``B"$ .

Since this applies to any optimally trained LM, increasing the training set size does not mitigate this problem. The reason for this sampling bias is that, during the tokenization process with longest prefix matching, the token $``A"$ must be followed by the token $``B"$ . Else, MPE encoding will merge to create a longer token $``AA"$ . We generalize this phenomenon with the definition of invalid encodings.

Definition 2.2.

(Invalid Encodings) The list of tokens (an encoding) $t^{i}_{1}$ is invalid if $\mathrm{encode}(\mathrm{decode}(t^{i}_{1})){\neq}t^{i}_{1}$ . Otherwise, it is a valid encoding.

For example, let $\mathcal{V}{=}\{``c",``a",``t",``at",``cat"\}$ then $[``c"{,}``at"{,}``t"]$ and $[``c"{,}``a"{,}``t"{,}``t"]$ are invalid encodings of $``catt"$ . We now show in Proposition 2.3 that the existence of invalid encodings introduces sampling bias, generalizing the observed phenomenon in the Markov chain example to any autoregressive distribution.

Proposition 2.3.

(Token-Induced Zero Probability) Let $t^{i}_{1}$ be a sequence of input tokens. For any invalid encoding $t^{i}_{1}$ , we have $P_{\mathrm{gt}}(t^{i}_{1}){=}0.0$ and the conditional probability $P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1})$ is undefined. In the case $t^{i}_{1}$ is valid, then $P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1}){=}0.0$ if $t^{i+1}_{1}$ is invalid. Furthermore, let $x^{n}_{1}{=}\mathrm{decode}(t^{i}_{1})$ , then for any string $x^{N}_{n+1}$ such that $\mathrm{encode}(\mathrm{concat}(\mathrm{decode}(t^{i}_{1}),x^{N}_{n+1})){\neq}t^{i}_{1}$ , we have $P_{\mathrm{gt}}(x^{N}_{n+1}|t^{i}_{1}){=}0.0.$

Proof.

See Appendix C. ∎

Remark 1.

Proposition 2.3 implies that LMs may not function as expected when presented with invalid encodings, because these models will never be exposed to such inputs within the dataset. This directly implies that the practice of evaluating LMs under different encodings (Cao & Rimell, 2021; Chirkova et al., 2023) is suboptimal.

3 Alleviating Sampling Bias

We propose a method to remove the described bias and recover the original token-free autoregressive model, i.e. expressing the implicitly learned $P(x^{N}_{n+1}|x^{n}_{1})$ using the tokenized LM that outputs the conditional probability $P(t_{i+1}|t^{i}_{1})$ . For $N{=}n{+}1$ , this captures the behavior of a token-free model, i.e. sampling the next character instead of a whole token. We assume our LM follows Proposition 2.3 on zero probability events and undefined conditional probability for invalid encodings. Appendix G justifies this assumption and provides its practical implementation.

Our method consists of two stages. In the first stage, the idea is to identify the condition when $P(x^{N}_{n+1}|t^{i}_{1})=P(x^{N}_{n+1}|x^{n}_{1})$ where $t^{i}_{1}=\mathrm{encode}(x^{n}_{1})$ . Once identified, we can refactor the conditional probability to match the conditioning events. In the second stage, we compute $P(x^{N}_{n+1}|t^{i}_{1})$ using the LM output probability, i.e. $P(t_{i+1}|t^{i}_{1})$ , through the novel Maximum Prefix Correction (MPC) Algorithm.

3.1 Refactoring

Our method removes the bias by connecting character and token domains through a special subset of tokens $\mathcal{V}^{*}{\subset}\mathcal{V}$ , whose elements $t^{*}{\in}\mathcal{V}^{*}$ are not a substring of any other tokens in $\mathcal{V}$ but itself. For example, given $\mathcal{V}{=}\{``AAA",``AA",``CB",``A",``B",``C"\}$ , then $\mathcal{V}^{*}=\{``AAA",``CB"\}$ . In the Markov example in Section 2, this corresponds to the tokens $``AA"$ and $``B"$ . Also, we assume that any string $x^{N}_{1}$ has the first token $t_{1}\in\mathcal{V}^{*}$ ³³3Many current language models begins with a start token ${<}\mathrm{start}{>}$ in $\mathcal{V}^{*}$ , e.g. in SentencePiece (Kudo & Richardson, 2018).. Consider the input string $x^{n}_{1}$ and its corresponding encoding $t^{i}_{1}{=}\mathrm{encode}(x^{n}_{1})$ , Proposition 3.1 shows the sufficient condition for $\mathcal{S}(t^{i}_{1}){=}\mathcal{S}(x^{n}_{1})$ .

Proposition 3.1.

Let $s^{*}=x^{n}_{1}$ , where $t^{i}_{1}=\mathrm{encode}(s^{*})=\mathrm{encode}(x^{n}_{1})$ . Then we have $\mathcal{S}(t^{i}_{1})\subset\mathcal{S}(x^{n}_{1})$ , i.e. for any string $s$ where $t^{i}_{1}=\mathrm{encode}(s)^{i}_{1}$ , we have $P(x^{n}_{1}|t^{i}_{1})=1.0$ . In the case $t_{i}\in\mathcal{V}^{*}$ , then we also have that $\mathcal{S}(t^{i}_{1})=\mathcal{S}(x^{n}_{1})$ , i.e. any string $s$ where $x^{n}_{1}\in\mathrm{prefix}(s)$ must have the first $i$ tokens as $t^{i}_{1}$ and $P(t^{i}_{1}|x^{n}_{1})=1.0$ .

Proof.

See Appendix D. ∎

The intuition for Proposition 3.1 is that the subsequent string after $t_{i}{\in}\mathcal{V}^{*}$ cannot change the tokenization for $x^{n}_{1}$ . We now establish one of the main results in Corollary 3.2.

Corollary 3.2.

Following Proposition 3.1, suppose $t_{i}\in\mathcal{V}^{*}$ then we have $P(x^{N}_{n+1}|x^{n}_{1}{)}{=}P(x^{N}_{n{+}1}|t^{i}_{1})$ . Similarly, we also have $P(t^{j}_{i+1}|x^{n}_{1}{)}{=}P(t^{j}_{i+1}|t^{i}_{1})$ .

Proof.

See Appendix D. ∎

We note that Proposition 3.1 and Corollary 3.2 always hold, regardless of the value of $\theta$ . In general, consider when the last token of $\mathrm{encode}(x^{n}_{1})$ is not in $\mathcal{V}^{*}$ , we can refactor $P(x^{N}_{n+1}|x^{n}_{1})$ as follow:

\displaystyle P(x^{N}_{n+1}|x^{n}_{1}){=}\frac{P(x^{N}_{n_{k}+1}|t^{k}_{1})}{P(x^{n}_{n_{k}+1}|t^{k}_{1})},

(1)

where $k$ is the last token in $\mathrm{encode}(x^{n}_{1})$ such that $t_{k}{\in}\mathcal{V}^{*}$ and $x^{n_{k}}_{1}{=}\mathrm{decode}(t^{k}_{1})$ , where $n_{k}\leq n$ . Proof details of this step can be found in the Appendix E. We then use the MPC algorithm to compute each term in the RHS individually.

3.2 Maximum Prefix Correction Algorithm

Algorithm 1 Maximum Prefix Correction Algorithm. This algorithm recursively computes

P(x^{N}_{n_{k}+1}|t^{k}_{1})

1:procedure compute(

x^{N}_{n_{k}+1},t^{k}_{1}

)

2: // Branching Step:

\mathcal{B}=\{t\in\mathcal{V}|x^{N}_{n_{k}{+}1}{\in}\mathrm{prefix}(\mathrm{decode}({t}))\}

\mathrm{b_{val}}={\sum\limits_{t{\in}\mathcal{B}}}P(t_{k{+}1}=t\big{|}t^{k}_{1})

5: // Base Case:

6: if

\mathrm{encode}(x^{N}_{i})\in\mathcal{V}

then

7: return

\mathrm{b_{val}}

8: end if

9: //Extract the Next Token:

10:

t_{k+1}=\mathrm{encode}(x^{N}_{n_{k}+1})_{1}

11: // Passing Step:

12:

\mathrm{p_{val}}=P(t_{k+1}\big{|}t^{k}_{1})

13:

\mathrm{p_{val}}=\mathrm{p_{val}}\times\text{COMPUTE}(x^{N}_{n_{k+1}+1},t^{k+1}_{1})

14: return

\mathrm{b_{val}}+\mathrm{p_{val}}

15:end procedure

We present the MPC algorithm in Algorithm 1, that allows us to compute the probabilities $P(x^{N}_{n_{k}+1}|t^{k}_{1})$ and $P(x^{n}_{n_{k}+1}|t^{k}_{1})$ in Equation (1). Note that this algorithm does not require $t_{k}{\in}\mathcal{V}^{*}$ . Details on the algorithmic correctness are shown in Appendix E.

The idea is to marginalize out $P(x^{N}_{n_{k}+1}|t^{k}_{1})$ by considering two complementary events: when the next token $t_{k+1}$ has a prefix $x^{N}_{n_{k}+1}$ ( $\mathrm{b_{val}}$ in the Branch Step) versus when the next token $t_{k+1}$ is contained within $x^{N}_{n_{k}+1}$ ( $\mathrm{p_{val}}$ in the Pass Step). Formally, MPC computes the following probabilities:

	$\displaystyle\vspace*{-20pt}\mathrm{b_{val}}$	$\displaystyle=P(x^{N}_{n_{k}{+}1},t_{k+1}\in\mathcal{B}(x^{N}_{n_{k}+1}))\big{\|}t^{k}_{1}),$		(2)
	$\displaystyle\mathrm{p_{val}}$	$\displaystyle=P(x^{N}_{n_{k}{+}1},t_{k+1}\notin\mathcal{B}(x^{N}_{n_{k}+1}))\big{\|}t^{k}_{1}),\vspace{-20pt}$		(3)

where $\mathcal{B}(x^{N}_{n_{k}+1}){=}\{t{\in}\mathcal{V}|x^{N}_{n_{k}{+}1}{\in}\mathrm{prefix}(\mathrm{decode}({t}))\}$ and we immediately see that $P(x^{N}_{n_{k}+1}|t^{k}_{1}){=}\mathrm{b_{val}{+}p_{val}}$ .

We provide an intuitive explanation for the algorithm following the example in Figure 2. Here, we would like to compute the probability $P(x^{n_{k}+3}_{n_{k}+1}{=}``bee"|t^{k}_{1})$ . The first possibility is that $``bee"$ is a prefix of the next token, so we search for all such tokens (line 3 in the algorithm) and sum up their probability (line 4), i.e. $\mathrm{b_{val}}{=}P(t_{k+1}{=}``beer"|t^{k}_{1})$ . Figure 2 visualizes this step as branching out the tree by finding all tokens completing the string. Since $``beer"$ is not the only string that contains $``bee"$ , e.g. $``beep",``been"$ , etc. we need to compute the probability for these other scenarios, each of which has $t_{k+1}{=}``b"$ (the first token in $``bee"$ , line 10 and 12) due to maximum prefix encoding. Then, we want to compute the probability that the subsequent string is $``ee"$ (line 13), given the previous $t^{k}_{1}$ and $t_{k+1}{=}``b"$ , which is the output of the MPC algorithm but for $x^{n_{k}+3}_{n_{k}+2}{=}``ee"$ and $t^{k+1}_{1}$ . Formally, in the Passing step: $\mathrm{p_{val}}{=}P(t_{k+1}{=}``b"|t^{k}_{1})P(x^{n_{k}+3}_{n_{k}+2}{=}``ee"|t^{k}_{1},t_{k+1}{=}``b")$ . We continue the procedure until meeting the base case, where the string must be a prefix of the next token (usually, when there is only a single character left). Finally, by computing the sum of the branch and pass steps, we obtain the desired conditional probability $\mathrm{b_{val}}{+}\mathrm{p_{val}}{=}P(x^{n_{k}+3}_{n_{k}+1}{=}``bee"|t^{k}_{1})$ .

4 Experiments

Refer to caption — Figure 2: MPC Visualization. At each recursive call, the Branch step finds tokens that starts with the query string while the Pass step extracts and employs the next token and leftover string for the next recursive call until meeting the base case.

We validate our method on a 3rd order Markov chain experiment with $\mathcal{A}{=}\{``A"{,}``B"\}$ , where we randomly construct the transition matrix and the vocabulary $\mathcal{V}{=}\{``A",``B",``AA",``BAAB",``BBAA",``BBBA",$ $``BA",``BBA"\}$ . We train a LM model using GPT-2 architecture with 6 hidden layers. Since the model is agnostic to the Markov chain order, we average the probability from 100 runs on different context length while fixing the last 3 characters. We compare our method with the baseline estimator $P(x_{n+1}|t^{i}_{1})$ , equivalent to one Branch step in the MPC algorithm. Figure 3 shows the results where the baseline method exhibits significant sampling bias due to tokenization. Following Proposition 2.3, one can clarify the zero probability events output from the baseline estimator. Our method, in contrast, accurately estimates the ground truth probability used to generate the data, showing that it is possible to recover the implicitly learned character information from the tokenized LMs.

5 Conclusion

This work identifies the next-character sampling gap between a tokenized model and a token-free one, which persists even for optimally trained models. We present a probabilistic approach to effectively eliminate this bias without requiring additional training. This closes the sampling gap between tokenized and token-free models, suggesting that language models implicitly absorb character-level information despite being trained solely on tokenized text. This result implies that it is theoretically possible to simulate the behavior of another language model trained using different vocabulary without any fine-tuning, since it is possible to transfer from token-free models to tokenized counterparts.

References

Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Cao & Rimell (2021) Cao, K. and Rimell, L. You should evaluate your language model on marginal likelihood over tokenisations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 2104–2114, 2021.
Chen et al. (2023) Chen, Y., Marchisio, K., Raileanu, R., Adelani, D., Saito Stenetorp, P. L. E., Riedel, S., and Artetxe, M. Improving language plasticity via pretraining with active forgetting. Advances in Neural Information Processing Systems, 36:31543–31557, 2023.
Chirkova et al. (2023) Chirkova, N., Kruszewski, G., Rozen, J., and Dymetman, M. Should you marginalize over possible tokenizations? In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023.
Cleary & Witten (1984) Cleary, J. and Witten, I. Data compression using adaptive coding and partial string matching. IEEE transactions on Communications, 32(4):396–402, 1984.
Cognetta et al. (2024) Cognetta, M., Zouhar, V., Moon, S., and Okazaki, N. Two counterexamples to $\backslash$ textit $\{$ Tokenization and the Noiseless Channel $\}$ . arXiv preprint arXiv:2402.14614, 2024.
Dagan et al. (2024) Dagan, G., Synnaeve, G., and Rozière, B. Getting the most out of your tokenizer for pre-training and domain adaptation. arXiv preprint arXiv:2402.01035, 2024.
Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Gallé (2019) Gallé, M. Investigating the effectiveness of bpe: The power of shorter sequences. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 1375–1381, 2019.
Gemini (2023) Gemini, T. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
Goldman et al. (2024) Goldman, O., Caciularu, A., Eyal, M., Cao, K., Szpektor, I., and Tsarfaty, R. Unpacking tokenization: Evaluating text compression and its correlation with model performance. arXiv preprint arXiv:2403.06265, 2024.
Goyal et al. (2023) Goyal, S., Ji, Z., Rawat, A. S., Menon, A. K., Kumar, S., and Nagarajan, V. Think before you speak: Training language models with pause tokens. In The Twelfth International Conference on Learning Representations, 2023.
guidance ai (2023) guidance ai. Guidance ai, 2023. URL https://github.com/guidance-ai/guidance. GitHub repository.
Gutierrez-Vasques et al. (2023) Gutierrez-Vasques, X., Bentz, C., and Samardžić, T. Languages through the looking glass of bpe compression. Computational Linguistics, 49(4):943–1001, 2023.
Kudo & Richardson (2018) Kudo, T. and Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
Liu et al. (2023a) Liu, S., Deng, N., Sabour, S., Jia, Y., Huang, M., and Mihalcea, R. Task-adaptive tokenization: Enhancing long-form text generation efficacy in mental health and beyond. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023a.
Liu et al. (2023b) Liu, Y., Lin, P., Wang, M., and Schütze, H. Ofa: A framework of initializing unseen subword embeddings for efficient large-scale multilingual continued pretraining. arXiv preprint arXiv:2311.08849, 2023b.
Makkuva et al. (2024) Makkuva, A. V., Bondaschi, M., Girish, A., Nagle, A., Jaggi, M., Kim, H., and Gastpar, M. Attention with markov: A framework for principled analysis of transformers via markov chains. arXiv preprint arXiv:2402.04161, 2024.
Minixhofer et al. (2024) Minixhofer, B., Ponti, E. M., and Vulić, I. Zero-shot tokenizer transfer. arXiv preprint arXiv:2405.07883, 2024.
Nawrot et al. (2022) Nawrot, P., Chorowski, J., Łańcucki, A., and Ponti, E. M. Efficient transformers with dynamic token pooling. arXiv preprint arXiv:2211.09761, 2022.
Petrov et al. (2024) Petrov, A., La Malfa, E., Torr, P., and Bibi, A. Language model tokenizers introduce unfairness between languages. Advances in Neural Information Processing Systems, 36, 2024.
Provilkov et al. (2019) Provilkov, I., Emelianenko, D., and Voita, E. Bpe-dropout: Simple and effective subword regularization. arXiv preprint arXiv:1910.13267, 2019.
Rajaraman et al. (2024) Rajaraman, N., Jiao, J., and Ramchandran, K. Toward a theory of tokenization in llms. arXiv preprint arXiv:2404.08335, 2024.
Schmidt et al. (2024) Schmidt, C. W., Reddy, V., Zhang, H., Alameddine, A., Uzan, O., Pinter, Y., and Tanner, C. Tokenization is more than compression. arXiv preprint arXiv:2402.18376, 2024.
Sennrich et al. (2015) Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
Singh & Strouse (2024) Singh, A. K. and Strouse, D. Tokenization counts: the impact of tokenization on arithmetic in frontier llms. arXiv preprint arXiv:2402.14903, 2024.
Song et al. (2020) Song, X., Salcianu, A., Song, Y., Dopson, D., and Zhou, D. Fast wordpiece tokenization. arXiv preprint arXiv:2012.15524, 2020.
Tay et al. (2021) Tay, Y., Tran, V. Q., Ruder, S., Gupta, J., Chung, H. W., Bahri, D., Qin, Z., Baumgartner, S., Yu, C., and Metzler, D. Charformer: Fast character transformers via gradient-based subword tokenization. In International Conference on Learning Representations, 2021.
Touvron et al. (2023) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Willems et al. (1995) Willems, F. M., Shtarkov, Y. M., and Tjalkens, T. J. The context-tree weighting method: Basic properties. IEEE transactions on information theory, 41(3):653–664, 1995.
Xue et al. (2022) Xue, L., Barua, A., Constant, N., Al-Rfou, R., Narang, S., Kale, M., Roberts, A., and Raffel, C. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022.
Yu et al. (2024) Yu, L., Simig, D., Flaherty, C., Aghajanyan, A., Zettlemoyer, L., and Lewis, M. Megabyte: Predicting million-byte sequences with multiscale transformers. Advances in Neural Information Processing Systems, 36, 2024.
Zouhar et al. (2023) Zouhar, V., Meister, C., Gastaldi, J., Du, L., Sachan, M., and Cotterell, R. Tokenization and the noiseless channel. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5184–5207, 2023.

Appendix A Related Work

Theory of Tokenization. Existing works on tokenization generally support the idea that compressing tokens enhances model performance (Gallé, 2019; Gutierrez-Vasques et al., 2023; Zouhar et al., 2023). However, these emperically findings are in conflicted with other later studies Cognetta et al. (2024); Schmidt et al. (2024). On the theoretical side, Rajaraman et al. (2024) examined tokenization through the lens of unigram models, motivated by the observation made by Makkuva et al. (2024) that transformers struggles to learn 2nd-order Markov chains. We, however, do not observe this phenomenon in our experiment. As such, our work on bias due to tokenization is not affected by their observation.

Tokenization and Perplexity. Our work relates to the statistical evaluation of LMs, where we provide an algorithm to directly evaluate the character-level perplexity $p(x^{N}_{n+1}|x^{n}_{1})$ , using a tokenized LM. In terms of token-level perplexity evaluation, some recent studies (Cao & Rimell, 2021; Chirkova et al., 2023) have suggested using stochastic tokenization (Provilkov et al., 2019) at test time to evaluate perplexity scores of LMs ( $p(t^{i}_{1})$ ). However, these evaluations were done on LMs trained with deterministic tokenization which could be suboptimal as demonstrated by our examination of undefined states in Section 2. As such, by utilizing our approach, one can obtain a much more accurate insights on LMs evaluation.

Related Algorithms. Our algorithm is inspired from the literature of universal compression such as prediction by partial matching (Cleary & Witten, 1984) and context-tree weighting (Willems et al., 1995), which have been applied for text prediction but for much simpler settings without any tokenization involved. Recently, Minixhofer et al. (2024); Liu et al. (2023a) propose tokenization adaptation methods, which still requires a heuristic optimization that complicates the training pipeline. Some recent studies have proposed method to target the problem of language models encountering difficulties generating text near prompt boundaries (Dagan et al., 2024; guidance ai, 2023), which bears some resemblance to our proposed algorithm. These methods, however, are heuristic and only applicable to certain scenarios. On the other hand, our bias removal algorithm is theoretically correct, versatile for various situations, and enables conversion between token-free and tokenized LMs due to its accurate representation of conditional sampling distributions.

Appendix B Supporting Theorems on Maximum Prefix Encoding

This section provides supporting theorems for the proof of the main results. We first remind the readers that the set $\mathcal{S}(x^{n}_{1})$ corresponds to the set of all strings that contain $x^{n}_{1}$ as a prefix. Similarly, the event set $\mathcal{S}(t^{i}_{1})$ corresponds to the set of all strings whose first $i$ tokens are $t^{i}_{1}$ . Consider when $t^{i}_{1}=\mathrm{encode}(x^{n}_{1})$ , it should be noted that the two sets $S(t^{i}_{1})$ and $S(x^{n}_{1})$ are not guaranteed to be equivalent. That is because the subsequent characters after $x^{n}_{1}$ can affect the tokenization within the first $n$ character. We illustrate this in more detail in the following example.

Example. Consider the Markov chain example in Section 2, where $\mathcal{V}=\{``AA",``A",``B"\}$ . Then, the string $s_{1}=``AABAABAB"$ , then $s_{1}\in\mathcal{S}(x_{1}=``A")$ and $s_{1}\in\mathcal{S}(t_{1}=``AA")$ since the first character of $s_{1}$ is $``A"$ and the first token of $s_{1}$ is $``AA"$ . On the other hand, $s_{1}\notin\mathcal{S}(t_{1}=\mathrm{encode}(x_{1})=``A")$ since its first token is $``AA"$ , not $``A"$ .

We introduce the Proposition B.1 that contains two facts regarding the MPE process, visually presented in Figure 4.

Proposition B.1.

Let $s$ be a string with the prefix $x^{n}_{1}$ ( $x^{n}_{1}\in\mathrm{prefix}(s)$ ). Define the minimal superstring $r$ to be the prefix of $s$ with the fewest tokens that contains $x^{n}_{1}$ as a prefix: $r=\mathrm{argmin}_{r}(k|t^{k}_{1}=\mathrm{encode}(r)\wedge x^{n}_{1}\in\mathrm{prefix}(r)\wedge r\in\mathrm{prefix}(s))$ . Then, we have the followings:

1.

For $1\leq i<k$ , $\mathrm{encode}(s)_{i}=\mathrm{encode}(x^{n}_{1})_{i}$ . Furthermore, when $r=x^{n}_{1}$ , we also have $\mathrm{encode}(s)_{k}=\mathrm{encode}(x^{n}_{1})_{k}$ .
2.

Let $\ell$ be the number of tokens in $\mathrm{encode}(x^{n}_{1})$ , then we have $\mathrm{decode}(\mathrm{encode}(x^{n}_{1})^{\ell}_{k})\in\mathrm{prefix}(\mathrm{decode}(\mathrm{encode}(s)_{k}))$ .

Proof.

(Result 1.) Proof by contradiction. Let $s$ be the counter-example with the fewest number of tokens. Assume that for $1\leq i<k$ , $\mathrm{encode}(s)_{i}\neq\mathrm{encode}(x^{n}_{1})_{i}$ . Let $j$ be the smallest of such $i$ .

Consider $\mathrm{encode}(s)_{j}$ and $\mathrm{encode}(x^{n}_{1})_{j}$ .

•
Case 1: $|\mathrm{decode}(\mathrm{encode}(s)^{j}_{1})|<|x^{n}_{1}|$ .
- –
  
  Case 1.a: $|\mathrm{decode}(\mathrm{encode}(s)_{j})|<|\mathrm{decode}(\mathrm{encode}(x^{n}_{1})_{j})|$ . This leads to a contradiction, since $x^{n}_{1}$ is a prefix of $s$ , therefore a longest prefix matching algorithm would always generate the longer token ( $\mathrm{encode}(x^{n}_{1})_{j}$ ) over the shorter one ( $\mathrm{encode}(s)_{j}$ ) when it is available.
- –
  
  Case 1.b: $|\mathrm{decode}(\mathrm{encode}(s)_{j})|>|\mathrm{decode}(\mathrm{encode}(x^{n}_{1})_{j})|$ . This leads to a contradiction, since $\mathrm{concat}(\mathrm{encode}(s)^{j}_{1})$ is a prefix of $x^{n}_{1}$ (Case 1 assumption), therefore a longest prefix matching algorithm would always generate the longer token ( $\mathrm{encode}(s)_{j}$ ) over the shorter one ( $\mathrm{encode}(x^{n}_{1})_{j}$ ) when it is available.
- –
  
  Case 1.c: $|\mathrm{decode}(\mathrm{encode}(s)_{j})|=|\mathrm{decode}(\mathrm{encode}(x^{n}_{1})_{j})|$ . This means that the two tokens are the same, contradicting our initial assumption.
•

Case 2: $|\mathrm{decode}(\mathrm{encode}(s)^{j}_{1})|\geq|x^{n}_{1}|$ . In this case, $r=\mathrm{decode}(\mathrm{encode}(s)^{j}_{1})$ is a superstring of $x^{n}_{1}$ implying that $k$ is at most $j$ , which contradicts our initial assumption that $1\leq j<k$ .

Finally, in the case $r=x^{n}_{1}$ , this means $\mathrm{decode}(\mathrm{encode}(s)_{k})$ is a suffix of $x^{n}_{1}$ . Since all the tokens before $k$ within $x^{n}_{1}$ has been matched, i.e. $\mathrm{encode}(s)_{i}=\mathrm{encode}(x^{n}_{1})_{i}$ for $1\leq i<k$ , the last token must also match as the result (else, $|r|\neq|x^{n}_{1}|$ , leads to contradiction), we have $\mathrm{encode}(s)_{k}=\mathrm{encode}(x^{n}_{1})_{k}$ .

(Result 2.) The proof idea is that since $r$ contains $x^{n}_{1}$ and any tokens within $r$ and $x^{n}_{1}$ has been matched up to $k{-}1$ , then what is left in $x^{n}_{1}$ must be in the last token in $r$ (which is the $k$ th token of $r$ ). Formally, following Result 1, we have $\mathrm{decode}(\mathrm{encode}(x^{n}_{1})^{k-1}_{1})=\mathrm{decode}(\mathrm{encode}(s)^{k-1}_{1})$ . Since $r$ has $k$ tokens in total and $x^{n}_{1}\in\mathrm{prefix}(r)$ , this means that $\mathrm{decode}(\mathrm{encode}(s)_{k})$ must cover the rest of $x^{n}_{1}$ , i.e. $\mathrm{decode}(\mathrm{encode}(x^{n}_{1})^{\ell}_{k})$ . As the result, we must have $\mathrm{decode}(\mathrm{encode}(x^{n}_{1})^{\ell}_{k})\in\mathrm{prefix}(\mathrm{decode}(\mathrm{encode}(s)_{k}))$ . ∎

We remind the reader the definition of invalid encoding below.

Definition B.2.

(Invalid Encodings) The list of tokens (an encoding) $t^{k}_{1}$ is invalid if $\mathrm{encode}(\mathrm{decode}(t^{k}_{1})){\neq}t^{k}_{1}$ . Otherwise, it is a valid encoding.

Corollary B.3.

$\mathcal{S}(t^{k}_{1})=\emptyset$ if and only if $t^{k}_{1}$ is invalid.

Proof.

We prove each direction as follow.

•

If $\mathcal{S}(t^{k}_{1})=\emptyset$ then $t^{k}_{1}$ is invalid: Since $\mathcal{S}(t^{k}_{1})=\emptyset$ , we know that there exist no string $s$ such that $\mathrm{encode}(s)^{k}_{1}=t^{k}_{1}$ . As such, for $s=\mathrm{decode}(t^{k}_{1})$ , we do not have $\mathrm{encode}(\mathrm{decode}(t^{k}_{1}))=t^{k}_{1}$ , which proves the result.
•

If $t^{k}_{1}$ is invalid then $\mathcal{S}(t^{k}_{1})=\emptyset$ : Let $x^{n}_{1}=\mathrm{decode}(t^{k}_{1})$ and $t^{i}_{1}=\mathrm{encode}(x^{n}_{1})$ . Let $s_{1}\in\mathcal{S}(t^{i}_{1})$ and suppose there exist a string $s_{2}\in\mathcal{S}(t^{k}_{1})$ . Re-running the MPE procedure on $s_{1}$ and $s_{2}$ in parallel, then every time a token is selected within $x^{n}_{1}$ in $s_{1}$ , it must also be selected at the same position in $s_{2}$ as well. Thus, we cannot have $t^{i}_{1}\neq t^{k}_{1}$ , which proves the result.

∎

Appendix C Proof of Proposition 2.3 in the Main Paper

Proposition 2.3 (Token-Induced Zero Probability) Let $t^{i}_{1}$ be a sequence of input tokens. For any invalid encoding $t^{i}_{1}$ , we have $P_{\mathrm{gt}}(t^{i}_{1}){=}0.0$ and the conditional probability $P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1})$ is undefined. In the case $t^{i}_{1}$ is valid, then $P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1}){=}0.0$ if $t^{i+1}_{1}$ is invalid. Furthermore, let $x^{n}_{1}{=}\mathrm{decode}(t^{i}_{1})$ , then for any string $x^{N}_{n+1}$ such that $\mathrm{encode}(\mathrm{concat}(\mathrm{decode}(t^{i}_{1}),x^{N}_{n+1})){\neq}t^{i}_{1}$ , we have $P_{\mathrm{gt}}(x^{N}_{n+1}|t^{i}_{1}){=}0.0.$

Proof.

For the first two statements, we have:

•

For an invalid $t^{i}_{1}$ where $t^{i}_{1}\neq\mathrm{encode}(\mathrm{decode}(t^{i}_{1}))$ , we have $\mathcal{S}(t^{i}_{1})=\emptyset$ , as implied by Corollary B.3. As such, we have $P_{\mathrm{gt}}(t^{i}_{1}){=}0.0$ which leads $P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1})$ to be an undefined conditional probability .
•

For a valid $t^{i}_{1}$ but invalid $t^{i+1}_{1}$ , we know that $P_{\mathrm{gt}}(t^{i+1}_{1}){=}0.0$ , which results in $P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1})=0.0$ .

For the last statement, we first note the following:

1.

Note that $P_{\mathrm{gt}}(x^{N}_{n+1},t^{i}_{1})=P_{\mathrm{gt}}(x^{N}_{1},t^{i}_{1})$ where $\mathrm{concat}(x^{n}_{1},x^{N}_{n+1})=x^{N}_{1}$ .
2.

Consider $P_{\mathrm{gt}}(x^{N}_{1},t^{i}_{1})=P_{\mathrm{gt}}(x^{N}_{1})P_{\mathrm{gt}}(t^{i}_{1}|x^{N}_{1})$ , we will prove that $P_{\mathrm{gt}}(t^{i}_{1}|x^{N}_{1})=0.0$ if $\mathrm{encode}(x^{N}_{1})^{i}_{1}\neq t^{i}_{1}$ .

The proof idea for this is shown in Figure 4 (Example 2, Right). Formally:

•

Let $j$ be the first position such that $\mathrm{encode}(x^{N}_{1})_{j}\neq t_{j}$ then we know that $|\mathrm{decode}(\mathrm{encode}(x^{N}_{1})_{j}))|>|\mathrm{decode}(t_{j})|$ (Proposition B.1 (Result 2)).
•

Following Proposition B.1 (Result 2), let $s\in\mathcal{S}(x^{N}_{1})$ , then we know that $\mathrm{decode}(\mathrm{encode}(x^{N}_{1})_{j})$ must be a substring of within another longer token (it cannot be broken down) in $s$ . Hence, no string $s$ will have a $j$ -th token as $t_{j}$ , so $P_{\mathrm{gt}}(t^{i}_{1}|x^{N}_{1})=0.0$ . This completes the proof.

Finally, we note that $P_{\mathrm{gt}}(t^{i}_{1})=0.0$ does not implies $\mathrm{encode}(\mathrm{decode}(t^{i}_{1}))\in\mathcal{V}$ , since it can be due to the original distribution on the character domain. A classic example for this is a Markov model with an absorption state. ∎

Appendix D Proof of Proposition 3.1 and Corollary 3.2 in the Main Paper

Proposition 3.1 Let $s^{*}=x^{n}_{1}$ , where $t^{i}_{1}=\mathrm{encode}(s^{*})=\mathrm{encode}(x^{n}_{1})$ . Then we have $\mathcal{S}(t^{i}_{1})\subset\mathcal{S}(x^{n}_{1})$ , i.e. for any string $s$ where $t^{i}_{1}=\mathrm{encode}(s)^{i}_{1}$ , we have $P(x^{n}_{1}|t^{i}_{1})=1.0$ . In the case $t_{i}\in\mathcal{V}^{*}$ , then we also have that $\mathcal{S}(t^{i}_{1})=\mathcal{S}(x^{n}_{1})$ , i.e. any string $s$ where $x^{n}_{1}\in\mathrm{prefix}(s)$ must have the first $i$ tokens as $t^{i}_{1}$ and $P(t^{i}_{1}|x^{n}_{1})=1.0$ .

Proof.

We prove each case as follow.

1) General Case: There exists a string $s\in\mathcal{S}(x^{n}_{1})$ where $\mathrm{encode}(s)^{i}_{1}\neq t^{i}_{1}$ , following directly from our 1st order Markov chain example in the main paper, i.e. the string $s=``AA"$ has $``A"$ as prefix but have the $t_{1}=``AA"\neq``A"$ . Also, any string $s$ that has the first $i$ tokens as $t^{i}_{1}$ must have the first $n$ characters as $x^{n}_{1}$ , hence $\mathcal{S}(t^{i}_{1})\subset\mathcal{S}(x^{n}_{1})$ and $P(x^{n}_{1}|t^{i}_{1})=1.0$ .

2) $t_{i}\in\mathcal{V}^{*}$ : The proof idea is that, since $t_{i}$ cannot be a part of any token in $\mathcal{V}$ , it is impossible to merge it by appending additional characters after $t_{i}$ . Formally, similar to Proposition B.1:

•

For any string $s\in\mathcal{S}(\mathrm{decode}(t^{i}_{1}))$ , let $\ell$ be the number of tokens in the minimal superstring $r$ of $s$ that contains $x^{n}_{1}$ as a prefix.
•

Following Proposition B.1 (Result 2), we know that $t_{i}$ must be a substring of $\mathrm{decode}(\mathrm{encode}(s)_{\ell})$ .
•

Due to $t_{i}\in\mathcal{V}^{*}$ , then $t_{i}=\mathrm{encode}(s)_{\ell}$ . We also know from Proposition B.1 (Result 1) that $\mathrm{encode}(s)_{i}=t_{i}$ for $1\leq i<\ell$ , this means that $\ell=i$ . This gives us $t^{i}_{1}=\mathrm{encode}(s)^{i}_{1}$ and $P(t^{i}_{1}|x^{n}_{1})=1.0$ .

This completes the proof. ∎

Remarks. We briefly note that the condition $t_{i}\in\mathcal{V}^{*}$ is the sufficient condition. In general, any token sequence $t^{i}_{1}$ that satisfies $\mathcal{S}(t^{i}_{1})=\mathcal{S}(x^{n}_{1})$ will have $P(t^{i}_{1}|x^{n}_{1})=1.0$ . One potential strategy is to find the first index $i=0,1,...k-1$ such that $t^{k}_{k-i}$ cannot be merged into another token in $\mathcal{V}$ .

Corollary 3.2 Following Proposition 3.1, suppose $t_{i}\in\mathcal{V}^{*}$ then we have $P(x^{N}_{n+1}|x^{n}_{1}{)}{=}P(x^{N}_{n{+}1}|t^{i}_{1})$ . Similarly, we also have $P(t^{j}_{i+1}|x^{n}_{1}{)}{=}P(t^{j}_{i+1}|t^{i}_{1})$ .

Proof.

For the first case, we prove through the following equations:

$\displaystyle P(x^{N}_{n{+}1}\|t^{i}_{1})$	$\displaystyle=P(x^{N}_{n{+}1}\|t^{i}_{1},x^{n}_{1})$	(4)
	$\displaystyle=\frac{P(x^{N}_{n+1},t^{i}_{1}\|x^{n}_{1})}{P(t^{i}_{1}\|x^{n}_{1})}$	(5)
	$\displaystyle=\frac{P(x^{N}_{n+1}\|x^{n}_{1})P(t^{i}_{1}\|x^{n}_{1},x^{N}_{n+1})}{P(t^{i}_{1}\|x^{n}_{1})}$	(6)
	$\displaystyle=P(x^{N}_{n+1}\|x^{n}_{1})$	(7)

where the first equality is due to $P(x^{n}_{1}|t^{i}_{1})=1.0$ and the last equality is due to $P(t^{i}_{1}|x^{n}_{1})=1.0$ for $t_{i}\in\mathcal{V}^{*}$ .

Similarly, for the second case, we have:

$\displaystyle P(t^{j}_{i+1}\|t^{i}_{1}{)}$	$\displaystyle=P(t^{j}_{i+1}\|x^{n}_{1},t^{i}_{1})$	(8)
	$\displaystyle=\frac{P(t^{j}_{i+1}\|x^{n}_{1})P(t^{i}_{1}\|x^{n}_{1},t^{j}_{i+1})}{P(t^{i}_{1}\|x^{n}_{1})}$	(9)
	$\displaystyle=P(t^{j}_{i+1}\|x^{n}_{1}),$	(10)

which completes the proof. ∎

Appendix E Proof for The Bias Removal Method

E.1 Refactoring

Our goal is to express the quantity $P(x^{N}_{n+1}|x^{n}_{1})$ using the tokenized LM that outputs the conditional probability $P(t_{i}|t^{i-1}_{1})$ . Let $x^{n_{k}}_{1}\in\mathrm{prefix}(x^{n}_{1})$ where $t^{k}_{1}=\mathrm{encode}(x^{n_{k}}_{1})$ and $t_{k}\in\mathcal{V}^{*}$ . Following Proposition 3.1, any string $s$ with prefix $x^{n_{k}}_{1}$ must have the first $k$ tokens as $t^{k}_{1}$ . We now perform the following factorization:

$\displaystyle P(x^{N}_{n+1}\|x^{n}_{1})$	$\displaystyle=P(x^{N}_{n+1}\|x^{n_{k}}_{1},x^{n}_{n_{k}+1})$	(11)
	$\displaystyle=\frac{P(x^{N}_{n+1},x^{n}_{n_{k}+1}\|x^{n_{k}}_{1})}{P(x^{n}_{n_{k}+1}\|x^{n_{k}}_{1})}$	(12)
	$\displaystyle=\frac{P(x^{N}_{n_{k}+1}\|x^{n_{k}}_{1})}{P(x^{n}_{n_{k}+1}\|x^{n_{k}}_{1})}$	(13)
	$\displaystyle=\frac{P(x^{N}_{n_{k}+1}\|t^{k}_{1})}{P(x^{n}_{n_{k}+1}\|t^{k}_{1})},$	(14)

where the last inequality is due to Corollary 3.2. Finally, we will use the Maximum Prefix Correction (MPE) Algorithm to compute each term in (14) individually. Note that the algorithm does not require $t_{k}\in\mathcal{V}^{*}$ . Here, we explicitly highlight the importance of having $t_{k}\in\mathcal{V}^{*}$ , as it bridges between the character and token domain through Equation (14).

E.2 Maximum Prefix Correction Algorithm

Overview. The MPC algorithm computes $P(x^{N}_{n_{k}+1}|t^{k}_{1})$ . Note that we do not require $t_{k}\in\mathcal{V}^{*}$ in the MPC algorithm. Using marginalization, we have the following:

	$\displaystyle P(x^{N}_{n_{k}+1}\|t^{k}_{1})$	$\displaystyle=\sum_{t\in\mathcal{V}}P(x^{N}_{n_{k}+1},t_{k+1}=t\|t^{k}_{1})$		(15)
		$\displaystyle=\underbrace{\sum_{t{\in}\mathcal{T}_{\mathrm{b_{val}}}}P(x^{N}_{n_{k}+1},t_{k+1}=t\|t^{k}_{1})}_{\mathrm{b_{val}}}+\underbrace{\sum_{t{\in}\mathcal{T}_{\mathrm{p_{val}}}}P(x^{N}_{n_{k}+1},t_{k+1}=t\|t^{k}_{1})}_{\mathrm{p_{val}}}$		(16)

where:

•

$\mathcal{T}_{\mathrm{b_{val}}}=\{t\in\mathcal{V}|x^{N}_{n_{k}+1}\in\mathrm{prefix}(\mathrm{decode}(t))\}$ is the set of tokens that have a prefix $x^{N}_{n_{k}+1}$ .
•

$\mathcal{T}_{\mathrm{p_{val}}}=\{t\in\mathcal{V}|x^{N}_{n_{k}+1}\notin\mathrm{prefix}(\mathrm{decode}(t))\}$ is the ones that do not.

and $\mathcal{T}_{\mathrm{b_{val}}}\cap\mathcal{T}_{\mathrm{p_{val}}}=\emptyset$ .

Branch Step. Here, $\mathrm{b_{val}}$ is the probability that, given the list of previous tokens $t^{k}_{1}$ , the next token of the string $s$ has $x^{N}_{n_{k}+1}$ as a prefix. To compute this term, we obtain $P(t_{k+1}=t|t^{k}_{1})$ for all $t\in\mathcal{V}$ using one model run, then sum the probabilities corresponds to all tokens whose prefix is $x^{N}_{n_{k}+1}$ .

\displaystyle\mathrm{b_{val}}=\sum_{t\in\mathcal{T}_{\mathrm{b_{val}}}}P(t_{k+1}=t|t^{k}_{1}),

(17)

Proof.

To see this, for each summand of $\mathrm{b_{val}}$ in Eq.(16), we have:

	$\displaystyle P(x^{N}_{n_{k}+1},t_{k+1}=t\|t^{k}_{1})$	$\displaystyle=P(t_{k+1}=t\|t^{k}_{1})\times P(x^{N}_{n_{k}+1}\|t^{k}_{1},t_{k+1}{=}t)$		(18)
		$\displaystyle=P(t_{k+1}=t\|t^{k}_{1}),$		(19)

where $P(x^{N}_{n_{k}+1}|t^{k}_{1},t_{k+1}{=}t)=1.0$ is due to $x^{N}_{n_{k}+1}\in\mathrm{prefix}(t)$ . This concludes the proof. ∎

Pass Step. Here, $\mathrm{p_{val}}$ is the probability that, given the list of previous tokens $t^{k}_{1}$ , the subsequent string $x^{N}_{n_{k}+1}$ is not a prefix of the next token. Under the MPE, we compute the value $\mathrm{p_{val}}$ as follow:

\displaystyle\mathrm{p_{val}}=P(t_{k+1}=t|t^{k}_{1})\times P(x^{N}_{n_{k+1}+1}|t^{k}_{1},t_{k+1}=t),

(20)

where $t=\mathrm{encode}(x^{N}_{n_{k}+1})_{1}$ and $x^{n_{k+1}}_{n_{k}+1}=\mathrm{decode}(t)$ . That is, during the passing step, there are two subroutines:

1.

Extract the next token $t$ within $x^{N}_{n_{k}+1}$ and compute $P(t_{k+1}=t|t^{k}_{1})$ . If $x^{N}_{n_{k}+1}=\mathrm{decode}(t)$ , then returns $0.0$ since this is not allowed according to the condition required in $\mathcal{T}_{\mathrm{p_{val}}}$ .
2.

Recursively compute $P(x^{N}_{n_{k+1}+1}|t^{k}_{1},t_{k+1}=t)$ .

Proof.

Following Proposition 2.3 for invalid encodings, we only need to consider $t$ such that $t^{k+1}_{1}$ is valid. Under Proposition B.1 for MPE on $x^{N}_{1}$ , only first token of $\mathrm{encode}(x^{N}_{n_{k}+1})$ is allowed (also see Example 2 in Figure 4(Right)). Finally, applying the chain rule of probability, we obtain Equation 20. For the case of non-optimal LM, see Section G.2 for non-optimal LM. This completes the proof. ∎

Base Case. We note that the base case of our algorithm corresponds to the situation where $x^{N}_{n_{k}+1}=\mathrm{decode}(t)$ . In this scenario, we only needs to compute $\mathrm{b_{val}}$ (branching step) while $\mathrm{p_{val}}=0.0$ .

Complexity Analysis. The complexity of our algorithm (number of inferences on the language model) scales with the length of the the query string, i.e. $N-n_{k}$ . Note that the complexity of the summation at the Branching step is relatively cheap compared to the runtime of the language model.

Appendix F Converting Token-Free Language Model to Tokenized Language Model for MPE.

We introduce an algorithm to compute $P(t_{k+1}|t^{k}_{1})$ using a token-free language model $P(x^{N}_{n+1}|x^{n}_{1})$ , despite having no access to any tokenized LM. This approach enables theoretical conversion of a token-free model to a tokenized one. The method involves two stages. First, we refactor the conditional probability similar to the technique presented in Section E. Next, we aggregate the probabilities of all possible strings leading to the desired tokenization. It is important to note that a Markov chain is a special type of autoregressive model, meaning this method can be employed to effortlessly calculate Markov chain transition matrices within the tokenized domain.

F.1 Refactoring

Consider the probability $P(t_{i+1}|t^{i}_{1})$ that we would like to expressed using $P(x^{N}_{n+1}|x^{n}_{1})$ . Let $t_{k}$ be the last token within $t^{i}_{1}$ such that $t_{k}\in\mathcal{V}^{*}$ . We now perform the following factorization:

	$\displaystyle P(t_{i+1}\|t^{i}_{1})$	$\displaystyle=\frac{P(t^{i+1}_{k+1}\|t^{k}_{1})}{P(t^{i}_{k+1}\|t^{k}_{1})}$		(21)
		$\displaystyle=\frac{P(t^{i+1}_{k+1}\|x^{n_{k}}_{1})}{P(t^{i}_{k+1}\|x^{n_{k}}_{1})},$		(22)

where $x^{n_{k}}_{1}=\mathrm{decode}(t^{k}_{1})$ . The second equality is due to Corollary 3.2. Each term can then be computed using the aggregation procedure shown next.

F.2 Aggregation.

In this step, we would like to compute $P(t^{i}_{k+1}|x^{n_{k}}_{1})$ where $\mathrm{encode}(x^{n_{k}}_{1})=t^{k}_{1}$ and $t_{k}\in\mathcal{V}^{*}$ , using the token-free representation $P(x_{n+1}|x^{n}_{1})$ . Here, we denote $\mathrm{decode}(t^{i}_{k+1})=x^{n_{i}}_{n_{k}+1}$ and $M=\max_{t\in\mathcal{V}}|\mathrm{decode}(t)|$ be the length of the longest token in $V$ and $\Omega=\mathcal{A}^{M}$ is the enumeration of all string of length $M$ .

Computing $P(t^{i}_{k+1}|x^{n_{k}}_{1})$ involves considering all possible strings $s$ with prefix $x^{n_{i}}_{1}$ and $t^{i}_{k+1}=\mathrm{encode}(s)^{i}_{k+1}$ . Although iterating through every possible string is infeasible, we can restrict our search by only examining strings with length $|s|=n_{i}+M$ , as any additional string beyond this point will not impact the tokenization of prefix $x^{n_{i}}_{1}$ due to $M$ being the maximum token length. Formally, we will show that one can express $P(t^{i}_{k+1}|x^{n_{k}}_{1})$ as follows:

\displaystyle P(t^{i}_{k+1}|x^{n_{k}}_{1}){=}\sum_{s^{\prime}\in\mathcal{A}^{M}}P(x^{n_{i}+M}_{n_{k}+1}{=}c_{1}(s^{\prime})|x^{n_{k}}_{1})\mathds{1}(t^{i}_{k+1}{=}\mathrm{encode}(c_{2}(s^{\prime}))^{i}_{k+1}),

(23)

where $c_{1}(s^{\prime}):=\mathrm{concat}(x^{n_{i}}_{n_{k}+1},s^{\prime})$ and $c_{2}(s^{\prime}):=\mathrm{concat}(x^{n_{i}}_{1},s^{\prime})$ . The first term can be computed using the given token-free LM, i.e. $P(x^{n_{i}+M}_{n_{k}+1}|x^{n_{k}}_{1})$ . The second term is an indicator function that checks whether $t^{i}_{k+1}=\mathrm{encode}(s)^{i}_{k+1}$ and can be computed deterministically.

Proof.

We have:

$\displaystyle P(t^{i}_{k+1}\|x^{n_{k}}_{1})$	$\displaystyle=P(t^{i}_{k+1},x^{n_{i}}_{n_{k}+1}\|x^{n_{k}}_{1})$	(24)
	$\displaystyle=\sum_{s^{\prime}\in\mathcal{A}^{M}}P(t^{i}_{k+1},x^{n_{i}+M}_{n_{k}+1}{=}c_{1}(s^{\prime})\|x^{n_{k}}_{1})$	(25)
	$\displaystyle=\sum_{s^{\prime}\in\mathcal{A}^{M}}P(x^{n_{i}+M}_{n_{k}+1}{=}c_{1}(s^{\prime})\|x^{n_{k}}_{1})P(t^{i}_{k+1}\|x^{n_{i}+M}_{1}=c_{2}(s^{\prime}))$	(26)
	$\displaystyle=\sum_{s^{\prime}\in\mathcal{A}^{M}}P(x^{n_{i}+M}_{n_{k}+1}{=}c_{1}(s^{\prime})\|x^{n_{k}}_{1})\mathds{1}(t^{i}_{k+1}{=}\mathrm{encode}(c_{2}(s^{\prime}))^{i}_{k+1})$	(27)

The rest is to prove the following equality:

P(t^{i}_{k+1}|x^{n_{i}+M}_{1}=c_{2}(s^{\prime}))=\mathds{1}(t^{i}_{k+1}{=}\mathrm{encode}(c_{2}(s^{\prime}))^{i}_{k+1})

(28)

We first note that the first $k$ tokens must be $t^{k}_{1}=\mathrm{encode}(x^{n_{k}}_{1})$ due to our condition that $t_{k}\in\mathcal{V}^{*}$ . Since $M$ is the length of the longest token in $\mathcal{V}$ , appending extra characters cannot change the tokenization happened for $x^{n_{i}}_{1}$ . In other words, any string $s$ with prefix $c_{2}(s^{\prime})$ must have the same minimal superstring $r$ containing $x^{n_{i}}_{1}$ (see Proposition B.1). We then apply this principle to the two cases:

•

$t^{i}_{k+1}{=}\mathrm{encode}(c_{2}(s^{\prime}))^{i}_{k+1}$ : In this case, we know that the string must contains the first $i$ tokens as $t^{i}_{1}$ , hence $P(t^{i}_{k+1}|x^{n_{i}+M}_{1}=c_{2}(s^{\prime}))=1.0$
•

$t^{i}_{k+1}{\neq}\mathrm{encode}(c_{2}(s^{\prime}))^{i}_{k+1}$ : In contrast, this case is equivalent to $P(t^{i}_{k+1}|x^{n_{i}+M}_{1}=c_{2}(s^{\prime}))=0.0$ since we are sure that the string do not contains the tokens $t^{i}_{k+1}$ .

This concludes the proof. ∎

F.3 The Markov Chain Example.

We provide a detail computation of the Markov chain example in the main paper. Recall that in the original chain (in the character domain), we have the following:

$\displaystyle P(x_{2}=``A"\|x_{1}=``A")$	$\displaystyle=\alpha$	(29)
$\displaystyle P(x_{2}=``B"\|x_{1}=``A")$	$\displaystyle=1-\alpha$	(30)
$\displaystyle P(x_{2}=``A"\|x_{1}=``B")$	$\displaystyle=\beta$	(31)
$\displaystyle P(x_{2}=``B"\|x_{1}=``B")$	$\displaystyle=1-\beta$	(32)

We also assume the initial probability $\pi=\{\gamma,1-\gamma\}$ for $``A"$ and $``B"$ respectively. In the token domain, let first compute $P(t_{2}=``A"|t_{1}=``AA")$ , where we do not have to do the refactoring step since we know that $t_{1}\in\mathcal{V}^{*}$ . Following the Aggregation step, we have:

$\displaystyle P(t_{2}=``A"\|t_{1}=``AA")$	$\displaystyle=P(x^{6}_{3}=``ABA"\|x^{2}_{1}=``AA")+P(x^{6}_{3}=``ABB"\|x^{2}_{1}=``AA")$	(33)
	$\displaystyle=P(x^{5}_{3}=``AB"\|x^{2}_{1}=``AA")$	(34)
	$\displaystyle=\alpha(1-\alpha),$	(35)

where in the first equality, we do not include the case $x^{6}_{3}="AAA"$ and $x^{6}_{3}="AAB"$ since $\mathrm{encode}("AAA")_{1}=``AA"$ and $\mathrm{encode}("AAB")_{1}=``AA"$ , which are not the token $``A"$ that we are interested in. For other tokens and when $t_{1}=``B"$ , the computation follows the same arguments.

We now consider the case $P(t_{2}=``B"|t_{1}=``A")$ , we can refactor it as:

\displaystyle P(t_{2}=``B"|t_{1}=``A")=\frac{P(t_{2}=``B",t_{1}=``A")}{P(t_{1}=``A")}

(36)

We first compute $P(t_{1}=``A")$ using the aggregation step:

$\displaystyle P(t_{1}=``A")$	$\displaystyle=P(x^{3}_{1}=``ABB")+P(x^{3}_{1}=``ABA")$	(37)
	$\displaystyle=P(x^{2}_{1}=``AB")$	(38)
	$\displaystyle=\gamma(1-\alpha),$	(39)

where we do again include the case $x^{6}_{3}="AAA"$ and $x^{6}_{3}="AAB"$ for the same reason above. For $P(t_{2}=``A",t_{1}=``A")$ we have:

$\displaystyle P(t_{2}{=}``B",t_{1}{=}``A")$	$\displaystyle{=}P(x^{4}_{1}{=}``ABAA")+P(x^{4}_{1}{=}``ABAB")+P(x^{4}_{1}{=}``ABBA")+P(x^{4}_{1}{=}``ABBB")$	(40)
	$\displaystyle{=}P(x^{2}_{1}=``AB")$	(41)
	$\displaystyle{=}\gamma(1-\alpha)$	(42)

which gives us $P(t_{2}=``B"|t_{1}=``A")=1.0$ . Finally, in this specific case, since order of the Markov chain in the character domain is $1$ , we do not need to consider the higher order of the Markov chain in the token domain.

Appendix G On Predictive Distribution of Language Models

In practice, LMs often do not follow Proposition 2.3 due to softmax activations. As such, in our MPC algorithm, when $t\in\mathcal{T}_{\mathrm{p_{val}}}$ and $t\neq\mathrm{encode}(x^{N}_{n_{k}+1})_{1}$ , then $P_{\theta}(x^{N}_{n_{k}+1},t_{k+1}=t|t^{k}_{1})$ may not be $0.0$ (where $\theta$ is the model weights). Eventually, this can potentially increase the complexity of our MPC algorithm during the Passing step.

In this section, we show that given any tokenized LM, we can force its output probabilities to obey Proposition 2.3, without any loss in terms of perplexity score on the token domain. This means that a tokenized LM satisfying Proposition 2.3 will guarantee the correctness of the Passing step in our MPC algorithm.

Finally, before going to the method, we remind the readers that Proposition 3.1 and Corollary 3.2 are factually correct and hold for all $\theta$ . As such, the refactoring step holds regardless.

G.1 Truncate-Renormalization Process

We justify the assumption that our tokenized language model $P_{\theta}(t_{i+1}|t^{i}_{1})$ follows Proposition 2.3. The idea is that we can turn a language model that does not follow Proposition 2.3 to the one that does while guaranteeing that the new model will always result in a lower token-level perplexity score.

We first introduce Proposition G.1. In this proposition, we are given a target discrete probability distribution $p$ where we know some of the values will not happen, says $\Phi^{*}$ . Assume that we have another distribution $q$ that approximates $p$ , then we can produce another distribution $q^{*}$ that is closer to $p$ in terms of KL divergence by setting corresponding probabilities of $q$ in $\Phi^{*}$ to $0.0$ and renormalize it (similar to rejection sampling).

Proposition G.1.

Given a discrete distribution $p=\{p_{1},p_{2},...,p_{m}\}$ and $q=\{q_{1},q_{2},...,q_{m}\}$ with $q_{i}>0.0$ for all $i$ . Let $\Phi=\{i\in\mathbb{Z}|p_{i}=0.0\}$ and $\Phi^{*}\subseteq\Phi$ , we define $q^{*}=\{q^{*}_{1},q^{*}_{2},...,q^{*}_{m}\}$ where $q^{*}_{i}=0.0$ for $i\in\Phi^{*}$ , and $q^{*}_{j}=q_{j}/(\sum_{l\notin\Phi^{*}}q_{l})$ . Then we have:

D_{\mathrm{KL}}(p||q^{*})\leq D_{\mathrm{KL}}(p||q),

(43)

which implies that $q^{*}$ is closer to $p$ than $q$ . We refer to the process of producing $q^{*}$ as truncate-renormalization (TR).

Proof.

Let $Z=(\sum_{l\notin\Phi}q_{l})$ is the normalizing factor in $q^{*}$ . Note that $Z\leq 1$ and as such $\log(Z)\leq 0$ . Then:

$\displaystyle D_{\mathrm{KL}}(p\|\|q^{*})$	$\displaystyle=\sum_{i}p_{i}\log\left(\frac{p_{i}}{q^{*}_{i}}\right)$	(44)
	$\displaystyle=\sum_{i\notin\Phi^{}}p_{i}\log\left(\frac{p_{i}}{q^{}_{i}}\right)\quad\text{, use }0\log 0=0.0$	(45)
	$\displaystyle=\sum_{i\notin\Phi^{*}}p_{i}\log\left(\frac{p_{i}}{q_{i}/Z}\right)$	(46)
	$\displaystyle=\left[\sum_{i\notin\Phi^{*}}p_{i}\log\left(\frac{p_{i}}{q_{i}}\right)\right]+\log(Z)$	(47)
	$\displaystyle\leq\sum_{i\notin\Phi^{*}}p_{i}\log\left(\frac{p_{i}}{q_{i}}\right)=D_{\mathrm{KL}}(p\|\|q),$	(48)

which completes the proof. ∎

Applying to our scenario, for any autoregressive language models $\hat{P}_{\theta}(t_{i+1}|t^{i}_{1})$ that does not follow Proposition 2.3 (due to the softmax activations), we can perform the TR process (since we know which encoding is invalid) to obtain a new LM ${P}_{\theta}(t_{i+1}|t^{i}_{1})$ , which is guaranteed to better approximate the ground-truth model ${P_{\mathrm{gt}}}(t_{i+1}|t^{i}_{1})$ . Thus, we are guaranteed that the token-level perplexity score of ${P}_{\theta}(t_{i+1}|t^{i}_{1})$ is always lower than or equal to $\hat{P}_{\theta}(t_{i+1}|t^{i}_{1})$ .

G.2 On Passing Step in Maximum Prefix Correction Algorithm.

Once our tokenized LM follows Proposition 2.3, it does not alternate the correctness of the Passing step. In other words, under Proposition 2.3, the LM will always output zero probability for invalid encodings $t^{k}_{1}$ . As a result, the Passing step in the MPC algorithm remains the same in this case.

Appendix H Algorithms for Byte Pair Encoding

H.1 Overview

We begin by introducing the Byte-Pair Correction (BPC) Algorithm for bias correction in Byte-Pair Encoding, which is more general than the MPC algorithm and also works for case of MPE. We then follow with a detail analysis to show the correctness of the algorithm.

Here, we introduce the definitions of invalid encodings (for BPE) and cover encodings.

Definition H.1.

(Invalid Encodings) The list of tokens (an encoding) $t^{k}_{1}$ is invalid if $\mathrm{encode}(\mathrm{decode}(t^{k}_{1})){\neq}t^{k}_{1}$ . Otherwise, it is a valid encoding. We denote a valid $t^{k}_{1}$ as $\mathrm{valid}(t^{k}_{1})$ .

Definition H.2.

(Cover Encodings) Given a string $x^{n}_{1}$ , an encoding $t^{k}_{1}$ is said to be covering $x^{n}_{1}$ when all the following conditions satisfied:

1.

$t^{k}_{1}$ is valid.
2.

$x^{n}_{1}\in\mathrm{prefix}(\mathrm{decode}(t^{k}_{1}))$ .
3.

$x^{n}_{i}\in t_{k}$ for some $1\leq i\leq n$ , i.e. the last token $t_{i}$ covers a part of the string $x^{n}_{1}$ .

We denote $\mathrm{cover}(x^{n}_{1})$ to be the set of all cover encodings of $x^{n}_{1}$ and $\vec{t}\in\mathrm{cover}(x^{n}_{1})$ is an encoding in $\mathrm{cover}(x^{n}_{1})$ .

Having established these two definitions, we will later show that for BPE (and MPE), the probability $P(x^{n}_{1})$ can be represented using a tokenized LM $P(t_{i+1}|t^{i}_{1})$ as follows:

P(x^{n}_{1})=\sum_{\vec{t}\in\mathrm{cover}(x^{n}_{1})}P(\vec{t}),

(49)

and the main goal of the BPC algorithm is to search through all cover encodings of $x^{n}_{1}$ . We can then apply this algorithm and compute any conditional probability $P(x^{N}_{n+1}|x^{n}_{1})$ through factorization. Figure 5 (Left) illustrates this with examples cover encodings and invalid/valid encodings.

H.2 Byte-Pair Correction Algorithm

For MPE, the MPC algorithm computes $P(x^{n}_{1})$ by searching all possible valid encodings that cover $x^{n}_{1}$ , where the probability of each encoding are computed using the LMs $P(t_{i+1}|t^{i}_{1})$ through greedy search. However, this does not work for the case of BPE. For example, under the BPE encoding rule of Llama 2, the string $\mathrm{``hello"}$ is tokenized as an individual token while the string $\mathrm{``hellow}"$ is tokenized into 2 tokens $\mathrm{``h}$ and $\mathrm{``ellow}"$ . Note that naive search for all tokens within $\mathcal{V}$ from left to right that cover $x^{n}_{1}$ is computationally expensive.

Algorithm 2 Byte-Pair Correction Algorithm. This algorithm computes

P(x^{n}_{1})

by gradually reducing the search space.

1:procedure compute(

x^{N}_{1}

)

2: //Initialize

P(x^{n}_{1})

P_{out}=0.0

4: //Probability Aggregation

5: for

i=n-1...0

6: //The last token that partially covers

x^{n}_{1}

begins with

x^{n}_{i+1}

\mathcal{B}=\{t\in\mathcal{V}|x^{n}_{i{+}1}{\in}\mathrm{prefix}(\mathrm{decode}({t}))\}

8: //Once the last token position is known, the remaining previous tokens can be determined

t^{k-1}_{1}=\mathrm{encode}(x^{i}_{1})

10: //Similar to the branching step in MPC

11:

\mathrm{b_{val}}={\sum\limits_{t{\in}\mathcal{B}}}P(t_{k}=t\big{|}t^{k-1}_{1})

12:

P_{out}=P_{out}+P(t^{k-1}_{1})\times\mathrm{b_{val}}

13: end for

14:return

P_{out}

15:end procedure

The Byte-Pair Correction (BPC) algorithm, shown in Algorithm 2 and visualized in Figure 5 (right), which is an efficient algorithm that can search all valid encodings covering $x^{n}_{1}$ . The idea is that, for each cover encoding $\vec{t}$ , once the starting position of the last token is determined (say $x_{i+1}$ ), we are guaranteed the prior tokens is unique and must be $\mathrm{encode}(x^{i}_{1})$ . Then one will accept the extracted $\vec{t}$ if it is valid, otherwise reject it. Corollary H.3 provides justification for this procedure. Here, we assume $P(\vec{t})=0.0$ for invalid $\vec{t}$ , see Proposition H.4 and justifications as well as implementation in Appendix G.

Remark. The BPC algorithm can also be applied for the case of MPE. In fact, it is more general than the original MPC algorithm as it only relies on the property of invalid encodings.

H.3 Analysis

Notations. We extend the notation of the vocabulary $\mathcal{V}$ for the case of BPE. Here, $\mathcal{V}$ is an ordered list that determines the merging order in the BPE algorithm. Each $v\in\mathcal{V}$ is a triplet of $(t_{\mathrm{left}},t_{\mathrm{right}},t_{\mathrm{new}})$ which corresponds to the merging tokens (left and right) and the new token. For simplicity, when we write $t\in\mathcal{V}$ , it corresponds to the merged token, i.e. $t_{\mathrm{new}}\in v$ . Finally, the first $|\mathcal{A}|$ entries in $\mathcal{V}$ correspond to the alphabet $\mathcal{A}$ , where no merging will happen.

Byte-Pair Encoding. We revise the encoding rule for BPE, shown in Algorithm 3. In practice, pre-tokenization is often used, where tokens are separated by whitespace or special characters. In this case, we can adjust our vocabulary $\mathcal{V}$ by removing tokens with special characters in the middle of the string.

Algorithm 3 Byte Pair Encoding Algorithm .

1:procedure Encode_BPE(

x^{N}_{1}

\mathcal{V}

)

2: //Set initial encodings:

\mathrm{c\_tokens}=x^{N}_{1}

4: //Iterate over merging order in

\mathcal{V}

, the first

|\mathcal{A}|

entries correspond the the alphabet (no merge happens):

5: for

i=|\mathcal{A}|+1,...|\mathcal{V}|

\mathrm{c\_tokens}\xleftarrow{}\mathrm{find\_merge}(\mathrm{c\_tokens},\mathcal{V}[i])

7: end for

8: return

\mathrm{c\_tokens}

9:end procedure

10:

11:procedure

\mathrm{find\_merge}

(

\mathrm{c\_tokens},v

)

12: // Left and right tokens for merging

13:

t_{\mathrm{left}},t_{\mathrm{right}},t_{\mathrm{new}}=v[1],v[2],v[3]

14:

\mathrm{old\_tokens}=\mathrm{c\_tokens}

15:

\mathrm{new\_tokens}=[]

16: // Find and merge tokens from left to right

17:

j=1

18: while

j<|\mathrm{old\_tokens}|

19: if

\mathrm{old\_tokens}[i,i+1]=t_{\mathrm{left}},t_{\mathrm{right}}

then

20:

\mathrm{new\_tokens.append}(t_{\mathrm{new}})

21:

j=j+2

22: else

23:

\mathrm{new\_tokens.append}(\mathrm{old\_tokens}[i])

24:

j=j+1

25: end if

26: end while

27: return

\mathrm{new\_tokens}

28:end procedure

Overview. We begin our analysis with theoretical results on invalid encodings (Corollary H.3 and Proposition H.4), which characterizes the zero probability events for an optimal tokenized LM. This will allow us to prove the representation of $P(x^{n}_{1})$ , i.e. Proposition H.5, previously shown in Equation (49). Finally, we conclude this section with the proof of correctness of the BPC algorithm, using Proposition H.5 and H.6.

We begin with theoretical results on invalid encodings in the case of BPE.

Corollary H.3.

$\mathcal{S}(t^{k}_{1})=\emptyset$ if and only if $t^{k}_{1}$ is invalid.

Proof.

We prove each direction as follows.

•

If $\mathcal{S}(t^{k}_{1})=\emptyset$ then $t^{k}_{1}$ is invalid: Since $\mathcal{S}(t^{k}_{1})=\emptyset$ , we know that there exist no string $s$ such that $\mathrm{encode}(s)^{k}_{1}=t^{k}_{1}$ . As such, for $s=\mathrm{decode}(t^{k}_{1})$ , we do not have $\mathrm{encode}(\mathrm{decode}(t^{k}_{1}))=t^{k}_{1}$ , which proves the result.
•

If $t^{k}_{1}$ is invalid then $\mathcal{S}(t^{k}_{1})=\emptyset$ : Let $x^{n}_{1}=\mathrm{decode}(t^{k}_{1})$ , we consider two string $s_{1}$ and $s_{2}$ that both have prefix $x^{n}_{1}$ . Furthermore, we assume the first $i$ tokens of $s_{1}$ covers exactly $x^{n}_{1}$ , i.e. $x^{n}_{1}=\mathrm{decode}(t^{i}_{1})$ and similarly, the first $j$ tokens of $s_{2}$ covers exactly $x^{n}_{1}$ , i.e. $x^{n}_{1}=\mathrm{decode}(t^{j}_{1})$ . Then:
1. 1.
  
  Proving invalid $t^{k}_{1}$ leads to $\mathcal{S}(t^{k}_{1})=\emptyset$ is equivalently to proving $t^{i}_{1}=t^{j}_{1}$ for any $s_{1},s_{2}$ .
2. 2.
  
  Re-running the BPE algorithm for $s_{1}$ and $s_{2}$ in parallel, we know that there will be no merge between any suffix of $x^{n}_{1}$ and the rest of strings, i.e. $s_{1}\backslash x^{n}_{1}$ and $s_{2}\backslash x^{n}_{1}$ due to the condition above (See Algorithm 3, line 6).
3. 3.
  
  Furthermore, for $s_{1}$ , any time a merge happens within $x^{n}_{1}$ then the same merge must also happen within $x^{n}_{1}$ for $s_{2}$ and vice versa.
As the result, we have $t^{i}_{1}=t^{j}_{1}$ and they must be equal to $\mathrm{encode}(x^{n}_{1})$ .

∎

Proposition H.4.

(Token-Induced Zero Probability-BPE) Let $t^{i}_{1}$ be a sequence of input tokens. For any invalid encoding $t^{i}_{1}$ , we have $P_{\mathrm{gt}}(t^{i}_{1}){=}0.0$ and the conditional probability $P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1})$ is undefined. In the case $t^{i}_{1}$ is valid, then $P_{\mathrm{gt}}(t_{i+1}|t^{i}_{1}){=}0.0$ if $t^{i+1}_{1}$ is invalid.

Proof.

The proof is the same as the MPE version (Proposition 2.3). ∎

Correctness of BPC Algorithm. We show in Proposition H.5 that computing the string probability $P(x^{n}_{1})$ is equivalent to marginalizing the probability of all covering tokens of $x^{n}_{1}$ . As such, the main task of computing $P(x^{n}_{1})$ is to iterate all the valid encodings that cover $x^{n}_{1}$ .

Proposition H.5.

(Prefix Probability Representation) Given a prefix $x^{n}_{1}$ , we have the followings:

1.

For any distinct $\vec{t}_{i},\vec{t}_{j}\in\mathrm{cover}(x^{n}_{1})$ , then $\mathcal{S}(\vec{t}_{i})\cap\mathcal{S}(\vec{t}_{j}){=}\emptyset$ .
2.

$\mathcal{S}(x^{n}_{1})=\smashoperator[]{\bigcup_{\vec{t}\in\mathrm{cover}(x^{n}_{1})}^{}}\mathcal{S}(\vec{t})$ .

As a result, $P(x^{n}_{1})$ can be expressed as the marginal probability of all covering tokens of $x^{n}_{1}$

P(x^{n}_{1})=\sum_{\vec{t}\in\mathrm{cover}(x^{n}_{1})}P(\vec{t})

(50)

Proof.

We prove each point as follows:

1.

Proof by contradiction, suppose that $\mathcal{S}(\vec{t}_{i})\cap\mathcal{S}(\vec{t}_{j})\neq\emptyset$ , then there exists a string $s$ that has two different cover encodings $\vec{t}_{1}$ and $\vec{t}_{2}$ . This is impossible since each string $s$ has only one unique encoding.
2.

This follows the definition of cover encodings.

Since each $\mathcal{S}(\vec{t})$ is pair-wise disjoint, we arrive at the final equation. We illustrate this Proposition in Figure 7. ∎

Finally, we prove that BPC extracts all the cover encodings of $x^{n}_{1}$ . Proposition H.6 shows the correctness of line 9 in the algorithm, where suppose that the last token starts from $x_{i+1}$ , then $\mathrm{encode}(x^{i}_{1})$ must be the encoding before that last token. Since each cover encoding $\vec{t}$ must have a last token cover a suffix within $x^{n}_{1}$ , iterating over all positions from $1$ to $n$ guarantees that we extract all possible encodings.

Proposition H.6.

Let $\vec{t}\in\mathrm{cover}(x^{N}_{1})$ and $k=|\vec{t}|$ is the number of tokens in $\vec{t}$ . Suppose that $x^{N}_{j+1}$ is a prefix of the $t_{k}$ for some $0\leq j\leq N$ , i.e. $x^{N}_{j}\in\mathrm{prefix}(\vec{t}_{k})$ , then $t^{k-1}_{1}=\mathrm{encode}(x^{j}_{1})$ .

Proof.

Since $\vec{t}=t^{k}_{1}\in\mathrm{cover}(x^{N}_{1})$ , then $t^{k-1}_{1}$ must be a valid encoding. As a result, we must have $t^{k-1}_{1}=\mathrm{encode}(x^{j}_{1})$ . ∎

Refactoring. Unlike the case for MPE, identifying the case when $P(x^{N}_{n+1}|t^{i}_{1})=P(x^{N}_{n+1}|x^{n}_{1})$ is nontrivial in general for BPE. Nevertheless, the refactoring step aims to reduce the computational complexity, and in general, we can still compute $P(x^{N}_{n+1}|x^{n}_{1})$ by refactoring.

\displaystyle P(x^{N}_{n+1}|x^{n}_{1})=\frac{P(x^{N}_{1})}{P(x^{n}_{1})},

(51)

and we use the BPC algorithm to compute $P(x^{N}_{1})$ and $P(x^{n}_{1})$ respectively. Note that this is equivalent to assuming $t_{1}$ is a ${<}\mathrm{start}{>}$ token within $\mathcal{V}^{*}$ (and consider $P(x^{N}_{2}|t_{1}),P(x^{n}_{2}|t_{1})$ instead of $P(x^{N}_{1}),P(x^{n}_{1})$ ). When pretokenization is used, e.g. split by white spaces, we can identify when $P(x^{N}_{n+1}|t^{i}_{1})=P(x^{N}_{n+1}|x^{n}_{1})$ by using the pretokenization pattern.

H.4 Experiments

The experiment setup for BPE is the same as the one in Section 4, except we use the vocabulary $\mathcal{V}=\{``A",``B",``B{\cdot}A",``BA{\cdot}A",``B{\cdot}BAA",``A{\cdot}A",``BA{\cdot}BA",``B{\cdot}B"\}$ , where the order within $\mathcal{V}$ is the merging order for the BPE encoding process and the “ ${\cdot}$ ” separates the merging tokens. The result is shown in Figure 7, where our method can accurately recover the ground truth probability $P(x_{n+1}|x^{n}_{1})$ while the baseline fails to. Notice that for the state $``BAA"$ , the baseline approach can output the correct probability, which is because the merging for the token $``BAA"$ happens before any mergs where $``A"$ is the left token happens. This experiment also shows the existence of bias within the BPE process and our method can recover the exact ground truth probability.

$\displaystyle P(x^{N}_{n{+}1}\|t^{i}_{1})$	$\displaystyle=P(x^{N}_{n{+}1}\|t^{i}_{1},x^{n}_{1})$	(4)
	$\displaystyle=\frac{P(x^{N}_{n+1},t^{i}_{1}\|x^{n}_{1})}{P(t^{i}_{1}\|x^{n}_{1})}$	(5)
	$\displaystyle=\frac{P(x^{N}_{n+1}\|x^{n}_{1})P(t^{i}_{1}\|x^{n}_{1},x^{N}_{n+1})}{P(t^{i}_{1}\|x^{n}_{1})}$	(6)
	$\displaystyle=P(x^{N}_{n+1}\|x^{n}_{1})$	(7)

$\displaystyle P(t^{j}_{i+1}\|t^{i}_{1}{)}$	$\displaystyle=P(t^{j}_{i+1}\|x^{n}_{1},t^{i}_{1})$	(8)
	$\displaystyle=\frac{P(t^{j}_{i+1}\|x^{n}_{1})P(t^{i}_{1}\|x^{n}_{1},t^{j}_{i+1})}{P(t^{i}_{1}\|x^{n}_{1})}$	(9)
	$\displaystyle=P(t^{j}_{i+1}\|x^{n}_{1}),$	(10)

$\displaystyle P(x^{N}_{n+1}\|x^{n}_{1})$	$\displaystyle=P(x^{N}_{n+1}\|x^{n_{k}}_{1},x^{n}_{n_{k}+1})$	(11)
	$\displaystyle=\frac{P(x^{N}_{n+1},x^{n}_{n_{k}+1}\|x^{n_{k}}_{1})}{P(x^{n}_{n_{k}+1}\|x^{n_{k}}_{1})}$	(12)
	$\displaystyle=\frac{P(x^{N}_{n_{k}+1}\|x^{n_{k}}_{1})}{P(x^{n}_{n_{k}+1}\|x^{n_{k}}_{1})}$	(13)
	$\displaystyle=\frac{P(x^{N}_{n_{k}+1}\|t^{k}_{1})}{P(x^{n}_{n_{k}+1}\|t^{k}_{1})},$	(14)

	$\displaystyle P(x^{N}_{n_{k}+1}\|t^{k}_{1})$	$\displaystyle=\sum_{t\in\mathcal{V}}P(x^{N}_{n_{k}+1},t_{k+1}=t\|t^{k}_{1})$		(15)
		$\displaystyle=\underbrace{\sum_{t{\in}\mathcal{T}_{\mathrm{b_{val}}}}P(x^{N}_{n_{k}+1},t_{k+1}=t\|t^{k}_{1})}_{\mathrm{b_{val}}}+\underbrace{\sum_{t{\in}\mathcal{T}_{\mathrm{p_{val}}}}P(x^{N}_{n_{k}+1},t_{k+1}=t\|t^{k}_{1})}_{\mathrm{p_{val}}}$		(16)

	$\displaystyle P(x^{N}_{n_{k}+1},t_{k+1}=t\|t^{k}_{1})$	$\displaystyle=P(t_{k+1}=t\|t^{k}_{1})\times P(x^{N}_{n_{k}+1}\|t^{k}_{1},t_{k+1}{=}t)$		(18)
		$\displaystyle=P(t_{k+1}=t\|t^{k}_{1}),$		(19)