An improved Bayesian TRIE based model for SMS text normalization
Abstract
Normalization of SMS text, commonly known as texting language, is being pursued for more than a decade. A probabilistic approach based on the Trie data structure was proposed in literature which was found to be better performing than HMM based approaches proposed earlier in predicting the correct alternative for an out-of-lexicon word. However, success of the Trie-based approach depends largely on how correctly the underlying probabilities of word occurrences are estimated. In this work we propose a structural modification to the existing Trie-based model along with a novel training algorithm and probability generation scheme. We prove two theorems on statistical properties of the proposed Trie and use them to claim that is an unbiased and consistent estimator of the occurrence probabilities of the words. We further fuse our model into the paradigm of noisy channel based error correction and provide a heuristic to go beyond a Damerau–Levenshtein distance of one.
Keywords: SMS text normalization, Noisy channel, Trie, Theory of estimation
keywords:
\KWDKeyword1, Keyword2, Keyword31 Introduction
SMS text normalization focuses on translating texting language, often filled with abbreviations and marred by typing errors into plain English text. With the smartphone revolution and massive popularisation of social media platforms, users often transmit messages consisting of up to thousands of words in a single day. However, these text messages consist of numerous abbreviations and errors. This often arises due to a lack of formalities between users, human error, and in more severe cases due to disabilities. With an increase in the screen sizes, this is becoming more of a concern especially when the user resorts to one handed typing. Thus, shorter message length and semantic unambiguity end up working antagonistically which gives shape to a compressed, non-standard form of language called NetSpeak [1], commonly known as the texting language. Unfortunately, traditional NLP methods perform rather poorly when dealing with these kinds of texts [2]. As described in [3], texting language contains many non-standard abbreviations, have unique characteristics and behave quite differently which may be the reason for poor performance of traditional NLP methods.
In [4] the problem was approached by leveraging the phonetic information but required a dictionary for phonetic transcription. [5] used a dictionary for erroneous words but used an ambiguous ranking mechanism. [6] used a completely heuristic rule based model and employed a dictionary for typos as well which in all the three cases caused the time complexity to grow with the size of the dictionary. [7] compared machine translation models such as SMT and NMT and showed their critical dependence on high quality training data. In [8], a Hidden Markov Model based approach was used for each word present in the corpus to model all possible normalized or corrected texts and their probabilities of occurrence. This was further used to rank all the possible normalised words and the highest ranking word would be selected for output as the semantically correct word. In [9], a Trie based probability generation model was proposed, and was shown to outperform HMM-based models whenever the incorrect word was within an edit distance one after prepossessing for phonetic errors. However, in some cases the target word did not end up having the highest rank. For example, ‘mate’, the intended target word was ranked in the suggestions list for the word ‘m8’.
Through this work we address the limitations of this Trie based probability generation model. We make a set of improvements over the scheme proposed in [9] which we consider to be the baseline system for the present work. Firstly, unlike [9] where a new word is added in a Trie only once when it is encountered for the first time during the training phase, a dynamic training environment is used in the proposed model. In the proposed model the Trie is initially constructed with a certain corpus of most commonly used words and is then deployed. After deployment, the tries learns dynamically from the texting habits of the users over time i.e. it keeps adding the same words repeatedly as and when they are encountered. We show that the new model in the described environment is able to estimate the real probability of occurrence of a word through its own new probability generating algorithm. Hence it facilitates 0-1 loss minimisation in a decision theoretic setup. This in turn minimizes the chances of the target word not being ranked as the highest one.
Additionally, as explained in [10] , it is computationally infeasible to generate all the words which are at
an edit distance two from a given word. This limited the spelling correction abilities of the baseline model. In this work we propose some novel heuristics that help in overcoming the above shortcoming significantly.
The contributions of the present work are
-
1.
We suggest a structural modification to the Trie-based reasoning scheme to improve model accuracy and performance.
-
2.
We prove mathematically that in the described dynamic environment, the expectation of the thus generated Trie probability equals the occurrence probability for each word.
-
3.
We further prove that the Trie probability of each word in the corpus almost surely(a.s.) converges to its occurrence probability.
-
4.
Develop a set of heuristics for error correction.
-
5.
Provide empirical results through simulations to support the presented theorems and highlight the superiority of the new model.
2 Background
From existing literature on the behaviour of users while using texting language [11], one can classify most of the linguistic errors as:
-
1.
Insertion Error: Insertion of an extra alphabet in the word. Eg. ‘hoarding’‘hoardingh’
-
2.
Deletion Error: Deletion of an alphabet from the word. Eg. ‘mobile’‘moble’
-
3.
Replacement Error: Replacement of an alphabet by another in the word. Eg. ‘amazing’‘anazing’
-
4.
Swap Error: Swapping of two adjacent characters in the word. Eg. ‘care’‘caer’
-
5.
Phonetic Error: A group of alphabets is replaced by another group of alphabets or numbers having the same sound. Eg. ‘tomorrow’‘2morrow’
To deal with these errors, an elaborate error correction algorithm has already been proposed in the baseline model wherein given an erroneous/non-lexical word, a suggestion list of possible target words was prepared. The probability of occurrence for each word in the list was generated using a Trie based algorithm as described in Section 2.2 which was used for ranking the words in the suggestion list. The highest ranking word was given out as the likely intended word.
2.1 Design of the TRIE
Trie is a memory efficient data structure primarily used for storing of words. The search complexity in the data structure is (), where represents the number of characters in the longest word stored in the Trie which makes it a computationally efficient data structure for the problem.
Each node contains an integer variable Count and a Boolean variable End-Of-Word. The Trie is set up by initializing it with a number of English words. Count for each node is initiated to zero and incremented by one each time a word is inserted through that node. End-Of-Word represents if a word ends at the given node and is set to True at the ending node of each word. Each time a new word is inserted, the Count variables of the passing nodes are updated and if at any point the next node is not already present for the insertion of a character, a new node is created and added as a child. At the end of the new word, End-Of-Word is switched from False to True.

2.2 TRIE probability of a Word
Assigning probability of occurrence to a word through the Trie is a task of prime importance in the model. In the baseline model this was done in a simplistic manner as described below.
The probability of choosing the child node was estimated by dividing its Count variable, to be referred as Count(i) by Total(i) which was defined as:
(1) |
Where Sibling(i) refers to the set of all the other child nodes of the common parent node. Hence the probability of the child node was set to be:
(2) |
Further, to get the probability of a string from the Trie, the conditional rule of probability was used as:
For example, consider the string ’bird’, then
Hence, from Fig.1 we get that .
3 The Improved Trie
We consider the user to be modelled as a corpus of words denoted by where each word has an occurrence probability of . Note that then the training and probability calculations as proposed in the baseline Trie does not account for an important case. This is when the corpus of words contains two words say and with occurrence probabilities and , such that is a prefix of or vice versa. For example consider the two words ’bill’ and ’bills’ in Fig.1. No matter what the occurrence probabilities of those two words are, they will always end up with the same Trie probability111Generally, in more branched Tries, whenever is a prefix of , Trie probability of Trie probability of even after training the Trie a large number of times after random sampling from . While the deviation from occurrence probabilities might not vary a lot in the example considered since one word happens to be the plural of the other, it will matter much more in words such as ’lips’ & ’lipstick’ and ‘dear’ & ‘dearth’.
To overcome this we propose the introduction of a dummy node. This dummy node is added as a child of every node which has End-Of-Word set as True but has at least one non-dummy node as its child. The Count variable of this dummy node is set to be the count variable of the parent reduced by the count variable of all the siblings of the new dummy node.
(3) |
Where denotes the dummy node and denotes the parent of the dummy node. Algorithm I in this section outlines the procedure for training the new Trie as described above.
In addition to the training algorithm, we also propose Algorithm II which is used to generate the Trie probability of a given string.
Further, to justify usage of the new Trie mathematically, the proposed training algorithm and the algorithm for the Trie probability generation, we present Theorem 3.1, Theorem 3.2 and Corollary 3.3.
Algorithm I: The Training
Input: , root
Initialization: , .length, root,
For to
-
1.
If .children does not contain .charAt[] node
-
1..1
Insert new node corresponding to .charAt[]
-
1..2
-
1..1
-
2.
child corresponding to .charAt[]
-
3.
.Counter .Counter
.End-Of-WordTrue
If a dummy node is already a child of
-
1.
Increment Counter of by 1
Else if is 0 and has at least 1 non-dummy child
-
1.
Insert new dummy node as a child of
-
2.
set Counter of dummy as in Eqn.(3)
End
Algorithm II: The Trie Probability Generation
Input: , root
Output: Trie probability of
Initialization: , .length, root,
For to
-
1.
If .children does not contain str.charAt[j] node
-
1..1
return
-
1..1
-
2.
child corresponding to str.charAt[j]
- 3.
-
4.
If .End-Of-Word is True
- 1.
-
2.
return
return
End
Theorem 3.1: Let denote a corpus of finite words. Let be a word of the corpus with an occurrence probability and let denote the word probability generated by the Trie. Let the Trie be trained number of times after randomly sampling(with replacement) from such that the Trie is trained with each at least once. Then, .
Proof: We use strong induction over the number of words present in the corpus to prove the theorem.
Base Case: Consider a corpus consisting of two words and with occurrence probabilities and . Then three cases are possible as shown in Fig.2.
Case I: The starting characters of and are different. Then, after using the random sample from to train the Trie number of times. The Trie along with all the expected values of the Count variables are shown in Fig.2(a). Clearly,
(4) |
and
(5) |
Case II: Both and have a common string of characters in the front. For illustration, ‘more’ and ‘most’ have ‘mo’ common at the front. Hence similar to Case I, it can be argued from Fig.2(b). that Eqn.(4) and Eqn.(5) hold true.
Case III: can be obtained by appending a string of characters at the end of or vice versa. Here, as described earlier we introduce a dummy node, seen as the blue box in Fig.2(c). Similar to previous arguments, it can be seen that Eqn.(4) and Eqn.(5) hold.
Hence, for the base case when , the theorem is true.

Induction: Assume that the theorem holds whenever for some positive integer . We now show that the theorem holds true for any corpus of size . Hence assume that we are given the words with occurence probabilities respectively.
Case I: Similar to Case I and Case II of the base case, all the words branch out at the same node together as in Fig.3. The figure also shows the expected values of the variables after training the Trie number of times.

Hence we can easily conclude that for each ,
(6) |
since .
Case II: In this case, consider all such possible Tries which are not covered by Case I. These Tries are in a way “more branched” than the ones considered in Case I. Consider to be any such Trie and denote its root by . Notice that there must exist a subtree , which contains strictly less than and at least two nodes which have their End-Of-Word set as True. It is easy to see that if not so, then lies in Case I which is a contradiction. Let us denote the root of by . Let us assume that the nodes in for which is set to represent the words with probabilities respectively where . Now suppose that instead of having the words , has a word whose characters are defined by travelling from to in the Trie as shown in Fig.4. Let this prefix word be with a probability of . Hence now practically contains strictly less than words. Hence by the induction hypothesis, when we train , such that each one of the words is used for training at least once, the expectation of the Trie probabilities matches the occurrence probabilities. This means for each whose last node doesn’t lie in , Eqn.(6) holds true.
Also, at the end of the training,
(7) |
where . We can now consider to be a standalone Trie by itself, consisting of words but truncated from the front so as to remove the characters of . Let these truncated words in be denoted by with occurrence probabilities . Now notice that has been trained number of times and since , the expectation of probabilities of will be the same as occurrence probabilities of the truncated words which from are given by:
(8) |

Hence, as described earlier, using induction, for each ,
(9) |
Let us consider the entire tree with its original words and not . Note that for each ,
(10) |
One may further notice that this is the same as , which implies that since both of the random variables on the RHS are independent by definition. Also since , we can use Eqn.(9) and Eqn.(10) to get that, for each ,
Hence for each , when .
\qed
While Theorem 3.1 justifies the use of Trie through equality of occurrence probability and the expected Trie probability at any stage of training, the next theorem provides a more desired result.
Theorem 3.2: For each ,
as .
Proof: Let denote a node of the Trie such that each time any one of the words from is sampled and used for training, its variable is incremented by 1. Let .
Define, a random variable such that,
For a fixed node , are i.i.d random variables for . Also for each , . Note that the variable of is actually . Hence by using the Strong Law Of Large Numbers,
(11) |
With the above in mind, we proceed for induction over the number of words in .

Base Case: Let contain two words and with occurrence probabilities and . Then as in the previous theorem, we can get three cases which can be for our purposes be depicted by Fig.5. It is possible for to be the root note as well. It is also possible for exactly one of or to be a dummy node. This would cover all the three cases.
Now for
Note that and that . Hence, by using the Continuous Mapping Theorem [12], we get that
We can similarly show that Hence our base case i.e. holds true.
The proof of the induction step is similar to that of Theorem 3.1 and is being skipped due to page constraint.
\qed
From Theorem 3.1 and Theorem 3.2 we conclude that from the perspective of Theory of Estimation and using the fact that almost sure convergence is stronger than convergence in probability, we get the following corollary
Corollary 3.3: For each , the Trie probability is an unbiased and consistent estimator of the occurrence probability .
Through the results above, we can infer that the new Trie can learn the occurrence probabilities for any set of words through sufficient training which in turn implies that the new Trie can adapt to the texting and typing style of any user when deployed on either a mobile phone, laptop or any such other device.

4 Error Checking Algorithm
In this section, we built upon the error correction schemes used in the baseline model.
4.1 The Bayesian approach
As mentioned earlier in Section 2, in [9] the words in the suggestions list were ranked solely based on the Trie probabilities. However, following [10], we use a Bayesian noisy channel approach to rank the words based on the type of error committed by the user. This requires assignment of probabilities to the types of error for which confusion matrices in [13] can be used. Hence
where denotes a possible correction for . depends on the type of error committed to get from and is the occurrence probability of estimated using the Trie.
4.2 Character Bigram
It was observed in [9] that the Trie could generate suggestions up to a Damerau-Levenshtein distance of one. This is because generating an exhaustive list of words at a DL distance of two is computationally very expensive and is of time complexity where is the length of the error word. However, 80% of human errors lie within a DL distance of one and almost all within a DL distance two [14]. To partially overcome this, we propose a heuristic on the lines of beam search algorithm [15], wherein we maintain a character probability bigram denoted by where represents the probability of the next letter being ’j’ given that the current letter is ’i’. At each stage of error correction, we provide a score to each word say as:
(12) |
and set a threshold, say . If the score exceeds the threshold, the word is passed on to the next stage of error correction. This ensures that most of the intermediate low probability words are discarded while only the ones with a very high probability are passed on further.


5 Simulations
5.1 Almost sure convergence
In light of Theorem 3.2, a simulation was performed where we trained the new Trie using the training and probability generation algorithms defined in Section 3 with the eight words used in Section 2. The Trie probabilities denoted by the bold lines evidently converge to the occurrence probabilities denoted by the dotted lines in Fig.6.
5.2 Equality of expectation
Similar to the simulation done in Section 5.1, we train 30 Tries and take the average of their probabilities. The bold lines can be seen to converge much faster to the dotted lines in Fig.7 than in Fig.6. This supports the theorem stating the equality of expectation of the Trie probabilities to the occurrence probabilities.
5.3 Comparison
In this simulation, we consecutively train both, the new Trie model proposed by us and the baseline Trie model using a corpus of twenty words. The assigned occurrence probabilities to these words are depicted in Fig.8. The new Trie clearly outperforms the one proposed baseline Trie. An important observation is that the baseline Trie probabilities clearly sum up to more than one, hence not a valid probability measure.
5.4 Error correction
We use a corpus of 3000 highest frequency English words222https://gist.github.com/h3xx/1976236 and use Zipf’s Law333Value of the exponent characterising the distribution was set to 0.25 to fit444Not needed once deployed, model learns probability directly from the user. a probability distribution over the ranked data. Comparative analysis with the baseline shown in Table 1 clearly showcases superiority of the new model.
Impure | Target | Top 5 Suggestions | Rank | Rank in [9] |
---|---|---|---|---|
tran | train | than, train , ran, trap, tan | 2 | 5 |
lng | long | long, lang, log, leg, an, gang | 1 | 4 |
aple | apple | pale, alle, able, apple, ample | 4 | 5 |
beleive | believe | believe, believed, believes | 1 | 1 |
gost | ghost | host, lost, most, past, ghost | 5 | 1 |
moble | noble | noble, nobler, nobles | 1 | 2 |
cuz | cause | use, case, cause, cut, cup | 3 | 8 |
cin | seen | in, can, sin, son, skin | 13 | 17 |
dem | them | them, then, den, chem, the | 1 | 11 |
m8 | mate | mate, might, eight, ate, mare | 1 | 6 |
thx | thanks | the, thy, tax, thanks, them | 4 | 1 |
h8 | hate | hate, height, hare, ate, haste | 1 | - |
6 Conclusions
In the paper, we first pointed out a a limitation in the Trie based probability generating model proposed in existing literature, to overcome which, we proposed a structural modification, a training algorithm and a probability generating scheme. We further proved rigorously that the new Trie generated probabilities are an unbiased and consistent estimator of the occurrence probabilities. These occurrence probabilities vary user to user which the new Trie is capable of adapting to. We performed simulations, the results of which strongly support both the presented theorems and demonstrated superiority in error correction.
References
- Crystal [2006] David Crystal. Language and the Internet. Cambridge University Press, 2 edition, 2006.
- Jurafsky and Martin [2009] Daniel Jurafsky and James H. Martin. Speech and Language Processing (2nd Edition). Prentice-Hall, Inc., USA, 2009.
- Kobus et al. [2008] Catherine Kobus, François Yvon, and Géraldine Damnati. Normalizing sms: are two metaphors better than one ? In COLING, 2008.
- Li and Liu [2012] Chen Li and Yang Liu. Normalization of text messages using character- and phone-based machine translation approaches. In INTERSPEECH, 2012.
- Desai and Narvekar [2015] Neelmay Desai and Meera Narvekar. Normalization of noisy text data. Procedia Computer Science, 45, 12 2015.
- Khan and Karim [2012] O. A. Khan and A. Karim. A rule-based model for normalization of sms text. In 2012 IEEE 24th International Conference on Tools with Artificial Intelligence, volume 1, pages 634–641, 2012.
- Veliz et al. [2019] Claudia Matos Veliz, Orphée De Clercq, and Véronique Hoste. Comparing mt approaches for text normalization. In RANLP, 2019.
- Choudhury et al. [2007] Monojit Choudhury, Rahul Saraf, Vijit Jain, Animesh Mukherjee, Sudeshna Sarkar, and Basu Anupam. Investigation and modeling of the structure of texting language. IJDAR, 10:157–174, 12 2007.
- Chatterjee [2019] Niladri Chatterjee. A Trie Based Model for SMS Text Normalization, pages 846–859. 06 2019.
- Keyboards and Kirk [2019] Keyboards and Amanda Kirk. Improving the accuracy of mobile touchscreen qwerty. 2019.
- Grinter and Eldridge [2001] Rebecca E. Grinter and Margery A. Eldridge. y do tngrs luv 2 txt msg?, pages 219–238. Springer Netherlands, Dordrecht, 2001.
- van der Vaart [2000] A.W. van der Vaart. Asymptotic Statistics. Asymptotic Statistics. Cambridge University Press, 2000.
- Kernighan et al. [1990] Mark Kernighan, Kenneth Church, and William Gale. A spelling correction program based on a noisy channel model. pages 205–210, 01 1990.
- Damerau [1964] Fred J. Damerau. A technique for computer detection and correction of spelling errors. Commun. ACM, 7(3):171–176, March 1964.
- 659 [1977] Speech understanding systems. IEEE Transactions on Professional Communication, PC-20(4):221–225, 1977.