Conference Paper Title*
*Note: Sub-titles are not captured in Xplore and
should not be used
††thanks: Identify applicable funding agency here. If none, delete this.
NLP Based Anomaly Detection for Categorical Time Series
Abstract
Identifying anomalies in large multi-dimensional time series is a crucial and difficult task across multiple domains. Few methods exist in the literature that address this task when some of the variables are categorical in nature. We formalize an analogy between categorical time series and classical Natural Language Processing and demonstrate the strength of this analogy for anomaly detection and root cause investigation by implementing and testing three different machine learning anomaly detection and root cause investigation models based upon it.
Keywords— anomaly detection, categorical data, tabular data, time series
1 Introduction
Current time series anomaly detection techniques excel at dealing with continuous data. As more and more mixed-type data have been collected, the need to more thoroughly address the categorical components of time series has increased as well. However, the tools available for anomaly detection and root cause investigation in large, multivariate time series containing categorical variables have not kept pace with the need. Additionally, in most anomaly detection contexts with time series involving many variables, knowing which variables are behaving most anomalously when a model indicates an anomaly is a great benefit to the user, but such explainability is not present in most current models dealing with categorical time series.
1.1 Related work
Recently, a wide variety of machine learning methods have been brought to bear on forecasting and anomaly detection in numerical time series, including Arima models [7], neural networks [11, 6] and statistical methods [12, 8]. These methods can often be made to accommodate categorical data through one-hot encoding, but in the authors’ experience quality of models in this family rapidly degrades as the fraction of the series’ variables that are categorical increases. The field in which anomaly detection in categorical time series is most developed in is intrusion detection in network security and fraud detection [9, 4]. The authors in [9] utilize a transformer architecture for fraud detection inspired by an analogy between finite sequences of discrete variables and words in the domain of Natural Language Processing (NLP).
1.2 Contributions of this Paper
In this paper, we formalize and extend the analogy of multivariate categorical time series to classical concepts in NLP in a way well-suited to anomaly detection in multivariate time series arising from telemetry streams. We design, implement and test three anomaly detection and root cause investigation models based on this analogy. One benefit of our formalization over methods in the current literature is its explainability in terms of an easy ranking of the variables in order of their apparent contribution to a suspected anomaly. A second benefit is that current categorical time series anomaly detection algorithms in the literature are focused on settings with relatively few sensors, and our framework enables construction of models managing large sensor sets, as demonstrated by our results in Section 8.
The remainder of the paper up to the conclusion is organized as follows. In Section 2 we develop the NLP analogy for categorical time series. In Sections 4, 5 and 6 we present and test three anomaly detection and root cause identification models based on this analogy. Sections 7 and 8 respectively contain a description of our test data and the models’ performances on it.
2 Analogy to NLP
In this section we go over the required modifications to concepts and terms of classical NLP and highlight each in italics when it appears the first time. Concepts are illustrated with the time series segment shown in Figure 1. We refer to variables as sensors, the values reported by a sensor as letters and the set of letters said by a single sensor as that sensor’s alphabet.

We regard the same value reported by two different sensors as different letters. Thus, we regard all “2’s” in the Sensor0 column of Figure 1 as the same letter but different from the “2’s” in the Sensor2 column. When distinguishing two different sensors’ letters is necessary, we write, for example, “Sensor0_2” and “Sensor2_2”.
Words are created by fixing a word length, say and defining a word as a sequence of consecutive letters that a sensor reports. We index words by their last letter’s position. For example, the word that Sensor0 says at time consists of the sequence of the letters the sensor said during times , outlined in Figure 1.
A sensor’s vocabulary is the set of words it says and the union of the vocabularies of all sensors of a time series is the series’s vocabulary. Since the alphabet of each sensor is unique to that sensor, every two different sensors’ vocabularies are disjoint. Therefore, we may shorten notation by removing the sensor name from each letter and placing it at the start of the word. Thus, we write the word that Sensor0 says at time 5402 as “Sensor0_0_2_1_2”, outlined in Figure 2.

In order to capture inter-sensor relationships, we regard the set of words that all sensors in a time series say at a given time as a sentence. Figure 2 shows the sentences constructed from the categorical time series in Figure 1. The sentence that the time series said at time is read across the time row of the dataframe. Sensors of a time series generally have no naturally preferred order, so the words within a given sentence have no natural order unless we artificially order the sensors. Therefore, the analogy to a natural language is not perfect, but our results show that we are able to retain enough temporal and inter-sensor information elsewhere to build highly effective models.
Finally, most applications in natural language processing require an “out of vocabulary” token for words that occur during inference but were never seen during training. In this context, we find it best to include an unknown word token in every sensor’s vocabulary, which we write as “Sensor_unknown_word.” It is also possible for a previously unseen letter to occur during inference, so we introduce a second unknown word token, “Sensor_unknown_letter”, to each sensor’s vocabulary and use Sensor_unknown_word when the unknown word involves only letters seen in training and Sensor_unknown_letter when the word involves a letter never seen in training.
3 Sentence-based Anomaly Scores
In the remainder of this paper, we describe and report the performance of three anomaly detection and root cause identification models built with the corpus of sentences resulting from processing a categorical time series into a sequence of sentences. Given a time series of training data and a time series (using the same sensors) in which we aim to detect anomalies, all three methods rely on computing an anomaly score for the sentence that the inference series says at time , with higher score indicating more anomalous behavior. To simplify the analysis here, we classify times as nominal or anomalous, by simply setting a threshold factor and flagging as anomalous any time at which exceeds times the value of the percentile of the anomaly scores observed when inference is performed on the training data or a set of validation data.
3.1 Root Cause Investigation
In the models we demonstrate, the score is always a sum of word-based anomaly scores for each word in the sentence . Since every two sensors’ vocabularies are disjoint, words are associated with unique sensors. Also, every sentence contains exactly one word from every sensor’s vocabulary. Therefore, at any given time, sensors can be ranked according to the magnitude of the contribution of the words associated with them. Additionally, for a set of (possibly non-consecutive) times during which an anomaly is suspected, we define the suspect score of sensor over those times as the sum of its words’ contribution to over all times . We denote the sensor-specific score over these times as . This score ranks sensors by their apparent contribution to a suspected anomaly over a time period. This highly interpretable ranking offers a valuable degree explainability for users interested in root cause analysis.
3.2 Next Sentence Forecast Models
For the model of Section 4, is calculated only from , but the models of Sections 5 and 6 rely on a “next sentence forecast” framework. For both of these models, we fix a lookback length, . At time , the input is a sequence of consecutive sentences from the inference series , , and the output is a forecast, for sentence .
The models are trained on nominal data, so is a forecast for under nominal conditions as characterized by the training sentences from . Therefore, in each model, a numerical measure of the difference between and serves as the anomaly score .
4 Singular Value Decomposition Model
In this section, we develop an anomaly detection model based on the Singular Value Decomposition (SVD) of the term-document matrix of . This decomposition has previously been used to uncover latent semantic structure in documents [1] and perform other NLP tasks [14].
4.1 Term Document Matrix
Arbitrarily order the vocabulary and the sentences of the training corpus. The term frequency of word in sentence , , is the number of times that occurs in . For the inverse document frequency of word , we use the formula , where is the total number of sentences, and is the number of sentences containing word . The term frequency matrix for the corpus is the matrix whose entry in position is This formula gives all unknown_word and unknown_letter tokens IDF weight which is larger than all other “true” words but can be modified as described Section 8. Column of is a kind of dimensional representation of the sentence .
4.2 Projection via SVD
Next, fix a dimension for a subspace onto which the sentences are to be projected and construct the full singular value decomposition,
Let be the matrix consisting of the first columns of , be the matrix consisting of the first rows and columns of and be the matrix consisting of the first rows of . There are two key observations.
Observation 1.
If then has rank and . The approximation is optimal over all matrices of rank in the sense that if is any matrix with then
Observation 2.
The matrix defines an orthogonal projection of to itself by the formula
(1) |
The image of is the column space of .
4.3 Anomaly Score and Root Cause
Observations 1 and 2 provide the optimal dimensional subspace to project onto in the sense of minimizing the average distance between and for all columns of . The average is minimized not just over projections onto but over all possible projections onto all possible dimensional subapaces of . This justifies the interpretation of as a kind of optimally in-family dimensional reconstruction of the dimensional sentence represented by . Therefore, for an arbitrary sentence, , we define its anomaly score by,
(2) |
The decomposition of in terms of individual word contributions on the right of Equation 2 lets us define the individual sensor scores for root cause investigation with the method described in Section 3.1.
5 End-to-end Transformer Model
5.1 Transformer Architecture
The transformer model was first introduced by Vaswani et al. [13] for the purpose of language translation using an encoder-decoder architecture. We repurpose the encoder portion of this model to perform both encoding and decoding in the same layer. For this conversion, the attention mechanism was modified to perform two types of attention: self-attention and causal attention.
The right side of Figure 3 describes the architecture of each transformer block. From the bottom, we transform the sequence of sentences into an embedding vector which represents each word. The words are then added with another embedding vector called positional embedding. This embedding is the same size as the input embedding and it is designed to carry information about the position of each word of the sentence. Once the data passes through the embedding layer, the transformer block splits the embedded data based on the number heads used for each block. The number of heads and the embedding dimension are tunable hyperparameters, constant for each transformer block. The number of transformer blocks in the full model is also a tunable hyperparameter.

5.2 Attention Mechanism
The attention mechanism shown in Figure 3 is composed of two different mechanisms: a self-attention mechanism and a causal attention mechanism. The self-attention is used only on the unmasked words to build a rich representation of the words by looking at future words, as well as previous words. On the other hand, the structure of the causal attention used on the masked words ensures no valuable information is transmitted from future masked words. This causal attention forces the model to build a representation of the masked words based only on the previous words. Researchers He et. al [5] have shown that this type of dual attention mechanism outperforms transformer baseline architecture in machine translation tasks. This architecture also reduces the number of hyperparameters by more than half by implementing both tasks in a single block.
5.3 Sentence prediction
As shown on the left side of Figure 3, the output dimension of each transformer block is the same as that of the input embedding. This output embedding is then linearly transformed into the dimension of size of the vocabulary. The layer has a softmax activation because the target is the one-hot-encoding of each word in the input. The output is finally filtered only to pass positions of the masked sentence on to the cross-entropy loss, but this is not shown in the figure. In practice, we compute the loss on each sensor’s vocabulary independently instead of computing the overall loss. We then average the sensor-specific losses to obtain the total loss. This technique comes from Padhi et. al [6], where researchers implement a global vocabulary and local vocabulary to compute the cross-entropy loss for each categorical column in their dataset.
5.4 Anomaly Score an Root Cause
The trained model generates next sentences based on nominal characteristics exhibited by the training data, so a measure of the difference between the forecast and the actual is used as the anomaly score at time . For this, we used the Levenshtein distance, which is the number of changes required to convert the predicted word to the truth word. The Levenshtein scores are then added together across words of the sentence to arrive at , as shown in Equation 3, where is the actual word, is the model’s predicted word and is the Levenshtein distance between and .
(3) |
6 Stand-alone Dense Embeddings
In this section, we describe a method of creating dense vector representations, or embeddings, of the words in a time series’s vocabulary, akin to Word2vec or GloVE representations in classical NLP. We also show their use in a downstream task by demonstrating an anomaly detection model based on these representations.
6.1 Entity Embedding
While existing NLP algorithms could be used to obtain vector representations of words arising from a categorical time series in an unsupervised manner, our situation is different from classical NLP because words in a sentence here have no naturally preferred ordering. We work around this difference by using a trainable entity embedding layer placed before a deep neural network as introduced in [3]. The full model is trained on the task of masked word modeling, and the activations of the trained embedding layer are used as the vector representations.
6.2 Implementation
While more sophisticated architectures could certainly provide performance gains, for demonstration purposes, we used a simple three-layer feed-forward network after the embedding layer. Each training sample consists of an input sentence with a randomly selected word masked to the value , and the target is the one-hot encoding of the masked word. As shown in Figure 4, the words in each sentence are first integer encoded with strictly positive integers so that the embedding can accommodate the -value mask. This has the further benefit that in downstream tasks, the vector representation corresponding to the value can be used for unknown word tokens’ representations. Once the model is fully trained, the learned parameter matrix of the embedding layer is used as a lookup table for the dense embeddings of the words.

6.3 LSTM
The vector representations of the previous section can be used to fuel most any machine learning model, but we again demonstrate their use on anomaly detection by next sentence forecasting described in Section 3.2. To demonstrate their “out-of-the-box” applicability to downstream tasks, we used a simple LSTM-based model with two LSTM layers followed by a dense layer with softmax activation to make the final forecasting.
The words in each sentence of an input sequence are each embedded according to the embeddings learned above. The resulting 2-dimensional representation of is flattened to -dimensional representation and the sequence is the model’s input. Each sentence has exactly one word for each sensor, so the target sentence can be encoded in a kind of “multi-hot” fashion, consisting of a vector of length equal to the total number of words in the full vocabulary with a “1” in the position of every word occurring in the target sentence. The model is trained on the cross entropy loss.
6.4 Anomaly Score and Root Cause
For anomaly scoring during inference, we first translate the model’s forecast at time back into sentence form. Words in each sentence correspond exactly with the sensors of the time series, so as in Section 5.4 we use the sum of the Levenshtein distances between the forecasted word for each sensor and the the actual word for that sensor observed in sentence as the anomaly score . Since is the sum of word-based contributions, we are able to define the individual sensor scores for root cause investigation with the method described in Section 3.1.
7 Testing Data Sets
Authors in [6] evaluate several anomaly detection algorithms on pairs of training/testing time series pairs originating form two assets managed by the NASA Jet Propulsion Laboratory. They contain a single response sensor and binary command indicators. We evaluate our models on the pairs in which the response sensor reports or fewer values, whose ordinal positions we take as the categories. The train/test pairs used for this analysis are and . For testing series, we use the first time steps to establish the ordinal ranking. Aggregate scores over all data sets is reported in [6], with scores for the several models they evaluated ranging from a low of to a high of We follow [6] in focusing on models forecasting only the single response variable.
We also use a second collection of four multivariate categorical training/testing time series pairs originating from data streams internal to Lockheed Martin. Each of the testing series has a single anomaly identified by the operator of the device generating the series. These time series contain between and categorical sensors each, which report up to distinct values apiece. The sensors of each time series are also grouped into “engineering subsystems” of sensors associated with different aspects of the device originating the data.
Counting false positives, false negatives and true positives is somewhat challenging due to the fact that we would like to regard each multi-time-step anomaly period as a single anomaly event. We made these calls by hand on a case by case basis, using criteria established in work with subject matter experts (SMEs) who indicated that if a tool flags an anomaly within about of its start (relative to the length of the series) it should be classified as a “successful” identification, hence true positive. If the tool did not flag any time period within about of the start, it is classified as a false negative. Generally the times when models flag anomalies tended to cluster well, so we counted every cluster not overlapping an anomaly as a false positive. Using these conventions, we report and for each model.
8 Model Performance
We found that a good word length for all models except the SVD models on the JPL data sets was . For the SVD model on JPL data sets, worked best, probably because the increased number of words helped compensate for the fairly small dimension that raw data. For both data sets our threshold factor for flagging anomalies as described in Section 3 is . Hyperparameter choices specific to each specific model are as follows.
For the SVD model, setting the TFIDF value of the unknown word tokens to twice the maximal value of all “true” words seemed to balance well the importance of unknown letters with general sensitivity to anomalies involving known letters.
For the transformer model, the embedding dimension was , the number of heads was , and transformer blocks were used in the full model.
For the stand alone dense embedding and LSTM model, the embedding dimension for JPL data was and for the Lockheed Martin data it was . The output sizes of the first two layers in the masked word forecasting model were the number of sensors and the average of the number of words and the number of sensors. The LSTM model had two LSTM layers, the first of dimension where is the number of sensors and the second of dimension equal to the average of and the total number of words. The lookback window for both LSTM layers in both cases was .
For the SVD and stand-alone dense embedding models, a separate model was built for each of the seven subsystems in the Lockheed Martin data, and an ensemble voting method was used to flag anomalous periods. For the inline model, a single model was made on all of the binary sensors of each Lockheed Martin data set. Performance of these models is shown in Table 1.
SVD on JPL | |||||
---|---|---|---|---|---|
SVD on LM | |||||
Inline on JPL | |||||
Inline on LM | |||||
Stand Alone on JPL | |||||
Stand Alone on LM |
The JPL data involved only one response sensor, so the suspicious sensor ranking aspects of the models did not apply to that data. For every anomaly in the Lockheed Martin data for which the SMEs identified categorical sensors involved in the root cause, all but one of our models placed more than of the SME-identified anomalous sensors in the top of suspect sensors.
9 Conclusion
This paper has three main contributions. First, we formalize the analogy between categorical time series and texts in classical Natural Language Processing to a full framework in which NLP tools can be applied to general categorical time series with minimal modification. Second, we show how the formalization of words and sentences in this framework enables model explainability useful in root cause investigations into anomalies in categorical telemetry streams. Finally, we demonstrate the applicability of this framework to series with hundreds of categorical sensors by developing and testing three anomaly detection and root cause models within this framework.
Acknowledgment
This work was performed in a cross-company collaboration with a team of researchers at the NEC corporation with a forthcoming publication [10].
References
- [1] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman, Indexing by latent semantic analysis, Journal of the American society for information science 41 (1990), no. 6, 391–407.
- [2] Carl Eckart and Gale Young, The approximation of one matrix by another of lower rank, Psychometrika 1 (1936), no. 3, 211–218.
- [3] Cheng Guo and Felix Berkhahn, Entity embeddings of categorical variables, arXiv preprint arXiv:1604.06737 (2016).
- [4] Robert Gwadera, Mikhail J Atallah, and Wojciech Szpankowski, Reliable detection of episodes in event sequences, Knowledge and Information Systems 7 (2005), no. 4, 415–437.
- [5] Tianyu He, Xu Tan, Yingce Xia, Di He, Tao Qin, Zhibo Chen, and Tie-Yan Liu, Layer-wise coordination between encoder and decoder for neural machine translation, Advances in Neural Information Processing Systems 31 (2018).
- [6] Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and Tom Soderstrom, Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding, Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 2018, pp. 387–395.
- [7] Guofei Jiang, Haifeng Chen, and Kenji Yoshihira, Modeling and tracking of transaction flow dynamics for fault detection in complex systems, IEEE Transactions on Dependable and Secure Computing 3 (2006), no. 4, 312–326.
- [8] Eamonn Keogh, Jessica Lin, and Ada Fu, Hot sax: Efficiently finding the most unusual time series subsequence, Fifth IEEE International Conference on Data Mining (ICDM’05), Ieee, 2005, pp. 8–pp.
- [9] Inkit Padhi, Yair Schiff, Igor Melnyk, Mattia Rigotti, Youssef Mroueh, Pierre Dognin, Jerret Ross, Ravi Nair, and Erik Altman, Tabular transformers for modeling multivariate time series, ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 3565–3569.
- [10] Yuan Peng, Chen Haifeng, and Moto Sato, 3D histogram based anomaly detection for categorical sensor data in internet of things, To appear.
- [11] Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei, Robust anomaly detection for multivariate time series through stochastic recurrent neural network, Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 2828–2837.
- [12] Swee Chuan Tan, Kai Ming Ting, and Tony Fei Liu, Fast anomaly detection for streaming data, Twenty-Second International Joint Conference on Artificial Intelligence, 2011.
- [13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017).
- [14] Geoffrey Zweig, John C Platt, Christopher Meek, Christopher JC Burges, Ainur Yessenalina, and Qiang Liu, Computational approaches to sentence completion, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2012, pp. 601–610.