Outlier Detection and Spatial Analysis Algorithms

Jacob John School of Computer Science and Engineering (SCOPE)
Vellore Institute of Technology
Vellore, India
[email protected]

Abstract

Outlier detection is a significant area in data mining. It can be either used to pre-process the data prior to an analysis or post the processing phase (before visualization) depending on the effectiveness of the outlier and its importance. [1] Outlier detection extends to several fields such as detection of credit card fraud, network intrusions, machine failure prediction, potential terrorist attacks and so on. Outliers are those data points with characteristics considerably different. They deviate from the data set causing inconsistencies, noise and anomalies during analysis and result in modification of the original points. [2] However, a common misconception is that outliers have to be immediately eliminated or replaced from the data set. Such points could be considered useful if analyzed separately as they could be obtained from a separate mechanism entirely making it important to the research question. This study surveys the different methods of outlier detection for spatial analysis. Spatial data or geospatial data are those that exhibit geographic properties or attributes such as position or areas. An example would be weather data such as precipitation, temperature, wind velocity and so on collected for a defined region.

Index Terms:

Spatial Analysis, Spatial Neighborhood, Outlier Detection, Anomaly Identification

I Literature Survey

Success of anomaly detection for spatial mining techniques relies on the definition of the neighborhood which sets the frame of reference. Spatial autocorrelation says that objects that are in proximity to each other tend to behave similarly. Spatial heterogeneity, on the other hand, focuses on the undergoing processes and causes the data within zones to be distinct or uncorrelated.

This paper [3] highlights the limitations of spatial autocorrelation and identifies the importance of spatial heterogeneity coupled with the creation of data sets. It also collates data from water sensors. Furthermore, by identifying abnormally behaving water monitoring sensors, it shows the importance of spatial heterogeneity by evaluating the impact after applying outlier detection and weighing the importance of spatial heterogeneity. Thus, concluding that anomalous behavior of objects can be recorded only when both spatial autocorrelation and heterogeneity are considered, in order to generate neighborhoods (both micro and macro). In addition to this, it aids in retrieving or alleviating (i) spatio-temporal outliers, (ii) spatial outliers and (iii) spatio-temporally coalesced outliers.

Take an example of sensor readings collected. These measure the toxicity levels of the water body. The goal is to identify any anomalous data while obtaining water toxicity levels.

Given in Fig. 1 (a), the neighborhood formation is on the basis of spatial autocorrelation. However, note with the value 80 is also a part due to its proximity to the sensors. Therefore, considering relationships obtaining using adjacency leads to frivolous anomalies such as node with the value 80 being included in the set. However, Fig. 1 (b) is generated using spatial heterogeneity and leads to the anomaly being segregated in to a new neighborhood or cluster. Thereby, emphasizing the importance of spatial and inferential relationships.

Refer to caption — Figure 1: (a): Neighborhood formation only on the basis of spatial autocorrelation (b) Formation taking spatial heterogeneity into consideration.[3]

This study [4] underlines the different approaches for unsupervised outlier detection in a heurisitic manner using data from the US Census. The approaches have been categorized as global – based on the complete data set and local – based on selected data objects. Furthermore, the paper recognizes the need for mentioning which “degree and notion of locality” with clarity when proposing a new method. This is because of its numerous perspectives on locality.

Spatial neighborhood is a special case of locality. The study mentions in order to analyze spatial attributes, $k$ nearest neighbor algorithm are typically due to the data’s uniform size. Fig. 2 (a) and (b) are visualizations curated using statistical approaches. Figure (a) uses Z-statistic for identifying outliers and (b) uses the median score.

This study [5] elucidates and illustrates the application of data cleaners using an exemplified approach. Data cleaners are used to annihilate ‘patchy’ and ‘isolated’ outliers using a robust algorithm described below. Its purpose further extends to the ‘cleaning’ of spatial lattice data with method that uses a find-and-replace form of approach.

The class of data cleaners identified by this paper works as given in (1).

\psi(y_{u,v};M,g^{0},g^{1})=\begin{cases}y_{u,v}\ \text{if}\ |y_{u,v}-g^{0}(y_{u,v})|\leq M\\ g^{1}(y_{u,v})\ \text{otherwise}\\ \end{cases}

(1)

Where, $y_{u,v}$ is an observation. $y_{u,v}$ is considered an outlier if $|y_{u,v}-g^{0}(y_{u,v})|\leq M$ . The data cleaner algorithm replaces the outlier with $g^{1}(y_{u,v})$ when $y_{u,v}$ is an outlier while the rest of the data is unaffected. “A sensible choice for $g^{0}$ is the median of the observed values.” Thus $|y_{u,v}-g^{0}(y_{u,v})|$ can also be referred to as the deviation from the median in this case.

A value for $g^{1}$ in this case, can be used as the median to smoothen the data after removing the outliers. However, the median in this case isn’t calculated for the entire lattice (or data set) but rather the nearest four elements to the outlier that is being replaced. This process starts at $y_{1,1}$ and proceeds row-wise until the entire lattice is covered. Furthermore, since the original data is used for calculating averages rather than the lattice after replacing the elements, the ordering of this process doesn’t impact the detection. This is method is also referred to as the neighboring median cleaner (NMC).

The diagrams below are the application of NMC with respect to botanic data for the Ceutorhynchus assimilis.

This paper [6] considers data available from 22 different monitoring stations located in Basse-Normandie and Haute-Normandie regions in France. The data collected is for the pollutant PM10 and are hourly measurements conducted by an official association called Air Normand. The data set available is from January 1, 2013 to May 31, 2013 which means its relatively new (roughly 6 years from the date of writing this exploratory study). Furthermore, the data is used to illustrate and compare two methods namely a non-stationary spatial approach and a stationary approach based on kriging methodology for detection of univariate and multivariate outliers in the data and then extenuating them.

Given in figure 6 are boxplots of validated PM10 concentrations from Air Normand of the learning set. After observing a relatively homogenous pattern can be inferred as all the medians for the data sets line up within a margin of error.

The first method in identifying outliers and for spatial predictions follows a non-stationary approach as discussed earlier. This uses the k nearest neighborhood algorithm and compares concentrations of a given site with the weighted median of neighboring sites. This method is used in a technique referred to as jackknife as according to the paper, which is leaving out observations which are anomalous to the data set to calculate more accurate estimates

The second method on the other hand, follows a stationary spatial approach based on “kriging the differences between current and reference observations” to give the best linear unbiased prediction of the intermediate observations.

This paper [14] addresses the issues related to multivariate outlier detection using unsupervised learning. Introduced by Kohonen, the self-organized map (SOM) is an artificial neural network (ANN) for dimensionality reduction and clustering of high-dimensionality data. Tools such as median interneuron distance matrix further develops estimates in conjunction with Sammon’s mapping algorithms. These also help in identifying the “first type of outlyingness”. The second level is formed using SOM quantization errors (QEs). In addition, this paper infers after analyzing a number of techniques, based on other trained SOMs, that they work well in collaboration with each other.

Outliers degrade classical principles and cluster analysis by causing discrepancies in cluster center and inflating or deflating variances and hence deteriorating the quality and accuracy of the results obtained. Furthermore, this paper suggests that multivariate outliers are relatively hard to detect and cause the masking and swamping effect and could possibility increase the dimensionality of the data. SOMs, in this case, are relatively fast and inexpensive when the dimensionality of the data is large. Furthermore, this method isn’t model-dependent.

An example [10] provided is the Local Density-Based Spatial Clustering of Applications with Noise (LDBSCAN) and merges both Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Local Outlier Factor (LOF). DBSCAN is meant for clustering algorithms. LOF provides outliers with a numerical value or as the papers describes it, “gives a quantitative measure of outlierness to each object”. [10] It can be inferred that a high LOF leads to a potential discrepant outlier. This method takes into consideration that the multi-granularity deviation factor (MDEF) is a result of the fact that potential outliers are isolated from its neighborhood rather than only the whole dataset. The MDEF copes up with these outlying clusters and helps smoothen the data.

This paper proposes a new method for outlier detection and extends upon the importance of spatial or temporal continuity. The technique proposed is the Spatio-Temporal Behavioral Outlier Factor (ST-BOF). This again is used to measure potential outliers. As compared to LOF that uses all attributes of the data points, ST-BOF only uses the spatial-temporal attributes and behavioral attributes. These are then used to describe the outlierness of the data points and context to define the neighborhood. The definition for ST-BOF is illustrated as in Fig 7.

This paper [13] associates spatial outliers in wind erosion with severe weather events. In this case, high rate of erosion and sedimentation in this case would be considered as outliers in a spatial context. This paper uses pins established in nested grids to obtain data for soil erosion and sedimentation for geospatial localities. Furthermore, the variability in the pin height is a response variable for the explanatory variables sedimentation and erosion. Lastly, the maximum outlier for soil sediments is underlined as 22 cm.

Data exploration was carried out using the geoR, gstat and Sp packages. This was done in order to normalize data into distributions and eliminate outliers at an early stage. Other visualization techniques used were Normal Q-Q plot, histograms, variogram cloud and variogram modeling.

In conclusion, after considering 47 data points, 13 of these were omitted as anomalous outliers. Furthermore, the paper mentions outliers should still be considered despite its abnormal behavior as they still contain vital and effective information in a spatial analysis. They could also act as special cases. Therefore, they should be assessed as a different set rather than eliminating them completely. Figure 8 in this case is one such example where the outliers can be paired and then studied separately.

This paper [7] discusses about wireless sensor networks (WSNs) and how they’ve gained significant attention over the past decade. However, if the density of the network is too high, WSNs are exposed to faults. Henceforth making them vulnerable and allow for malicious attacks to occur. These sensor readings can be inaccurate and unreliable at times and lead to erroneous results. These data points are referred to as outliers, anomalies or abnormal data and cause inconsistency and deviation of the data from the empirical value leading to a lower precision. Figure 9 shows the different sources of such outliers in WSNs. This paper outweighs the different methods of outlier detection and evaluates them on characteristics of “detection mode, architectural structure and correlation extraction.” The aim of the paper is to ensure data quality and reliability of data by securely monitoring the data for suspicious activities.

Statistical based approaches can be successfully implemented provided the right probability distribution model is obtained. They also could encompass temporal attributed to identify outliers. However, these fail when using parametric or non-parametric techniques in real time applications as are only efficient for univariate data and the computational costs are high for multivariate data and real time applications such as WSNs. Artificial Intelligence based approaches are able to “generalize from limited, noisy and incomplete data.” [7] But are hard to develop and fine tune. Spatial and temporal semantics make the rules and fuzzy logic more complex and harder to implement and could poses a load on the memory.

This paper [16] presents a new adaptive approach for spatial point events outlier detection (SPEOD) using the multilevel constrained Delaunay triangulation method. The first step is establishing the spatial proximity relationships among spatial points events using the Delaunay triangle as shown in Figure 10. This forms the spatial point event data set (SPED). This triangulation method has a time complexity of $\mathcal{O}(N\log N)$ .

The next step considers statistical characteristics and is used to establish three-level constraints which are described in order to fine tune spatial proximity relationships. Finally, the spatial points create subgraphs by joining the remaining edges amongst events. Those subgraphs with a relatively fewer number of points are referred to as spatial outliers. By gathering those subgraphs and calculating the average volume and corresponding standard deviation can aid in finding spatial outliers. The time complexity for this process is $\mathcal{O}(N)$ , where N is the number of subgraphs. These steps can be illustrated in Figure 11.

This paper [19] utilizes the Moran’s “I index” which can be generally divided into Global Moran’s I index and Local Moran’s I index for spatial autocorrelation. These methods encompass spatial relationships and characteristics of variables and their adjacency between data points. This paper implements these methodologies on the Geoda 10.0 software platform for analyzing data for cultivated land.

Global Moran’s I index as the name suggests, analyzes overall spatial features over a global region and index. The forumal used in this case is as follows:

I=\frac{n\times\sum_{i}^{n}\sum_{j}^{n}w_{ij}(x_{i}-\bar{x})(x_{j}-\bar{x})}{(\sum_{i}^{n}\sum_{j}^{n}w_{ij})\times(\sum_{i}^{n}(x_{i}-\bar{x})^{2})}

(2)

Where in 2, n are the number of spatial features, $x_{i}$ and $x_{j}$ are the attribute values of feature i and j respectively, $\bar{x}$ is the mean of all the features and, $w_{ij}$ is the weight of the spatial matrix and can only take values 0 and 1.

The range of I is between 0 and 1 and represents the spatial correlation of the dataset. A Z-test could further be used on the test matrix to obtain the acceptance level of significance.

Local Moran’s I analysis can only establish the degree of spatial correlation for an overall data set but cannot identify whether the correlation is positive or negative for a local region. The formula given is as follows:

I_{i}=\frac{x_{i}-\bar{x}}{S^{2}_{i}}\sum_{j=1,j\neq i}^{n}w_{ij}(x_{j}-\bar{x})

(3)

Where,

S^{2}_{i}=\frac{\sum^{n}_{j=1,j\neq 1}(x_{i}-\bar{x})^{2}}{n-1}-\bar{x}^{2}

(4)

In conclusion, using the K nearest neighbors, the corresponding K values were obtained to be 5, 6 and 5 given that the spatial autocorrelation coefficient are of natural index and the economic index is largest for cultivated land. It shows a global Moran’s index of 0.85 which is spatially positive and has a high aggregation degree. This was done at a 5% level of significance for statistical hypothesis testing.

Tested upon data available from Sentinel-1, Multi-temporal InSAR (MT) results show a new approach for detecting multivariate outliers with respect to post-processing of data. This approach [8] demonstrated an expanding point density by a half of the aggregate sum of standard PS points. The MTI technique overcomes the challenge exhibited by the presence of spatial dependencies among observations. Low coherence areas show enhanced details of deformations, noise and imperfections in this model. MTI is particularly useful when the coherence $\in[0,1]$ and is exceeds the threshold of 0.7. Furthermore, parameters such as velocity and height where used to perform these estimates. Figure 12 shows the impact of MTI over the given threshold while maintaining coherence over 0.7.

This paper [9] studies and presents Self-organizing feature maps (SOFM) as a novel approach for outlier detection for non-spatial attributes. This overcomes the inefficiencies of clustering algorithms, which are being sensitive to input data and unreliable outcomes while processing noise. Furthermore, SOFMs also allows the analysis of multi-dimensional data as well.

This paper analyzes the features of the SOFMs and the inadequacies are overcome. This is done by implementing the SOFM in an experiment by organizing it in a hierarchical structure using the standard Iris data set. “The data features include a 3-dimensional data with 500 points and 5 clusters, 4 rational clusters in projection on 2-dimension surface.” An extension of clustering theory and SOFM is proposed by this paper called topological similarity. This is complemented by a mathematical model to accompany this similarity theory. This is called the topological similarity matrix which uses adjacency as parameter.

This paper [20] defines a spatial-temporal outlier as an “object whose non-spatial attribute value is significantly different from those of other objects in its spatial and temporal neighbors.” This paper uses a Hadoop based approach to evaluate and detect outliers with a proposed algorithm claiming that Hadoop could improve performance of the running algorithm. In order to experiment this, the Ningbo sea tide data set is used to test the validity of this claim.

This algorithm proposed uses the Hadoop Distributed File System (HDFS) and MapReduce which are relatively complex and its composed of four sequential-modular MapReduce programs. MapReduce uses the concept of divide and conquer for a parallel computing program. MapReduce in this case, uses K neighbors for outlierness of every spatial location. Furthermore, it’s also for identifying temporal outliers by obtaining the coordinates and non-spatial attributes of the outliers from the original data set. It also arranges the outliers in ascending order.

Despite, its complexity, Hadoop performed much better than the approach without. Figure 13 illustrates this comparison.

This paper [12] discusses the four types of outliers in a geographical setting. Figure 14 illustrates these four different types of settings. Outliers in this case are measured and geographically weighted over Mahalanobis distances or the distance between any two points P and Q and signifies how many standard deviations away is P from the mean of Q. These distances are used to transform data spaces in order to detect outliers. The data set used to assess the algorithm is simulated data bundled with freshwater chemistry data collected all over Great Britain.

Figure 14 shows non-spatial outliers, spatial outliers, relationship outliers and temporal outliers. Non-spatial outliers or univariate data can be simply removed using a boxplot and using inspection to remove the outliers. This is because outliers usually occur as a tail (right or left) in a sampled distribution.

In a bivariate case, the data considered in pairs. An outlier occurs when such a pair in an unusual in relation with respect to all the other data pairs and bagplots are used in this case treat this outlier. However, in the case of multivariate outlier detection, several different methodologies have been presented. One such is the detection with robust Mahalanobis Distances. This can be computed as follows:

MD_{i}=[(x_{i}-\mu)^{\textbf{T}}{\Sigma}^{-1}(x_{i}-\mu)]^{0.5}\ \text{for}\ i=1,...,n

(5)

For low dimensionality data, Detection with Robust Principe Component Analysis (PCA) is considered. In this case, a standard linear algebra states for $p<n$ ,

\textbf{LVL}^{\textbf{T}}=\textbf{R}

(6)

Where, V is the diagonal matrix of eigen values, L is the matrix of eigenvalues, and R is symmetric and positive definite.

And the SD (standard deviation) for a given component is:

SD_{i}=\sqrt[]{\sum^{q}_{k=1}\frac{t^{2}_{ik}}{v_{k}}}

(7)

The paper uses three geographically weighted (GWs) methods - One uses local Mahalanobis distances (MDs), whilst the other two, use outputs from a local principal component analysis (PCA). In conclusion, all the techniques perform adequately in an empirical and heuristic study. Detection performance is both illustrated using maps and measured numerically.

II Inference

There exist several different methods for evaluating outlier detection over a spatial set of data. This is because data can be multivariate, bivariate, univariate or symmetric and hence the analysis applied is different in each case. Statistical based approaches are sensitive to changes in temporal correlations and a sudden change in the data distribution helps detect outliers. But are not suitable for real time applications. [7] Furthermore, since this approach uses histograms which do not rely on data distribution, its only useful for univariate data. Furthermore, computational cost for multivariate data is high if it is to be considered [1]. The nearest neighbor algorithm approach is unsupervised and does not make presumptions with any underlying distribution of data. Furthermore, the technique is fairly simple to apply [18] and only requires the selection of appropriate values for data cleaners [6]. However, computing the intra-distance among patterns is expensive and makes scalability of the algorithm difficult. Threshold values could lead to resulting the data being considered as false negative or false positive for outlier detection as selecting this value is a tough estimate. Inter cluster analysis is difficult and might result in elimination of effective data during spatial autocorrelation [5]. Artificial Intelligence based approaches are able to “generalize from limited, noisy and incomplete data” [7]. Spatial and temporal semantics make the rules and fuzzy logic more complex and harder to implement and could poses a load on the memory. Lastly, clustering based approaches are easily adaptable and do not tend to be supervised. But is again computational expensive due to intra-cluster distances among points.

Acknowledgment

The author, Jacob John would like to thank Dr. Archana T for her continuous support throughout this paper. I would also like to thank Vellore Institute of Technology for their aid without which this paper wouldn’t have been completed.

References

[1] Lu, C.-T., et al. “Algorithms for Spatial Outlier Detection.” Third IEEE International Conference on Data Mining, doi:10.1109/icdm.2003.1250986.
[2] Ernst, Marie, and Gentiane Haesbroeck. “Comparison of Local Outlier Detection Techniques in Spatial Multivariate Data.” Data Mining and Knowledge Discovery, vol. 31, no. 2, 2016, pp. 371–399., doi:10.1007/s10618-016-0471-0.
[3] Janeja, Vandana P., et al. “Spatial Neighborhood Based Anomaly Detection in Sensor Datasets.” Data Mining and Knowledge Discovery, vol. 20, no. 2, 2010, pp. 221–258. Springer Nature, doi:10.1007/s10618-009-0147-0.
[4] Schubert, Erich, et al. “Local Outlier Detection Reconsidered: a Generalized View on Locality with Applications to Spatial, Video, and Network Outlier Detection.” Springer, vol. 28, no. 1, 18 Dec. 2012, pp. 190–237., doi:10.1007/s10618-012-0300-z.
[5] Mugglestone, M.a., et al. “Modelling and Analysing Outliers in Spatial Lattice Data.” Mathematical and Computer Modelling, vol. 32, no. 1-2, 2000, pp. 1–10. ScienceDirect, doi:10.1016/s0895-7177(00)00116-3.
[6] Michel, Bobbia, et al. “Spatial Outlier Detection in the PM 10 Monitoring Network of Normandy (France).” Atmospheric Pollution Research, vol. 6, no. 3, 2015, pp. 476–483. Science Direct, doi:10.5094/apr.2015.053.
[7] Ayadi, Aya, et al. “Outlier Detection Approaches for Wireless Sensor Networks: A Survey.” Computer Networks, vol. 129, 2017, pp. 319–333. ScienceDirect, doi:10.1016/j.comnet.2017.10.007.
[8] Bakon, M., et al. “A Data Mining Approach for Multivariate Outlier Detection in Heterogeneous 2D Point Clouds: An Application to Post-Processing of Multi-Temporal InSAR Results.” 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2016. IEEE, doi:10.1109/igarss.2016.7729005.
[9] Bo, Jiang. “Spatial Outlier Detection Algorithms Based on Knowledge Discovery.” 2009 International Conference on Management and Service Science, 30 Oct. 2009. IEEE, doi:10.1109/icmss.2009.5302291.
[10] Duggimpudi, Maria Bala, et al. “Spatio-Temporal Outlier Detection Algorithms Based on Computing Behavioral Outlierness Factor.” Data & Knowledge Engineering, 2017. Data & Knowledge Engineering, doi:10.1016/j.datak.2017.12.001.
[11] Gómez-López María Teresa, and Rafael M. Gasca.“Object Relational Constraint Databases for GIS.” Encyclopedia of GIS, 2017, pp. 1449–1457., doi:10.1007/978-3-319-17885-1 $\_$ 1598.
[12] Harris, Paul, et al. “Multivariate Spatial Outlier Detection Using Robust Geographically Weighted Methods.” Mathematical Geosciences, vol. 46, no. 1, 2013, pp. 1–31., doi:10.1007/s11004-013-9491-0.
[13] Hosseinalizadeh, Mohsen, et al. “Importance of Outlier Detection in Spatial Analysis of Wind Erosion.” Procedia Environmental Sciences, vol. 7, 2011, pp. 341–346. Science Direct, doi:10.1016/j.proenv.2011.07.059.
[14] Muñoz, Alberto, and Jorge Muruzábal. “Self-Organizing Maps for Outlier Detection.” Neurocomputing, vol. 18, no. 1-3, 1998, pp. 33–60. Elsevier, doi:10.1016/s0925-2312(97)00068-4.
[15] Shahid, Nauman, et al. “Characteristics and Classification of Outlier Detection Techniques for Wireless Sensor Networks in Harsh Environments: a Survey.” Artificial Intelligence Review, vol. 43, no. 2, 2012, pp. 193–228., doi:10.1007/s10462-012-9370-y.
[16] Shi, Yan, et al. “Adaptive Detection of Spatial Point Event Outliers Using Multilevel Constrained Delaunay Triangulation.” Computers, Environment and Urban Systems, vol. 59, 2016, pp. 164–183. ScienceDirect, doi:10.1016/j.compenvurbsys.2016.06.001.
[17] Xu, Wenping, et al. “An Adaptive Spatial Outlier Detection Algorithm with No Parameter for WSN.” 2017 20th International Conference on Information Fusion (Fusion), 15 Aug. 2017, doi:10.23919/icif.2017.8009844.
[18] Xue, Anrong, et al. “Algorithm for Fast Spatial Outlier Detection.” 2008 The 9th International Conference for Young Computer Scientists, 2008, doi:10.1109/icycs.2008.346.
[19] Yan, Mingyang, et al. “Outliers Detection of Cultivated Land Quality Grade Results Based on Spatial Autocorrelation.” 2016 Fifth International Conference on Agro-Geoinformatics (Agro-Geoinformatics), 29 Sept. 2016. IEEE, doi:10.1109/agro-geoinformatics.2016.7577629.
[20] Yao, Lingling, and Zhanquan Wang. “Research on the Algorithm of Hadoop-Based Spatial-Temporal Outlier Detection.” 2015 Fifth International Conference on Instrumentation and Measurement, Computer, Communication and Control (IMCCC), 15 Feb. 2016. IEEE, doi:10.1109/imccc.2015.175.