2022
[1]\fnmMichael \surStephanou
[1]\orgnameRand Merchant Bank, \orgaddress\cityJohannesburg, \countrySouth Africa
2] \orgnameUniversity of Cape Town, \orgaddress\cityCape Town, \countrySouth Africa
3]\orgnameUniversity of Western Australia, \orgaddress\cityPerth, \countryAustralia
hermiter: R package for Sequential Nonparametric Estimation
Abstract
This article introduces the R package hermiter which facilitates estimation of univariate and bivariate probability density functions and cumulative distribution functions along with full quantile functions (univariate) and nonparametric correlation coefficients (bivariate) using Hermite series based estimators. The algorithms implemented in the hermiter package are particularly useful in the sequential setting (both stationary and non-stationary) and one-pass batch estimation setting for large data sets. In addition, the Hermite series based estimators are approximately mergeable allowing parallel and distributed estimation.
keywords:
fast Kendall Tau, fast quantiles, fast Spearman Rho, Hermite series estimators, online algorithms, sequential algorithms1 Introduction
Efficient algorithms for the nonparametric statistical analysis of streaming data and massive data sets are becoming increasingly relevant. Indeed, given the recent ubiquity of such data there is a growing need for such algorithms. Ideally these algorithms should be able to update estimates with constant i.e. time and space complexity. This would allow effective processing of streaming data along with estimation on arbitrarily large data sets with fixed amounts of memory available. In addition, these algorithms should ideally be able to treat both stationary (static) and non-stationary (dynamic) settings, where statistical estimands are constant or may vary over time respectively. Finally, it would be very useful for these algorithms to also facilitate decentralized estimation where analyses on different subsets of a larger data set can be consistently and efficiently merged. Fundamental statistical quantities of interest for online nonparametric analysis include the mean, variance, higher order moments, probability density function (PDF), cumulative distribution function (CDF) and quantiles in the univariate setting. The most complete distributional information is provided by PDFs, CDFs and associated quantiles. The quantiles also have the favorable property of being robust to outliers and are more appropriate than moments such as the variance for skewed data.
In the bivariate setting PDFs, CDFs and correlations are core quantities of interest. The most commonly applied measure of correlation is the Pearson product-moment correlation coefficient. This measure has several limitations however. The first is that it only captures the linear aspect of relationships between the random variables. The second is that the sample estimator for the Pearson product-moment correlation is not robust and is thus sensitive to outliers. Finally, this measure of correlation is not invariant under all order preserving transformations of the random variables. Nonparametric correlation measures such as the Spearman rank correlation coefficient (Spearman Rho) and the Kendall rank correlation coefficient (Kendall Tau) do not suffer from these limitations. They are appropriate to the broader class of monotonic relationships, are robust to outliers (Croux and Dehon, 2010), and are invariant under all order preserving transformations (Gibbons and Chakraborti, 2010).
In this article we present algorithms and software - i.e the R package hermiter - based on Hermite series estimators, that focus on the most general statistical quantities, namely the PDF, CDF and quantiles in the univariate setting and the PDF, CDF and nonparametric correlation coefficients in the bivariate setting. The algorithms we present have time and space update complexity, can treat static and dynamic estimation and are fully mergeable (albeit only approximately in some cases). This mergeability implies suitability for parallelized and distributed computation of all the aforementioned estimands. Parallel computation could be carried out on a single machine, utilizing multiple cores/threads or split across several machines. Both situations allow faster computation. The fact that Hermite series based estimators are appropriate for distributed computation suggests another natural use case, namely edge computing. For example, different machines on a network could each update a Hermite series estimator with observations from attached sensors. Since the representation of the Hermite estimator is very compact, each estimator could then be efficiently communicated to a central data center to be merged into a single estimator that can provide combined metrics.
This article is organized as follows. In section 2 we review existing sequential algorithms and highlight some advantages and disadvantages of these algorithms. This is followed by a comparison of the existing algorithms with the Hermite series based algorithms in section 3. Related software is reviewed in section 4 after which we discuss our software contribution in section 5. In section 6 we review the Hermite series based estimators and algorithms that are implemented in hermiter. In section 7 we treat merging of the estimators. In section 8 we present concrete examples of code to construct a hermite_estimator object, update it, merge it and calculate univariate and bivariate statistics. In section 9 we present a numerical example utilizing the hermiter package to analyze real foreign exchange data. In section 10 we briefly review existing accuracy comparisons between Hermite series based estimators and competing approaches. In addition, we present new results comparing Hermite series based algorithms as implemented in hermiter to the tdigest algorithm for online quantile estimation in the univariate setting and to the algorithm in Xiao (2019) for online estimation of the Kendall Tau coefficient in the bivariate setting. We conclude in section 11.
2 Related work
2.1 Sequential PDF, CDF and quantile estimation
There are existing algorithms for online estimation of all of the aforementioned statistical quantities. For PDFs, existing algorithms include recursive kernel estimators as discussed in Chapters 4 and 5 of Greblicki and Pawlak (2008) and Chapter 7 of Devroye and Gyorfi (1985). See also Slaoui and Jmaei (2019) for recursive estimators based on Bernstein polynomials. In the case of CDFs, existing algorithms include those based on the empirical distribution function estimator, the smooth kernel distribution function estimator (Slaoui, 2014) and Bernstein polynomials (Jmaei et al, 2017). These PDF and CDF estimators furnish sequential estimates of the probability density and cumulative probability at predefined values of the support of the probability density function, which is not as general as the online estimation of the full PDF and CDF.
Several algorithms exist for estimating quantiles in a sequential (one-pass) manner. These algorithms can be differentiated into two broad categories. The first category is comprised of those algorithms that directly approximate arbitrary empirical quantiles of a set of numeric values using sub-linear memory. Here, the empirical p-quantile is defined as for observations, where , are the observations sorted in ascending order. No particular data generating process is presupposed. These algorithms provide either deterministic or probabilistic guarantees on the rank error of approximate sample quantiles and have space requirements that grow with required accuracy and potentially also with the number of observations. Noteworthy algorithms in this category include those due to Greenwald and Khanna (2001) and Karnin et al (2016). For recent reviews of algorithms of this type see Greenwald and Khanna (2016), Luo et al (2016) and Chen and Zhang (2020). Rank error accuracy is but one choice of quality metric however and it has been noted that controlling rank error accuracy does not preclude arbitrarily large relative error (for heavily skewed, or heavy tailed data) as per Masson et al (2019). In Masson et al (2019) an algorithm providing relative error accuracy guarantees for estimating sample quantiles (with space requirements that grow with accuracy) is proposed to address this (see also Epicoco et al (2020) for an enhancement of this algorithm). It is also worth noting that many of the aforementioned algorithms cannot be used to form dynamic quantile estimates based on a recent subset of observations since they are insertion-only algorithms. Here insertion-only means that the estimates can only be updated with new observations in an online manner, but previous observations cannot be removed in such a manner. Exceptions include those algorithms that also allow deletion operations and use turnstile semantics to maintain an estimate over a sliding window. These algorithms include the Dyadic Count-Sketch (Luo et al, 2016) for example. Such algorithms assume a fixed universe of potential values for the values being analyzed and typically have higher space and update time complexity than insertion-only algorithms. See also Zhao et al (2021) for recent developments using a bounded deletion model.
The second broad category of online quantile estimation algorithms, of which the Hermite series based algorithm is a member (i.e. the algorithm proposed in Stephanou et al (2017) and implemented in hermiter), assume that the observations to be analyzed are a realization of a random process. Approximating quantiles is then a nonparametric statistical estimation problem. Algorithms in this category are numerous (Jain and Chlamtac, 1985; Raatikainen, 1987, 1990; Naumov and Martikainen, 2007; Chen et al, 2000; Yazidi and Hammer, 2017; Hammer et al, 2019b, a; Tiwari and Pandey, 2019). The aforementioned algorithms in the literature only apply to pre-specified quantiles however. A recently introduced algorithm by Gan et al (2018) involving moment-based quantile sketches, furnishes an efficient approach to quantile estimation but focusses on rapid mergeability and single quantile queries. See also Mitchell et al (2021) for a comparison of the moment-based quantile sketch algorithm with several other algorithms on a large collection of real-world data sets. In Mitchell et al (2021) orthogonal series based quantile estimators similar to those in Stephanou et al (2017) are studied, including an estimator based on the Chebyshev-Hermite polynomials (also known as the “Probabilist’s" Hermite polynomials). These are different orthogonal polynomials than those used in Stephanou et al (2017), which are instead the “Physicist’s" Hermite polynomials, and thus the results in Mitchell et al (2021) do not directly apply to the estimators implemented in hermiter. The popular t-digest approach (Dunning, 2021) maintains an estimate of the empirical distribution function and allows the estimation of arbitrary quantiles, presenting similar capabilities to the Hermite series based approach. This algorithm has the useful property of maintaining an accuracy which is nearly constant relative to for the p-quantile. In addition, the t-digest algorithm has many desirable properties in practice, such as approximate mergeability, and is fast and accurate.
2.2 Sequential nonparametric correlation estimation
In the bivariate setting, there is only one alternate approach to online estimation of nonparametric correlation coefficients known to the authors, namely the algorithms introduced in Xiao (2019) which involve discretizing the joint distribution of the random variables.
3 Comparison with existing algorithms
The algorithms we implement based on Hermite series estimators have certain advantages over the existing approaches discussed above. Firstly, the Hermite series based approach allows the online estimation of the full PDF and CDF. Existing algorithms only furnish online estimates of the probability density and cumulative probability at predefined values of the support of the probability density function. This makes the Hermite series algorithms more general.
Secondly, the Hermite series based algorithms present several advantages in the online quantile estimation setting. In comparison to algorithms that directly approximate arbitrary empirical quantiles, the Hermite series based estimators introduced in Stephanou et al (2017) have fixed i.e. space requirements, but do not provide explicit guarantees on rank error accuracy in estimating sample quantiles. As noted in the previous section, the rank error accuracy metric does have shortcomings however. In addition, the Hermite series based algorithms facilitate online quantile estimation in non-stationary settings with small and constant i.e. O(1) memory requirements, a distinct advantage over the aforementioned turnstile algorithms.
When comparing to the existing online nonparametric statistical estimators, the Hermite series based approach allows the online estimation of arbitrary quantiles at any point in time (along with a direct online estimate of the full CDF) compared to pre-specified quantiles only. The t-digest approach provides the most similar capabilities to the Hermite series based approach. As we will demonstrate though, the Hermite series based methods appear to have superior accuracy in the cases we studied with roughly comparable speed. In addition, the Hermite series based algorithms facilitate online quantile estimation in non-stationary settings with small and constant memory requirements whereas the t-digest algorithm does not allow dynamic quantile estimation directly. This is pertinent for tracking the quantile function of non-stationary processes in near real-time for example. It should be noted that a similar effect can be achieved by maintaining a fixed number of t-digest sketches for recent non-overlapping windows of data (including the most recent, potentially partially complete window) and merging these sketches prior to quantile estimation. Sketches for windows beyond a certain age could be excluded from being merged or deleted. This is perhaps best suited to maintaining continuously updated aggregates in the database context however111We would like to thank Ted Dunning for useful discussions in this regard..
In the bivariate setting, the Hermite series based algorithms introduced in Stephanou and Varughese (2021b) and in this article have accuracy advantages for large numbers of observations in comparison to the existing algorithms in Xiao (2019). Our algorithms also allow the Spearman Rho and Kendall Tau of non-stationary streams to be tracked via an exponential weighting scheme, without relying on a moving/sliding window approach. This is advantageous since memory requirements are fixed with the exponential weighting approach compared to growing memory requirements with moving window size when utilizing a moving/sliding window approach.
4 Review of related software
Various open-source software implementations for online statistical estimation exist. Perhaps the most comprehensive solution is the OnlineStats package (Day and Zhou, 2020) in Julia (Bezanson et al, 2017) which facilitates one-pass batch estimation, sequential estimation and merging of estimates of most of the aforementioned statistical quantities. Notable exceptions are bivariate PDFs, CDFs and nonparametric correlations. In Python (Van Rossum, 2023) the RunStats package (Jenks, 2019) has a smaller coverage of statistical quantities than OnlineStats but provides similar functionality. Focussing more specifically on rapid online estimation of quantiles and cumulative probabilities, the tdigest algorithm has been implemented in several languages including R (R Core Team, 2023) in the package tdigest (Rudis et al, 2022). A summary of the nonparametric statistics covered by each package is presented in Table 1. Note that the package OnlineStats also facilitates online parametric estimation, which is not the focus of this article.
5 The R package hermiter
The R package hermiter (Stephanou, 2023) which we introduce in this article, is a focussed package concentrating on online nonparametric estimation of specific quantities in the univariate and bivariate settings. It addresses a gap in current implementations, namely online estimation of bivariate probability density functions, cumulative distribution functions and nonparametric correlations. In addition, this package presents advantages over existing packages in certain univariate estimation problems. In particular, the hermiter package presents accuracy advantages in quantile estimation over a leading alternative R package, namely tdigest whilst having roughly comparable speed. The hermiter package can also directly handle non-stationary quantile estimation, which tdigest cannot. The unifying theme of the hermiter package is the use of recently introduced Hermite series based estimators for univariate and bivariate estimation. It collects the associated algorithms in a coherent implementation.
Package | Language | Univariate Statistics | Bivariate Statistics |
---|---|---|---|
OnlineStats | Julia | Mean, Variance, Quantiles, Extrema, | |
Skewness, Kurtosis, PDF, CDF | Pearson’s Correlation | ||
RunStats | Python | Mean, Variance, Extrema, Skewness, | |
Kurtosis | Pearson’s Correlation | ||
tdigest | R | CDF, Quantiles | - |
hermiter | R | Quantiles, PDF, CDF | PDF, CDF, Spearman |
Rho, Kendall Tau |
6 Hermite series based estimators and algorithms
6.1 Density estimation
The Hermite series density estimator for the univariate PDF, , is defined as follows (Schwartz, 1967; Walter, 1977; Greblicki and Pawlak, 1984, 1985; Liebscher, 1990),
(1) |
where , and are the normalized Hermite functions defined from the Hermite polynomials, .
The bivariate Hermite series density estimator for the bivariate PDF, , has the following form:
(2) |
where . Note that although standardization of the observations is not required for Hermite series based estimators, our empirical studies have revealed that it often improves accuracy. Here standardization refers to subtracting the estimated mean from each observation and dividing by the estimated standard deviation. The estimates for the Hermite series coefficients based on these standardized observations are biased compared to the coefficients that would have been obtained by using the true mean and standard deviation for standardization. Despite this bias, we have typically observed better overall results in practice using standardization. The PDF estimators previously defined need to be modified for standardized observations. For the univariate case,
where is the estimated standard deviation. For the bivariate case,
where and are the estimates of the standard deviation of and respectively.
6.2 Distribution function and quantile estimation
The univariate density estimator (1) can be used to define the following univariate CDF estimator,
(3) |
Quantiles can be obtained numerically by applying a root-finding algorithm to solve for the quantile value, , at cumulative probability via,
In Stephanou et al (2017) an alternate univariate CDF estimator was also introduced,
(4) |
We have found that while the asymptotic properties of the alternate univariate CDF estimator (4) are the same as (3), the empirical, finite sample performance of (4) is superior for quantile estimation specifically, see Stephanou et al (2017). A vectorized implementation of the bisection root finding algorithm is a robust way to calculate the quantiles at an arbitrary vector of cumulative probability values. This is one of the two algorithms available for quantile estimation in hermiter. The second algorithm allows for more rapid but sometimes less accurate approximation of the quantiles by calculating the CDF values at a fixed vector of values and applying linear interpolation to find the vector of quantiles using the R function findInterval. It is also noteworthy that in (3) and (4), it is computationally advantageous to utilize the recurrence relations for integrals of Hermite functions in (5) and (6),
(5) |
and
(6) |
where is the complementary error function. Convergence acceleration techniques can be successfully applied to improve accuracy at a given order of truncation for Hermite series. In particular, hermiter applies straightforward iterated averaging techniques to the truncated series (see Boyd and Moore (1986) and Table A.2 of Boyd (2018)) to improve the accuracy of univariate PDF and CDF estimates along with quantile estimates obtained through the CDF estimates. We have observed that empirically, applying series acceleration typically yields better results. Deriving formal guarantees that these methods are generally effective in our context is an area for future research. It is worth noting that the results presented in Stephanou et al (2017) can be reproduced by setting the algorithm used for quantile estimation to ‘bisection’ and the accelerate series argument to FALSE.
In a similar manner to the univariate case, the bivariate density estimator (2) can be used to construct the following bivariate CDF estimator:
(7) |
The recurrence relation in (5) can again be applied for rapid calculation of the integrals.
6.2.1 Limitations of Hermite series estimators
The cumulative probability estimates produced by the estimators in this section may not be monotonically non-decreasing and may lie outside the range . This is related to the fact that the Hermite series based distribution function estimators are obtained by integrating Hermite series density estimators (with a finite number of terms). These density estimators can, in principle, produce negative probability density estimates at certain values of the domain. In practice, these departures are usually quite minor given sufficient data and can be further ameliorated by clipping the cumulative probability estimates to lie between 0 and 1.
6.3 Nonparametric correlation estimation
We can also use the Hermite series based estimators defined above to obtain estimators for the Spearman Rho and Kendall Tau nonparametric correlation measures. For the Spearman correlation coefficient estimator, , we plug the bivariate Hermite PDF estimator and univariate Hermite CDF estimator into a large sample form of the Spearman correlation coefficient (see Stephanou and Varughese (2021b)) to obtain,
where correspond to univariate CDF estimators as defined in equation (3), updated on and observations respectively. The function is the bivariate PDF estimator defined in equation (2). We can phrase the estimator above in terms of linear algebra operations for computational efficiency:
(8) |
where are the coefficients associated with the Hermite CDF estimator , are the coefficients associated with the Hermite CDF estimator and are the coefficients associated with the Hermite bivariate PDF estimator . The matrix and vector . In all the matrices and vectors we have defined, . These integrals can be evaluated numerically once and the values stored for rapid calculation. The sequential nature of the estimators follows from the fact that we have obtained sequential update rules for the coefficients and the fact that there is no explicit dependence on the observations in these expressions as discussed in section 6.4 below.
For the Kendall Tau estimator, , we plug the bivariate Hermite PDF estimator and bivariate Hermite CDF estimator into a form of the Kendall correlation coefficient relating directly to the probability of concordance (see Gibbons and Chakraborti (2010) for more details on this representation), to obtain,
where is the bivariate CDF estimator defined in equation (7) and is the bivariate PDF estimator defined in equation (2). We can again phrase the estimator above in terms of linear algebra operations for computational efficiency:
(9) |
6.4 Sequential Hermite series based estimators
The particular utility of all the Hermite series based estimators in online/sequential estimation follows from the ability to update the estimates of the Hermite series coefficients in an online, manner. In particular, for the coefficients in (1), we can define the following updating algorithm,
(10) |
For the coefficients in (2), we can define an updating algorithm in terms of the outer product of and ,
(11) |
In the update rules (10) and (11) above, and are the vectors and respectively. For computational efficiency it is advantageous to calculate and making use of the recurrence relation for the Hermite polynomials . The computational cost of updating the coefficients, and as above is manifestly constant () with respect to the number of previous observations. In non-stationary scenarios the following exponential weighting scheme can be applied for the coefficients defined in (1),
(12) |
where controls the weight of new terms associated with new observations. For the coefficients defined in (2) we have,
(13) |
The computational cost of updating the coefficients, and is again manifestly constant () with respect to the number of previous observations.
As noted above, the Hermite series based estimators have been found to typically perform better on standardized observations. The observations can be standardized in an online manner by subtracting an online estimate of the mean and dividing by an online estimate of the standard deviation.
7 Merging Hermite series based estimators
Hermite series based estimators can be consistently merged in both the univariate and bivariate settings. In particular, when the observations are not standardized, the results obtained from merging distinct Hermite series based estimators updated on subsets of a data set are exactly equal to those obtained by constructing a single Hermite series based estimator and updating on the full data set (corresponding to the concatenation of the aforementioned subsets). This follows from the fact that the merged coefficients of subsets of an overall data set,
(14) |
(15) |
are exactly equal to the coefficients that would have been obtained on the full data set. Since the coefficient values are identical, it is easy to see that the PDF, CDF and quantile results in the univariate case and the PDF, CDF and nonparametric correlation results in the bivariate case are identical too. When the observations are standardized, the equivalence is no longer exact but it is accurate enough to be practically useful. The algorithm we use is as follows, first the mean and standard deviation estimates from each of the estimators constructed on the data subsets are merged using the method in Schubert and Gertz (2018). This procedure yields a merged mean, , and a merged standard deviation, , that are exactly equal to the mean and standard deviation of the full data set. In the univariate case, we calculate the merged Hermite series coefficients as follows:
(16) |
In the bivariate case, we utilize:
(17) |
We use Gauss-Hermite quadrature to rapidly evaluate these integrals in the hermiter package. We conclude this section by contrasting the operation of sequential updating of a Hermite series based estimator with the operation of merging several estimators. In the case where observations are not standardized, the result of repeated sequential updates to a single estimator versus repeated merging of several individual Hermite series based estimators initialized with any partition of the same set of observations gives precisely the same results for all estimands (this is readily apparent from equations (10), (11), (14) and (15)). Merging is not as efficient however, in that there is computational overhead in constructing the individual Hermite estimators to be merged. When observations are standardized, the resulting estimand values are expected to differ between a sequentially updated estimator and an estimator formed from merging however (although the differences are likely to be small). Conceptually, sequential updating of a Hermite series based estimator is best applied to incorporating new observations into the estimator in an efficient manner. Merging estimators on the other hand is perhaps most naturally applied in calculating marginal estimands. For example, if individual Hermite series based estimators are formed from observations for each hour of the day or for different sensors, merging the estimators for all hours or all sensors respectively corresponds to marginalizing the distribution of observations over these variables. We make these considerations more concrete in the numerical example in section 9.
8 Function reference and code usage examples
8.1 Function reference summary
We present a summary of the core functions provided by the hermiter package in Table 2 and elaborate on the functions in more detail in the remainder of the section.
Function Name | Description |
---|---|
cum_prob | Estimate the cumulative probability at one or more x values |
dens | Estimate the probability density at one or more x values |
density | Return an object that can be printed and plotted using |
the generic methods print and plot respectively (adds a | |
further generic density method to R) | |
hcdf | Return an object that can be printed, plotted and summarized |
using the generic methods print, plot and summary respectively | |
hermite_estimator | Construct a univariate or bivariate hermite_estimator object |
kendall | Estimate the Kendall rank correlation coefficient for a bivariate |
estimator | |
merge_hermite | Merge a list of Hermite estimators |
quant | Estimate the quantiles at a vector of probability values for |
a univarite estimator | |
quantile | Dispatch to quant method (adds a further generic quantile |
method to R) | |
spearmans | Estimate the Spearman rank correlation coefficient for a bivariate |
estimator | |
update_sequential | Update a Hermite series based estimator sequentially |
8.2 Constructing the estimator
A hermite_estimator R S3 object is constructed as below. The argument, N, adjusts the number of terms in the Hermite series based estimator and controls the trade-off between bias and variance. A lower N value implies a higher bias but lower variance and vice versa for higher values of N. We have found that empirically, different contexts benefit from different default values of N for good performance. We have distilled our experience on a wide array of data sets into defaults that have been implemented in the hermiter package. For univariate, non-exponentially weighted estimators, the default is N whereas for univariate, exponentially weighted estimators, the default is N . For bivariate, non-exponentially weighted estimators, the default is N . Finally, for bivariate, exponentially weighted estimators, the default is N . For further detail on selecting in the univariate setting, see section 4.1 of Stephanou et al (2017) and for the bivariate setting, see section 3.3 of Stephanou and Varughese (2021b). The argument, standardize, controls whether or not to standardize observations before applying the estimator. Standardization usually yields better results and is recommended for most estimation settings, thus the default for standardize is set to TRUE. Finally, the estimator can be optionally initialized with a batch of observations via the argument observations.
A univariate estimator is constructed as follows (the default estimator type is univariate, so this argument does not need to be explicitly set):
Similarly for constructing a bivariate estimator:
If one wishes to initialize the estimator with an initial batch of observations, the following syntax can be utilized. For univariate observations we have:
For bivariate observations we have:
Note that the constructors above only take numeric observations arguments i.e. a numeric vector or numeric matrix respectively. Initializing with data frames is not presently supported even if all variables are numeric.
8.3 Sequential estimator updating in stationary scenarios
In the sequential setting, observations are revealed one at a time. A hermite_estimator object can be updated sequentially with a single or multiple new observations by utilizing the update_sequential method. Note that when updating the Hermite series based estimator sequentially, observations are also standardized sequentially if the standardize argument is set to TRUE in the constructor. This means that using the update_sequential method with standardize set to TRUE for an initial set of observations will give a slightly different result to initializing the estimator using the observations argument. For updating with univariate observations we have:
For bivariate observations we have:
One can verify that the new observations have been incorporated by printing the hermite_estimator object before and after the sequential updates: the number of observations can be seen to have increased by the appropriate amount. The generic method update_sequential dispatches to the appropriate implementation depending on whether the hermite_estimator object is univariate or bivariate. In the univariate case, the sequential updating of the Hermite series coefficients is based on (10). In the bivariate case, the sequential updating of the Hermite series coefficients is based on (11).
8.4 Sequential estimator updating in non-stationary scenarios
The hermiter package is also applicable to non-stationary data streams. An exponentially weighted form of the Hermite series coefficients can be applied to handle this case. The estimators will adapt to the new distribution and incrementally “forget" the old distribution. In order to use the exponentially weighted form of the hermite_estimator, the exp_weight_lambda argument must be set to a non-NA value. Typical values for this parameter are 0.01, 0.05 and 0.1. The lower the exponential weighting parameter, the slower the estimator adapts and vice versa for higher values of the parameter. However, the variance of estimands obtained from the hermite_estimator object increases with higher values of exp_weight_lambda, so there is a trade-off to bear in mind. For univariate observations we have:
For bivariate observations we have:
Again the generic method update_sequential dispatches to the appropriate implementation depending on whether the hermite_estimator object is univariate or bivariate. In the univariate case, the sequential updating of the Hermite series coefficients is based on (12). In the bivariate case, the sequential updating of the Hermite series coefficients is based on (13).
8.5 Merging Hermite estimators
Hermite series based estimators can be consistently merged in both the univariate and bivariate settings as discussed previously. In particular, when standardize = FALSE, the results obtained from merging distinct hermite_estimator objects updated on subsets of a data set are exactly equal to those obtained by constructing a single hermite_estimator and updating on the full data set (corresponding to the concatenation of the aforementioned subsets). This holds true for the PDF, CDF and quantile results in the univariate case and the PDF, CDF and nonparametric correlation results in the bivariate case. When standardize = TRUE, the equivalence is no longer exact, but is accurate enough to be practically useful. Merging hermite_estimator objects is illustrated below. Note that the merge_hermite function takes in a list of hermite_estimator objects as an argument. In order to be merged, these objects must have a consistent type, either "univariate" or "bivariate". As an example in the univariate case:
The above example yields the following output which verifies that the merged hermite_estimator i.e. hermite_est_merged is exactly equal equal to the full hermite_estimator, i.e. hermite_est_full when standardize = FALSE.
A univariate example with standardize = TRUE is the following:
This yields the output:
The above example illustrates that the merged hermite_estimator i.e. hermite_est_merged is a close approximation to the full hermite_estimator, i.e. hermite_est_full when standardize = TRUE but it is not exactly the same object. As an example for the bivariate case:
When standardize = FALSE, merging utilizes (14) and (15) to combine the univariate and bivariate Hermite series coefficients respectively. When standardize = TRUE, merging utilizes the methods in Schubert and Gertz (2018) to combine the mean and standard deviation estimates of the individual hermite_estimator objects along with (16) and (17) to combine the univariate and bivariate Hermite series coefficients respectively.
A few final notes are that the standardize parameter must be the same i.e. either all TRUE or all FALSE for the list of hermite_estimator objects to be merged. In addition, the N argument must be the same for all estimators in order to merge them.
8.6 Estimating the univariate PDF, CDF and quantile function
The main advantage of Hermite series based estimators is that they can be updated in a sequential/one-pass manner as above and subsequently probability densities and cumulative probabilities at arbitrary x values can be obtained, along with arbitrary quantiles. The hermite_estimator object only maintains a small and fixed number of coefficients and thus uses minimal memory. The syntax to calculate probability densities, cumulative probabilities and quantiles in the univariate setting is presented below. The PDF estimators (1) and (2) are implemented in the dens method, where dispatch occurs to the appropriate implementation depending on whether the hermite_estimator object is univariate or bivariate. As described in section 6.2.1, density estimates can be negative in principle. The parameter clipped can be set to TRUE in order to replace negative values with a small, but strictly positive value (i.e. 1e-8).
The CDF estimators (3) and (7) are implemented in the cum_prob method, where again dispatch occurs to the appropriate implementation depending on whether the hermite_estimator object is univariate or bivariate. Cumulative probability estimates may not be monotonically non-decreasing and might lie outside the range as per section 6.2.1. These limitations can be partially addressed by clipping the estimates to lie between , this can be done by setting the clipped parameter to TRUE. There are also various post-processing strategies that can be employed to enforce the monotone property (e.g. making use of the R function cummax). The quantile function quant makes use of the CDF estimator (4) and is only defined for univariate hermite_estimator objects. It is worth noting that since the Hermite series based quantile estimator is directly estimating the quantile function and not the empirical quantiles, we do not expect the 0-quantile to correspond to the minimum observation, nor do we expect the 1-quantile to correspond to the maximum observation. In fact, since the Hermite series based estimators are defined on the full real line, the 0-quantile (true minimum) is and the 1-quantile (true maximum) is by definition. As a result, estimating these quantiles, i.e. estimating the minimum and maximum directly is not sensible using these estimators. Quantiles for very close to 0 and very close to 1 may still be reasonably accurate however. The default behaviour of the quant function is therefore to return a quantile for very close to 0 for the 0-quantile and a quantile for very close to 1 for the 1-quantile.
As an example of estimating the PDF, CDF and quantiles in the univariate case:
Actual values of the PDF, CDF and quantiles are evaluated as below:
The estimated versus actual values are visualized in Figures 1, 2 and 3.



8.7 Estimating the bivariate PDF, CDF and nonparametric correlations
The aforementioned suitability of Hermite series based estimators in sequential and one-pass batch estimation settings extends to the bivariate case. Probability densities and cumulative probabilities can be obtained at arbitrary points. The syntax to calculate probability densities and cumulative probabilities along with the Spearman and Kendall correlation coefficients in the bivariate setting is presented below. Note that the functions spearmans and kendall implement the equations (8) and (9) respectively. As an example of estimating the PDF, CDF and nonparametric correlation coefficients in the bivariate case:
Actual values of the bivariate PDF, CDF, Spearman and Kendall correlation coefficients are evaluated as below.
The estimated versus actual values for the PDF and CDF are visualized in Figures 4, 5 respectively.


The Spearman and Kendall correlation coefficient results are presented in Table 3.
Spearman Rho | Kendall Tau | |
---|---|---|
Actual | 0.479 | 0.331 |
Estimated | 0.475 | 0.330 |
9 Numerical example
In this section, we illustrate the utility of the hermiter package on a real data set. The data set consists of one week (the first week of October 2021) of high frequency forex quote data sourced from TrueFX (Integral, 2021) for the currency pairs EUR/USD and GBP/USD. These data are comprised of tick-by-tick, top-of-book bid and offer quotes, aggregated across several bank, broker and asset manager participants on the Integral OCX platform222This platform is an ECN i.e. an Electronic Communication Network. In particular, we use the hermiter package to summarize the quantiles of the bid-offer spreads of EUR/USD and GBP/USD per hour of day, and merge these estimates to obtain the quantiles of spread over all hours (i.e. we marginalize the distribution of spreads over hour of day). This is contrasted with tracking the spread quantiles in a sequential manner. Finally we illustrate tracking the Spearman and Kendall correlation coefficients of the spreads of the EUR/USD and GBP/USD currency pairs in a sequential manner. Bid-offer spread is reflective of the cost to liquidity takers of fulfilling immediate liquidity needs i.e. the need to buy or sell at any given time, and is defined as the prevailing offer price minus the prevailing bid price. We sample the spread at 15 second intervals.
The quantile results are obtained by creating a univariate hermite_estimator object for each hour of the day (each initialized with a batch of spread observations corresponding to a particular hour) and running the quant function for . The merge_hermite function is then used to combine the estimates (marginalize the distribution of spreads over hour of day) and the quant function is again applied to the merged estimator. The results are presented in Table 4. This procedure illustrates the utility of fast batch calculation of quantiles for each hour of day and then merging the estimators to marginalize the distribution over all hours of day and obtain combined quantile estimates. It is noteworthy that estimates can just as easily be obtained for any subset of hours of the day, avoiding the need to cater for a combinatorial explosion of combinations upfront. A very practical application is to calculate statistics for different trading sessions for example e.g. Tokyo, London and New York. The results suggest a mostly stable median spread throughout the day for both EUR/USD and GBP/USD, which is consistent with the fact that these are two of the most liquid currency pairs and are heavily traded in all trading sessions (covering almost 24 hours). The exceptions to this are the hours 21h and 22h i.e. 9 pm and 10 pm UTC time. These hours coincide with a period where trading in New York tapers off and trading in Tokyo has not yet picked up. These hours are thus typically associated with the lowest traded volumes of the day and this is reflected in the median spreads in Table 4 which are higher than the rest of the day.
A distinct, but related use case of hermiter is to track time-varying quantiles in a sequential manner. In this case, the exponentially weighted form of the univariate Hermite series based estimator can be applied (where the hermite_estimator object is contructed with a non-NA weighting parameter ). In the present numerical example, a hermite_estimator object was contructed with , and the estimator was sequentially updated with one spread observation at a time using the update_sequential function. The updated median value was calculated on each update. The result of this procedure is illustrated in Figure 6. This figure again reflects the spike in median spread between 9 pm and 11 pm. We illustrate a similar procedure of tracking time-varying nonparametric correlations between EUR/USD and GBP/USD spreads using a bivariate hermite_estimator and the spearmans, kendall functions, calculated at each update, in Figure 7. There are several interesting features of this figure. The first is that the correlation between the currency pair spreads varies significantly throughout the day. The second interesting feature is that the local mean levels of the correlations are very slightly positive throughout the day. Lastly, there are periods where the correlation spikes above 0.5 in magnitude, which is meaningful information for risk management. A final note is that we found it advantageous to apply a log transformation to the positive spread data before constructing hermite_estimator objects in this section (we invert the transformation for the results) as this yielded more sensible summary statistics.
Currency Pair | Hour (UTC) | p_1% | p_10% | p_25% | p_50% | p_75% | p_90% | p_99% |
---|---|---|---|---|---|---|---|---|
EUR/USD | 0 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.5 | 0.7 |
EUR/USD | 1 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.4 | 0.5 |
EUR/USD | 2 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.5 | 0.6 |
EUR/USD | 3 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.5 | 0.6 |
EUR/USD | 4 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.5 | 0.6 |
EUR/USD | 5 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.5 | 0.6 |
EUR/USD | 6 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.4 | 0.5 |
EUR/USD | 7 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.5 | 0.5 |
EUR/USD | 8 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.4 | 0.5 |
EUR/USD | 9 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.4 | 0.5 |
EUR/USD | 10 | 0.1 | 0.2 | 0.3 | 0.4 | 0.4 | 0.5 | 0.6 |
EUR/USD | 11 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.5 | 0.6 |
EUR/USD | 12 | 0.2 | 0.3 | 0.3 | 0.3 | 0.4 | 0.4 | 0.5 |
EUR/USD | 13 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.4 | 0.5 |
EUR/USD | 14 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.5 | 0.5 |
EUR/USD | 15 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.5 | 0.5 |
EUR/USD | 16 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.5 | 0.5 |
EUR/USD | 17 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.5 | 0.6 |
EUR/USD | 18 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.5 | 0.6 |
EUR/USD | 19 | 0.2 | 0.3 | 0.3 | 0.4 | 0.4 | 0.5 | 0.6 |
EUR/USD | 20 | 0.3 | 0.3 | 0.4 | 0.5 | 0.7 | 0.8 | 1.0 |
EUR/USD | 21 | 1.8 | 2.6 | 3.0 | 4.0 | 6.0 | 8.5 | 15.9 |
EUR/USD | 22 | 0.4 | 0.5 | 0.6 | 0.7 | 0.7 | 0.9 | 1.2 |
EUR/USD | 23 | 0.3 | 0.4 | 0.5 | 0.5 | 0.6 | 0.7 | 0.7 |
GBP/USD | 0 | 0.4 | 0.5 | 0.6 | 0.7 | 0.9 | 1.0 | 1.2 |
GBP/USD | 1 | 0.4 | 0.5 | 0.6 | 0.8 | 0.9 | 1.0 | 1.2 |
GBP/USD | 2 | 0.4 | 0.5 | 0.5 | 0.7 | 0.9 | 1.0 | 1.2 |
GBP/USD | 3 | 0.4 | 0.5 | 0.6 | 0.8 | 0.9 | 1.0 | 1.1 |
GBP/USD | 4 | 0.4 | 0.5 | 0.6 | 0.7 | 0.9 | 1.0 | 1.1 |
GBP/USD | 5 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 1.2 |
GBP/USD | 6 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.8 | 1.0 |
GBP/USD | 7 | 0.3 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 1.0 |
GBP/USD | 8 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 1.0 |
GBP/USD | 9 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 1.0 |
GBP/USD | 10 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
GBP/USD | 11 | 0.4 | 0.5 | 0.6 | 0.7 | 0.7 | 0.8 | 0.9 |
GBP/USD | 12 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 1.0 |
GBP/USD | 13 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
GBP/USD | 14 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
GBP/USD | 15 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.7 | 0.9 |
GBP/USD | 16 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
GBP/USD | 17 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
GBP/USD | 18 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.8 | 0.9 |
GBP/USD | 19 | 0.3 | 0.5 | 0.6 | 0.7 | 0.8 | 0.8 | 1.0 |
GBP/USD | 20 | 0.5 | 0.7 | 0.8 | 1.0 | 1.1 | 1.3 | 2.3 |
GBP/USD | 21 | 2.1 | 2.5 | 3.3 | 5.0 | 8.9 | 12.0 | 24.7 |
GBP/USD | 22 | 0.9 | 1.0 | 1.1 | 1.3 | 1.6 | 1.9 | 2.5 |
GBP/USD | 23 | 0.7 | 0.8 | 0.9 | 1.0 | 1.2 | 1.4 | 1.6 |
EUR/USD | All | 0.2 | 0.3 | 0.3 | 0.4 | 0.5 | 0.6 | 6.4 |
GBP/USD | All | 0.4 | 0.5 | 0.6 | 0.7 | 0.9 | 1.1 | 9.0 |


10 Accuracy and performance
10.1 Univariate setting
In the univariate case, simulation studies have been conducted to demonstrate competitive performance of the Hermite series based algorithms in the sequential quantile estimation setting in Stephanou et al (2017). In addition, simulation studies have been performed to evaluate the accuracy of the Hermite series based CDF estimator in Stephanou and Varughese (2021a) where it was concluded that while smooth kernel distribution function estimators are more accurate in the non-sequential setting, the performance of the sequential Hermite series based CDF estimators is still sufficiently good to be practically useful in online settings. The recent tdigest package in R offers very similar functionality for sequential quantile estimation to hermiter. This similarity, along with the availability of a R implementation and the popularity of the tdigest algorithm suggests a new and useful comparison to the Hermite series based algorithms in terms of accuracy and performance in the stationary setting. Note that in the non-stationary setting, where quantiles may vary over time, only the hermiter package can be directly applied, implying that even if the accuracy and performance of tdigest and hermiter are similar in the stationary setting, the hermiter package still adds valuable functionality. The simulation comparison we conduct utilizes the suite of test distributions provided by the R package benchden (Mildenberger and Weinert, 2012) excluding heavy-tailed distributions. The reason we exclude the heavy-tailed densities from the benchden test suite is that most of these have undefined variances and some even have undefined means. The Hermite series based quantile estimators as implemented in hermiter require at least the first two moments to exist and thus we cannot meaningfully compare hermiter and tdigest on these densities. The simulation study is as follows: for sample sizes, and each of the test distributions, the following steps are repeated.
-
1.
For each test distribution, we initialize a hermite_estimator object (using defaults N , standardize = TRUE), with observations drawn from the test distribution. We record the integrated absolute error (IAE) and partial integrated absolute error (pIAE) between estimated quantiles - obtained via the quant function with algorithm = "interpolate" and accelerate_series = TRUE - and the true quantile values. Similarly for the tdigest algorithm using the tdigest and quantile functions. The IAE is defined as,
and the partial IAE is defined as,
where is the estimated quantile value at cumulative probability and is the true quantile value at cumulative probability . The IAE captures the integrated absolute error across the full range of cumulative probability values. The partial IAE captures the integrated absolute error excluding extreme quantiles (namely those where and ). In practice, we apply Quasi-Monte Carlo integration with a Sobol sequence to numerically determine the value of these integrals.
-
2.
We repeat the above times and average across all runs to obtain estimates of the mean integrated absolute error, , and partial mean integrated absolute error, .
Observation Count | Hermite Estimator Better MIAE |
---|---|
10,000 | 0.57 (12/21) |
100,000 | 0.81 (17/21) |
1000,000 | 0.86 (18/21) |
10,000,000 | 0.86 (18/21) |
Observation Count | Hermite Estimator Better pMIAE |
---|---|
10,000 | 0.62 (13/21) |
100,000 | 0.81 (17/21) |
1000,000 | 0.86 (18/21) |
10,000,000 | 0.91 (19/21) |
This yields a MIAE and pMIAE value for each test distribution and sample size for both the Hermite series based quantile estimator and tdigest based quantile estimation algorithm. There are non heavy-tailed test distributions in total. We summarize the results of comparing the MIAE of hermiter and tdigest estimates in Table 5. The results for pMIAE are presented in Table 6.
The results reveal that the accuracy of the Hermite series based approach is better than the tdigest algorithm for fairly large numbers of observations even for the full MIAE error measure. In this simulation study we are particularly interested in performance on a number of common distributions across a range of non-extreme (i.e. more central) quantiles, for which the pMIAE results are more relevant. These results demonstrate significantly better accuracy using the Hermite series based algorithm for quantile estimates formed from more than observations. The distributions in benchden include those with full real-line, half real-line and compact support and thus the simulation study should yield information about typical performance in a broad range of scenarios. We reiterate that we exclude the heavy-tailed densities from the benchden test suite but should mention that the tdigest package appears to perform well in quantile estimation on all these heavy-tailed densities. Thus, for applications involving heavy-tailed distributions, tdigest is likely better suited.
In terms of computational performance, we plot the time taken to initialize the hermiter estimator and the tdigest estimator with observations in Figure 8. It is apparent that hermiter is faster across all parameters of N considered. This is mainly the result of leveraging multithreaded computation on multiple cores for batch updating. This is enabled by default. Multithreaded computation can be disabled by setting options(hermiter.parallel = FALSE). Computation time will typically increase on multi-core systems in this case however, compare Figure 8 (multithreaded computation enabled) to Figure 9 (multithreaded computation disabled) for example. Both benchmarks were conducted on the same eight core machine (see section Computational details).


Finally, we plot the time taken to calculate vectors of quantile values ranging in length from a single quantile to 100,000 quantiles in Figure 10. The performance of tdigest is better for smaller numbers of quantiles but the distributions of run-times of hermiter and tdigest are similar for larger numbers of quantiles. Calculating large vectors of quantiles simultaneously is meaningful for drawing random variables for example. Given the fact that further implementation optimization is possible for hermiter (and potentially tdigest) and that performance results are likely operating system dependent, we regard the computational performance as roughly similar (while acknowledging that tdigest appears faster for smaller vectors of quantiles, perhaps due to less computational overhead).

10.2 Bivariate setting
In the bivariate case, extensive simulation studies have been conducted to demonstrate competitive performance in the sequential estimation of the Spearman correlation coefficient versus the only known competing algorithm in Stephanou and Varughese (2021b). In this section we present a succinct simulation study using essentially the same methodology as Stephanou and Varughese (2021b) where we extend the results to the sequential estimation of the Kendall Tau coefficient. We term the competing algorithms in Xiao (2019) count matrix based algorithms.
The simulation study involves generating i.i.d. samples of increasing size, , from a bivariate normal distribution with mean vector, , and covariance matrix, , where . The correlation of the bivariate normal distribution is varied by drawing samples across a range of correlation parameters, . For each value of the sample size and correlation parameter, , the below steps are repeated times.
- 1.
-
2.
Construct a bivariate hermite_estimator using hermiter and initialize with the full sample.
-
3.
Estimate the Spearman Rho and Kendall Tau correlation coefficients for the
hermite_estimator object using the spearmans and kendall functions in hermiter. -
4.
Estimate the true Spearman Rho on the full sample using the R function cor. Calculate the true Kendall Tau analytically using the relation which applies to bivariate normal distributions (Gibbons and Chakraborti, 2010). We use the exact analytical Kendall Tau since the implementation in cor is prohibitively slow for estimating Kendall Tau with larger sample sizes (as it would be in other standard implementations).
-
5.
Calculate the absolute error between the Hermite series based Spearman Rho and Kendall Tau estimates and the respective exact values.
The mean absolute error (MAE) between the Hermite series based estimates and the exact Spearman and Kendall rank correlation coefficients for a particular value of and sample size are then estimated through and respectively, where indexes a particular set of observations. The same steps as above are repeated for the count matrix algorithms namely, algorithms 2 and 3 of Xiao (2019). For the choice of parameters, we use the default of hermiter for non-exponentially weighted, bivariate estimators, N , which we have found to give good performance in a wide range of scenarios. For algorithm 2 of Xiao (2019) we use the recommended value of 30 cutpoints for estimating Spearman Rho. For algorithm 3 of Xiao (2019) we use the recommended value of 100 cutpoints for estimating Kendall Tau. It is worth noting that the number of values to be maintained in memory is significantly larger in the count matrix case with 100 cutpoints than the Hermite estimator case with N . The results are summarized as the average MAE (and standard deviation of MAE) across all the values of in Table 7 for the Hermite series based algorithm with N and count matrix algorithm with for Spearman Rho estimation. Similarly, results are summarized as the average MAE (and standard deviation of MAE) across all the values of in Table 8 for the Hermite series based algorithm with N and count matrix algorithm with for Kendall Tau estimation. It is clear that the Hermite series based algorithms present an accuracy advantage. The mean MAE and variation in MAE are also summarized in Figure 11.
The computational performance of the algorithms in hermiter is roughly comparable with the algorithms in Xiao (2019). That said, when updating the Spearman or Kendall correlation coefficient with each new observation, we have observed better computational performance with the Hermite series based algorithms. We acknowledge that this may depend on the implementation of the count matrix algorithms however. We omit detailed benchmarking for brevity. A final note is that the calculation time for the Hermite series based Kendall Tau estimator is thousands of times faster than the cor function in R with observations and this differential grows rapidly with . For large samples (e.g. millions or even billions of observations), the standard methods of calculating the Kendall correlation coefficient are intractable. The Hermite series based approach is not only tractable but very fast in practice.
Spearman Rho MAE Avg (x ) | Spearman Rho MAE Std (x ) | |||
Observations | Matrix (c=30) | Hermite (N=30) | Matrix (c=30) | Hermite (N=30) |
10,000 | 0.103 | 0.146 | 0.029 | 0.014 |
50,000 | 0.099 | 0.065 | 0.028 | 0.005 |
100,000 | 0.100 | 0.044 | 0.029 | 0.003 |
Kendall Tau MAE Avg (x ) | Kendall Tau MAE Std (x ) | |||
Observations | Matrix (c=100) | Hermite (N=30) | Matrix (c=100) | Hermite (N=30) |
10,000 | 0.540 | 0.472 | 0.038 | 0.077 |
50,000 | 0.388 | 0.217 | 0.127 | 0.042 |
100,000 | 0.353 | 0.158 | 0.134 | 0.029 |

11 Summary and discussion
In this article we have introduced a new R package, hermiter, for sequential nonparametric estimation. The unifying theme of the package is the implementation of several recently developed algorithms for online sequential estimation based on Hermite series estimators. This package addresses several gaps in the R package ecosystem as well as the more general scientific computing ecosystem. Namely, the online estimation of univariate PDFs, CDFs and quantiles, along with bivariate PDFs, CDFs and nonparametric correlation coefficients in both stationary and non-stationary settings. We have demonstrated competitive performance versus leading existing approaches such as tdigest in R in the sequential quantile estimation setting. In addition, we have demonstrated improved accuracy in sequential nonparametric correlation estimation compared to the only alternate approach known to the authors. The hermiter package does not seek to be a comprehensive collection of online methods for statistical analysis in general, in contrast to a package such as OnlineStats. A work in progress by the authors is developing such a comprehensive online statistics package for R which draws on the functionality of hermiter. In summary, we expect hermiter to be a compelling choice for online estimation on streaming data and massive data sets.
Computational details
The results in this paper were obtained using R 4.3 (R Core Team, 2023) running on Windows 10. The hardware utilized was an AMD Ryzen 4900H 8-core laptop with 64GB of RAM. The version of our R package utilized was hermiter 2.3.0 (Stephanou, 2023). Versions of hermiter in development are available at https://github.com/MikeJaredS/hermiter. A script to facilitate reproduction of the results in this article is available at https://github.com/MikeJaredS/article_hermiter. The following packages were also utilized in preparation of this work, benchden 1.0.5 (Mildenberger and Weinert, 2012), BH 1.78.0 (Eddelbuettel et al, 2021), ggplot2 3.4.0 (Wickham, 2016), microbenchmark 1.4.9 (Mersmann, 2021), mvtnorm 1.1-3 (Genz et al, 2020; Genz and Bretz, 2009), patchwork 1.1.2 (Pedersen, 2020), randtoolbox 2.0.3 (Christophe and Petr, 2022), Rcpp 1.0.9 (Eddelbuettel and François, 2011), RcppParallel 5.1.5 (Allaire et al, 2022), tdigest 0.4.1 (Rudis et al, 2022). R and all packages used are available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/.
Statement and acknowledgements
The views expressed in this article are those of the authors and do not necessarily reflect the views of Rand Merchant Bank. Rand Merchant Bank does not make any representations or give any warranties as to the correctness, accuracy or completeness of the information presented; nor does Rand Merchant Bank assume liability for any losses arising from errors or omissions in the information in this article. We would like to thank Ted Dunning for useful and interesting discussions. We would also like to sincerely thank the editor, associate editor and particularly the reviewers for thorough and deeply insightful feedback that helped us greatly improve this article.
Declarations
On behalf of all authors, the corresponding author states that there is no conflict of interest.
References
- \bibcommenthead
- Allaire et al (2022) Allaire J, Francois R, Ushey K, et al (2022) RcppParallel: Parallel programming tools for Rcpp, 5.1.5. https://CRAN.R-project.org/package=RcppParallel. Accessed 2022-12-15
- Bezanson et al (2017) Bezanson J, Edelman A, Karpinski S, et al (2017) Julia: A fresh approach to numerical computing. SIAM Review 59(1):65–98. 10.1137/141000671
- Boyd (2018) Boyd JP (2018) Dynamics of the Equatorial Ocean. Springer Berlin, Heidelberg
- Boyd and Moore (1986) Boyd JP, Moore DW (1986) Summability methods for Hermite functions. Dynamics of atmospheres and oceans 10(1):51–62. 10.1016/0377-0265(86)90009-6
- Chen et al (2000) Chen F, Lambert D, Pinheiro JC (2000) Incremental quantile estimation for massive tracking. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 516–522, 10.1145/347090.347195
- Chen and Zhang (2020) Chen Z, Zhang A (2020) A survey of approximate quantile computation on large-scale data. IEEE Access 8:34,585–34,597. 10.1109/ACCESS.2020.2974919
- Christophe and Petr (2022) Christophe D, Petr S (2022) randtoolbox: Generating and testing random numbers, 2.0.3. https://cran.r-project.org/package=randtoolbox. Accessed 2022-12-15
- Croux and Dehon (2010) Croux C, Dehon C (2010) Influence functions of the Spearman and Kendall correlation measures. Statistical methods & applications 19(4):497–515. 10.1007/s10260-010-0142-z
- Day and Zhou (2020) Day J, Zhou H (2020) Onlinestats.jl: A Julia package for statistics on data streams. Journal of open source software 5(46). 10.21105/joss.01816
- Devroye and Gyorfi (1985) Devroye L, Gyorfi L (1985) Nonparametric Density Estimation: The L1 View. John Wiley & Sons Incorporated, New York
- Dunning (2021) Dunning T (2021) The t-digest: Efficient estimates of distributions. Software Impacts 7:100,049. 10.1016/j.simpa.2020.100049
- Eddelbuettel and François (2011) Eddelbuettel D, François R (2011) Rcpp: Seamless R and C++ integration. Journal of Statistical Software 40(8):1–18. 10.18637/jss.v040.i08
- Eddelbuettel et al (2021) Eddelbuettel D, Emerson JW, Kane MJ (2021) BH: Boost C++ header files, 1.78.0-0. https://CRAN.R-project.org/package=BH. Accessed 2022-12-15
- Epicoco et al (2020) Epicoco I, Melle C, Cafaro M, et al (2020) UDDSketch: Accurate tracking of quantiles in data streams. IEEE Access 8:147,604–147,617. 10.1109/ACCESS.2020.3015599
- Gan et al (2018) Gan E, Ding J, Tai KS, et al (2018) Moment-based quantile sketches for efficient high cardinality aggregation queries. Proceedings of the VLDB Endowment 11(11). 10.14778/3236187.3236212
- Genz and Bretz (2009) Genz A, Bretz F (2009) Computation of Multivariate Normal and t Probabilities. Lecture Notes in Statistics, Springer-Verlag, Heidelberg
- Genz et al (2020) Genz A, Bretz F, Miwa T, et al (2020) mvtnorm: Multivariate normal and t distributions, 1.1-3. https://CRAN.R-project.org/package=mvtnorm. Accessed 2022-12-15
- Gibbons and Chakraborti (2010) Gibbons JD, Chakraborti S (2010) Nonparametric Statistical Inference. CRC Press, New York
- Greblicki and Pawlak (1984) Greblicki W, Pawlak M (1984) Hermite series estimates of a probability density and its derivatives. Journal of Multivariate Analysis 15(2):174–182. 10.1016/0047-259X(84)90024-1
- Greblicki and Pawlak (1985) Greblicki W, Pawlak M (1985) Pointwise consistency of the Hermite series density estimate. Statistics & Probability Letters 3(2):65–69. 10.1016/0167-7152(85)90026-4
- Greblicki and Pawlak (2008) Greblicki W, Pawlak M (2008) Nonparametric System Identification. Cambridge University Press, Cambridge
- Greenwald and Khanna (2001) Greenwald M, Khanna S (2001) Space-efficient online computation of quantile summaries. ACM SIGMOD Record 30(2):58–66. 10.1145/376284.375670
- Greenwald and Khanna (2016) Greenwald M, Khanna S (2016) Quantiles and equi-depth histograms over streams. In: Data Stream Management. Springer, p 45–86
- Hammer et al (2019a) Hammer HL, Yazidi A, Rue H (2019a) A new quantile tracking algorithm using a generalized exponentially weighted average of observations. Applied Intelligence 49(4):1406–1420. 10.1007/s10489-018-1335-7
- Hammer et al (2019b) Hammer HL, Yazidi A, Rue H (2019b) Tracking of multiple quantiles in dynamically varying data streams. Pattern Analysis and Applications pp 1–13. 10.1007/s10044-019-00778-3
- Integral (2021) Integral (2021) TrueFX. https://www.truefx.com/. Accessed 2022-12-19
- Jain and Chlamtac (1985) Jain R, Chlamtac I (1985) The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations. Communications of the ACM 28(10):1076–1085. 10.1145/4372.4378
- Jenks (2019) Jenks G (2019) Runstats, 1.8.0. https://pypi.org/project/runstats/. Accessed 2022-12-15
- Jmaei et al (2017) Jmaei A, Slaoui Y, Dellagi W (2017) Recursive distribution estimator defined by stochastic approximation method using Bernstein polynomials. Journal of Nonparametric Statistics 29(4):792–805. 10.1080/10485252.2017.1369538
- Karnin et al (2016) Karnin Z, Lang K, Liberty E (2016) Optimal quantile approximation in streams. In: 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), IEEE, pp 71–78, 10.1109/FOCS.2016.17
- Liebscher (1990) Liebscher E (1990) Hermite series estimators for probability densities. Metrika 37(6):321–343. 10.1007/BF02613540
- Luo et al (2016) Luo G, Wang L, Yi K, et al (2016) Quantiles over data streams: Experimental comparisons, new analyses, and further improvements. The VLDB Journal 25(4):449–472. 10.1007/s00778-016-0424-7
- Masson et al (2019) Masson C, Rim JE, Lee HK (2019) DDSketch: A fast and fully-mergeable quantile sketch with relative-error guarantees. Proceedings of the VLDB Endowment 12(12):2195–2205. 10.14778/3352063.3352135
- Mersmann (2021) Mersmann O (2021) microbenchmark: Accurate timing functions, 1.4.9. https://CRAN.R-project.org/package=microbenchmark. Accessed 2022-12-15
- Mildenberger and Weinert (2012) Mildenberger T, Weinert H (2012) The benchden package: Benchmark densities for nonparametric density estimation. Journal of Statistical Software 46(i14). 10.18637/jss.v046.i14
- Mitchell et al (2021) Mitchell R, Frank E, Holmes G (2021) An empirical study of moment estimators for quantile approximation. ACM Transactions on Database Systems (TODS) 46(1):1–21. 10.1145/3442337
- Naumov and Martikainen (2007) Naumov V, Martikainen O (2007) Exponentially weighted simultaneous estimation of several quantiles. World Academy of Science, Engineering and Technology 8:563–568. 10.5281/zenodo.1077112
- Pedersen (2020) Pedersen TL (2020) patchwork: The composer of plots, 1.1.2. https://CRAN.R-project.org/package=patchwork. Accessed 2022-12-15
- R Core Team (2023) R Core Team (2023) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, https://www.R-project.org/. Accessed 2023-05-01
- Raatikainen (1987) Raatikainen KE (1987) Simultaneous estimation of several percentiles. Simulation 49(4):159–163. 10.1177/003754978704900405
- Raatikainen (1990) Raatikainen KE (1990) Sequential procedure for simultaneous estimation of several percentiles. Transactions of the Society for Computer Simulation 1:21–44. 10.5555/87953.87955
- Rudis et al (2022) Rudis B, Dunning T, Werner A (2022) tdigest: Wicked fast, accurate quantiles using t-digests, 0.4.1. https://CRAN.R-project.org/package=tdigest. Accessed 2022-12-15
- Schubert and Gertz (2018) Schubert E, Gertz M (2018) Numerically stable parallel computation of (co-) variance. In: Proceedings of the 30th International Conference on Scientific and Statistical Database Management, pp 1–12, 10.1145/3221269.3223036
- Schwartz (1967) Schwartz SC (1967) Estimation of probability density by an orthogonal series. Annals of Mathematical Statistics 38:1261–1265. 10.1214/aoms/1177698795
- Slaoui (2014) Slaoui Y (2014) The stochastic approximation method for estimation of a distribution function. Mathematical Methods of Statistics 23(4):306–325. 10.3103/S1066530714040048
- Slaoui and Jmaei (2019) Slaoui Y, Jmaei A (2019) Recursive density estimators based on Robbins-Monro’s scheme and using Bernstein polynomials. Statistics and Its Interface 12(3):439–455. 10.4310/19-SII561
- Stephanou (2023) Stephanou M (2023) hermiter, 2.3.0. https://cran.r-project.org/package=hermiter. Accessed 2023-05-31
- Stephanou and Varughese (2021a) Stephanou M, Varughese M (2021a) On the properties of Hermite series based distribution function estimators. Metrika 84(4):535–559. 10.1007/s00184-020-00785-z
- Stephanou and Varughese (2021b) Stephanou M, Varughese M (2021b) Sequential estimation of Spearman rank correlation using Hermite series estimators. Journal of Multivariate Analysis 186:104,783. 10.1016/j.jmva.2021.104783
- Stephanou et al (2017) Stephanou M, Varughese M, Macdonald I (2017) Sequential quantiles via Hermite series density estimation. Electronic Journal of Statistics 11(1):570–607. 10.1214/17-EJS1245
- Tiwari and Pandey (2019) Tiwari N, Pandey PC (2019) A technique with low memory and computational requirements for dynamic tracking of quantiles. Journal of Signal Processing Systems 91(5):411–422. 10.1007/s11265-017-1327-6
- Van Rossum (2023) Van Rossum G (2023) Python Programming Language. http://www.python.org. Accessed 2023-05-31
- Walter (1977) Walter GG (1977) Properties of Hermite series estimation of probability density. The Annals of Statistics 5(6):1258–1264. 10.1214/aos/1176344013
- Wickham (2016) Wickham H (2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag, New York
- Xiao (2019) Xiao W (2019) Novel online algorithms for nonparametric correlations with application to analyze sensor data. In: 2019 IEEE International Conference on Big Data (Big Data), IEEE, pp 404–412, 10.1109/BigData47090.2019.9006483
- Yazidi and Hammer (2017) Yazidi A, Hammer H (2017) Multiplicative update methods for incremental quantile estimation. IEEE transactions on cybernetics 49(3):746–756. 10.1109/TCYB.2017.2779140
- Zhao et al (2021) Zhao F, Maiyya S, Wiener R, et al (2021) Kllapproximate quantile sketches over dynamic datasets. Proceedings of the VLDB Endowment 14(7):1215–1227. 10.14778/3450980.3450990