A new method for estimating the tail index using truncated sample sequence ∗
Fuquan Tang Dong Han⋆ Department of Statistics, School of Mathematical Sciences,
Shanghai Jiao Tong University, Shanghai, 200240, China
ABSTRACT\\
This article proposes a new method of truncated estimation to estimate the tail index of the extremely heavy-tailed distribution with infinite mean or variance. We not only present two truncated estimators and for estimating () and () respectively, but also prove their asymptotic statistical properties. The numerical simulation results comparing the six known estimators in estimating error, the Type I Error and the power of estimator show that the performance of the two new truncated estimators is quite good on the whole.
††footnotetext: ∗Supported by National Natural Science Foundation of China (11531001)
⋆ Corresponding author, E-mail: [email protected]
Heavy-tailed phenomena are widespread in many aspects of our lives, and exist in a variety of disciplines such as physics, meteorology, computer science, biology, and finance. The probabilistic and statistical methods and theories about the heavy-tailed phenomenon have been used to study the magnitude of earthquakes, the diameter of lunar craters on the surface of the moon, the size of interplanetary fragments, and the frequency of words in human languages, and so on [References, References, References, References].
Geography and hydrology are important scenarios for the study and application of thick-tailed distribution. In 1998, Anderson [References] discussed heavy tail time series models and provided a periodic ARMA model for Salt River. In 2022, Merz et al.[References] provided a detailed and coherent review on understanding heavy tails of flood peak distributions, they proposed nine hypotheses on the mechanisms generating heavy-tailed phenomena in flood system. In financial markets, Mandelbrot[References] presented seminal research on cotton price using the heavy tails distribution theory. In 2013, Marat et al.[References] found the emerging exchange markets would be more pronouncedly heavy-tailed and illustrated that heavy-tailed properties did not change obviously during the financial and economic crisis period.
There is a large literature proposing numerous ideas and methods on the estimation of the tail index of heavy-tailed distribution. The size of is mainly used to measure the degree of thinness of the tail. The smaller the , the higher the probability of a heavy-tailed event. Since Hill put forward the famous Hill estimator in 1975 [References], researchers have provided multiple estimation methods for estimating , such as DPR estimator[References, References], QQ estimator[References], the Moment estimator[References], quantile estimator[References], the estimators of extreme value index in a censorship framework[References, References, References, References], t-Hill estimator[References, References], IPO estimator[References] and so on. There are more than 100 tail index estimators have been reviewed by two papers [References, References].
It can be seen that nearly all estimators based on the order statistics of observation samples. Moreover, the estimators based on the order statistics have three characteristics that are not very satisfactory: (1) The calculation of the estimators is relatively complex since the order statistics are not easy to calculate for large sample size; (2) The mathematical meaning of the estimators for the tail index is not obvious; (3) There is no explicit expression for the rate of strong consistency convergence of the estimators.
In order to make up for the shortcomings of existing estimation methods, we propose a new truncated estimation method to estimate the tail index () of heavy-tailed distribution with infinite mean or variance. The proposed two estimators for and for , are based on the truncated sample mean and the truncated sample second order moment, respectively, and they are not only relatively easy to calculate, but also their strong consistency convergence rate and the asymptotic normal property can be obtained.
In Section 2, we will present two truncated estimators and , and obtain their asymptotic statistical properties. Section 3 compares the two truncated estimators with the six known estimators in estimating error, the type I error and the power of estimator by numerical simulations. Section 4 provides concluding remarks. The proofs of the three theorems are given in the Appendix.
2. Two truncated estimators
Due to a random variable can be written as the summation of positive and negative parts , we consider only the nonnegative random variables in the paper.
Let be independent and identical distribution (i.i.d.) with extremely heavy-tailed distribution function for , where the tail index, , is unknown. We know that when , the mean or variance is infinite, and when , the mean is finite but the variance is infinite.
In this section, we will present two truncated estimators and to estimate () and () respectively and prove their asymptotic statistical properties.
To this end, let be a positive truncated sequence satisfying as . Define the truncated random variable , where is the indicator function. We can get the truncated mean , the truncated sample mean , truncated second order moment and the truncated sample second order moment in the following.
Hence, we can define two truncated estimators and which satisfy the following two equations for and for respectively, by replacing , and in equation(3) and (4) with , and , respectively, that is,
(5)
for and
(6)
for .
Take in equation(5) and in equation(6), respectively, it follows that and . Hence, we can use the following two estimators and to estimate and , respectively.
(7)
(8)
Since it is difficult to obtain the analytic solutions (estimators) and to the two equations and respectively, we present two recursive estimators for in the following
(9)
for and
(10)
for , where and are two constants.
The following theorem shows that the two estimators and can be approximated by the two sequences of estimators and , respectively.
Theorem 1.
Let and satisfy for and large n. Then, both the two equations , and , , have unique solutions and , respectively. If and (or and ), then and (or and ) and
(11)
(12)
where and .
Remark 1.
Take large such that and . Note that and . It follows from the two inequalities (11) and (12) that
The two inequalities above implies that and can converge (almost everywhere) at least exponentially to and , respectively.
Remark 2.
If we don’t know whether is included in interval or interval , we may take the initial values and in the following: Take samples (for example, ) such that for and for .
In order to get the asymptotic statistical properties and , we give a theorem in the following, which describes the asymptotic statistical properties of the truncated sample mean and the truncated second order moment .
Theorem 2.
Assume that the conditions of Theorem 1 hold. Then
(13)
for and
(14)
for , where ”” denotes the convergence in distribution and is the normal distribution.
The following theorem gives the asymptotic statistical properties of the two truncated estimators and .
Theorem 3.
Assume that the conditions of Theorem 1 hold. Then
(15)
for and
(16)
for . Moreover,
(17)
for and , and
(18)
for and .
3. Numerical Simulations
In this section, we will compare our two estimators and with other five estimators in the estimating error, the Type I Error and the power, including the Hill estimator[References], QQ estimator[References], the Moment estimator[References], t-Hill estimator[References, References] and t-lgHill estimator[References]. Since the asymptotic distribution of IPO estimator[References] is unknown, we only give the estimating error of the IPO estimator in Section 3.1.
Excepting the two truncated estimators, the other six estimators can be written as
and
where are the order statistics of . , as , . is defined as
and the definitions of and in detail are given in the paper[References].
3.1. The estimating error
In our next simulations, let be the number of samples in each trial and set the truncated sequence be in the truncated estimator or , where is the index of satisfying . In order to obtain the simulation searching accuracy for the two truncated estimators, we take as the number of iterations (see (11) and (12) in Theorem 1) such that or . Remark 2 provides a method on how to determine the initial values and .
We first consider and take the initial value . Let for the four estimators , , and and set for the . We set for .
The following table 1 and figure 1 illustrate the numerical simulation results for the seven estimators. All the numerical simulation results in this section were obtained using repetitions.
Table 1: . The estimation for different .
Parameters
Estimation
q
k
0.10
2.00
11
0.101
0.101
0.097
0.101
0.091
0.100
0.157
0.20
2.00
6
0.200
0.201
0.194
0.202
0.190
0.201
0.216
0.30
2.00
5
0.301
0.299
0.289
0.303
0.287
0.304
0.304
0.40
1.80
4
0.402
0.404
0.386
0.408
0.391
0.410
0.403
0.50
1.70
4
0.502
0.496
0.481
0.510
0.488
0.517
0.503
0.60
1.50
5
0.603
0.595
0.583
0.620
0.586
0.624
0.606
0.70
1.30
7
0.704
0.705
0.677
0.725
0.686
0.727
0.710
0.80
1.20
9
0.803
0.793
0.771
0.831
0.784
0.827
0.816
0.90
1.10
12
0.904
0.895
0.868
0.943
0.884
0.923
0.925
AE
0.002
0.004
0.017
0.016
0.013
0.015
0.016
The AE in the last row of the table 1 denotes the average of estimating error, the smaller the mean error, the better the estimator. We define , where denotes one of estimators , , , , , and for =0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90.
Figure 1: . The estimation for different .
It can be seen from the table 1 and the figure 1 that the estimating errors of are smaller than that of other six estimators for . Only for , the estimating error of is larger than that of since . When or , the estimating error of is large than that of other six estimators. Obviously, the average of estimating error AE (0.02) of is smallest among the seven estimators. That is, we can say that the estimator has the best performance in estimating among the seven estimators.
Next, we consider . Take the initial value . Similar to , let be the number of samples. Let the truncated sequence be in the truncated estimator , where is the index of satisfying .
Table 2: . The estimation for different .
Parameters
Estimation
q
k
1.10
0.80
3
1.108
1.089
1.055
1.116
1.075
1.098
0.613
1.20
0.70
3
1.212
1.187
1.156
1.283
1.169
1.181
0.686
1.30
0.65
5
1.317
1.274
1.245
1.399
1.254
1.257
0.766
1.40
0.63
8
1.425
1.365
1.334
1.513
1.344
1.327
0.851
1.50
0.61
7
1.537
1.448
1.413
1.611
1.426
1.392
0.946
1.60
0.60
9
1.652
1.543
1.508
1.736
1.521
1.454
1.049
1.70
0.60
15
1.765
1.620
1.580
1.846
1.597
1.513
1.164
1.80
0.58
19
1.890
1.714
1.692
1.985
1.683
1.566
1.292
1.90
0.45
24
1.995
1.790
1.764
2.166
1.759
1.618
1.437
AE
0.095
0.110
0.136
0.116
0.141
0.282
0.463
Figure 2: . The estimation for different .
Similarly, from the table 2 and the figure 2 we can see that the estimating errors of are smaller than that of other six estimators for . Only for and , the estimating errors and of are larger than that of and respectively since and . The estimating error of is large than that of other six estimators for all . Obviously, the average of estimating error AE (0.095) of is smallest among the seven estimators. That is, the estimator has the best performance in estimating among the seven estimators.
In short, the two truncated estimators and have the best performance in estimating () among the seven estimators.
Remark 3.
The disadvantage of the two truncated estimators is that they need to know the value range of the unknown parameter . If we don’t know whether is included in interval or interval , we may take the initial values and according to the method in Remark 2
3.2. The rejection regions and the Type I Error
In order to get the Type I Error, we consider the rejected regions of these estimators except the IPO estimator since we do not know the asymptotic distribution of . Let and denote the original hypothesis and the alternative hypothesis, respectively, that is,
where or . Let the confidence level be and in the Theorem 3. By using the inequalities (15) and (16) of the Theorem 3, we have
for and
for . Therefore, we can get two rejection regions and in the following
for and
for .
Since the five estimators, , , , and satisfy
and
we can similarly get the five rejection regions , , , and in the following with the confidence level respectively
and
Similar to the Section 3.1, we first consider and set the initial value . Let for the four estimators , , and and set for .
Table 3: . The Type I Error for different .
Parameters
Type I Error
q
k
0.10
2.00
11
0.038
0.058
0.074
0.053
0.171
0.009
0.20
2.00
6
0.052
0.061
0.073
0.050
0.106
0.006
0.30
2.00
5
0.046
0.062
0.071
0.052
0.107
0.017
0.40
1.80
4
0.050
0.055
0.065
0.057
0.069
0.041
0.50
1.70
4
0.055
0.050
0.074
0.042
0.076
0.068
0.60
1.50
5
0.055
0.061
0.068
0.059
0.088
0.095
0.70
1.30
7
0.046
0.050
0.067
0.054
0.069
0.053
0.80
1.20
9
0.026
0.053
0.072
0.047
0.071
0.051
0.90
1.10
12
0.098
0.053
0.082
0.070
0.074
0.020
AT
0.052
0.056
0.072
0.054
0.092
0.040
Like the definition of the average of estimating error AE we can similarly define the average of Type I Error, under the confidence level . The closer the average () of the Type I Error, the better the estimator.
Figure 3: . The Type I Error for different
From the table 3 and the figure 3 we can see that the value AT of is closer to than that of the other five estimators. Thus, it could be said that the truncated estimator is better than other five estimators for estimating .
Next, we consider and set the initial value . Similar to , let be the number of samples and the truncated sequence be in the truncated estimator , where is the index of satisfying .
Table 4: . The Type I Error for different .
Parameters
Type I Error
q
k
1.10
0.80
3
0.043
0.063
0.089
0.060
0.078
0.004
1.20
0.70
3
0.041
0.052
0.064
0.068
0.072
0.007
1.30
0.65
5
0.049
0.054
0.071
0.057
0.090
0.048
1.40
0.63
8
0.053
0.079
0.081
0.068
0.095
0.190
1.50
0.61
7
0.067
0.101
0.101
0.049
0.119
0.489
1.60
0.60
9
0.090
0.079
0.090
0.062
0.094
0.803
1.70
0.60
15
0.094
0.104
0.096
0.056
0.135
0.964
1.80
0.58
19
0.093
0.097
0.084
0.059
0.128
0.999
1.90
0.45
24
0.199
0.125
0.112
0.069
0.151
1.000
AT
0.081
0.084
0.088
0.061
0.107
0.500
Figure 4: . The Type I Error for different .
From the table 4 and the figure 4 we can see that the value of is closer to than that of other four estimators except the Moment estimator since the average Type I Error of is 0.061.
3.3. Power of estimator
In this section we consider the power of estimator, that is, the probability of correctly rejecting the original hypothesis under the confidence level . Consider two original hypothesises and respectively. Take , and consider several different tail indices , we can get the corresponding estimators , , and . We can similarly define the average power as , where denotes the power of , , and .
Table 5: . The power for with different .
Parameters
Power
k
5
0.055
0.056
0.062
0.049
0.064
0.076
5
0.137
0.067
0.029
0.086
0.049
0.777
6
0.552
0.194
0.037
0.215
0.083
0.999
7
0.841
0.330
0.108
0.342
0.147
1.000
8
0.967
0.536
0.179
0.509
0.269
1.000
5
0.990
0.760
0.295
0.657
0.440
1.000
9
0.995
0.875
0.407
0.772
0.604
1.000
7
1.000
0.945
0.558
0.893
0.709
1.000
9
1.000
0.985
0.628
0.925
0.835
1.000
AP
0.726
0.528
0.256
0.494
0.356
0.872
The table 5 above and the following figure 5 illustrate the power and average power AP of six estimators, , , , , and . It can be seen that the average power AP of the truncated estimator is 0.726 which is larger than that of other four estimators except the t-lgHill estimator since the average power of is 0.872.
Figure 5: . The power for with different .
Next we consider . It can be seen from the following table 6 and figure 6 that the power of is larger than that of other five estimators respectively for , , and the average power AP (0.781) of is the largest among all six estimators.
Table 6: . The power for with different .
k
8
0.068
0.066
0.084
0.067
0.096
0.170
8
0.383
0.038
0.040
0.088
0.044
0.006
7
0.820
0.076
0.036
0.120
0.055
0.008
8
0.978
0.164
0.058
0.181
0.107
0.125
11
0.999
0.256
0.080
0.207
0.176
0.578
12
1.000
0.414
0.142
0.331
0.264
0.925
18
1.000
0.562
0.196
0.375
0.389
0.995
26
1.000
0.678
0.254
0.423
0.518
1.000
AP
0.781
0.282
0.111
0.224
0.206
0.476
Figure 6: . The Power of for different .
The second largest average powers of and the largest average powers of respectively in table 5 and table 6 mean that the two truncated estimators have a more robust performance than that of other five estimators on the whole.
4. Conclusion
In order to make up for the shortcomings of existing estimation methods, we present a new method of truncated estimation to estimate the tail index of the extremely heavy-tailed distributions with infinite mean or variance. By using the truncated sample mean , the truncated sample second order moment and the two recursive estimators in equation (9) and (10), we can obtain the two truncated estimators and respectively for and . We not only give the rate of strong consistency convergence of the two truncated estimators, but also prove that their asymptotic distributions are normal. Moreover, among all six estimators, the numerical simulation results show that the two truncated estimators have the smallest average estimating error, the truncated estimator has the closest average (0.05) of Type I Error and the truncated estimator has the largest average power. In short, the performance of the two new truncated estimators is quite good on the whole.
Acknowledgments
The authors are grateful to the referees for their careful reading of this paper and valuable comments.
Declaration of interest statement
The authors declare that they have no conflict of interest.
Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
References
[1] A. L. M. Dekkers, J. H. J. Einmahl, L. De Haan. (1989). A Moment Estimator for the Index of an Extreme-Value Distribution. Ann. Statist. 17(4):1833-1855. doi:10.1214/aos/1176347397.
[2] Anderson, P. L., Meerschaert, M. M.(1998). Modeling river flows with heavy tails. Water Resources Research 34(9):2271-2280. doi:10.1029/98WR01449.
[3] Beirlant, Jan , Worms, J. , Worms, Rym. (2018). Estimation of the extreme value index in a censorship framework: Asymptotic and finite sample behavior. Journal of Statistical Planning and Inference 202:31-56. doi:10.1016/j.jspi.2019.01.004.
[4] Bladt, Martin, Albrecher, Hansjoerg, Beirlant, Jan.(2021). Trimmed extreme value estimators for censored heavy-tailed data. Electronic Journal of Statistics 15(1):3112-3136. doi:10.1214/21-EJS1857.
[5] Bowers, M. C., Tung, W. W., Gao, J. B. (2012). On the distributions of seasonal river flows: Lognormal or power law Water Resources Research 48(5):0043-1397. 10.1029/2011WR011308.
[6] Cooke, Roger , Nieboer, Daan , Misiewicz, Jolanta. (2014). Fat-Tailed Distributions: Data, Diagnostics and Dependence. ISTE Ltd and John Wiley Sons,Inc. 10.1002/9781119054207.
[7] Fedotenkov, Igor, (2018). A review of more than one hundred Pareto-tail index estimators.Research Papers in Economics. University Library of Munich, Germany.
[8] Girard, S., Stupfler, G. and Carleve, A. U. (2020). An quantile methodology for tail index estimation. HAL Id: hal-02311609.
[9] Goegebeur, Y., Guillou, A., Qin, J. (2019). Bias-corrected estimation for conditional pareto-type distributions with random censoring. Extremes 22:459-C498. doi:10.1007/s10687-019-00341-7.
[10] Gomes, MI,, Guillou, A (2015). Extreme Value Theory and Statistics of Univariate Extremes: A Review.International Statistical Review 83:263– 292. doi:10.1111/insr.12058.
[11] Hill, B. M. . (1975). A simple approach to inference about the tail of a distribution. Ann. Statist. 3(5):1163-1174. doi: 10.1214/aos/1176343247.
[12] Jordanova,P.,Fabian,Z.,Hermann, P.,Strelec, L.,Rivera, A.,Girard, S.,Torres,S.,Stehlik, M.(2016). Weak properties and robustness of t-Hill estimators. Extremes 19:591-626. doi:10.1007/s10687-016-0256-2.
[13] Jordanova,P. K., Pancheva, E. I. . (2012). Weak asymptotic results for t-hill estimator. Comptes rendus de I’Académie bulgare des sciences: sciences mathématiques et naturelles, 65(12):1649-1656.
[14] Jordanova,P.,Stehlik, M. (2020). IPO estimation of heaviness of the distribution beyond regularly varying tails. Stochastic Analysis and Applications.38(1):76-96. doi:10.1080/07362994.2019.1647786.
[15] Kratz, M. and Resnick, S. I. (1996). The qq-estimator and heavy tails. Communications in Statistics, Stochastic Models. 12(4):699-724. doi:10.1080/15326349608807407.
[16] Mandelbrot, B.B. (1963). The Variation of Certain Speculative Prices. The Journal of Business 36:371-418. doi:10.1007/978-1-4757-2763-0_14.
[17] Marat Ibragimov, Rustam Ibragimov, Paul Kattuman. (2013). Emerging markets and heavy tails. Journal of Banking Finance, 37(7):2546-2559. doi:10.1016/j.jbankfin.2013.02.019.
[18] Merz, B., Basso, S., Fischer, S., Lun, D., Blöschl, G., Merz, R., et al. (2022). Understanding heavy tails of flood peak distributions. Water Resources Research, 58(6):0043-1397. doi:10.1029/2021WR030506.
[19] Paulauskas, V. (2003). A new estimator for a tail index. Acta Applicandae Mathematicae. 79:55-67. doi:10.1023/A:1025818424104.
[20] Paulauskas, V., VaiIulis, M. (2011). Several modifications of dpr estimator of the tail index. Lithuanian Mathematical Journal. 51:36-50. doi:10.1007/s10986-011-9106-8.
[21] Resnick, S. I. (1989). Extreme values, regular variation, and point processes. Journal of the American Statistical Association. 84(407):845. doi:10.2307/2289692.
[22] Resnick, S. I. (2007). Heavy-Tail Phenomena: Probabilistic and Statistical Modeling. Springer Verlag. Springer Series in Operations Research and Financial Engineering.
[23] Worms, J. and Worms, R. (2018). Extreme value statistics for censored data with heavy tails under competing risks. Metrika 81:849-889. doi:10.1007/s00184-018-0662-3
Appendix: Proofs Theorems
Proof of Theorem 1. Let for and for . Let
It follows that both and have two real roots, and , respectively, i.e.
for large such that , and . Since
it follows that for and for and . Hence, is monotonically increasing for since for .
Let . Note that for large . By using , (5) and the probability of in (13) of the theorem 2, we can get that
with high probability for large . On the other hand, , and since for large , it follows that for large . Thus, has an unique root such that , i.e. for large . Note that as , therefore, has an unique root for large .
Similarly, from
it follows that for and for and . Hence, is monotonically increasing for since for .
Let . By equation(6) and the probability of in (14) of the theorem 2, we can get that
with high probability for large since . On the other hand, since , and , it follows that . Thus, has an unique root such that , i.e. for large . Note that as , therefore, has an unique root for large .
Note that the functions () is monotonically increasing for large . Let . Since , it follows and therefore, . Through step-by-step iteration, we can get . Furthermore, by (3), (5) and (9) we have
Let and . By (8), the Bernstein inequality and the Lyapunov central limit theorem, we can similarly prove (18) since for large . It completes the proof.