Law of the logarithm for the maximum interpoint distance constructed by high-dimensional random matrix
Abstract
Suppose is an array of i.i.d. real random variables. Let be positive integers. Consider the maximum interpoint distance where and denote the -th and -th rows of the matrix , respectively. This paper shows the laws of the logarithm for under two high-dimensional settings: the polynomial rate and the exponential rate. The proofs rely on the moderation deviation principle of the partial sum of i.i.d. random variables, the Chen–Stein Poisson approximation method and Gaussian approximation.
Keywords
Maximum interpoint distance, law of the logarithm, Chen–Stein Poisson approximation, moderation deviation, Gaussian approximation
Mathematics Subject Classification: Primary 60F15; secondary 60B12.
1 Introduction
Consider a -dimensional population represented by a random vector with mean and covariance matrix , where is the identity matrix. Let be a random matrix whose rows are an independent and identically distributed (i.i.d.) random sample of size from the population . Write for the Euclidean norm on . Let
(1) |
denote the maximum interpoint distance or diameter. It is clear that the resulting value of will be maximized if we choose two vectors and with nearly opposing directions and nearly the largest magnitudes, that is, is mainly obtained by outliers. Therefore, this makes less useful for goodness of fit tests, but may be appropriate for detecting outliers.
In the past thirty years, several authors mainly studied the limiting distribution of the largest interpoint distance . For the multidimensional situation, Matthews and Rukhin [15] assumed that are independent standard normal random vectors in and obtained the asymptotic distribution of . Henze and Klein [7] generalized the result of Matthews and Rukhin [15] by embedding the multivariate normal distribution into the symmetric Kotz type distribution and corrected some errors. Jammalamadaka and Janson [9] considered a more general spherically symmetric distribution and proved that with a suitable normalization asymptotically obeys a Gumbel type distribution.
If the distribution of has an unbounded support, Henze and Lao [8] proved that the limiting distribution of is none of the three types of classical extreme-value distributions (Gumbel, Weibull and Fréchet distributions) when the distribution function of satisfies as for some and some slowly varying function . Demichel et al. [4] obtained the asymptotic behavior of when the random vector in has an elliptical distribution.
In the bounded case, Appel et al. [2] discovered if has a uniform distribution in a planar set with unique major axis and sub- decay of its boundary at the endpoints, then the limiting distribution of is a convolution of two independent Weibull distributions. Furthermore, Appel and Russo [1] investigated the case where i.i.d. points are uniformly distributed on the surface of a unit hypersphere in and derived the limiting distribution of the maximum pairwise distance. Mayer and Molchanov [16] proved that the limiting distribution of is a Weibull distribution when i.i.d. points are uniformly distributed in the unit -dimensional ball. Lao [11] further assumed that obeys some distributions in the unit square, the uniform distribution in the unit hypercube, and the uniform distribution in a regular convex polygon. Jammalamadaka and Janson [9] obtained a Gumbel limit distribution for if are i.i.d. -valued random vectors with a spherically symmetric distribution. Schrempp [19] relaxed the condition to that random points are in a -dimensional ellipsoid with a unique major axis. Furthermore, Schrempp [20] considered the case where random points are in a -dimensional set with a unique diameter and a smooth boundary at the poles. Tang et al. [22] assumed are a random sample from a -dimensional population with independent sub-exponential components and obtained the limiting distribution of . Heiny and Kleemann [6] showed that converges weakly to a Gumbel distribution under some moment assumptions and corresponding conditions on the growth rate of .
The aforementioned papers mainly focus on the asymptotic distribution of as well as relaxing the range of relative to . The logarithmic law for remains largely unknown. In this paper, we assume that the sample size depends on the population dimension . We will study the strong limiting theorems for the maximum interpoint distance under two high-dimensional cases. They are the polynomial rate with and and the exponential rate with and where , , and are positive constants. Furthermore, we generalize the result for to the maximum interpoint -norm distance and perform a Monte Carlo simulation to validate our main results.
The laws of the logarithm for different statistics have previously been studied by some authors, including Jiang [10] (sample correlation matrix), Li and Rosalsky [12] (sample correlation matrix), and Ding [5] (random tensor). In addition, we refer to Li [13], Modarres [17], Modarres and Song [18], and Song and Modarres [21] for several other investigates on interpoint distance.
2 Main results
Throughout this paper, let be a random matrix where are i.i.d. random variables with mean and variance , let be the rows of . And some notations will be used in this paper. The symbol implies convergence in probability, means convergence in distribution, and means almost sure convergence. For two sequences of positive numbers and , implies , means . We write if , and if there exists a universal constant such that . In addition, is a constant and may vary from line to line.
We first study the law of the logarithm for under the assumption that is a polynomial power of .
Theorem 2.1.
Assume
where and are constants not depending on . Then, the following holds as :
(2) |
Then, we will consider that is an exponential power of .
Theorem 2.2.
Corollary 2.3.
Let be a random sample from the standard multivariate normal population. Assume for any as . Then, the following holds as :
Remark 2.4.
Note that the results of Theorems 2.1 and 2.2 still hold for arbitrary mean where , since is invariant under translation of the vectors . Therefore, without loss of generality, we assume are i.i.d. random variables with mean in the proofs. By some calculations, we have
So the condition is equivalent to as shown by Tang et al. [22].
3 Proofs
3.1 Some technical tools
We first show the Chen–Stein Poisson approximation method, which is a special case of Theorem 1 from Arratia et al. [3].
Lemma 3.1.
Let be an index set and be a set of subsets of , that is, for each . Let be random variables. For a given , set . Then we have
where
and is the -algebra generated by . In particular, if is independent of for each , then .
The following lemma is about the moderate deviation of the partial sum of i.i.d. random variables (Linnik [14]).
Lemma 3.2.
Suppose that is a sequence of i.i.d. random variables with and . Define .
(1) If for some and . Then
for any , .
(2) If for some and . Then
holds uniformly for .
3.2 Proof of Theorem 2.1
Now we are in a position to prove Theorem 2.1.
Proof.
Review Remark 2.4. Then, without loss of generality, we prove the theorem by assuming are i.i.d. random variables with mean . Recall (1). Write
(3) |
Define
It is obvious that
for each . Then, we find that
(4) |
Define
(5) |
where . To prove Theorem 2.1, it is sufficient to show that
(6) |
(7) |
and
(8) |
We will complete the proof with three steps.
Step 1: proof of (6). Given , set . For any integer , we have
(9) | ||||
where
(10) | ||||
Set and . Then, we have
Note that . By Theorem 1.1 from Zaĭtsev [23], there exists a random variable such that
(11) |
(12) |
For the exponential term, we have
(13) |
as . By the formula that , we get
(14) | ||||
as . We use the fact that in the above inequality. By (11)–(14), we can obtain
as since . By the Borel–Cantelli lemma, we arrive at
(15) |
Trivially, due to for any as is sufficiently large. By Markov inequality, we have
as . Then, by (10), Ottaviani’s inequality and the same argument as in (11)–(14), we obtain
for sufficient large . Since , we see
as . Therefore, by the Borel–Cantelli lemma again,
(16) |
Review (9)–(10) and (15)–(16). We obtain that
Choosing small enough, we then get (6).
Step 2: proof of (7). Given , set . Set
For , define
and
Note that is independent of . By Lemma 3.1, we have
(17) |
where
and
We first calculate . By the same argument as in (11)–(14), we have
(18) | ||||
as . Then, it follows from (18) that
(19) | ||||
as .
Finally, we estimate . Set . By Theorem 1.1 from Zaĭtsev [23], there exist two normal random variables and such that
(20) | ||||
where and . For the exponential term,
(21) |
as is sufficiently large. Since , we have
(22) | ||||
as . By some calculations, we have
as . Therefore, if , then as . Next, from (17)–(22), we obtain that
as is sufficiently large, where depends on and . Take an integer such that
as . Then, by the Borel–Cantelli lemma,
(23) |
Recalling (10), we find
Combining the above inequality, (16) and (23), we obtain that
Then, (7) follows from the arbitrariness of .
3.3 Proof of Theorem 2.2
Proof.
We continue to use the notations in the proof of Theorem 2.1. Note that
due to for some and . Then, by Lemma 3.2, we have
(24) | ||||
as since . Next, by the Borel–Cantelli lemma, one has
(25) |
Review the proof of (7), the definitions of , and . For each , set . By Lemma 3.1, we have
where
By Lemma 3.2 and the same argument as in (24), we have
as . Furthermore, one can get that
for sufficiently large . Write
By some calculations, we see
for each . Moreover, we know for some and . By Lemma 3.2, we then have that
as . If , then and
as for some constant depending on and . Therefore, we see from the Borel–Cantelli lemma that
Combining the above inequality, (4) and (25), the desired conclusion follows from the arbitrariness of . ∎
4 Applications
4.1 -norm
Instead of studying the Euclidean norm distance, we will consider a more general distance, that is, -norm distance in . For a -dimensional random vector , define the -norm of by
where . And the maximum interpoint -norm distance is denoted by
Similarly to Theorem 2.1, one can deduce the law of the logarithm for as follows.
Theorem 4.1.
Assume
where and are constants not depending on . Then, the following holds as :
(26) |
Proof.
By Theorem 2.2 and the same method as above, we obtain the following result.
Theorem 4.2.
4.2 Simulation resluts
In this section, we carry out a Monte Carlo simulation on the law of logarithm for the maximum interpoint distance discussed in the previous section, to illustrate the validity of our results. We assume that are i.i.d. -distributed random variables. For each Monte Carlo iteration, we calculate the value of
In our simulation, we perform Monte Carlo iterations and work on four different combinations of with =(150, 100), =(200, 200), =(500, 250), and =(600, 400). Figures 1 and 2 present scatter diagrams that compare the values of and 2. The observed results indicate that the value of are uniformly distributed around 2, with no discernible bias between the scatter plots and 2, a trend accentuated by the substantial values of both and . This outcome serves as a compelling illustration of the validity of the results derived from the law of logarithm for the maximum interpoint distance . The systematic exploration of various combinations enriches the understanding of the behavior of in the context of this simulation, adding depth and credibility to the findings.




Acknowledgements
The authors thank anonymous reviewers for giving valuable comments and suggestions in improving the manuscript. This paper was supported by National Natural Science Foundation of China (Grant No. 11771178, 12171198); the Science and Technology Development Program of Jilin Province (Grant No. 20210101467JC); Technology Program of Jilin Educational Department during the ”14th Five-Year” Plan Period (Grant No. JJKH20241239KJ) and Fundamental Research Funds for the Central Universities.
References
- [1] Appel, M. J., Russo, R. P.: Limiting distributions for the maximum of a symmetric function on a random point set. J. Theor. Probab., 19(2), 365–375 (2006)
- [2] Appel, M. J. B., Najim, C. A., Russo, R. P.: Limit laws for the diameter of a random point set. Adv. Appl. Probab., 34(1), 1–10 (2002)
- [3] Arratia, R., Goldstein, L., Gordon, L.: Two moments suffice for Poisson approximations: the Chen-Stein method. Ann. Probab., 17, 9–25 (1989)
- [4] Demichel, Y., Fermin, A. K., Soulier, P.: The diameter of an elliptical cloud. Electron. J. Probab., 20(27), 1–32 (2015)
- [5] Ding, X.: Strong limit theorem for largest entry of large-dimensional random tensor. Random Matrices Theory Appl., doi: 10.1142 (2023)
- [6] Heiny, J., Kleemann, C.: Maximum interpoint distance of high-dimensional random vectors. arXiv: 2302.06965 (2023)
- [7] Henze, N., Klein, T.: The limit distribution of the largest interpoint distance from a symmetric kotz sample. J. Multivariate Anal., 57(2), 228–239 (1996)
- [8] Henze, N., Lao, W.: The limit distribution of the largest interpoint distance for power-tailed spherically decomposable distributions and their affine images. Preprint, Karlsruhe Institute of Technology, (2010)
- [9] Jammalamadaka, S. R., Janson, S.: Asymptotic distribution of the maximum interpoint distance in a sample of random vectors with a spherically symmetric distribution. Ann. Appl. Probab., 25(6), 3571–3591 (2015)
- [10] Jiang, T.: The asymptotic distributions of the largest entries of sample correlation matrices. Ann. Appl. Probab., 14(2), 865–880 (2004)
- [11] Lao, W.: Some weak limit laws for the diameter of random point sets in bounded regions, KIT Scientific Publishing, Karlsruhe, 2010
- [12] Li, D., Rosalsky A.: Some strong limit theorems for the largest entries of sample correlation matrices. Ann. Appl. Probab., 16, 423–447 (2006)
- [13] Li, J.: Asymptotic distribution-free change-point detection based on interpoint distances for high-dimensional data. J. Nonparametr. Stat., 32(1), 157–184 (2020)
- [14] Linnik, J. V.: On the probability of large deviations for sums of independent variables. In Proc. 4th Berkeley Sympos. Math. Statist. and Prob., Vol. II, Univ. California Press, Berkeley, Calif., pp. 289–306 (1961)
- [15] Matthews, P. C., Rukhin A. L.: Asymptotic distribution of the normal sample range. Ann. Appl. Probab., 3, 454–466 (1993)
- [16] Mayer, M., Molchanov, I.: Limit theorems for the diameter of a random sample in the unit ball. Extremes, 10(3), 129–150 (2007)
- [17] Modarres, R.: Multinomial interpoint distances. Statist. Papers, 59(1), 341–360 (2018)
- [18] Modarres, R., Song, Y.: Interpoint distances: applications, properties, and visualization. Appl. Stoch. Models Bus. Ind., 36(6), 1147–1168 (2020)
- [19] Schrempp, M.: The limit distribution of the largest interpoint distance for distributions supported by a -dimensional ellipsoid and generalizations. Adv. Appl. Probab., 48(4), 1256–1270 (2015)
- [20] Schrempp, M.: Limit laws for the diameter of a set of random points from a distribution supported by a smoothly bounded set. Extremes, 22(1), 167–191 (2019)
- [21] Song, Y., Modarres, R.: Interpoint distance test of homogeneity for multivariate mixture models. Int. Stat. Rev., 87(3), 613–638 (2019)
- [22] Tang, P., Lu, R., Xie, J.: Asymptotic distribution of the maximum interpoint distance for high-dimensional data. Statist. Probab. Lett., 190, 109567 (2022)
- [23] Zaĭtsev, A. Y.: On the Gaussian approximation of convolutions under multidimensional analogues of S. N. Bernstein’s inequality conditions. Probab. Theory Related Fields, 74(4), 535–566 (1987)