Detecting Outliers in Multiple Sampling Results Without Thresholds
Abstract
Bayesian statistics emphasizes the importance of prior distributions, yet finding an appropriate one is practically challenging. When multiple sample results are taken regarding the frequency of the same event, these samples may be influenced by different selection effects. In the absence of suitable prior distributions to correct for these selection effects, it is necessary to exclude outlier sample results to avoid compromising the final result. However, defining outliers based on different thresholds may change the result, which makes the result less persuasive. This work proposes a definition of outliers without the need to set thresholds.
keywords:
1 Introduction
People often determine the probability of occurrence of event through random sampling, but results are unreliable if the number of the sample is too small. The probability density function, which depends on both the number of samples and the number of events, is superior to a single probability value. Bayesian statistics goes further by emphasizing the importance of the prior distribution; for example, if there is a strong selection effect during sampling, no matter how large the sample is, the result still be unreliable. However, it is difficult to obtain an appropriate prior distribution in practical situations, and sometimes we may not even be aware of the selection effect during sampling, mistakenly assuming that all samples are equally weighted. This is especially common in social investigations and astronomical spectroscopic surveys. All investigations or observations have different environments, it is difficult to assess the selection effect of each of them. Bayesian linear statistics[1] take this issue into account, but this work argues that some sampling results with strong selection effects should be identified first, and they can be defined as outliers.
Sometimes sampling results with strong selection effects may be identified manually. Regardless of whether they can be manually identified, if one wants to exclude some sampling results, they must either find clear evidence of the problem within these sampling results or classify them as outliers using a strict definition, otherwise there may be suspicion of cheating. Of course, sometimes the majority of samples make the same error, and the best sampling results may end up being outliers.
The method based on the standard score (Z-score) can be employed to find outliers. However, it is difficult to account for the impact of sample size unless the sample results are weighted according to their size, but there is no unified form of weighting, and thresholds must be set. Methods that require setting thresholds lack persuasiveness because conclusions may differ with different thresholds.
The method proposed in this paper uses the probability density function to consider the impact of sample size and defines outliers for multiple random sampling results without setting thresholds. For a set of probability density functions corresponding to multiple sampling results, under the definition of this work, there may be no outliers, one outlier or multiple outliers. Sometimes all probability density functions in the set are outliers, resembling a “fragmented” set, which indicates extremely unstable sampling quality and cannot give reliable results.
2 Method for finding outliers
When there is no selection effect, assuming N events are detected, the more samples (n), the more reliable . However, cannot reflect n, so a probability density function is needed to replace . The larger n, the smaller the information entropy[5] of the corresponding probability density function. Now suppose we want to investigate how many stars in a sky area are giants; we perform spectroscopic observations of that region and obtain spectra for n stars, analyzing and finding that N of them are giants. The probability of finding a giant in this sky area fits the binomial distribution
(1) |
Equation 1 indicates that even if the proportion of giants () in this sky area is constant, the probability of finding giants from stars in that sky area is not equal to 1, which is consistent with the Theorem of Large Numbers[2, 3, 4]. In Bayesian Statistics, we have
(2) |
A prior distribution is required, but we know nothing about it. For example, in this scenario, we need to test the galactic model using the proportion of giants, so we cannot correct the observational results based on the prior parameters obtained from the model. Besides, during astronomical observations, we are bound to see more giants because they are brighter than non-giants (turn-off stars) at the same distance. Therefore, we do not expect to obtain a truly complete ; it is good enough to be complete within a certain brightness (magnitude) range. However, sometimes observers also tend to select stars that are either bluer or redder, and the color distribution of giants differs from that of non-giants. Thus, color bias can affect the proportion of giants. Even if the observer’s color bias is known, it is difficult to quantify its impact on the proportion of giants. In summary, the prior distribution cannot be estimated, so the prior distribution is assumed to be an uniform distribution
(3) |
then we have
(4) |
To ensure , it has been proved that
(5) |
where
(6) |
where , , and
(7) |
If there is only one sampling result, it ends here. However, in practical situations, the conditions during sampling are always changing. Even if it is unclear whether these specific conditions will actually lead to selection effects, the sampling result should be divided into multiple sampling results based on these conditions. In this scenario, there are always multiple observations of the same sky area, and in each observation the selection biases differ. For instance, one observation might be biased towards bluer stars, another towards redder stars, and another might even have undergone pre-filtering to exclude giants, albeit with a pre-filter accuracy that is not one hundred percent, leaving a small number of giants behind. As a result, even if an observation provides a vast sample size, it can still be unreliable, whereas a result with a much smaller sample size might actually be closer to the truth. If sampling results with strong selection biases are not treated as outliers and removed, it will inevitably lead to biases in the overall probability density function. Here comes the definition of outliers without setting thresholds.
Now assuming this sky area has been observed k times (k¿3), then we have , , … and , , … . So,
(8) |
If, in a few sampling results, a non-uniform prior distribution assumption is used, resulting in a different form of the corresponding from that in Equation 5, this is acceptable and will not affect the following definitions.
Now we have . Make sure no repeated elements in the set . Then define Similarity ,
(9) |
Then define
(10) |
where . Define
(11) |
and
(12) |
so we have
(13) | |||
Then define
(14) | |||
and
(15) | |||
and a check function
(16) |
and an operator
(17) |
then define
(18) |
if observation i is an outlier in k observations, we have
(19) |
we can also define an operator
(23) |
and
(24) |
then we have
(25) | |||
If is undefined, are undefined. If is defined, is defined so is “fragmented”, no reliable results should be given by a “fragmented” set.
Figure 1 are some examples.




Python code \sdescriptionThe code for drawing the sketch map is also included
import numpy as np from scipy.stats import beta from collections import Counter def S(Ni,ni,Nj,nj): h=0.001#integration step size theta=np.arange(0,1,h) pi=beta.pdf(theta,Ni+1,ni-Ni+1) pj=beta.pdf(theta,Nj+1,nj-Nj+1) minij=[min(a,b) for a,b in zip(pi,pj)] minij=np.array(minij) return np.sum(minij*h) def out(N,n): k=len(N) Slist=[] ilist=[] jlist=[] for i in range(k): for j in range(k): if i>=j: continue else: Slist.append( S(N[i],n[i],N[j],n[j]) ) ilist.append(i) jlist.append(j) ilist=np.array(ilist) jlist=np.array(jlist) Slist=np.array(Slist) ilist=ilist[np.argsort(Slist)] jlist=jlist[np.argsort(Slist)] ilist=ilist[:k-1].tolist() jlist=jlist[:k-1].tolist() l=ilist+jlist counter = Counter(l) mce, mcc = counter.most_common(1)[0] if mcc<k-1: return -1 else: return mce def main(N,n): if len(N)!=len(n): print(’len(N)!=len(n)’) return -1 d=np.array(n)-np.array(N) if len(d[d<0])>0: print("n < N") return -1 outN=[] outn=[] output=len(N)+1 while output>=0: output=out(N,n) if output>=0: outliersN.append(N[output]) outliersn.append(n[output]) N=N[:output]+N[output+1:] n=n[:output]+n[output+1:] if len(N)==1: print(’Fragmented!’) outN.append(N[0]) outn.append(n[0]) return N,n,outN,outn else: return N,n,outN,outn #example: import matplotlib.pyplot as plt N=[15,11,7,29,100] n=[30,20,15,60,200] newN,newn,outN,outn=main(N,n) print(outN) print(outn) theta=np.arange(0,1,0.001) for i in range(len(newN)): plt.plot(theta, \ beta.pdf(theta,newN[i]+1, \ newn[i]-newN[i]+1),color=’black’) for i in range(len(outN)): plt.plot(theta, \ beta.pdf(theta,outN[i]+1, \ outn[i]-outN[i]+1),color=’red’) plt.xlabel(’$\\theta$’,fontsize=16) plt.ylabel(’Probability density’, \ fontsize=16) plt.xticks(fontsize=14) plt.yticks(fontsize=14) plt.tight_layout() plt.show()
References
- [1] {barticle}[author] \bauthor\bsnmAnnis, \bfnmDavid H.\binitsD. H. (\byear2008). \btitleBayes Linear Statistics: Theory and Methods. \bjournalJournal of the American statistical association \bvolume103 \bpagesp.1319. \endbibitem
- [2] {bbook}[author] \bauthor\bsnmBernoulli, \bfnmJakob\binitsJ. (\byear1713). \btitleJacobi Bernoulli,… Ars conjectandi, opus posthumum. Accedit Tractatus de seriebus infinitis, et epistola Gallice scripta De ludo pilae reticularis. \bpublisherimpensis Thurnisiorum, fratrum. \endbibitem
- [3] {barticle}[author] \bauthor\bsnmKhintchine, \bfnmA Ya\binitsA. Y. (\byear1936). \btitleSu una legge dei grandi numeri generalizzata. \bjournalGiorn. Ist. Ital. Attuari \bvolume7 \bpages365–377. \endbibitem
- [4] {bbook}[author] \bauthor\bsnmLoève, \bfnmMichel\binitsM. and \bauthor\bsnmLoève, \bfnmM\binitsM. (\byear1977). \btitleElementary probability theory. \bpublisherSpringer. \endbibitem
- [5] {barticle}[author] \bauthor\bsnmShannon, \bfnmC. E.\binitsC. E. (\byear1948). \btitleA mathematical theory of communication. \bjournalThe Bell System Technical Journal \bvolume27 \bpages379-423. \bdoi10.1002/j.1538-7305.1948.tb01338.x \endbibitem