This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A discussion on two celebrated examples of application of linear regression

Evangelos Matsinos
Abstract

This work aims at providing some (further) perspective into the analysis of two well-known datasets in terms of the methods of linear regression. The first set has been taken from Hubble’s 1929 investigation of the relation between the distances of galaxies and their recessional velocities. The second set relates to Galton’s family data on human stature, collected for the purposes of his 1889 book on natural inheritance.
PACS 2010: 07.05.Kf, 02.60.Pn

keywords:
Linear least-squares, linear regression, multiple linear regression, numerical optimisation, data analysis: algorithms and implementation

1 Introduction

Regression methods have been indispensable tools in the analysis of observations for over two centuries. The first description of linear regression may be traced to one of Legendre’s books [1], published in 1805. It seems that Gauss had already developed and employed the technique before Legendre’s publication, but had considered it “trivial” and unworthy of any special attention, e.g., see Ref. [2].

The purpose of regression analysis is to provide optimal solutions (and predictions obtained thereof) to overdetermined systems, i.e., to situations where the available amount of information (i.e., the number of the input datapoints) exceeds the number of parameters being involved in the modelling of the processes yielding the observations. The simplest of the regression models involves the replacement of the entirety of the available information (e.g., of all lifespans of incandescent light bulbs, produced by a specific manufacturer) by one ‘constant’ (i.e., by the mean product lifetime) and by one uncertainty (e.g., by the ‘root mean square’ of the distribution of the lifetimes of the tested products): information is thus imparted to the potential customer that a purchased bulb of that specific brand and type is expected to withstand, say, 8 0008\,000 h of operation with an estimated uncertainty of, say, 500500 h. Consequently, it will be very surprising if such a bulb eventually exceeds 12 00012\,000 h of operation, and (in case that the underlying distribution of the product lifetime is symmetrical about the mean value) equally surprising if it becomes dysfunctional before it reaches 4 0004\,000 h of operation 111In practice, however, the latter tends to be more probable than the former, though Statistics is hardly to blame for that mishap!. The next-to-simplest modelling involves a straight line 222Although a constant may be thought of as a straight line without slope, such a ‘semantic narrowing’ in the definition of the straight line will be avoided in this work. (and two adjustable parameters, intercept and slope).

Numerous authors have occupied themselves with the details of the methods of linear regression (ordinary, weighted, multiple, etc.). The task of listing even the most influential of these contributions would inevitably result in a very long reference list, and, in all probability, even such a list would be incomplete. Therefore, I have decided to simply quote a few relevant books from my own collection [3, 4, 5, 6, 7] and rather refer the interested reader to the works cited therein, as well as to two articles which discuss the results of the application of some standard methods of linear regression to simulated and experimentally acquired datasets, see Refs. [8, 9].

This work discusses details of the application of linear regression to two celebrated datasets, relating to different branches of Science and obtained in the distant past: the first dataset (Section 2.1) is taken from Astronomy, from the 1929 paper by Edwin Powell Hubble (1889-1953) on the relation between the distance rr and the radial velocity vv of galaxies [10], which became known as ‘Hubble’s law’ later on; the second dataset (section 2.2) is taken from Anthropometry, from the 1889 (the year Hubble was born) book by Francis Galton (1822-1911) on natural inheritance [11]. As the present paper grew longer than I initially thought, I decided to abandon my plan to include herein a discussion on two additional datasets.

All important details of the methods of linear regression which are applied to the two aforementioned datasets may be found in Appendix A. That appendix serves as a short guide to obtaining the formalism which is relevant to the various methods of linear regression, including the determination of the optimal values of the parameters and of their uncertainties, as well as of any predictions based on the results of the fits (including all correlations among the fit parameters). It is recommended to the reader to start with the material in the appendices, in order to become familiar, if not with the ‘theoretical’ details about the various analysis options, at least with the definitions.

For the sake of brevity, a number of acronyms will be used in this paper:

  • PDF will stand for ‘Probability Density Function’;

  • CDF will stand for ‘Cumulative Distribution Function’;

  • DoF will stand for ‘degree of freedom’;

  • NDF will stand for ‘number of DoFs’;

  • WLR will stand for ‘weighted linear regression’ (‘weighted least-squares optimisation’); and

  • rms will stand for ‘root mean square’ (square root of the unbiased variance of a dataset).

Finally, pc will stand for parsec, defined as the distance at which the length of one astronomical unit (1AU149 597 870.71{\rm AU}\coloneqq 149\,597\,870.7 km), representative of the Sun-Earth distance, subtends an angle of one arcsecond: 1pc1AU(tanθ)11~{}{\rm pc}\coloneqq 1~{}{\rm AU}\,(\tan\theta)^{-1}, where θπ/1 296 000\theta\coloneqq\pi/1\,296\,000 rad; 1pc3.08567810131~{}{\rm pc}\approx 3.085678\cdot 10^{13} km or (after using the former unit of length for cosmic distances, namely the light year) 1pc3.262ly1~{}{\rm pc}\approx 3.262~{}{\rm ly}.

2 The datasets

2.1 Hubble’s dataset

2.1.1 The beginnings

The story behind Hubble’s determination of the famous constant, which has been named after him, started over 100100 years before Hubble was born: Clotho’s spindle started spinning when Edward Pigott (1753-1825), an English astronomer, developed a fondness for stars of variable brightness. Pigott’s family moved to York in 1781, where Pigott met John Goodricke (1764-1786), a deaf-mute with unparalleled visual acuity and sensitivity. Working in a private observatory built by Pigott’s father, Pigott and Goodricke embarked on a project aiming at exploring the properties of the most promising variable stars, taken from Pigott’s precompiled list [12]. It did not take long before Goodricke’s skills and attention to detail paid off: one year into the project, he announced to the Royal Society the discovery of the periodical variation of the brightness of the star Algol (β\beta Persei) [13]. Apart from proposing a mechanism to account for the effect (and attributing it to the transits of Algol’s dimmer binary companion), Goodricke also examined the variations of brightness of several other stars, including η\eta Aquilae (A) and δ\delta Cephei (A) [14], which became textbook examples of the Cepheid family of variable stars 333Cepheid variable stars pulsate radially, varying in both diameter and temperature [15]., featuring very regular ‘sharply-up-gradually-down’ light curves, dissimilar to the one obtained from Algol; evidently, the variation of brightness of these stars could not be attributed to stellar eclipses. Goodricke’s contributions to Astronomy did not go unnoticed: at the age of 21, he was elected a Fellow of the Royal Society, a few days before his sad passing.

2.1.2 The Harvard College Observatory

Over the course of the next century, many more Cepheid variable stars were detected. With the invention of photography, the acquisition of accurate brightness data was boosted. The Harvard College Observatory became the hub of the developments in Observational Astronomy, and its new director in 1877, Edward Charles Pickering (1846-1919), launched an ambitious programme aiming at photographing and cataloguing all observable stars. Whether due to budget issues (e.g., see Ref. [16], p. 205) or to the poor quality of the work of Pickering’s male assistants [17], over 8080 College-educated women were recruited, to work at the Observatory as human computers, i.e., as skilled employees, comparing photographic plates, processing astronomical data, and contributing to Astronomy for the first time after Hypatia of Alexandria and Aglaonice of Thessaly.

The computers excelled in their assignments. For instance, Annie Jump Cannon (1863-1941), who was deaf after she turned twenty, conducted the main work in the development of the Harvard Classification Scheme of stars, the first attempt towards a categorisation of the stars in terms of their surface temperature and spectral type. In an article entitled ‘Woman making index of 100 000100\,000 stars for a catalogue’, the Danville Morning News reported on 10 February 1913 [18]: “When the work was new she could analyze at the rate of 1 0001\,000 stars in three years. Now she analyzes 5 0005\,000 stars in one month, 200200 stars an hour 444This is an erratum: evidently, the author of the article intended to write ‘day’, not ‘hour’.. On Jan. 1 she had examined about 65 00065\,000, which means that about two-fifths of the work is completed.”

Without doubt, Pickering’s research programme reached its climax with another computer, Henrietta Swan Leavitt (1868-1921). Leavitt’s methodical work was destined to make an indelible impact on Astronomy (and Cosmology). Having been given the task of studying variable stars in the Small and Large Magellanic Clouds (SMC and LMC, respectively), Leavitt catalogued 1 7771\,777 such objects between 1903 and 1908 [19]. The decisive moment came when she focussed her attention on a small subset of these objects, i.e., on the Cepheid variable stars in the SMC (see Table VI of Ref. [19]); on p. 107 of her report, she inconspicuously commented: “It is worthy of notice that in Table VI the brighter variables have the longer period.”

Following her intuition in subsequent years, Leavitt examined the relation between the period of the variation of the apparent magnitude (which is associated with the apparent brightness) of the Cepheid variable stars in the SMC and the extrema of the apparent magnitude for each such object. Her decision to restrict her analysis to the SMC turned out to be of pivotal importance, because the selected 2525 (i.e., the 1616 from Table VI of Ref. [19] and the 99 which were added to her database between 1908 and 1913) celestial bodies were, in practical terms, equidistant from the Earth 555It is currently known that the SMC is a dwarf galaxy (with a typical diameter of about 5.785.78 kpc), gravitationally attracted to the Milky Way, and separated from the Earth by 60.5(7.0)60.5(7.0) kpc [20]..

After plotting the period-brightness 666I shall use this short term when referring to the datapoints corresponding to the period and to the apparent magnitude of variable stars; other authors prefer the term ‘period-luminosity’. datapoints of the 2525 Cepheid variable stars on a linear-log plot (logarithmic scale on the xx axis, linear scale on the yy axis), Leavitt 777The fact that Pickering, rather than Leavitt, signed the 1912 report is baffling. The report starts with the sentence: “The following statement regarding the periods of 2525 variable stars in the Small Magellanic Cloud has been prepared by Miss Leavitt.” Although Pickering’s motive for signing Leavitt’s report is a mystery to me, I decided to include him in the author list of that report, yet as second author. obtained the “remarkable” result shown in Fig. 2 of Ref. [21]: two nearly parallel straight lines, modelling the apparent-magnitude maxima and minima of these stars, separated by about 1.21.2 mag (difference between the corresponding intercepts); I shall use ‘mag’ to denote the unit of apparent magnitude (which, of course, is not a physical unit). Two remarks from that report serve as a prelude to the significance of Leavitt’s research:

  • “…there is a simple relation between the brightness of the variables and their periods” and

  • “Since the variables are probably at nearly the same distance from the Earth, their periods are apparently associated with their actual emission of light, as determined by their mass, density, and surface brightness.”

The consequences of Leavitt’s findings were beyond reckoning. The measurement of the period of a Cepheid variable star, whichever neighbourhood of the Universe that specific celestial body occupies, could lead to an estimate for its apparent brightness bSMCb_{\rm SMC} which that body would have if it had been confined within the SMC. By comparing the body’s measured apparent brightness bb to bSMCb_{\rm SMC}, one could obtain an estimate for the distance between the specific Cepheid variable star and the Earth, expressed (of course) in terms of the SMC-Earth distance. As the distances of all Cepheid variable stars would be expressed as multiples of the SMC-Earth distance, one would need to determine only one such distance (e.g., using the stellar-parallax method) in order that all distances become known. Within one year of Leavitt’s breakthrough, Ejnar Hertzsprung (1873-1967) provided a solution, however flawed 888On p. 204 of Hertzsprung’s article [22], one reads: “…so wird die entsprehende visuelle Sterngröße gleich 13m.013^{m}.0. Diese Überlegung führt also zu einer Parallaxe pp der kleinen Magellanschen Wolke, welche durch 5logp=7.313.0=20.35\log p=-7.3-13.0=-20.3 gegeben ist. Man erhält p=0′′.0001p=0^{\prime\prime}.0001, einem Abstand von etwa 3 0003\,000 Lichtjahren entsprechend.” This part is translated as follows: “…in which case the corresponding apparent magnitude is equal to 13m.013^{m}.0. This consideration leads to a parallax of the Small Magellanic Cloud, which is given by the relation 5log10p=7.313.0=20.35\log_{10}p=-7.3-13.0=-20.3. One obtains p=0′′.0001p=0^{\prime\prime}.0001, which corresponds to a distance of about 3 0003\,000 light years.” It is a puzzle to me how Hertzsprung obtained this result. Expressed in arcseconds, the parallax would be equal to about 8.7101058.710\cdot 10^{-5}, yielding a distance of about 3.54310203.543\cdot 10^{20} m or about 37 44837\,448 ly. However that may be, it is known today that the SMC-Earth distance is significantly larger than both aforementioned values., to this problem [22]. In spite of the serious underestimation of the cosmic distances when using Hertzsprung’s results, Harlow Shapley (1885-1972), then at the Mount Wilson Observatory, elaborated on Hertzsprung’s method and obtained a better calibration of the cosmic distances on the basis of the apparent brightness of the Cepheid variable stars [23]. The yardstick to measure cosmic distances had been invented. Or hardly so?

Looking at Leavitt’s period-brightness data from the comfort of my armchair (next to a fast desktop, running present-day software) over one century after she (arduously) collected the data of the 2525 Cepheid variable stars, I notice that the values of Pearson’s correlation coefficient between the logarithm of the period and both apparent-magnitude extrema are very close to 1-1: 0.966-0.966 in case of the apparent-magnitude maxima and 0.962-0.962 in case of the apparent-magnitude minima, both calling for linear regression. To advance to the modelling of these data, one first needs to establish a method to assess the uncertainties relevant to the apparent magnitude (there is no mention of uncertainties in Leavitt’s reports), and, to be able to select a reasonable model, one first needs to revisit the procedure by which the apparent-magnitude data are obtained.

The apparent brightness bb of a star is linked to the star’s luminosity LL, which is the absolute measure of the power emitted by the star in the form of electromagnetic radiation of all frequencies, via the formula

b=L4πr2,b=\frac{L}{4\pi r^{2}}\,\,\,, (1)

where rr stands for the distance between the points of emission and absorption, i.e., for the star-Earth distance. From the measurement of the apparent brightness, one obtains the apparent magnitude of the star by using the relation

m=52log10bbf,m=-\frac{5}{2}\log_{10}\frac{b}{b_{f}}\,\,\,, (2)

where the quantity bfb_{f} is the reference apparent brightness (i.e., the apparent brightness which is to be associated with the apparent magnitude of 0). It must be borne in mind that smaller values of the apparent magnitude are indicative of higher apparent brightness: the apparent magnitude of Venus lies between 4.92-4.92 and 2.98-2.98, of Sirius (α\alpha Canis Majoris) about 1.46-1.46, and of Polaris (α\alpha Ursae Minoris) about 1.981.98; these three celestial bodies have been listed in order of decreasing apparent brightness.

Using Eq. (2), one obtains the uncertainty δm\delta m from the uncertainty of the apparent brightness δb\delta b, as follows:

δm=52ln10δbb.\delta m=\frac{5}{2\,\ln 10}\frac{\delta b}{b}\,\,\,. (3)

Although mm depends on the choice of the reference apparent brightness bfb_{f}, δm\delta m is independent of that choice. It is reasonable to assume that the uncertainties of the apparent brightness are proportional; under this assumption, the uncertainties δm\delta m are constant. Without doubt, the matter is more complex than it has been presented, yet the discussion above suffices for ‘the sake of argument’ purposes.

Following the aforementioned train of thought, constant working uncertainties (δm=0.2\delta m=0.2 mag) were assigned to all apparent-magnitude data of Ref. [21]. It must be emphasised that, provided that all assigned uncertainties are equal (and non-zero), the exact value of δm\delta m has no bearing whatsoever on the results.

The separate linear fits to the data (compare the second and third columns of Table 1) suggest that there is nothing against the hypothesis that the two datasets be accounted for on the basis of a common slope parameter. The third row of Table 1 contains the results of the joint fit to the data, using as parameters: the intercept p0p_{0} of the straight line which models the apparent-magnitude maxima, the common slope parameter p1p_{1}, and the (positive) difference Δp0\Delta p_{0} between the intercepts of the two straight lines which model the apparent-magnitude extrema. The results of the joint fit to Leavitt’s period-brightness data are shown in Fig. 1.

Table 1:

The results, contained in the second and third columns of this table, correspond to the separate fits to the maxima and the minima of the apparent magnitude, see Table VI of Ref. [21]; the results of the joint (three-parameter) fit (featuring a common slope parameter) are shown in the last column. The value of the slope appears positive in Fig. 2 of Ref. [21], because the vertical axis is shown inverted in that figure, evidently to conform with the common-sense expectation that brightness maxima appear ‘above’ brightness minima.

Fit to maxima Fit to minima Joint fit
NN 2525 2525 5050
NDF 2323 2323 4747
χmin2\chi^{2}_{\rm min} 40.8940.89 47.8847.88 88.8488.84
p^0\hat{p}_{0} (mag) 15.5615.56 16.7716.77 15.57315.573
p^1\hat{p}_{1} 2.02-2.02 2.05-2.05 2.033-2.033
Δp^0\Delta\hat{p}_{0} (mag) - - 1.1801.180
δp^0\delta\hat{p}_{0} (mag) 0.110.11 0.120.12 0.0880.088
δp^1\delta\hat{p}_{1} 0.110.11 0.120.12 0.0820.082
δ(Δp^0)\delta(\Delta\hat{p}_{0}) (mag) - - 0.0780.078
t-multiplier 1.0222171.022217 1.0222171.022217 1.0107521.010752
Corrected δp^0\delta\hat{p}_{0} (mag) 0.110.11 0.120.12 0.0890.089
Corrected δp^1\delta\hat{p}_{1} 0.110.11 0.120.12 0.0830.083
Corrected δ(Δp^0)\delta(\Delta\hat{p}_{0}) (mag) - - 0.0790.079
Refer to caption
Figure 1: The results of the joint fit to the period-brightness data of the 2525 Cepheid variable stars of Ref. [21] (Table VI therein). Filled circles: apparent-magnitude minima; unfilled circles: corresponding apparent-magnitude maxima. The value of the slope appears positive in Fig. 2 of Ref. [21], because the vertical axis is shown inverted in that figure, see also caption of Table 1.

I left two comments for the very end of the story regarding Leavitt’s invaluable contributions to Science. First, Leavitt too was suffering from severe hearing problems by the time she joined Pickering’s team in 1903. Second, like Goodricke 135135 years earlier, she did not live long enough to enjoy the fruits of her labour.

2.1.3 Hubble and Company

It was about time Hubble, a U.S. American from the state of Missouri, came on stage. Hubble had been offered a job at the Mount Wilson Observatory already in 1916, but arrived with a delay of about three years after he answered the call of duty and embarked on the task of defending the United Kingdom against the Quadruple Alliance. When Hubble finally arrived at the Observatory, the 100100-inch Hooker Telescope had been completed and was about to yield unprecedented observations of the Cosmos for the next few decades. Shapley was still at the Observatory, at the peak of his scientific productivity [23]. That was a time when a paradigm shift was under way in the human understanding of the Cosmos, with the introduction of General Relativity a few years earlier and the gradual emergence of Quantum Mechanics. Astronomers were striving to provide answers to questions about the nature of the ‘nebulae’, which were detected at an increasing rate with the help of the modern telescopes, and the size of the Universe; a relevant question was whether or not the nebulae were part of the Milky Way. There was no shortage of arguments for a few years, as the case frequently is when reliable evidence is scarce. Intending to resolve the matter, the National Academy of Sciences proposed a public debate between representatives of the two sides in the first half of 1920 [24]; this event became known as the ‘Great Debate’, though - for the obvious reasons - I would rather replace ‘Debate’ with ‘Divide’. The two representatives, Shapley (for those who propounded that the nebulae were within the Milky Way) and Heber Doust Curtis (1872-1942) (for those who surmised that the nebulae were independent galaxies), were different in nearly all ways possible [16], not only regarding their views on the matter at hand: they had different personalities, different ways of expressing themselves when providing explanations to others, different skills when addressing the general public, and - the icing on the cake - different social background. On 27 April 1920 (the day after the Great Debate), there was no clear winner: the unquestionable winner was announced four years later, and his name was Hubble.

Hubble’s resolution of the question of the Great Debate was based on a somewhat serendipitous discovery: on 4 October 1923, he took a long exposure of the Andromeda Nebula (M31) [16] and noticed peculiar ‘spots’ on the photographic plate. Following up on the issue the next night, he found out that two of the spots were novae, whereas the third one turned out to be a Cepheid variable star, the first one to be detected beyond the two Magellanic Clouds. He evaluated the object’s distance and found it several times larger than the typical dimensions of the Milky Way, suggesting that the Andromeda Nebula was an independent galaxy, well beyond our own. Several other Cepheid variable stars were detected in M31 shortly afterwards (also by Hubble), confirming the earlier result [25]. Following the scientific code of conduct, Hubble broke the news to Shapley (who had moved to Cambridge, Massachusetts, to assume the position of the Director of the Harvard College Observatory) in early 1924. After reading Hubble’s letter, Shapley is said to have commented to one of his doctoral students: “Here is the letter that destroyed my universe,” see Ref. [24], p. 1142. That was Hubble’s first significant contribution to Observational Astronomy. He then turned his attention to a related problem.

In the early 1910s, Vesto Melvin Slipher (1875-1969) became the first astronomer to measure the radial velocity of a nebula. Utilising the Doppler shift, he determined the radial velocity of the Andromeda Galaxy (which, as aforementioned, still had the status of a nebula at those times) relative to the Solar System, and found out that it approaches the Milky Way at a speed of 300300 km/s, see first paper of Refs. [26]. Within a few years, Slipher had enlarged his database to a total of 2525 nebulae (see Table 1 of the last paper of Refs. [26]), 2121 of which were undoubtedly receding; in addition, four of the receding nebulae had radial velocities in excess of 1 0001\,000 km/s (which was enormous in comparison with the typical relative velocities of stars - about 3030 km/s - in the Milky Way). Provided that there is no preferred direction in the motion of these objects, the probability that at most 44 nebulae (in a set of 2525) either recede from or approach the Milky Way is equal to about 9.111049.11\cdot 10^{-4}. This probability suggests a statistically significant departure 999In this study, the threshold of statistical significance, which is frequently denoted in the literature as α\alpha, is assumed to be equal to 1.001021.00\cdot 10^{-2}. This threshold is regarded by most statisticians as signifying the outset of statistical significance. A usual choice in several branches of Science is α=5.00102\alpha=5.00\cdot 10^{-2} (which most statisticians associate with ‘probable statistical significance’). from the null hypothesis that the distribution of the radial velocities of the nebulae is symmetrical about 0. Slipher’s results posed the inevitable question: why do so many more nebulae retreat from the Milky Way?

In the early 1920s, Alexander Alexandrovich Friedmann (1888-1925) derived (from General Relativity) the equations which became known in Cosmology as ‘Friedmann’s equations’ [27, 28]: one of the solutions of these equations predicted an expanding Universe (see also Ref. [29], Chapters 10.3.4 and 10.8.3). A few years later (and working independently), Georges Henri Joseph Édouard Lemaı^\hat{\rm\i}tre (1894-1966), the father of the Big-Bang Theory with his seminal 1931 paper [30], also came up with the same equations and attributed the recession of the nebulae to the expansion of space [31]. It is far less known (or, perhaps, intentionally overlooked) that what many call ‘Hubble’s law’ can be found in Lemaı^\hat{\rm\i}tre’s 1927 paper [31]: the ratio v/rv/r (where, as mentioned in Section 1, rr represents the distance of a nebula from the Earth and vv its radial velocity), which in subsequent years became known as Hubble’s constant 101010Regarding Hubble’s constant, a caustic comment I usually make is that it is neither Hubble’s nor a constant (given its dependence on the cosmological time); H0H_{0} is the value of the Hubble parameter at the current cosmological epoch. H0H_{0}, was actually evaluated by Lemaı^\hat{\rm\i}tre to 625625 km/s Mpc-1 (see fifth line from the top of p. 56 of Lemaı^\hat{\rm\i}tre’s paper [31]), not far from Hubble’s two estimates a few years later. Unfortunately for Lemaı^\hat{\rm\i}tre, his paper had been published in a journal of very limited dissemination. In addition, Lemaı^\hat{\rm\i}tre’s efforts in 1927, to draw attention to his work, were not successful [32]. The consequence was that the famous relation between the distance and the radial velocity at which the galaxies recede from the Milky Way became known as ‘Hubble’s law’, instead of the fairer ‘Lemaı^\hat{\rm\i}tre-Hubble law’ (see also Ref. [29], Chapter 10.3.1).

Concerning the data on which Hubble based his H0H_{0} determination, I have heard two Cosmologists remark in their seminars: “Hubble’s data seem to be all over the place” and “I do not know what Hubble did in his paper.” I find both comments lamentable (though, regrettably, I did not raise an objection on those two occasions) and shall next address both issues. Available to Hubble in his 1929 paper [10], were pairs of measurements of the distance rr and of the radial velocity vv of nebulae. To account for the observations, Hubble employed the linear model

v=Kr+Δv,v=Kr+\Delta v\,\,\,, (4)

where KK was assumed to be a constant (to be identified with H0H_{0}) and the correction

Δv=Xcosδcosα+Ycosδsinα+Zsinδ\Delta v=X\,cos\delta\,\cos\alpha+Y\,\cos\delta\,\sin\alpha+Z\,\sin\delta (5)

compensated for the motion of the Solar System (i.e., for the motion of the observer); the velocities XX, YY, and ZZ were parameters, to be obtained from the optimisation of the description of the observations. The angles α\alpha and δ\delta in Eq. (5) denote the right ascension and the declination of each observed object, respectively. Hubble’s (r,v)(r,v) datapoints, originating from 2424 “nebulae whose distances have been estimated from stars involved or from mean luminosities in a cluster,” see Table 1 of Ref. [10], are shown in Fig. 2.

Refer to caption
Figure 2: The pairs of measurements of the distance and of the radial velocity of the nebulae on which Hubble based his main analysis in Ref. [10].

Was Hubble wrong when suggesting that the distances of the nebulae and their radial velocities are linearly related? Do the data spread “all over the place?” The answer to both questions is negative. Pearson’s correlation coefficient for the data shown in Fig. 2 is equal to ρ0.790\rho\approx 0.790. Given that the resulting value of the tt-statistic is about 6.0366.036, one obtains the p-value of about 4.481064.48\cdot 10^{-6} (two-tailed). Therefore, there is ample evidence to reject the null hypothesis that the two quantities, rr and vv, are not correlated: equivalently, Hubble’s data do not spread all over the place.

Using Hubble’s datapoints, I obtained the following results for KK (i.e., H0H_{0}), XX, YY, and ZZ: 465(56)465(56) km/s Mpc-1, 69(79)-69(79) km/s, 235(152)235(152) km/s, and 200(82)-200(82) km/s, respectively. Hubble quotes slightly different values and sizeably smaller uncertainties in Ref. [10] (in all probability, he employed constraints in his fit). His values read as follows: 465(50)465(50) km/s Mpc-1, 65(50)-65(50) km/s, 226(95)226(95) km/s, and 195(40)-195(40) km/s, respectively. After enhancing his database, Hubble updated his results two years later [33]; nevertheless, his H0H_{0} value remained enormous (in fact, the H0H_{0} estimate of 1931 is even larger than the 1929 result!), suggesting that the typical age of the Universe (obtained from H01H_{0}^{-1}) was smaller than the age of the oldest rocks found on the Earth (established via radiometric dating), which in turn provides an estimate for the temporal interval which elapsed since the Earth’s crust was formed. Hubble’s H0H_{0} result stood for over two decades, and so did the age-of-the-Universe paradox (also known as “timescale difficulty”).

It was Wilhelm Heinrich Walter Baade (1893-1960), another ‘user’ of the 100100-inch Hooker Telescope at the Mount Wilson Observatory (as well as of the 200200-inch Hale Telescope at the Palomar Observatory in the late 1940s), who came up with an explanation for the paradox. Hubble’s results had been flawed, because his estimates for the distances of the galaxies had been erroneous. It had not occurred to him that the Cepheid variable stars basically come in two types: the classical ones, which are young luminous stars, and the older ones which are fainter, e.g., see Ref. [34], figure in Section ‘Calculating distances using Cepheids’. The distance calibration had involved the second type, whereas Hubble had been processing data pertaining to classical Cepheid variable stars. The consequence was that the distances of the galaxies had seriously been underestimated (to be so luminous, the observed star had to be closer to the Milky Way). By how much had Hubble underestimated the present-day distances? It turns out that Hubble’s distances were between 6.86.8 and 68.768.7 % of the present-day estimates (available from the NASA/IPAC Extragalactic Database [20]), the median being equal to about 15.315.3 %. (I find it very surprising that no one validated Hubble’s determination of the distances of the galaxies for such a long time.) Within a few years of Baade’s discovery, his doctoral student, Allan Rex Sandage (1926-2010), came up with the first realistic estimate for H0H_{0} and for the age of the Universe in his 1958 paper [35]; in the abstract of that work, one reads: “…This gives H75H\approx 75 km/s Mpc-1 or H113×109H^{-1}\approx 13\times 10^{9} years …”

I had been wondering for some time what Hubble’s H0H_{0} result would have been, if present-day data for the galaxies, which are mentioned in Ref. [10], had been available to him. To this end, I used the latest information from NED [20] for all 4545 galaxies mentioned in Ref. [10], i.e., for the then selected 2424 (whose data Hubble considered ‘reliable’), as well as for the 2121 which, though mentioned in Hubble’s paper, had not been fully measured by 1929. Information about the right ascension, declination, and radial velocity can be found directly in the main web page of these galaxies in NED, whereas all available distance estimates may be downloaded using a tab therein. It does not take long to realise that the variability of the distance estimates of these galaxies is sizeable even today, nearly one century after Hubble’s analysis. In any case, I obtained the median, the mean, and the rms of the distribution of all distance estimates (no exclusion of results) for each of these galaxies, and finally constructed the input (to the optimisation software) datafile using the means and the rms values; alternatively, one could replace the means with the medians.

The measured radial velocities were corrected for the motion of the Solar System, using updated information regarding the ‘peculiar’ velocity [36] and the rotational motion of the Local Standard of Rest (LSR) [37, 38], a coordinate system obtained after averaging the velocities of stars in our neighbourhood. The three components of the velocity of the Solar System relative to the centre of the Milky Way will be denoted as (U,V,W)(U,V,W): in the coordinate system which follows the right-hand rule (to my great surprise, the use of left-handed coordinate systems is not uncommon in Observational Astronomy!), UU is positive towards the galactic centre, VV is positive in the direction of the galactic rotation, and WW is positive towards the North Galactic Pole. The velocity of the Solar System relative to the centre of the Milky Way is obtained by adding the peculiar velocity of the Sun (i.e., the velocity relative to the LSR) to the velocity of the LSR relative to the centre of the Milky Way. The velocity components UU and WW (and their uncertainties) were taken from Ref. [36], whereas VV (and its uncertainty) was fixed from Ref. [37]; the VV result of Ref. [37] is compatible with the estimate for the rotational velocity of the LSR which is recommended in Ref. [38]. For the transformation from equatorial (α,δ)(\alpha,\delta) to galactic (latitude bb and longitude ll of each celestial object) angles, Ref. [39] was followed, see p. 900 therein: the radial velocities of Ref. [20] were corrected by adding to the measured velocity the term Δv=Ucosbcosl+Vcosbsinl+Wsinb\Delta v\,=\,U\,\cos b\,\cos l\,+\,V\,\cos b\,\sin l\,+\,W\,\sin b.

A few words regarding the treatment of a few outliers in the dataset of the 4545 galaxies, mentioned in Ref. [10], are due.

  • After its radial velocity was corrected, NGC 3031 appears to recede from the Milky Way at a speed of 74.9(3.8)74.9(3.8) km/s, whereas its distance suggested that it should recede at about 250250 km/s.

  • Six galaxies (SMC, LMC, NGC 6822, NGC 598, NGC 221, and NGC 224) are separated from the Milky Way by less than 11 Mpc: the proximity of these galaxies to the Milky Way results in irregular radial velocities. The (corrected) recessional velocities of four of these galaxies are negative, implying that they approach the Milky Way (only LMC and NGC 6822 recede). Regarding the treatment of these six galaxies, one may either replace them with one representative object, situated at a typical distance (weighted average) of 0.055(21)0.055(21) Mpc and having a typical velocity (weighted average) of (5±24)(-5\pm 24) km/s, or exclude them from the database.

The results of the WLRs (without intercept) following the two approaches of Appendix A.4, for four ways of treatment of the aforementioned outliers, are listed in Table 2; the input datapoints (after all seven outliers are removed from the database) and the fitted straight line in case of the EV2 method are shown in Fig. 3. The crux of the matter is that, had present-day data been available to Hubble for his 1929 paper, he probably would have ended up with H0=(71.0±2.9)H_{0}=(71.0\pm 2.9) km/s Mpc-1 (or with a similar result). The Particle Data Group (PDG) recommend: H0=67.4(0.5)H_{0}=67.4(0.5) km/s Mpc-1, obtained from a plethora of precise astronomical data [40].

Table 2:

The results of the linear fit (without intercept, see Appendix A.4) to the pairs of measurements of the distance and of the radial velocity (corrected for the motion of the Solar System) of the galaxies mentioned in Ref. [10]; the input (r,v)(r,v) datapoints correspond to present-day information, available from the NASA/IPAC Extragalactic Database [20]. Results are quoted for four ways of treatment of the seven outliers, i.e., of the (r,v)(r,v) measurements corresponding to NGC 3031, as well as to the six galaxies closest to the Milky Way (see text). The t-multipliers, corresponding to 1σ1\sigma effects in the normal distribution, have been applied to the quoted uncertainties.

Method χmin2\chi^{2}_{\rm min} χmin2\chi^{2}_{\rm min}/NDF p^1H0\hat{p}_{1}\equiv H_{0} δp^1δH0\delta\hat{p}_{1}\equiv\delta H_{0}
(km/s Mpc-1) (km/s Mpc-1)
All galaxies: 4545 (r,v)(r,v) datapoints
Quadratic summation (EV2) 723.12723.12 16.4316.43 8282 1414
Linear summation 494.14494.14 11.2311.23 75.475.4 9.99.9
All but NGC 3031: 4444 (r,v)(r,v) datapoints
Quadratic summation (EV2) 690.61690.61 16.0616.06 8383 1414
Linear summation 468.56468.56 10.9010.90 7777 1010
No NGC 3031, six closest galaxies replaced by one: 3939 (r,v)(r,v) datapoints
Quadratic summation (EV2) 48.2248.22 1.271.27 71.071.0 2.92.9
Linear summation 45.6045.60 1.201.20 71.071.0 2.92.9
All outliers removed: 3838 (r,v)(r,v) datapoints
Quadratic summation (EV2) 48.0848.08 1.301.30 71.071.0 2.92.9
Linear summation 45.4745.47 1.231.23 71.071.0 2.92.9
Refer to caption
Figure 3: The pairs of measurements of the distance and of the radial velocity (corrected for the motion of the Solar System) of the galaxies mentioned in Ref. [10]; the values correspond to present-day information, obtained from the NASA/IPAC Extragalactic Database [20]. Seven outliers have been removed from the database (see text). The uncertainties of the radial velocity have been taken into account, but are too small to be displayed.

2.2 Of sweet peas, moths, and men: Galton’s family data on human stature

Galton was an English multidisciplinarian, with important contributions to subjects in a variety of branches of Science, ranging from Anthropology and Biology to Statistics. Before I started writing this article, I did not know that, in addition, Galton was among the very first to introduce weather maps in order to obtain forecasts, in particular to predict storms. Nor was I aware of the fact that he had been one of the pioneers of Forensic Science: in 1892, he published a book on the analysis of fingerprints as the means of identifying individuals [41]. In this paper, I shall focus on his 1889 book [11], which touches upon the subject of natural inheritance.

In his book, Galton set out to explore three questions (see pp. 1-2). I let him give a summary of these issues, copying a few sentences from his book.

  • “The large do not always beget the large, nor the small the small, and yet the observed proportions between the large and the small in each degree of size and in every quality, hardly varies from one generation to another.” Galton suggests that corresponding PDFs of the human height “hardly vary from one generation to another,” though (so far as human stature is concerned) the children are not close replicas of their parents.

  • “The second problem regards the average share contributed to the personal features of the offspring by each ancestor severally. Though one half of every child may be said to be derived from either parent, yet he may receive a heritage from a distant progenitor that neither of his parents possessed as personal characteristic. Therefore the child does not on the average receive so much as one half of his personal qualities from each parent, but something less than a half.” At this point, the laws of Mendelian inheritance, which (presumably unbeknownst to him) had been developed by the time Galton started writing his book, come to my mind. (Such inherited features from distant progenitors, categorised as ‘latent elements’, are addressed in Chapter XI of Galton’s book.)

  • “The last of the problems that I need mention now, concerns the nearness of kinship in different degrees. We are all agreed that a brother is nearer akin than a nephew, and a nephew than a cousin, and so on, but how much nearer are they in the precise language of numerical statement?” Here, Galton expresses interest in the quantification of the degree of consanguinity.

I shall next address a few issues relevant to Galton’s family data on human stature (‘Galton’s family data’ henceforth). Before skimming through Galton’s book, I was under the impression that Galton’s assistants had been responsible for all measurements of the height of the individuals on which the study [11] was based. This turned out to be untrue. Galton discusses the data-acquisition process in Chapter VI of his book, starting on p. 71: “I had to collect all my data for myself, as nothing existed, so far as I know, that would satisfy even my primary requirement. This was to obtain records of at least two successive generations of some population of considerable size. They must have lived under conditions that were of a usual kind, and in which no great varieties of nurture were to be found.” Although, according to his first sentence, Galton “had to collect all his data for himself,” the data was actually collected ‘through the offer of prizes’ by individuals who were inexperienced with acquiring measurements for scientific purposes (pp. 74-75). These individuals simply responded to Galton’s request (p. 72): “Mr. Francis Galton offers 500500l. (i.e., £500500) in prizes to those British Subjects resident in the United Kingdom who shall furnish him before May 15, 1884, with the best Extracts from their own Family Records.” Although the conditions regarding the use of any submitted data were thoroughly addressed in the advertisement (pp. 72-74), those pertaining to the data acquisition itself were not! A certain amount of disappointment overtook me as I read the following passage in Galton’s book (p. 78): “In many cases there remains considerable doubt whether the measurement refers to the height with the shoes on or off; not a few of the entries are, I fear, only estimates, and the heights are commonly given only to the nearest inch.” He did not mention that this was mostly the result of his glaring failure to specify the process by which the measurements of the height should have been acquired. Nevertheless, Galton’s spirits must have had a miraculous recovery as, a few lines on, he shifted his mind-set towards the thesis that “a fair share of these returns are undoubtfully careful and thoroughly trustworthy, and as there is no sign or suspicion of bias, I have reason to place confidence in the values of the Means that are derived from them.” How can there be “no sign or suspicion of bias,” if the fraction of measurements, which had been made with the shoes on, remains unknown? Without doubt, the inclusion of such measurements into the database is bound to introduce bias, in particular (for the sake of example) if most of the fathers walked around the house (and let their heights be measured) in shoes, whereas most of the other family members preferred to walk (and let their heights be measured) barefoot.

Be that as it may, Galton’s records contain rows corresponding to different families: the essential information comprises the parental heights, as well as the ordered heights of all filial descendants, first of the sons, followed by those of the daughters. To be able to analyse the data in his time 111111It was Carl Pearson (1857-1936), one of Galton’s doctoral students, who brought Statistics to another level. (using one independent variable), Galton decided to combine the two parental heights into one measure of ‘parental tallness’. As the arithmetical mean of the two heights did not appeal to him 121212On p. 6 of his book, Galton writes: “A serious complexity due to sexual differences seems to await us at every step when investigating the problems of heredity …The artifice is never to deal with female measures as they are observed, but always to employ their male equivalents in the place of them. I transmute all the observations of females before taking them in hand, and thenceforward am able to deal with them on equal terms with the observed male values.”, he came up with a scheme of ‘transmuting females into male equivalents’ by increasing the heights of all females in his study (i.e., of the mothers and of the daughters) by 88 %. When I first had a look at Galton’s family data, this factor appeared arbitrary to me, but (as I shall explain shortly) it emerges from the data sample which was available to him. After ‘transmuting the mothers into father equivalents’, Galton created a hypothetical parent (by averaging the two values), a ‘mid-parent’ in his own words (p. 87): “The word ‘Mid-Parent’ …expresses an ideal person of composite sex, whose stature is half way between the stature of the father and the transmuted stature of the mother.”

Galton finally presented the data in tabular form, see Table 11 of his book, p. 208; all heights are expressed in inches (1′′=2.541^{\prime\prime}=2.54 cm). Given in this table are the frequency distributions of the filial height in nine histogram bins of the mid-parent’s height ranging between 63.5′′63.5^{\prime\prime} and 72.5′′72.5^{\prime\prime}, with one overflow (mid-parent’s height in excess of 72.5′′72.5^{\prime\prime}) and one underflow (mid-parent’s height below 63.5′′63.5^{\prime\prime}) bin. As I was not content with this table, I started investigating whether Galton’s original data had appeared elsewhere.

This was how I came across Ref. [42] and found out that its author too, James A. Hanley, had been in pursuit of the original data over two decades ago: on p. 238 of his paper, Hanley writes: “And so, in 2000, I began my search for Galton’s ‘untransmuted’ data …I also hoped that the children would still be found with their families, that is, before they were marshalled into what Galton called ‘filial’ arrays.” Hanley then continues to recount his efforts to localise Galton’s records, which were finally ‘unearthed’ in 2001, in possession of the University College London. I can almost sense his contentment as I read his text: “The data were exactly what I had wished, in a single notebook, family by family, with sons and daughters identified, and with all female heights untransmuted. Because of the frail condition of the notebook, photocopying was not permitted …In February 2003, I requested and obtained permission to digitally photograph the material.” Digital photographs of the eight relevant pages of Galton’s notebook are now, as a result of Hanley’s commendable efforts, available online [43].

I copied the values, as they appear in the digital photos, onto an Excel file (the constant value of 60′′60^{\prime\prime} had been subtracted from all entries in Galton’s notebook) and edited a few family records as follows:

  • Family 70: one daughter, described as ‘tall’, was removed;

  • Family 76: one son (between the heights of 68′′68^{\prime\prime} and 67′′67^{\prime\prime}) and a second daughter, both described as ‘medium’, were removed;

  • Family 92: one daughter, described as ‘tall’, was removed;

  • Family 95: one daughter, described as ‘deformed’, was removed;

  • Family 118: the fourth son, described as ‘tallish’, and all five daughters, described as ‘tallish’ (twice), ‘medium’ (twice), and ‘shortish’, were removed;

  • Family 119: the first three daughters, described as ‘tall’ (twice) and ‘medium’, were removed;

  • Family 131: three daughters, described as ‘medium’ and ‘short’ (twice), were removed;

  • Family 144: the third son, described as ‘deformed’, was removed;

  • Family 158: one daughter (between the heights of 62′′62^{\prime\prime} and 61′′61^{\prime\prime}), described as ‘short’, was removed;

  • Family 159: the first three daughters, described as ‘very tall’ and ‘tall’ (twice), were removed;

  • Family 173: the last son, described as ‘short’, was removed;

  • Family 176: one son (between the heights of 68.5′′68.5^{\prime\prime} and 66.5′′66.5^{\prime\prime}), described as ‘medium’, and the last daughter, described as ‘idiotic’, were removed;

  • Family 177: two daughters (between the two reported values), described as ‘tall’, were removed; and

  • Family 189: the second son and one daughter (between the two reported values), both described as ‘middle’, were removed.

No measurement of height is mentioned in the aforementioned cases of deleted entries. I thus came up with 205205 rows (families) and a total of 934934 children, 481481 sons and 453453 daughters. I cannot explain why Galton writes on p. 77 of his book: “I was able to extract …the statures of 205205 couples of parents, with those of an aggregate of 930930 of their adult children of both sexes.” Nor can I understand why the filial multiplicities in the eleven (i.e., 9+29+2) bins of the mid-parent’s height in his Table 11 (p. 208) sum up to 928928.

2.2.1 Correlations between the parental and filial heights

It is interesting to compare the values of Pearson’s correlation coefficient between the sets of heights corresponding to all combinations of parental and filial types. I am aware of three methods taking on the problem of unequal dimensions of the input datasets (205205 pairs of parents, 481481 sons, 453453 daughters).

  • Method A: The analysis is restricted to the families with either exactly one son or exactly one daughter. (Of course, families with exactly one son and exactly one daughter contribute to both cases.) As the heights of the sons and of the daughters will be analysed separately, the correspondence between each pair of parents and the relevant filial descendant (male and/or female, as the case might be) is one-to-one (bijective). In the accepted data, there are 4242 families with exactly one son (the multiplicity of the daughters in these families is irrelevant) and 6060 with exactly one daughter (the multiplicity of the sons in these families is irrelevant).

  • Method B: Within each family, the mean fraternal and sororal heights are evaluated. (Of course, families with no sons or with no daughters do not contribute to the corresponding dataset.) Again, a bijective correspondence between each pair of parents and the filial descendants is created (all sons and all daughters within a family are replaced by two persons - one male, another female - with stature corresponding to the aforementioned means). In the accepted data, there are 179179 families with at least one son and 176176 with at least one daughter.

  • Method C: The obvious commonality between the descendants within the same family is ignored, and the two parental heights are assigned to all filial descendants of each family. The correspondence is multi-valued: each pair of parents is linked to all of their filial descendants.

I do not expect the introduction of bias into the analysis in cases of methods A and B.

The values of Pearson’s correlation coefficient, obtained when following the aforementioned methods, are given in Table 3. Regardless of the method, the correlation of the heights of both filial types with the paternal height comes out stronger.

Table 3:

Pearson’s correlation coefficient between the heights corresponding to all combinations of parental and filial types.

Paternal height Maternal height
Method A
Son’s height 0.6420.642 0.4260.426
Daughter’s height 0.4890.489 0.4420.442
Method B
Sons’ mean height 0.5200.520 0.3640.364
Daughters’ mean height 0.4940.494 0.3670.367
Method C
Sons’ heights 0.3930.393 0.3230.323
Daughters’ heights 0.4290.429 0.3050.305

2.2.2 Distributions of human height

Table 4 contains the approximate values of the mean and of the rms of the distributions of parental and filial heights corresponding to Galton’s 205 families. The apparent similarities of these statistical measures for the male (fathers and sons) and for the female (mothers and daughters) subjects, which the contents of the table suggest, require further examination.

Table 4:

Approximate mean and rms values of the distributions of parental and filial heights corresponding to Galton’s 205 families, containing acceptable entries of 481481 sons and 453453 daughters. The ratio of the mean parental heights is equal to about 1.0831.083, i.e., close to the factor of 1.081.08 which Galton employed in order to transform the heights of the females into those of the ‘male equivalents’. In this work, there is no need for such a transformation.

Family member Mean value rms
Father 69.316′′69.316^{\prime\prime} 2.647′′2.647^{\prime\prime}
Mother 64.002′′64.002^{\prime\prime} 2.333′′2.333^{\prime\prime}
Son 69.234′′69.234^{\prime\prime} 2.624′′2.624^{\prime\prime}
Daughter 64.103′′64.103^{\prime\prime} 2.356′′2.356^{\prime\prime}

The four distributions of the parental and the filial heights will next be submitted to a number of tests. Implementations of the algorithms relevant to these tests are available in the form of classes in my C++ software library, developed within the framework of Microsoft Visual Studio. The input comprises the original arrays (i.e., not histograms of these data).

The first test concerns the symmetry of the distributions about their corresponding median values. The results of Wilcoxon’s signed-rank tests [44] are given in Table 5. On the basis of the two sets of p-values (depending on whether or not the so-called continuity correction is applied), there is no evidence in support of asymmetrical distributions about their corresponding median values.

Table 5:

The results of Wilcoxon’s signed-rank (two-tailed) tests [44] on the four distributions of parental and filial heights; owing to the largeness of the populations, the Edgeworth approximation [45] is reliable (and has been used). The quantity nn is the dimension of each input dataset. The T-domain and the T-value are standard quantities associated with the test. The two sets of p-values, depending on whether or not the so-called continuity correction is applied, are nearly identical.

Quantity Fathers Mothers Sons Daughters
nn 205205 205205 481481 453453
T-domain [0,21 115][0,21\,115] [0,21 115][0,21\,115] [0,115 921][0,115\,921] [0,102 831][0,102\,831]
T-value 9 244.59\,244.5 8 881.08\,881.0 59 025.559\,025.5 45 352.045\,352.0
Without continuity correction
p-value 3.8491013.849\cdot 10^{-1} 7.0211017.021\cdot 10^{-1} 4.5121014.512\cdot 10^{-1} 4.1801014.180\cdot 10^{-1}
With continuity correction
p-value 3.8521013.852\cdot 10^{-1} 7.0271017.027\cdot 10^{-1} 4.5131014.513\cdot 10^{-1} 4.1811014.181\cdot 10^{-1}

To test the similarity of the distributions of the height between a) fathers and sons, and b) mothers and daughters, two tests were performed: the Mann-Whitney-Wilcoxon test [44, 46] and the Brunner-Munzel test [47]. The p-values of these two tests on the data at hand are nearly identical, see Table 6: as a result, there is no evidence in support of dissimilar distributions of the height between a) fathers and sons, and b) mothers and daughters. Therefore, on the basis of his data, no evidence can be produced against Galton’s first conjecture about the similarity of the two (one for male, another for female subjects) PDFs of the human height 131313Without doubt, the mean height of the humans gradually increased during the twentieth century, e.g., see Figs. 6-8 of Ref. [48]. However, this effect has been associated with better nutrition and improved healthcare. According to Ref. [49]: “Poor nutrition and illness in childhood limit human growth. As a consequence, the mean height of a population is strongly correlated with living standards in a population.” “from one generation to another” (see beginning of Section 2.2).

Table 6:

Results of the two (two-tailed) tests for similarity of the distributions of the height between a) fathers and sons, and b) mothers and daughters. The quantities nn and mm are the dimensions of the input datasets (parent and filial descendant) in each case. The U-domain, U-value, and T-value are standard quantities associated with the two tests.

Quantity Fathers, sons Mothers, daughters
nn 205205 205205
mm 481481 453453
Median height parent 69.5′′69.5^{\prime\prime} 64.0′′64.0^{\prime\prime}
Median height offspring 69.2′′69.2^{\prime\prime} 64.0′′64.0^{\prime\prime}
Results of the Mann-Whitney-Wilcoxon test [44, 46]
U-domain [0,98 605][0,98\,605] [0,92 865][0,92\,865]
U-value 49 939.549\,939.5 46 094.046\,094.0
p-value, without continuity correction 7.8811017.881\cdot 10^{-1} 8.8051018.805\cdot 10^{-1}
p-value, with continuity correction 7.8831017.883\cdot 10^{-1} 8.8071018.807\cdot 10^{-1}
Results of the Brunner-Munzel test [47]
T-value 0.27020.2702 0.1505-0.1505
p-value 7.8711017.871\cdot 10^{-1} 8.8051018.805\cdot 10^{-1}

The aforementioned tests suggest that the distributions of the height a) of the fathers and of the sons, and b) of the mothers and of the daughters may be combined into two distributions: one for the male, another for the female subjects. To rid the data of the rounding effects 141414The height of 2424 (out of 205205) fathers in Galton’s family data is given as 70′′70^{\prime\prime}. Such shortcomings would hardly have occurred, if the data acquisition had been carried out by professionals. (which give rise to artefacts in the analysis), random uniformly-distributed noise U(0,1), offset by 0.50.5 so that its expectation value would correspond to vanishing correction, was added to all available measurements of the height; in essence, this operation neutralises Galton’s concern that “the heights are commonly given only to the nearest inch.” Two new datafiles were created after the addition of the ‘noise’, one for the male, another for the female subjects, and the resulting distributions were tested for normality.

Numerous algorithms are available for testing the normality of distributions, e.g., see Refs. [50, 51] and the works cited therein. In this study, the normality will be tested by means of three well-established statistical methods.

  • The formal Shapiro-Wilk normality test, which emerges as the ultimate test for normality in power studies (e.g., see Ref. [52]), was introduced in 1965 [53] for small samples (in the first version of the algorithm, a maximum of 5050 observations could be tested) and was substantially extended (for application to large samples, certainly up to n=5 000n=5\,000 observations, perhaps to even larger samples) in a series of studies by Royston [54].

  • The Anderson-Darling test [55] with the D’Agostino 1986 addition [56].

  • D’Agostino’s (or the D’Agostino-Pearson) K2K^{2} test, which was introduced in 1973 [57] and appeared in its current form in 1990 [51].

There is only one commonality between these three tests, the obvious one: the tests result in an estimate for the p-value for the acceptance of the null hypothesis, i.e., that the underlying distribution is the normal distribution N(μ,σ2)N(\mu,\sigma^{2}). Without ado, I present the results of the tests in Table 7. The normal probability plots, corresponding to the male and to the female subjects, are given in Figs. 4 and 5, respectively. In both cases, a few outliers may be seen in the tails of the two distributions. Although such datapoints may be excluded in a more thorough analysis of Galton’s family data, it was decided to leave them in the database of this work. On the basis of Table 7, as well as of Figs. 4 and 5, no evidence can be produced in support of a departure of the two distributions from normality. The two PDFs of the height corresponding to Galton’s male (father and sons) and female (mother and daughters) subjects, obtained from the means and the rms values of Table 7, are shown (side by side) in Fig. 6.

Table 7:

Results of the three tests for normality of the distributions of the height of the male (fathers and sons) and of the female (mothers and daughters) subjects. The quantity nn is the dimension of the input datasets (parent and filial descendant) in each case. The WW-statistic, A2A^{2}-statistic, and K2K^{2}-statistic are the test statistics associated with the three tests.

Quantity Fathers + sons Mothers + daughters
nn 686686 658658
Mean value 69.254′′69.254^{\prime\prime} 64.071′′64.071^{\prime\prime}
Standard deviation 2.636′′2.636^{\prime\prime} 2.375′′2.375^{\prime\prime}
Skewness 0.0010.001 0.013-0.013
Kurtosis 3.3613.361 3.1923.192
Excess kurtosis 0.3610.361 0.1920.192
Results of the Shapiro-Wilk test [53, 54]
WW-statistic 0.99580.9958 0.99760.9976
p-value 6.451026.45\cdot 10^{-2} 4.731014.73\cdot 10^{-1}
Results of the Anderson-Darling test [55, 56]
A2A^{2}-statistic 0.65940.6594 0.37640.3764
p-value 8.521028.52\cdot 10^{-2} 4.111014.11\cdot 10^{-1}
Results of the K2K^{2} test [51, 57]
K2K^{2}-statistic 3.27913.2791 1.17381.1738
p-value 1.941011.94\cdot 10^{-1} 5.561015.56\cdot 10^{-1}
Refer to caption
Figure 4: The normal probability plot for the male subjects (fathers and sons) in Galton’s family data. The red straight line is the result of an optimisation with equal statistical weights for all datapoints (see Appendix A.2). The departure of the datapoints from linearity in the normal probability plot is equivalent to the departure of the input data from normality.
Refer to caption
Figure 5: The normal probability plot for the female subjects (mothers and daughters) in Galton’s family data, see also caption of Fig. 4.
Refer to caption
Figure 6: The two PDFs of the height corresponding to the male (father and sons) and the female (mother and daughters) subjects in Galton’s family data. It must be borne in mind that these two distributions represent “British subjects, resident in the United Kingdom,” who decided to send the “best extracts from their own family records” to Galton in 1884.

2.2.3 Modelling

Regarding the modelling of Galton’s family data, two approaches would appeal to the physicist, both involving the extraction of the mean filial heights (for the sake of simplicity, both will be denoted as yiy_{i}, but will be separately modelled) for each family (of course, families without any sons or without any daughters do not contribute to the corresponding datafile). Provided that the filial heights are sampled from normal distributions with two family-dependent means and two family-independent variances 151515In the language of the analysis of variance (ANOVA), these variances are associated with the within-treatments variations., i.e., one variance relevant to the heights of the sons σM2\sigma^{2}_{M}, another to those of the daughters σF2\sigma^{2}_{F}, the two standard errors of the means δyi\delta y_{i} are expected to be equal to σX/ni\sigma_{X}/\sqrt{n_{i}}, where σX2\sigma^{2}_{X} is to be identified either with σM2\sigma^{2}_{M} or with σF2\sigma^{2}_{F}, and nin_{i} stands for the fraternal or sororal multiplicity in the ii-th family. As explained in Appendix A.3, a statistical weight wiw_{i} is introduced, to account for the uncertainty δyi\delta y_{i} of each of the two filial means: evidently, wi=ni/σX2w_{i}=n_{i}/\sigma^{2}_{X}, and, as the plan is to perform separate fits to the two types of mean filial heights, one may simply use wi=niw_{i}=n_{i} for the purposes of the optimisation: the rescaling of the statistical weights (i.e., their multiplication by any constant >0\in\mathbb{R}_{>0}) has no impact on the important results of the fits.

I shall next address the two types of optimisation to be pursued in relation to Galton’s family data. The first one involves multiple linear regression (with the parental heights as the two independent variables), as detailed in Appendix A.5. The second possibility involves the combination of the two parental heights into one value, i.e., the one corresponding to the hypothetical mid-parent in Galton’s language. However, the realisation of this possibility does not involve Galton’s simple procedure, but a more rigorous one, namely the separate standardisation of the two paternal heights: the notion of tallness can be quantified (in the absolute sense) on the basis of a comparison of each parent’s height with the heights of the parents of the same sex. To this end, one may obtain an absolute measure of tallness for each of the two parents of the ii-th family by introducing the quantity ziz_{i} with the relation

zi=hih¯var(h),z_{i}=\frac{h_{i}-\bar{h}}{\sqrt{{\rm var}(h)}}\,\,\,, (6)

where hih_{i} denotes the height of the parent, and h¯\bar{h} and var(h){\rm var}(h) stand for the mean height and for the (unbiased) variance of the distribution of the height of the parents of the same sex; although the values of Table 4 could be used, I decided to employ the statistical measures of the two distributions of parental height for those of the families with at least one son (when modelling the fraternal heights) and at least one daughter (when modelling the sororal heights), see Table 8.

Table 8:

Approximate mean and rms values of the distributions of parental heights corresponding to Galton’s 179179 families with at least one son (suitable for the modelling of the fraternal heights) and the 176176 families with at least one daughter (suitable for the modelling of the sororal heights).

Family member Mean value rms
Obtained from the data of the 179179 families with at least one son
Father 69.099′′69.099^{\prime\prime} 2.547′′2.547^{\prime\prime}
Mother 63.994′′63.994^{\prime\prime} 2.367′′2.367^{\prime\prime}
Obtained from the data of the 176176 families with at least one daughter
Father 69.418′′69.418^{\prime\prime} 2.732′′2.732^{\prime\prime}
Mother 64.125′′64.125^{\prime\prime} 2.289′′2.289^{\prime\prime}

After standardising the height of each parent of each family (using separate distributions, as explained above), one may either proceed to add the two zz-scores or to average them (the choice in this work), and assign that score (independent variable) to the mid-parent for the purposes of a WLR, as detailed in Appendix A.1.

Before advancing, I should mention that I have not studied the papers relevant to the past analyses of Galton’s family data; this work aims at applying a few standard methods of linear regression, not at compiling a review article on Galton’s dataset. The interested reader is addressed to Ref. [42], as well as to the works cited therein.

2.2.3.1 Results of the optimisation in case of multiple linear regression

The results of linear regression, using two independent variables (the paternal heights xix_{i} and the maternal heights ziz_{i}), are given in Table 9 for the two fits to the filial heights. The square of the Birge factor [58] is an estimate for the (unbiased) ‘unexplained’ variance in each case, see also Appendices A.3, B, and C.

Table 9:

The results of linear regression, using two independent variables (the paternal heights xix_{i} and the maternal heights ziz_{i}) for the separate fits to the filial heights. All quantities have been defined in Appendix A.5. Due to the largeness of the two samples, the values of the t-multiplier are small, and modify (to the quoted precision) only the fitted uncertainty of p1xp_{1x} (from 0.0410.041 to 0.0420.042) in case of the fit to the sororal heights.

Quantity Sons Daughters
NN 179179 176176
F(p^0,p^1x,p^1z)F(\hat{p}_{0},\hat{p}_{1x},\hat{p}_{1z}) 1 236.751\,236.75 941.58941.58
Birge factor 2.652.65 2.332.33
p^0\hat{p}_{0} 19.3′′19.3^{\prime\prime} 18.8′′18.8^{\prime\prime}
p^1x\hat{p}_{1x} 0.4180.418 0.3740.374
p^1z\hat{p}_{1z} 0.3290.329 0.3040.304
δp^0\delta\hat{p}_{0} 4.7′′4.7^{\prime\prime} 4.2′′4.2^{\prime\prime}
δp^1x\delta\hat{p}_{1x} 0.0530.053 0.0410.041
δp^1z\delta\hat{p}_{1z} 0.0520.052 0.0490.049
t-multiplier 1.00281.0028 1.00291.0029

Scatter plots of the standardised residuals (yiy~i)/δyi(y_{i}-\tilde{y}_{i})/\delta y_{i} versus the fitted values y~i\tilde{y}_{i} are given in Figs. 7 and 8 for the fits to the fraternal and sororal heights, respectively.

Refer to caption
Figure 7: Scatter plot of the standardised residuals (yiy~i)/δyi(y_{i}-\tilde{y}_{i})/\delta y_{i} versus the fitted values y~i\tilde{y}_{i} for the fit to the mean fraternal heights yiy_{i}. The red straight line corresponds to the ideal description of the input data.
Refer to caption
Figure 8: The equivalent of Fig. 7 for the fit to the mean sororal heights.

2.2.3.2 Results of the optimisation in case of a WLR

To start with, the mean filial heights correlate equally well with the mid-parents’ zz-scores: Pearson’s correlation coefficient is equal to 0.5930.593 in case of the mean fraternal heights and 0.5920.592 in case of the mean sororal heights. The results for the fitted values of the parameters and of their uncertainties, obtained from the two fits to the available data, are given in Table 10. The quality of the two fits is nearly so good as that of the corresponding fits in case of the multiple linear regression of the previous section.

Table 10:

The results of a WLR, using as independent variable the average zz-scores of the parental heights and as dependent variable the filial heights. All quantities have been defined in Appendix A.5. Due to the largeness of the two samples, the values of the t-multiplier are small, and do not modify (to the quoted precision) the fitted uncertainties of the two parameters of the fit.

Quantity Sons Daughters
NN 179179 176176
F(p^0,p^1)F(\hat{p}_{0},\hat{p}_{1}) 1 252.621\,252.62 963.67963.67
Birge factor 2.662.66 2.352.35
p^0\hat{p}_{0} 69.20′′69.20^{\prime\prime} 64.14′′64.14^{\prime\prime}
p^1\hat{p}_{1} 1.82′′1.82^{\prime\prime} 1.71′′1.71^{\prime\prime}
δp^0\delta\hat{p}_{0} 0.12′′0.12^{\prime\prime} 0.11′′0.11^{\prime\prime}
δp^1\delta\hat{p}_{1} 0.17′′0.17^{\prime\prime} 0.16′′0.16^{\prime\prime}
t-multiplier 1.00281.0028 1.00291.0029

Plots of the mean filial heights versus the mid-parents’ zz-scores are shown in Figs. 9 and 10 for the sons and the daughters, respectively.

Refer to caption
Figure 9: The values of the mean fraternal height for the 179179 families with at least one son, accompanied by uncertainties corresponding to the standard error of the means. The red straight line represents the optimal result of the WLR, see Table 10. The abscissa is the mid-parent’s zz-score.
Refer to caption
Figure 10: The equivalent of Fig. 9 for the daughters; there are 176176 families in Galton’s family data with at least one daughter.

I left one issue for the very end. Was Galton’s approach to ‘transmuting females into male equivalents’ deficient? To answer this question, let us consider two populations - one of male, another of female subjects - which fulfil two conditions.

  • The dimensions of the populations are equal.

  • So far as human stature is concerned, the correspondence between the two groups is bijective. In addition, for each male subject with height xix_{i}, there exists a female subject with height zi=xi/kz_{i}=x_{i}/k, where kk is a constant.

In such a population, the means of the heights of the male subjects x¯\bar{x} and of the female subjects z¯\bar{z} are related: z¯=x¯/k\bar{z}=\bar{x}/k; similarly linked are the rms values of the two distributions, i.e., σx\sigma_{x} for the male subjects and σz\sigma_{z} for the female subjects: σz=σx/k\sigma_{z}=\sigma_{x}/k. Let us next focus on one family, with parental heights xax_{a} and zbz_{b}. Galton suggested that each mid-parent’s height mabm_{ab} should be taken as

mab=xa+kzb2.m_{ab}=\frac{x_{a}+kz_{b}}{2}\,\,\,. (7)

On the other hand, the average zz-score, corresponding to the heights of these two parents - see Eq. (6), would be equal to

12(xax¯σx+zbz¯σz),\frac{1}{2}\left(\frac{x_{a}-\bar{x}}{\sigma_{x}}+\frac{z_{b}-\bar{z}}{\sigma_{z}}\right)\,\,\,, (8)

which yields

12(xax¯σx+zbx¯/kσx/k)=xa+kzb2x¯2σx=mabx¯σx.\frac{1}{2}\left(\frac{x_{a}-\bar{x}}{\sigma_{x}}+\frac{z_{b}-\bar{x}/k}{\sigma_{x}/k}\right)=\frac{x_{a}+kz_{b}-2\bar{x}}{2\sigma_{x}}=\frac{m_{ab}-\bar{x}}{\sigma_{x}}\,\,\,. (9)

Given the constancy of the quantities x¯\bar{x} and σx\sigma_{x}, there can be no doubt that, provided that the two aforementioned conditions about the two populations are fulfilled, Galton’s mid-parent approach and the method involving the average zz-score of the parental heights, are bound to give identical results so far as the quality of the description of the filial heights is concerned. In case of Galton’s family data, the first criterion is undoubtedly fulfilled: the 205205 families contain 205205 fathers and 205205 mothers. However, the second criterion is not fulfilled (in the mathematical sense): given that the heights of the subjects are sampled from continuous probability distributions, one cannot expect to find any two subjects in the entire population, whose heights could possibly have a ratio exactly equal to the ratio kk of the means of the heights in the two populations of male and female subjects; the consequence is that σzσx/k\sigma_{z}\neq\sigma_{x}/k, and the average zz-score of the parental heights is no longer given by Eq. (9). Only one question remains: how effective (in terms of the description of the input data) was Galton’s introduction of the mid-parent’s height via Eq. (7) as the independent variable in the linear-regression problem?

The results of the two WLRs to the average filial heights, using as independent variable Galton’s mid-parent’s height of Eq. (7), are given in Table 11. The quality of the fits is comparable to that obtained with the two methods of this work. All things considered, Galton’s use of the mid-parent’s height as the independent variable in the problem, which he set out to examine, was a reasonable approximation.

Table 11:

The results of a WLR, using as independent variable Galton’s mid-parent’s height of Eq. (7) and as dependent variable the filial heights. The optimal values of kk per filial type have been used: 1.0801.080 for the fraternal heights, 1.0831.083 for the sororal ones. All quantities have been defined in Appendix A.5. Due to the largeness of the two samples, the values of the t-multiplier are small, and modify (to the quoted precision) only the fitted uncertainty of p1p_{1} (from 0.0680.068 to 0.0690.069) in case of the fit to the fraternal heights.

Quantity Sons Daughters
NN 179179 176176
F(p^0,p^1)F(\hat{p}_{0},\hat{p}_{1}) 1 252.981\,252.98 953.79953.79
Birge factor 2.662.66 2.342.34
p^0\hat{p}_{0} 19.9′′19.9^{\prime\prime} 18.3′′18.3^{\prime\prime}
p^1\hat{p}_{1} 0.7130.713 0.6610.661
δp^0\delta\hat{p}_{0} 4.7′′4.7^{\prime\prime} 4.2′′4.2^{\prime\prime}
δp^1\delta\hat{p}_{1} 0.0680.068 0.0600.060
t-multiplier 1.00281.0028 1.00291.0029

3 Conclusions

This work aimed at providing a modicum of perspective into the analysis of two well-known datasets in terms of the methods of linear regression.

The first of these sets has been taken from Hubble’s 1929 paper [10]; that work investigated the relation between the distances rr of nebulae (identified with galaxies later on) and their radial velocities vv (relative to the observer). Hubble assumed a linear relation between these two quantities, and determined the slope in their scatter plot; that slope, which became known as the ‘Hubble constant’ H0H_{0} at later times, was estimated by Hubble to about 450600450-600 km/s Mpc-1 [10, 33]. After revisiting Hubble’s problem, using his selection of galaxies in conjunction with present-day (r,v)(r,v) data [20], this work demonstrated that Hubble could have obtained an H0H_{0} estimate close to the currently accepted value [40] (which is several times smaller than his 1929 and 1931 estimates), if accurate information (in particular, more accurate estimates for the distances rr) had been available to him, see Table 2 and Fig. 3.

The second dataset of this work related to Galton’s family data on human stature [11]. Galton’s original data, retrieved from digital photographs of eight pages of his notebook available from Ref. [43], contain records of parental and filial heights for each family separately. The four distributions of the height (two parental, two filial) were first tested for symmetry (about their respective median values), see first part of Section 2.2.2. Subsequently, the similarity of the distributions of the height for the male subjects (fathers and sons), as well as for the female subjects (mothers and daughters), was investigated and confirmed. The two distributions, obtained after combining the heights of the male subjects, as well as those of the female subjects, were successfully tested for normality. The two PDFs of the height, corresponding to the male and female “British subjects, resident in the United Kingdom, who furnished Galton with the best extracts from their own family records” are shown, next to one another, in Fig. 6. The filial heights were subsequently modelled in two ways:

  • in terms of multiple linear regression, using the parental heights as independent variables and

  • in terms of a WLR, featuring the average zz-score of the two parents as the independent variable. Each of the zz-scores of the two parents was obtained on the basis of a comparison of that parent’s height with the statistical measures corresponding to the distribution of the height of parents of the same sex. My thesis is that the separate standardisation of the parental heights represents the best option towards establishing an absolute standard of ‘parental tallness’ in the general case.

The two methods of linear regression gave nearly identical results (in terms of the description of the input data), supporting the hypothesis of a positive correlation between parental and filial heights, see Tables 9 and 10, and Figs. 9 and 10.

In addition to these results, this paper may also serve as a short guide to obtaining the formalism which is relevant to the various methods of linear regression, including the determination of the optimal values of the parameters and of their uncertainties, as well as of any predictions based on the results of the fits: given in Appendix A are analytical formulae for all these quantities in case of one independent variable, for different types of input uncertainties, see Appendices A.1-A.4. Appendix A.5 deals with the simplest case of multiple linear regression, the one involving two independent variables. Appendices B and C deal with two technical issues, namely with some features of the numerical (as opposed to the analytical) minimisation, as well as with the application of the Birge factor to the fitted uncertainties of the parameters.

I acknowledge several interesting discussions with Davide Sardella on a number of subjects relating to Astronomy, from the very ancient to the very modern times. Figure 11 has been created with CaRMetal [59]. The remaining figures of this paper have been created with MATLAB® (The MathWorks, Inc., Natick, Massachusetts, United States).

References

  • [1] A.-M. Legendre, ‘Nouvelles méthodes pour la détermination des orbites des comètes’ (in French), F. Didot, Paris (1805); see https://books.google.ch/books?id=FRcOAAAAQAAJ&redir_esc=y
  • [2] https://priceonomics.com/the-discovery-of-statistical-regression
  • [3] R.J. Carroll, D. Ruppert, ‘Transformation and Weighting in Regression’, Springer-Science+Business Media, B.V. (1988). ISBN-13: 978-0-4120-1421-5
  • [4] G.S. Maddala, ‘Introduction to Econometrics’, 2nd edn., Macmillan Publishing Company (1992). ISBN-10: 0-02-374545-2
  • [5] F.A. Graybill, H.K. Iyer, ‘Regression Analysis: Concepts and Applications’, Duxbury Press (1994). ISBN-10: 0-53-419869-4, ISBN-13: 978-0-5341-9869-5
  • [6] T. Hastie, R. Tibshirani, J. Friedman, ‘The Elements of Statistical Learning: Data Mining, Inference, and Prediction’, 2nd edn., Springer (2016). ISBN-10: 0-38-784857-6, ISBN-13: 978-0-3878-4857-0
  • [7] R.B. Darlington, A.F. Hayes, ‘Regression Analysis and Linear Models: Concepts, Applications, and Implementation’, The Guilford Press (2017). ISBN-13: 978-1-4625-2113-5
  • [8] C.A. Cantrell, ‘Technical Note: Review of methods for linear least-squares fitting of data and application to atmospheric chemistry problems’, Atmos. Chem. Phys. Discuss. 8(17), 6409 (2008). DOI: 10.5194/acp-8-5477-2008
  • [9] S. Mikkonen et al., ‘Technical note: Effects of uncertainties and number of data points on line fitting - a case study on new particle formation’, Atmos. Chem. Phys. 19(19), 12531 (2019). DOI: 10.5194/acp-19-12531-2019
  • [10] E. Hubble, ‘A relation between distance and radial velocity among extra-galactic nebulae’, P. Natl. Acad. Sci. USA 15, 168 (1929). DOI: 10.1073/pnas.15.3.168
  • [11] F. Galton, ‘Natural Inheritance’, Macmillan (1889).
  • [12] L.M. French, ‘John Goodricke, Edward Pigott, and their study of variable stars’, J. Amer. Assoc. Var. Star Obs. 40(1), 120 (2012); see https://app.aavso.org/jaavso/article/2742/
  • [13] J. Goodricke, ‘XXVI. A series of observations on, and a discovery of, the period of the variation of the light of the bright star in the head of medusa, called algol. In a letter from John Goodricke, Esq. to the Rev. Anthony Shepherd, D.D.F.R.S. and Plumian Professor at Cambridge’, Philos. T. R. Soc. 73, 474 (1783). DOI: 10.1098/rstl.1783.0027. ISSN: 0261-0523; see https://royalsocietypublishing.org/doi/epdf/10.1098/rstl.1783.0027
  • [14] J. Goodricke, ‘II. A series of observations on, and a discovery of, the period of the variation of the light of the star marked δ\delta by Bayer, near the head of Cepheus. In a letter from John Goodricke, Esq. to Nevil Maskelyne, D.D.F.R.S.’, Philos. T. R. Soc. 76, 48 (1786). DOI: 10.1098/rstl.1786.0002. ISSN: 0261-0523; see https://royalsocietypublishing.org/doi/epdf/10.1098/rstl.1786.0002
  • [15] H. Shapley, ‘On the nature and cause of Cepheid variation’, Astrophys. J. 40, 448 (1914). DOI: 10.1086/142137
  • [16] S.L. Singh, ‘Big Bang: The origin of the Universe’, Harper Perennial (2005). ISBN-10: 0-00-716221-9, ISBN-13: 978-0-0071-6221-5; first published (under a slightly different title) by Fourth Estate (2004).
  • [17] M.W. Rossiter, “‘Women’s Work” in Science, 1880-1910’, Isis 71(3), 381 (1980). DOI: 10.1086/352540. ISSN 0021-1753.
  • [18] https://www.newspapers.com/article/10248290
  • [19] H.S. Leavitt, ‘1 7771\,777 variables in the Magellanic Clouds’, Annals of Harvard College Observatory 60(4), 87 (1908);
    see https://articles.adsabs.harvard.edu/pdf/1908AnHar..60…87L
  • [20] https://ned.ipac.caltech.edu
  • [21] H.S. Leavitt, E.C. Pickering, ‘Periods of 2525 Variable Stars in the Small Magellanic Cloud’, Harvard College Observatory Circular 173 (1912) 1; see https://articles.adsabs.harvard.edu/pdf/1912HarCi.173….1L
  • [22] E. Hertzsprung, ‘Über die räumliche Verteilung der Veränderlichen vom δ\delta Cephei-Typus’ (in German), Astron. Nachr. 196, 201 (1913); see https://adsabs.harvard.edu/full/1913AN….196..201H
  • [23] H. Sawyer Hogg, ‘Harlow Shapley and globular clusters’, Publ. Astron. Soc. Pac. 77(458), 336 (1965).
    DOI: 10.1086/128229
  • [24] V. Trimble, ‘The 1920 Shapley-Curtis discussion: Background, issues, and aftermath’, Publ. Astron. Soc. Pac. 107(718), 1133 (1995). DOI: 10.1086/133671
  • [25] https://www.nasa.gov/mission_pages/hubble/science/star-v1.html
  • [26] V.M. Slipher, ‘The radial velocity of the Andromeda Nebula’, Lowell Observatory Bulletin 2(8), 56 (1913); ‘Spectrographic observations of nebulae’, Popular Astronomy 23, 21 (1915); and ‘Nebulae’, P. Am. Philos. Soc. 56, 403 (1917).
    Available from https://ui.adsabs.harvard.edu/abs/1913LowOB…2…56S, https://ui.adsabs.harvard.edu/abs/1915PA…..23…21S, and https://ui.adsabs.harvard.edu/abs/1917PAPhS..56..403S, respectively.
  • [27] A. Friedmann, ‘Über die Krümmung des Raumes’ (in German), Z. Phys. 10, 377 (1922). DOI: 10.1007/BF01332580
  • [28] A. Friedmann, ‘Über die Möglichkeit einer Welt mit konstanter negativer Krümmung des Raumes’ (in German), Z. Phys. 21, 326 (1924). DOI: 10.1007/BF01328280
  • [29] E. Matsinos, ‘From Copernicus to Lambda-CDM’, Nova Science Publishers Inc. (2017). ISBN-10: 1-53-612034-0, ISBN-13: 978-1-5361-2034-9
  • [30] G. Lemaı^\hat{\rm\i}tre, ‘The expanding universe’, Mon. Not. R. Astron. Soc. 91, 490 (1931); see https://ui.adsabs.harvard.edu/abs/1931MNRAS..91..490L
  • [31] G. Lemaı^\hat{\rm\i}tre, ‘Un univers homogène de masse constante et de rayon croissant rendant compte de la vitesse radiale des nébuleuses extra-galactiques’ (in French), Annales de la Société Scientifique de Bruxelles A 47, 49 (1927); English translation: G. Lemaı^\hat{\rm\i}tre, ‘Republication of: A homogeneous universe of constant mass and increasing radius accounting for the radial velocity of extra-galactic nebulae’, Gen. Relat. Gravit. 45 (2013) 1635. DOI: 10.1007/s10714-013-1548-3
  • [32] C. O’Raifeartaigh, ‘Eddington, Lemaitre and the hypothesis of cosmic expansion’, arXiv:1907.12297 [physics.hist-ph].
    DOI: 10.48550/arXiv.1907.12297
  • [33] E. Hubble, M.L. Humason, ‘The velocity-distance relation among extra-galactic nebulae’, Astrophys. J. 74, 43 (1931). DOI: 10.1086/143323
  • [34] https://www.atnf.csiro.au/outreach/education/senior/astrophysics/
    variable_cepheids.html
  • [35] A. Sandage, ‘Current problems in the extragalactic distance scale’, Astrophy. J. 127(3), 513 (1958). DOI: 10.1086/146483
  • [36] R. Schönrich, J. Binney, W. Dehnen, ‘Local kinematics and the Local Standard of Rest’, Mon. Not. R. Astron. Soc. 403, 1829 (2010). DOI: 10.1111/j.1365-2966.2010.16253.x
  • [37] M.J. Reid et al., ‘Trigonometric parallaxes of high mass star forming regions: The structure and kinematics of the Milky Way’, Astrophys. J. 783(130) (2014). DOI: 10.1088/0004-637X/783/2/130
  • [38] M.J. Reid, T.M. Dame, ‘On the rotation speed of the Milky Way determined from H I emission’, Astrophys. J. 832(159) (2016). DOI: 10.3847/0004-637X/832/2/159
  • [39] B.W. Carroll, D.A. Ostlie, ‘An Introduction to Modern Astrophysics’, 2nd edn., Pearson Addison-Wesley (2007). ISBN-10: 0-32-144284-9, ISBN-13: 978-0-8053-0402-2.
  • [40] R.L. Workman et al.(Particle Data Group), Prog. Theor. Exp. Phys. 2022, 083C01 (2022); see also 2023 update.
  • [41] F. Galton, ‘Finger Prints’, Macmillan (1892).
  • [42] J.A. Hanley, “‘Transmuting” women into men: Galton’s family data on human stature’, Am. Stat. 58(3), 237 (2004). DOI: 10.1198/000313004X1558
  • [43] http://www.epi.mcgill.ca/hanley/galton
  • [44] F. Wilcoxon, ‘Individual comparisons by ranking methods’, Biometrics Bull. 1(6), 80 (1945). DOI: 10.2307/3001968
  • [45] J.E. Kolassa, ‘Edgeworth approximations for rank sum test statistics’, Stat. Probab. Lett. 24(2), 169 (1995). DOI: 10.1016/0167-7152(95)00164-H
  • [46] H.B. Mann, D.R. Whitney, ‘On a test of whether one of two random variables is stochastically larger than the other’, Ann. Math. Stat. 18(1), 50 (1947). DOI: 10.1214/aoms/1177730491
  • [47] E. Brunner, U. Munzel, ‘The nonparametric Behrens-Fisher problem: Asymptotic theory and a small-sample approximation’, Biometrical J. 42(1), 17 (2000). DOI: 10.1002/(SICI)1521-4036(200001)42:1¡17::AID-BIMJ17¿3.0.CO;2-U
  • [48] NCD Risk Factor Collaboration (NCD-RisC), ‘A century of trends in adult human height’, eLife 5:e13410 (2016). DOI: 10.7554/eLife.13410
  • [49] M. Roser, C. Appel, H. Ritchie, ‘Human height’, Our World in Data (2013); see https://ourworldindata.org/human-height
  • [50] https://math.mit.edu/ rmd/465/shapiro.pdf
  • [51] R.B. D’Agostino, A. Belanger, R.B. D’Agostino Jr., ’A suggestion for using power and informative tests of normality’, Am. Stat. 44(4), 316 (1990). DOI: 10.2307/2684359
  • [52] N. Mohd Razali, B. Yap, ‘Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests’, J. Stat. Model. Analytics 2(1), 21 (2011). ISBN-13: 978-9-6736-3157-5
  • [53] S.S. Shapiro, M.B. Wilk, ‘An analysis of variance test for normality (complete samples)’, Biometrika 52(3/4), 591 (1965). DOI: 10.2307/2333709
  • [54] J.P. Royston, ‘An extension of Shapiro and Wilk’s WW test for normality to large samples’, Appl. Statist. 31(2), 115 (1982). DOI: 10.2307/2347973; P. Royston, ‘Approximating the Shapiro-Wilk W-test for non-normality’, Stat. Comput. 2, 117 (1992). DOI: 10.1007/BF01891203; P. Royston, ‘A toolkit for testing for non-normality in complete and censored samples’, J. R. Stat. Soc., D: The Statistician 42, 37 (1993). DOI: 10.2307/2348109; J.P. Royston, ‘A Remark on Algorithm AS 181: The W-test for normality’, J. R. Stat. Soc., C: Appl. Stat. 44(4), 547 (1995). DOI: 10.2307/2986146
  • [55] T.W. Anderson, D.A. Darling, ‘Asymptotic theory of certain goodness of fit criteria based on stochastic processes’, Ann. Math. Stat. 23(2), 193 (1952). DOI: 10.1214/aoms/1177729437
  • [56] R.B. D’Agostino, ‘Tests for the Normal Distribution’, in ‘Goodness-of-Fit Techniques’, ed. R.B. D’Agostino, M.A. Stephens, New York: Marcel Dekker (1986). ISBN-10: 0-82-477487-6
  • [57] R. D’Agostino, E.S. Pearson, ‘Tests for departure from normality. Empirical results for the distributions of b2b_{2} and b1\sqrt{b_{1}}’, Biometrika 60, 613 (1973). DOI: 10.2307/2335012
  • [58] R.T. Birge, ‘The calculation of errors by the method of least squares’, Phys. Rev. 40(2), 207 (1932). DOI: 10.1103/PhysRev.40.207
  • [59] https://en.freedownloadmanager.org/Windows-PC/CaRMetal-FREE.html
  • [60] S.F. Gull, ‘Bayesian data analysis: straight-line fitting’, in ‘Maximum Entropy and Bayesian Methods. Fundamental Theories of Physics’, Vol. 36, ed. J. Skilling, Springer (1989), Dordrecht. DOI: 10.1007/978-94-015-7860-8_55
  • [61] W.A. Fuller, ‘Measurement Error Models’, John Wiley & Sons, New York (1987). ISBN-10: 0-47-186187-1
  • [62] J.W. Gillard, ‘An historical overview of linear regression with errors in both variables’, Technical report, Cardiff University (2006); see https://mathsdemo.cf.ac.uk/maths/resources/Gillard_Tech_Report.pdf
  • [63] G.Y. Yi, J.S. Buzas, ‘Measurement Error Models - A brief account of past developments and modern advancements’, in ‘Handbook of Measurement Error Models’, ed. G.Y. Yi, A. Delaigle, P. Gustafson, CRC Press (2021). DOI: 10.1201/9781315101279-1
  • [64] J. Tellinghuisen, ‘Least-squares analysis of data with uncertainty in xx and yy: A Monte Carlo methods comparison’, Chemom. Intell. Lab. Syst. 103(2), 160 (2010). DOI: 10.1016/j.chemolab.2010.07.003
  • [65] J. Tellinghuisen, ‘Least squares methods for treating problems with uncertainty in xx and yy’, Anal. Chem. 92(16), 10863 (2020). DOI: 10.1021/acs.analchem.0c02178
  • [66] H.-Ch. Schröder et al., ‘Determination of the πN\pi N scattering lengths from pionic hydrogen’, Phys. Lett. B 469(1), 25 (1999); H.-Ch. Schröder et al., ‘The pion-nucleon scattering lengths from pionic hydrogen and deuterium’, Eur. Phys. J. C 21(3), 473 (2001). DOI: 10.1016/S0370-2693(99)01237-X, 10.1007/s100520100754
  • [67] M. Hennebach et al., ‘Hadronic shift in pionic hydrogen’, Eur. Phys. J. A 50(12), 190 (2014); see also Erratum: Eur. Phys. J. A 55(2), 24 (2019). DOI: 10.1140/epja/i2014-14190-x, 10.1140/epja/i2019-12710-x
  • [68] A. Hirtl et al., ‘Redetermination of the strong-interaction width in pionic hydrogen’, Eur. Phys. J. A 57(2), 70 (2021). DOI: 10.1140/epja/s10050-021-00387-x
  • [69] F. James, ‘MINUIT - Function Minimization and Error Analysis’, CERN Program Library Long Writeup D506.
  • [70] J.A. Nelder, R. Mead, ‘A simplex method for function minimization’, Comput. J. 7(4), 308 (1965). DOI: 10.1093/comjnl/7.4.308
  • [71] M. Abramowitz, I.A. Stegun, ‘Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables’, United States Department of Commerce, National Bureau of Standards (1972). ISBN-13: 978-0-3181-1730-0
  • [72] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, ‘Numerical Recipes: The Art of Scientific Computing’, 3rd edn., Cambridge University Press (2007). ISBN-13: 978-0-5218-8068-8

Appendix A Cases of linear regression relevant to this work

The structure of this appendix is as follows. In the next section, I shall give a concise description of the WLR with one independent variable. The ordinary (called ‘simple’ by some authors 161616In my opinion, the adjective ‘simple’ should rather be used in order to distinguish linear regression with one independent variable from multiple linear regression.) linear regression with one independent variable will be treated in Section A.2 as a special case of Section A.1. Section A.3 addresses the problem of the WLR when ‘meaningful’ (i.e., representing 1σ1\sigma effects in the normal distribution for each of the observations) measurement uncertainties are supplied for one variable. To deal with meaningful measurement uncertainties in both variables (a relatively new subject, which has not been addressed in any of the books [3, 4, 5, 6, 7]), several methods have been developed: Bayesian [60], errors-in-variables (e.g., see Refs. [61, 62, 63]), as well as least-squares methods [64, 65], etc. Detailed in Section A.4 are two relevant options, including the method which is recommended in Ref. [65]. The simplest of multiple linear regressions, i.e., the one involving two independent variables, is discussed in Section A.5.

Not addressed in this appendix are established tests regarding the applicability of the aforementioned methods (e.g., whether or not linearity is established in a dataset, whether or not heteroscedasticity is detected in a dataset, whether or not the fitted residuals are normally distributed, and the like). The technical details of such tests are not difficult to obtain, e.g., see Refs. [3, 4, 5, 6, 7], and there is no reason for repetition in this study.

For the purposes of this appendix, the quantity σ2\sigma^{2} (the unbiased average of the weighted squares comprising the minimisation function F(pk)F(p_{k}), where pkp_{k}, k=1,2,,Npk={1,2,\dots,N_{p}}, represent the parameters of the fit) will stand for the ratio between the minimal value of the minimisation function F(p^i)F(\hat{p}_{i}) and the NDF in the fit:

σ2=F(p^i)NDF.\sigma^{2}=\frac{F(\hat{p}_{i})}{\rm NDF}\,\,\,. (10)

The quantity NDF is equal to the number of input datapoints NN reduced by the number of the parameters NpN_{p} of the fit: NDF=NNp{\rm NDF}=N-N_{p}. (Some authors remove one additional DoF from NN, believing that, by doing so, they will obtain an unbiased variance. However, this approach is simply wrong: the variance, obtained using the relation NDF=NNp{\rm NDF}=N-N_{p}, is already unbiased!)

A.1 The WLR with one independent variable

In a WLR, involving two quantities - namely xx (independent variable) and yy (dependent variable) - which are expected to be linearly related (y=p0+p1xy=p_{0}+p_{1}x), the minimisation function has the form

F(p0,p1)=i=1Nwi(yip0p1xi)2,F(p_{0},p_{1})=\sum_{i=1}^{N}w_{i}\left(y_{i}-p_{0}-p_{1}x_{i}\right)^{2}\,\,\,, (11)

where a statistical weight 171717The case wi=0w_{i}=0 is equivalent to excluding the ii-th datapoint from the optimisation. All formulae in this work relate to datapoints which are allowed to contribute to the minimisation function F(p0,p1)F(p_{0},p_{1}). wi>0w_{i}\in\mathbb{R}_{>0}, independent of the parameters p0,1p_{0,1}, is assigned to each pair of NN observations (xi,yi)(x_{i},y_{i}). For the purposes of this section, the input-data structure comprises the observations (xi,yi)(x_{i},y_{i}) and the (independent of p0,1p_{0,1}) weights wiw_{i}. If the observations (xi,yi)(x_{i},y_{i}) correspond to measurements of physical quantities, they are - more often than not - accompanied by meaningful measurement uncertainties, mostly by uncertainties associated with the dependent variable (δyi\delta y_{i}), frequently by uncertainties in both variables, and rarely only by uncertainties in the independent variable (δxi\delta x_{i}); each statistical weight wiw_{i} reflects the size of such measurement uncertainties.

The optimal (or fitted) values of the two parameters entering Eq. (11), namely of the intercept p0p_{0} and of the slope p1p_{1}, correspond to the global minimum of the function F(p0,p1)F(p_{0},p_{1}), where two conditions are fulfilled:

F(p0,p1)p0=F(p0,p1)p1=0.\frac{\partial F(p_{0},p_{1})}{\partial p_{0}}=\frac{\partial F(p_{0},p_{1})}{\partial p_{1}}=0\,\,\,. (12)

From now on, the values of the parameters p0p_{0} and p1p_{1}, which correspond to the global minimum of the function F(p0,p1)F(p_{0},p_{1}), will be denoted by p^0\hat{p}_{0} and p^1\hat{p}_{1}, respectively.

The application of these conditions leads to the emergence of the system of linear equations

p^0i=1Nwi\displaystyle\hat{p}_{0}\sum_{i=1}^{N}w_{i} +p^1i=1Nwixi\displaystyle+\hat{p}_{1}\sum_{i=1}^{N}w_{i}x_{i}\quad =i=1Nwiyi\displaystyle=\sum_{i=1}^{N}w_{i}y_{i}
p^0i=1Nwixi\displaystyle\hat{p}_{0}\sum_{i=1}^{N}w_{i}x_{i} +p^1i=1Nwixi2\displaystyle+\hat{p}_{1}\sum_{i=1}^{N}w_{i}x_{i}^{2}\quad =i=1Nwixiyi,\displaystyle=\sum_{i=1}^{N}w_{i}x_{i}y_{i}\,\,\,, (13)

or, after adopting a self-explanatory notation for the sums,

p^0Sw\displaystyle\hat{p}_{0}S_{w} +p^1Swx\displaystyle+\hat{p}_{1}S_{wx}\quad =Swy\displaystyle=S_{wy}
p^0Swx\displaystyle\hat{p}_{0}S_{wx} +p^1Swxx\displaystyle+\hat{p}_{1}S_{wxx}\quad =Swxy.\displaystyle=S_{wxy}\,\,\,. (14)

Equations (A.1) may be rewritten in matrix form as

(SwSwxSwxSwxx)(p^0p^1)=(SwySwxy)or𝐆𝐏^=𝐂,\left(\begin{array}[]{cc}S_{w}&S_{wx}\\ S_{wx}&S_{wxx}\\ \end{array}\right)\left(\begin{array}[]{c}\hat{p}_{0}\\ \hat{p}_{1}\\ \end{array}\right)=\left(\begin{array}[]{c}S_{wy}\\ S_{wxy}\\ \end{array}\right)\,\,\,\text{or}\,\,\,{\bf G}{\bf\hat{P}}={\bf C}\,\,\,, (15)

where the 2×22\times 2 matrix 𝐆{\bf G} may be identified with half of the Hessian matrix

𝐇𝐅=F(p0,p1)(p0,p1)(2F(p0,p1)p022F(p0,p1)p0p12F(p0,p1)p1p02F(p0,p1)p12),{\bf H_{F}}=\frac{\partial F(p_{0},p_{1})}{\partial(p_{0},p_{1})}\coloneqq\left(\begin{array}[]{cc}\frac{\partial^{2}F(p_{0},p_{1})}{\partial p_{0}^{2}}&\frac{\partial^{2}F(p_{0},p_{1})}{\partial p_{0}\partial p_{1}}\\ \frac{\partial^{2}F(p_{0},p_{1})}{\partial p_{1}\partial p_{0}}&\frac{\partial^{2}F(p_{0},p_{1})}{\partial p_{1}^{2}}\\ \end{array}\right)\,\,\,, (16)

while 𝐏^{\bf\hat{P}} and 𝐂{\bf C} represent the column vectors of the array p^0,1\hat{p}_{0,1} and of the two constants on the right-hand side (rhs) of Eqs. (A.1), respectively. The solution of the matrix Eq. (15) is

𝐏^=𝐆1𝐂,{\bf\hat{P}}={\bf G}^{-1}{\bf C}\,\,\,, (17)

where 𝐆1{\bf G}^{-1} is evidently the inverse of the matrix 𝐆{\bf G}. In detail, the solution reads as

(p^0p^1)=1𝒟(SwySwxxSwxSwxySwSwxySwxSwy),\left(\begin{array}[]{c}\hat{p}_{0}\\ \hat{p}_{1}\\ \end{array}\right)=\frac{1}{\mathscr{D}}\left(\begin{array}[]{c}S_{wy}S_{wxx}-S_{wx}S_{wxy}\\ S_{w}S_{wxy}-S_{wx}S_{wy}\\ \end{array}\right)\,\,\,, (18)

where 𝒟=det(𝐆)=SwSwxxSwx2\mathscr{D}={\rm det}({\bf G})=S_{w}S_{wxx}-S_{wx}^{2}. A unique solution 𝐏^{\bf\hat{P}} is obtained when 𝒟0\mathscr{D}\neq 0.

The statistical uncertainties of the two parameters of the fit may be obtained from the Taylor expansion of the function F(p0,p1)F(p_{0},p_{1}) in the vicinity of the minimum F(p^0,p^1)=Swyyp^0Swyp^1SwxyF(\hat{p}_{0},\hat{p}_{1})=S_{wyy}-\hat{p}_{0}S_{wy}-\hat{p}_{1}S_{wxy}. Given that (on account of the conditions of Eqs. (12)) the first derivatives identically vanish at the minimum, one obtains after retaining the first two terms in the expansion:

F(p0,p1)F(p^0,p^1)+12i=01j=01(HF)ij(pip^i)(pjp^j),F(p_{0},p_{1})\approx F(\hat{p}_{0},\hat{p}_{1})+\frac{1}{2}\sum_{i=0}^{1}\sum_{j=0}^{1}(H_{F})_{ij}(p_{i}-\hat{p}_{i})(p_{j}-\hat{p}_{j})\,\,\,, (19)

which implies that

F(p0,p1)F(p^0,p^1)+i=01j=01Gij(pip^i)(pjp^j).F(p_{0},p_{1})\approx F(\hat{p}_{0},\hat{p}_{1})+\sum_{i=0}^{1}\sum_{j=0}^{1}G_{ij}(p_{i}-\hat{p}_{i})(p_{j}-\hat{p}_{j})\,\,\,. (20)

The covariance matrix of the fit takes the form

cov(p0,p1)=σ2𝒟(SwxxSwxSwxSw).{\rm cov}(p_{0},p_{1})=\frac{\sigma^{2}}{\mathscr{D}}\left(\begin{array}[]{cc}S_{wxx}&-S_{wx}\\ -S_{wx}&S_{w}\\ \end{array}\right)\,\,\,. (21)

The statistical uncertainties of the two parameters are obtained from the diagonal elements of the covariance matrix:

(δp^0)2=σ2Swxx𝒟(\delta\hat{p}_{0})^{2}=\sigma^{2}\frac{S_{wxx}}{\mathscr{D}} (22)

and

(δp^1)2=σ2Sw𝒟.(\delta\hat{p}_{1})^{2}=\sigma^{2}\frac{S_{w}}{\mathscr{D}}\,\,\,. (23)

I shall next address two subtle issues. First, many authors associate integer multiples (e.g., 11, 22, 33, and so on) of the uncertainties of Eqs. (22,23) with the confidence intervals corresponding to the same number (e.g., 11, 22, 33, and so on) of σ\sigma’s in the normal distribution; the same applies to any requested new predictions ypy_{p} (see next paragraph in this section). This approximation is poor when the results of the fit have been obtained from a small number of observations NN. Although the aforementioned quantities δp^0\delta\hat{p}_{0} and δp^1\delta\hat{p}_{1} do represent the standard errors of the estimators p^0\hat{p}_{0} and p^1\hat{p}_{1}, the extraction of the correct confidence intervals necessitates the application of the so-called t-multiplier 181818The quantity α\alpha is the (user-defined) threshold which is associated with the outset of statistical significance. t1α/2,NDFt_{1-\alpha/2,{\rm NDF}}, i.e., a quantity which takes account of the size of the dataset which yielded the results of Eqs. (22,23), and which follows Student’s t-distribution for NDF DoFs (e.g., for N2N-2 DoFs, if two parameters are used in the fit), see Refs. [4] (pp. 78-80) and [5] (Chapter 3.6). Using the t-multiplier, the confidence intervals of the two parameters of the fit are expressed as

pk[p^kt1α/2,N2δp^k,p^k+t1α/2,N2δp^k].p_{k}\in\,[\,\hat{p}_{k}-t_{1-\alpha/2,N-2}\,\delta\hat{p}_{k}\,\,,\,\,\hat{p}_{k}+t_{1-\alpha/2,N-2}\,\delta\hat{p}_{k}\,]\,\,\,. (24)

The importance of this correction for small samples is demonstrated in Table 12.

Table 12:

Listed in this table are a few values of the t-multiplier t1α/2,N2t_{1-\alpha/2,N-2}, a quantity which must be used in the evaluation of the confidence intervals of the parameters p0p_{0} and p1p_{1}, as well as in that of the confidence intervals of any requested new predictions ypy_{p}. This quantity depends on the significance level α\alpha and on the number NN of the input datapoints (xi,yi)(x_{i},y_{i}). The corresponding values in number of σ\sigma’s in the normal distribution, which may be thought of as the limits of t1α/2,N2t_{1-\alpha/2,N-2} when NN\to\infty, are also cited. This table demonstrates that, if the multiples of the standard errors of the estimators p^0\hat{p}_{0} and p^1\hat{p}_{1} represent numbers of σ\sigma’s in the normal distribution, then narrower confidence intervals are obtained (e.g., compare the second and last columns of this table). The confidence intervals, corresponding to the 2σ2\sigma limits in the normal distribution, are close to those favoured by authors in several branches of Science (9595 % confidence level); in Physics, 1σ1\sigma intervals are predominantly used. The values were obtained with the Excel methods NORMSDIST (CDF of the standard normal distribution) and TINV (Inverse CDF of Student’s t-distribution).

NN # of σ\sigma’s α\alpha t1α/2,N2t_{1-\alpha/2,N-2}
55 1.1968811.196881
1010 1.0665281.066528
2020 11 0.3173110.317311 1.0285601.028560
4040 1.0133321.013332
8080 1.0064511.006451
55 3.3068223.306822
1010 2.3664162.366416
2020 22 0.0455000.045500 2.1488492.148849
4040 2.0679632.067963
8080 2.0325612.032561

I shall next touch upon the subject of forecasting, which, to my surprise, is not addressed in any of the books [3, 4, 5, 6, 7] in case of the WLR. Let us assume that the objective is to extract (from the results of the optimisation) a ‘prediction’ for the value of the dependent variable (ypy_{p}) at one specific value of the independent variable (xpx_{p}). Following the equation yielding the variance of the prediction error on p. 86 of Ref. [4], as well as the covariance matrix of the fit of Eq. (21) of this work, one obtains the expectation value of the prediction ypy_{p} and its standard error δyp\delta y_{p} as follows:

yp=p^0+p^1xpy_{p}=\hat{p}_{0}+\hat{p}_{1}x_{p} (25)

and

(δyp)2=σ2(wp1+Sw1+(xpx¯)2SwxxSwx¯2),(\delta y_{p})^{2}=\sigma^{2}\left(w_{p}^{-1}+S_{w}^{-1}+\frac{(x_{p}-\bar{x})^{2}}{S_{wxx}-S_{w}\bar{x}^{2}}\right)\,\,\,, (26)

where the quantity x¯\bar{x} stands for the weighted mean of the xix_{i} values of the input datapoints and wpw_{p} is (or would be) the (expected) statistical weight of a datapoint at x=xpx=x_{p}. It must be mentioned that Eq. (26) yields the standard error of a new prediction. The standard error of the fit does not contain the first term within the brackets on the rhs of Eq. (26), see also Chapter 3.7 of Ref. [4] and Eqs. (3.6.4,3.6.5) in Chapter 3.6 of Ref. [5] (both assuming that wi=1w_{i}=1, i\forall i).

I shall finalise this section by giving the relevant expressions in case that the theoretical straight line contains no intercept p0p_{0}; datasets relevant to this part may include force-displacement measurements following Hooke’s law, distance-velocity data of galaxies following the Lemaı^\hat{\rm\i}tre-Hubble law, etc. In such a case, the minimisation function reads as

F(p1)=i=1Nwi(yip1xi)2.F(p_{1})=\sum_{i=1}^{N}w_{i}\left(y_{i}-p_{1}x_{i}\right)^{2}\,\,\,. (27)

The minimal value F(p^1)=SwyySwxy2/SwxxF(\hat{p}_{1})=S_{wyy}-S_{wxy}^{2}/S_{wxx}, the fitted value p^1=Swxy/Swxx\hat{p}_{1}=S_{wxy}/S_{wxx}, whereas its variance

(δp^1)2=σ2Swxx,(\delta\hat{p}_{1})^{2}=\frac{\sigma^{2}}{S_{wxx}}\,\,\,, (28)

where NDF=N1{\rm NDF}=N-1, as the linear model now contains only one parameter. Regarding the expectation value of a new prediction ypy_{p} at x=xpx=x_{p} and its standard error δyp\delta y_{p}, Eq. (25) applies without the first term on the rhs (i.e., p^0\hat{p}_{0}), whereas Eq. (26) has now the form:

(δyp)2=σ2(wp1+xp2Swxx).(\delta y_{p})^{2}=\sigma^{2}\left(w_{p}^{-1}+\frac{x_{p}^{2}}{S_{wxx}}\right)\,\,\,. (29)

A.2 The linear regression with constant (or ‘without’) weights

If all statistical weights wiw_{i} are equal (to a constant w0w\neq 0) in Section A.1, a case which is usually referred to in the standard literature in Statistics as (fulfilling) ‘homoscedasticity’, or if no weights are supplied (or are relevant) in a problem, all expressions of Section A.1 apply after simply replacing all wiw_{i}’s by 11. It so happens that there is no dependence of any of the important quantities of the fit (i.e., of the optimal values of the parameters of the fit, of the standard errors of these parameters, as well as of any predictions and of their standard errors) on the value of the constant ww. Only the minimisation function F(p0,p1)F(p_{0},p_{1}) is ww-dependent, but (as the important results of the optimisation - which also involve F(p^0,p^1)F(\hat{p}_{0},\hat{p}_{1}) - are ratios of quantities containing the same powers of ww in the nominators and denominators) the ww-dependence is eliminated from the useful output. This special case of linear regression, which is generally known as ‘ordinary linear regression’ (OLR for short) or ‘ordinary linear least-squares optimisation’, provides (in the eyes of many) better insight into the nature of the various quantities. As a result, most contributions in the standard literature start the description of the methods of linear regression by treating the OLR case (and some do not even venture any further!).

I shall next cover some of the main features of the OLR, starting from the fitted value of the slope. The determinant 𝒟\mathscr{D} is equal to NSxxSx2NS_{xx}-S_{x}^{2}, which turns out to be simply the variance of the input xix_{i} values multiplied by N2N^{2}: 𝒟=N2var(x)\mathscr{D}=N^{2}{\rm var}(x). From Eq. (18), one obtains for the fitted value of the slope:

p^1=Ni=1Nxiyi(i=1Nxi)(i=1Nyi)N2var(x)=cov(x,y)var(x).\hat{p}_{1}=\frac{N\sum_{i=1}^{N}x_{i}y_{i}-\left(\sum_{i=1}^{N}x_{i}\right)\left(\sum_{i=1}^{N}y_{i}\right)}{N^{2}{\rm var}(x)}=\frac{{\rm cov}(x,y)}{{\rm var}(x)}\,\,\,. (30)

Pearson’s correlation coefficient ρx,y[1,+1]\rho_{x,y}\in[-1,+1] between two samples (of equal dimension) corresponding to two quantities xx and yy, defined by the formula

ρx,ycov(x,y)var(x)var(y),\rho_{x,y}\coloneqq\frac{{\rm cov}(x,y)}{\sqrt{{\rm var}(x)\,{\rm var}(y)}}\,\,\,, (31)

is a measure of the linear correlation between these quantities. Using the first of Eqs. (A.1) with all statistical weights wiw_{i} set to 11, one obtains

Np^0+p^1i=1Nxi=i=1Nyip^0+p^1x¯=y¯.N\hat{p}_{0}+\hat{p}_{1}\sum_{i=1}^{N}x_{i}=\sum_{i=1}^{N}y_{i}\Rightarrow\hat{p}_{0}+\hat{p}_{1}\bar{x}=\bar{y}\,\,\,. (32)

Using this equation in order to replace p^0\hat{p}_{0}, one obtains for the minimal value of the minimisation function:

F(p^0,p^1)\displaystyle F(\hat{p}_{0},\hat{p}_{1}) =i=1N(yip^0p^1xi)2\displaystyle=\sum_{i=1}^{N}\left(y_{i}-\hat{p}_{0}-\hat{p}_{1}x_{i}\right)^{2}
=i=1N(yiy¯p^1(xix¯))2\displaystyle=\sum_{i=1}^{N}\left(y_{i}-\bar{y}-\hat{p}_{1}\left(x_{i}-\bar{x}\right)\right)^{2}
=i=1N((yiy¯)2+p^12(xix¯)22p^1(xix¯)(xiy¯))\displaystyle=\sum_{i=1}^{N}\left(\left(y_{i}-\bar{y}\right)^{2}+\hat{p}_{1}^{2}\left(x_{i}-\bar{x}\right)^{2}-2\hat{p}_{1}\left(x_{i}-\bar{x}\right)\left(x_{i}-\bar{y}\right)\right)
=Nvar(y)+Np^12var(x)2Np^1cov(x,y).\displaystyle=N{\rm var}(y)+N\hat{p}_{1}^{2}{\rm var}(x)-2N\hat{p}_{1}{\rm cov}(x,y)\,\,\,. (33)

Making use of Eqs. (30,31), one finally obtains

σ2=var(y)(1ρx,y2).\sigma^{2}={\rm var}(y)\left(1-\rho_{x,y}^{2}\right)\,\,\,. (34)

The quantity on the left-hand side of this equation is routinely interpreted as the part of the variance of the dependent variable which remains ‘unexplained’ (after the optimisation), whereas the fraction of the variance which is associated with the ρx,y2\rho_{x,y}^{2} term is considered ‘explained’, in the sense that this fraction of the total variance of the dependent variable is understood as essentially originating from the variation of the independent variable. Of course, all these expressions are analogous to those obtained in the WLR after replacing the numerical measures ‘mean’, ‘variance’, and ‘correlation’ with their weighted counterparts.

A.3 The WLR with meaningful measurement uncertainties only in one variable

At least so far as physical measurements are concerned, this is the most frequent, and hence useful case. In Physics, all measurements of physical quantities are expected to be accompanied by meaningful uncertainties: measurements of physical quantities without uncertainties might provide a general impression about the order of magnitude of effects, but are hardly of any use in statistical analyses. As a result, particular attention is paid in extracting such uncertainties for the physical quantities which are under experimental exploration, i.e., uncertainties which are indicative of the likelihood of the departure of the measured from the true values. Unlike most other branches of Science, Physics favours 1σ1\sigma uncertainties in the normal distribution, representing a confidence level of about 68.2768.27 %.

In short, the pairs of NN observations (xi,yi)(x_{i},y_{i}) are replaced in most Physics studies by triplets of NN observations (xi,yi,δyi)(x_{i},y_{i},\delta y_{i}), and the optimisation problem reduces to a WLR with statistical weights wi(δyi)2w_{i}\coloneqq(\delta y_{i})^{-2}. Had it not been for one additional consideration, the treatment of this subcategory would have been unremarkable. So, what makes this case special?

The fact that the measurement uncertainties δyi\delta y_{i} represent 1σ1\sigma effects in the normal distribution implies that each standardised residual

yiy~iδyi\frac{y_{i}-\tilde{y}_{i}}{\delta y_{i}} (35)

is expected to follow the standard normal distribution; in the expression above, the quantity y~i\tilde{y}_{i} represents the fitted value at x=xix=x_{i}, obtained via the modelling as the case might be, in particular, via the linear model y=p0+p1xy=p_{0}+p_{1}x in this work. As each of the quantities under the sum in Eq. (11) follows the standard normal distribution, F(p0,p1)F(p_{0},p_{1}) is expected to follow the χ2\chi^{2} distribution with NN DoFs. Given that two parameters are used in the general problem of linear regression, the minimal value F(p^0,p^1)F(\hat{p}_{0},\hat{p}_{1}) is, in the context of this section, χ2\chi^{2}-distributed with N2N-2 DoFs.

One of the obvious advantages of the use of meaningful uncertainties is that the quality of the fit (or, equivalently, the adequacy of the linear model to account for the input dataset) may be assessed on the basis of the p-value 191919The p-value represents the upper tail of the CDF of the relevant distribution; of the χ2\chi^{2} distribution in this case. corresponding to the F(p^0,p^1)F(\hat{p}_{0},\hat{p}_{1}) result for the given DoFs.

When dealing with the subject of this section, there is one notable difference to the formalism developed earlier in order to treat the general-case WLR: it relates to the standard errors of the parameters of the fit, δp^0\delta\hat{p}_{0} and δp^1\delta\hat{p}_{1}. Equations (22) and (23) may be rewritten as

δp^0=BFSwxx𝒟\delta\hat{p}_{0}={\rm BF}\sqrt{\frac{S_{wxx}}{\mathscr{D}}} (36)

and

δp^1=BFSw𝒟,\delta\hat{p}_{1}={\rm BF}\sqrt{\frac{S_{w}}{\mathscr{D}}}\,\,\,, (37)

where

BF=σ2.{\rm BF}=\sqrt{\sigma^{2}}\,\,\,. (38)

The importance of the quantity BF of Eq. (38) in the problem of linear regression was (to the best of my knowledge) first addressed by Birge in 1932 [58]. Although BF should (in my opinion) be called ‘Birge factor’, the Particle Data Group (PDG) use instead the plain term ‘scale factor’ [40] in their compilations of physical constants, and recommend its application (in the context of this section) only when BF>1{\rm BF}>1; if BF<1{\rm BF}<1, the factor is omitted from Eqs. (36,37), the intention evidently being to prevent the decrease of the statistical uncertainties δp^0\delta\hat{p}_{0} and δp^1\delta\hat{p}_{1} when the input uncertainties δyi\delta y_{i} have been overly generous (which may be considered a ‘rare event’ in Physics, yet this is another story). Being a particle physicist, I abide by (and have nothing against) the PDG recommendation, see also Appendix C.

Last but not least, if only meaningful uncertainties δxi\delta x_{i} are supplied in a problem, one may swap the roles of the quantities xx and yy, and perform a WLR using the linear model: x=q0+q1yx=q_{0}+q_{1}y. The optimal values of the two parameters of the originally intended straight line y=p0+p1xy=p_{0}+p_{1}x could then be retrieved from the quantities q^0,1\hat{q}_{0,1} as follows: (p^0,p^1)=(q^0q^11,q^11)(\hat{p}_{0},\hat{p}_{1})=(-\hat{q}_{0}\hat{q}_{1}^{-1},\hat{q}_{1}^{-1}).

A.4 The WLR with meaningful measurement uncertainties in both variables

My first effort towards obtaining a solution, when I first encountered such a problem in the mid-1980s, rested upon the use of the WLR method with statistical weights inversely proportional to the square of the product of the two uncertainties for each input datapoint. In that implementation, the statistical weight wiw_{i} was defined by the formula:

wi1(δxiδyi)2i=1N1(δyi)2i=1N1(δxiδyi)2.w_{i}\coloneqq\frac{1}{(\delta x_{i}\delta y_{i})^{2}}\frac{\sum_{i=1}^{N}\frac{1}{(\delta y_{i})^{2}}}{\sum_{i=1}^{N}\frac{1}{(\delta x_{i}\delta y_{i})^{2}}}\,\,\,. (39)

Given that the statistical weights of this equation fulfil i=1Nwii=1N(δyi)2\sum_{i=1}^{N}w_{i}\equiv\sum_{i=1}^{N}(\delta y_{i})^{-2}, one arrives at a situation reminiscent of Appendix A.3 with redefined statistical weights, and could trick oneself into considering the minimal value F(p^0,p^1)F(\hat{p}_{0},\hat{p}_{1}) χ2\chi^{2}-distributed with N2N-2 DoFs. Although the statistical weights wiw_{i} sum up (by construction) to the same constant in the two cases, those given in Eq. (39) also take account of the uncertainties δxi\delta x_{i}. Of course, one disadvantage of using the weights of Eq. (39) is that an input datapoint needs to be excluded if either of the two input uncertainties vanishes, whereas such an exclusion is necessary only if both input uncertainties vanish in case of the following two methods of this section.

Several methodologies were developed during the last decades for dealing with this problem in a rigorous manner 202020In particular, errors-in-variables models emerged with the principal scope of providing reliable solutions to this particular situation [61, 62, 63].. A recent article [65] lists and compares the most important of these methods. Following the results of his analysis, the author recommends the use of the so-called ‘Modified effective-variance method’ (EV2), in which parameter-dependent statistical weights are assigned to the input datapoints: in the general case, each such weight is given by

wi(pk)((δyi)2+(dydx|x=xi)2(δxi)2)1,w_{i}(p_{k})\coloneqq\left((\delta y_{i})^{2}+\left(\left.\frac{dy}{dx}\right|_{x=x_{i}}\right)^{2}(\delta x_{i})^{2}\right)^{-1}\,\,\,, (40)

which, in case of linear regression, takes the form

wi(p0,p1)wi(p1)((δyi)2+p12(δxi)2)1.w_{i}(p_{0},p_{1})\to w_{i}(p_{1})\coloneqq\left((\delta y_{i})^{2}+p_{1}^{2}(\delta x_{i})^{2}\right)^{-1}\,\,\,. (41)

Owing to its obvious dependence on one of the parameters of the WLR, the application of a statistical weight of this form introduces non-linearity into a linear problem and calls for numerical minimisation, see Appendix B. I shall next elaborate on the emergence of the statistical weight of Eq. (41), by allowing myself to be guided by physical intuition.

Referring to Fig. 11, one may argue that the contribution of each input datapoint to the minimisation function F(p0,p1)F(p_{0},p_{1}) must involve two quantities:

  • the minimal distance did_{i} of that datapoint to the ‘current’ (i.e., relating to a specific iteration of the numerical minimisation) straight line and

  • the input uncertainties δxi\delta x_{i} and δyi\delta y_{i}.

One may argue that the aforementioned contribution depends on the relative largeness of the quantity did_{i}, judged in terms of a representative combined size of the two uncertainties δxi\delta x_{i} and δyi\delta y_{i}. To assess this, one first obtains the coordinates of point QQ, representing the intersection of the ‘current’ straight line and its orthogonal straight line passing through point PP:

(xi,yi)=(p1(yip0)+xi1+p12,p12yi+p1xi+p01+p12)(x_{i\perp},y_{i\perp})=\left(\frac{p_{1}\left(y_{i}-p_{0}\right)+x_{i}}{1+p_{1}^{2}},\frac{p_{1}^{2}y_{i}+p_{1}x_{i}+p_{0}}{1+p_{1}^{2}}\right) (42)

and obtains the distance did_{i} as follows:

di=|yip0p1xi|1+p12.d_{i}=\frac{\left|y_{i}-p_{0}-p_{1}x_{i}\right|}{\sqrt{1+p_{1}^{2}}}\,\,\,. (43)

The two uncertainties δxi\delta x_{i} and δyi\delta y_{i} may be interpreted as representing different effects: δyi\delta y_{i} is directly associated with the statistical effects regarding the specific observation, whereas δxi\delta x_{i} may be taken to represent systematic effects, i.e., effects which are induced on yy as the result of the variation of the independent variable. Projected on the orthogonal direction, the systematic component of the uncertainty is equal to

δxi|cos(π2θ)|=δxi|p1|1+p12,\delta x_{i}\left|\cos\left(\frac{\pi}{2}-\theta\right)\right|=\delta x_{i}\frac{\left|p_{1}\right|}{\sqrt{1+p_{1}^{2}}}\,\,\,, (44)

whereas the statistical component of the uncertainty yields

δyicosθ=δyi11+p12;\delta y_{i}\cos\theta=\delta y_{i}\frac{1}{\sqrt{1+p_{1}^{2}}}\,\,\,; (45)

of course, p1=tanθp_{1}=\tan\theta. As these two uncertainties are independent, the strict application of Gauss’ law of error propagation suggests the use of the combined uncertainty in the form

(δyi)2+p12(δxi2)1+p12.\sqrt{\frac{(\delta y_{i})^{2}+p_{1}^{2}(\delta x_{i}^{2})}{1+p_{1}^{2}}}\,\,\,. (46)

Therefore, the application of Gauss’ law suggests the summation of the two uncertainties of Eqs. (44,45) in quadrature, leading to the minimisation function

F(p0,p1)=i=1N(yip0p1xi)2(δyi)2+p12(δxi)2i=1Nwi(p1)(yip0p1xi)2,F(p_{0},p_{1})=\sum_{i=1}^{N}\frac{(y_{i}-p_{0}-p_{1}x_{i})^{2}}{(\delta y_{i})^{2}+p_{1}^{2}(\delta x_{i})^{2}}\equiv\sum_{i=1}^{N}w_{i}(p_{1})(y_{i}-p_{0}-p_{1}x_{i})^{2}\,\,\,, (47)

with wi(p1)w_{i}(p_{1}) given by Eq. (41).

Refer to caption
Figure 11: The figure facilitates the development of a solution to the problem of linear regression when meaningful measurement uncertainties in both xx and yy are supplied. The minimisation function uses the distance did_{i} of an input datapoint (xi,yi)(x_{i},y_{i}) to the ‘current’ (i.e., corresponding to a specific iteration of the numerical minimisation) straight line. The projected components of the two uncertainties on the orthogonal (to that straight line) direction may be quadratically (EV2 method [65]) or linearly summed (see text).

To cope with underestimated uncertainties (which, unfortunately, cannot be considered a sporadic phenomenon in Physics), many physicists favour the linear summation of the uncertainties corresponding to statistical and systematic effects. For the sake of example (taken from Particle Physics), two experimental groups measured (over two decades ago) the strong-interaction shift ϵ1s\epsilon_{1s} [66, 67] and the total decay width Γ1s\Gamma_{1s} [66, 68] of the ground state in pionic hydrogen, utilising the same low-energy pion beamline (πE5\pi E5) at the Paul Scherrer Institute (PSI), the same degrading device (cyclotron trap), similar crystal-spectrometer systems, and measurements corresponding (predominantly) to the same X-ray transition (3p1s3p\to 1s). In spite of the obvious similarities, the earlier collaboration (the ETHZ-Neuchâtel-PSI Collaboration) recommended the linear summation of their statistical and systematic uncertainties, whereas the later one (the Pionic Hydrogen Collaboration) favoured the quadratic summation of their corresponding uncertainties. In essence, there is no conflict between these two choices: some researchers choose to ‘err on the side of caution’ (use of the L1L_{1} norm when combining statistical and systematic effects), whereas others assume a somewhat bolder stance (use of the L2L_{2} norm).

In the context of this section, the linear summation of the uncertainties corresponding to statistical and systematic effects yields:

wi(p1)=(δyi+|p1|δxi)2.w_{i}(p_{1})=\left(\delta y_{i}+\left|p_{1}\right|\delta x_{i}\right)^{-2}\,\,\,. (48)

Last but not least, if the uncertainties δxi\delta x_{i} are (on average) considerably larger than δyi\delta y_{i}, it might make sense to swap the role of the two variables, see also ending paragraph of Appendix A.3. (Regarding Table 2, the MINUIT-based application yielded identical results with the two options.)

A.5 The WLR with two independent variables

Numerous articles in the literature have treated the problem of multiple linear regression. In this section, I shall detail the solution to the problem involving two independent variables xx and zz, which is the simplest case in multiple linear regression. Assuming the linear relation y=p0+p1xx+p1zzy=p_{0}+p_{1x}x+p_{1z}z, the minimisation function reads as

F(p0,p1x,p1z)=i=1Nwi(yip0p1xxip1zzi)2.F(p_{0},p_{1x},p_{1z})=\sum_{i=1}^{N}w_{i}\left(y_{i}-p_{0}-p_{1x}x_{i}-p_{1z}z_{i}\right)^{2}\,\,\,. (49)

The observation set now comprises NN quadruplets (xi,zi,yi,wi)(x_{i},z_{i},y_{i},w_{i}), reducing to NN triplets (xi,zi,yi)(x_{i},z_{i},y_{i}) for constant statistical weights wi=w0w_{i}=w\neq 0, in which case the WLR with two independent variables becomes ordinary. Using the notation of Appendix A.1, one obtains

𝐏^=𝐆1𝐂,{\bf\hat{P}}={\bf G}^{-1}{\bf C}\,\,\,, (50)

where 𝐆1{\bf G}^{-1} is evidently the inverse of the symmetrical matrix

𝐆=(SwSwxSwzSwxSwxxSwxzSwzSwxzSwzz),{\bf G}=\left(\begin{array}[]{ccc}S_{w}&S_{wx}&S_{wz}\\ S_{wx}&S_{wxx}&S_{wxz}\\ S_{wz}&S_{wxz}&S_{wzz}\\ \end{array}\right)\,\,\,, (51)

and the column arrays 𝐏^{\bf\hat{P}} and 𝐂{\bf C} are given by

𝐏^=(p^0p^1xp^1z){\bf\hat{P}}=\left(\begin{array}[]{c}\hat{p}_{0}\\ \hat{p}_{1x}\\ \hat{p}_{1z}\\ \end{array}\right) (52)

and

𝐂=(SwySwxySwzy).{\bf C}=\left(\begin{array}[]{c}S_{wy}\\ S_{wxy}\\ S_{wzy}\\ \end{array}\right)\,\,\,. (53)

After a few algebraical operations, one may obtain the inverse of the matrix 𝐆{\bf G} as

𝐆1=1𝒟(SwxxSwzzSwxz2SwzSwxzSwxSwzzSwxSwxzSwzSwxxSwzSwxzSwxSwzzSwSwzzSwz2SwxSwzSwSwxzSwxSwxzSwzSwxxSwxSwzSwSwxzSwSwxxSwx2),{\bf G}^{-1}=\frac{1}{\mathscr{D}}\left(\begin{array}[]{ccc}S_{wxx}S_{wzz}-S_{wxz}^{2}&\,S_{wz}S_{wxz}-S_{wx}S_{wzz}&\,S_{wx}S_{wxz}-S_{wz}S_{wxx}\\ S_{wz}S_{wxz}-S_{wx}S_{wzz}&\,S_{w}S_{wzz}-S_{wz}^{2}&\,S_{wx}S_{wz}-S_{w}S_{wxz}\\ S_{wx}S_{wxz}-S_{wz}S_{wxx}&\,S_{wx}S_{wz}-S_{w}S_{wxz}&\,S_{w}S_{wxx}-S_{wx}^{2}\\ \end{array}\right)\,\,\,, (54)

where 𝒟=det(𝐆)=SwSwxxSwzz+2SwxSwzSwxzSwSwxz2Swx2SwzzSwz2Swxx\mathscr{D}={\rm det}({\bf G})=S_{w}S_{wxx}S_{wzz}+2S_{wx}S_{wz}S_{wxz}-S_{w}S_{wxz}^{2}-S_{wx}^{2}S_{wzz}-S_{wz}^{2}S_{wxx}.

The diagonal elements of the matrix 𝐆1{\bf G}^{-1} yield the standard errors of the parameters of the fit as follows:

(δp^0)2\displaystyle(\delta\hat{p}_{0})^{2} =σ2SwxxSwzzSwxz2𝒟\displaystyle=\sigma^{2}\frac{S_{wxx}S_{wzz}-S_{wxz}^{2}}{\mathscr{D}}
(δp^1x)2\displaystyle(\delta\hat{p}_{1x})^{2} =σ2SwSwzzSwz2𝒟\displaystyle=\sigma^{2}\frac{S_{w}S_{wzz}-S_{wz}^{2}}{\mathscr{D}}
(δp^1z)2\displaystyle(\delta\hat{p}_{1z})^{2} =σ2SwSwxxSwx2𝒟,\displaystyle=\sigma^{2}\frac{S_{w}S_{wxx}-S_{wx}^{2}}{\mathscr{D}}\,\,\,, (55)

where σ2\sigma^{2} has been obtained after using NDF=N3{\rm NDF}=N-3, as the general linear model now contains three parameters.

Following again the procedure set forth in Chapter 3.7 of Ref. [4], one obtains (after some effort) the expectation value of the prediction ypy_{p} and its standard error δyp\delta y_{p} corresponding to x=xpx=x_{p} and z=zpz=z_{p} as follows:

yp=p^0+p^1xxp+p^1zzpy_{p}=\hat{p}_{0}+\hat{p}_{1x}x_{p}+\hat{p}_{1z}z_{p} (56)

and

(δyp)2=σ2(wp1+Sw1+σz2(xpx¯)2+σx2(zpz¯)22ρx,zσxσz(xpx¯)(zpz¯)Sw(1ρx,z2)σx2σz2),(\delta y_{p})^{2}=\sigma^{2}\left(w_{p}^{-1}+S_{w}^{-1}+\frac{\sigma_{z}^{2}\left(x_{p}-\bar{x}\right)^{2}+\sigma_{x}^{2}\left(z_{p}-\bar{z}\right)^{2}-2\rho_{x,z}\sigma_{x}\sigma_{z}\left(x_{p}-\bar{x}\right)\left(z_{p}-\bar{z}\right)}{S_{w}\left(1-\rho_{x,z}^{2}\right)\sigma_{x}^{2}\sigma_{z}^{2}}\right)\,\,\,, (57)

where x¯=Swx/Sw\bar{x}=S_{wx}/S_{w}, z¯=Swz/Sw\bar{z}=S_{wz}/S_{w}, σx2=Swxx/Swx¯2\sigma_{x}^{2}=S_{wxx}/S_{w}-\bar{x}^{2}, and σz2=Swzz/Swz¯2\sigma_{z}^{2}=S_{wzz}/S_{w}-\bar{z}^{2}; the statistical weight wpw_{p} has been explained in Section A.1. Finally, the weighted version of Pearson’s correlation coefficient ρx,z\rho_{x,z} reads as

ρx,z=SwxzSwx¯z¯Swσxσz.\rho_{x,z}=\frac{S_{wxz}-S_{w}\bar{x}\bar{z}}{S_{w}\sigma_{x}\sigma_{z}}\,\,\,. (58)

For the sake of completeness, I shall finalise this section by giving the relevant expressions in case that the theoretical straight line is expected to contain no intercept p0p_{0}. In such a case, the minimisation function reads as

F(p1x,p1z)=i=1Nwi(yip1xxip1zzi)2.F(p_{1x},p_{1z})=\sum_{i=1}^{N}w_{i}\left(y_{i}-p_{1x}x_{i}-p_{1z}z_{i}\right)^{2}\,\,\,. (59)

The minimal value F(p^1x,p^1z)=Swyyp^1xSwxyp^1zSwzyF(\hat{p}_{1x},\hat{p}_{1z})=S_{wyy}-\hat{p}_{1x}S_{wxy}-\hat{p}_{1z}S_{wzy}, whereas the fitted values of the parameters p1xp_{1x} and p1zp_{1z} are as follows:

𝐏^=(p^1xp^1z)=1𝒟(SwzzSwxySwxzSwzySwxxSwzySwxzSwxy),{\bf\hat{P}}=\left(\begin{array}[]{c}\hat{p}_{1x}\\ \hat{p}_{1z}\\ \end{array}\right)=\frac{1}{\mathscr{D}}\left(\begin{array}[]{c}S_{wzz}S_{wxy}-S_{wxz}S_{wzy}\\ S_{wxx}S_{wzy}-S_{wxz}S_{wxy}\\ \end{array}\right)\,\,\,, (60)

where the determinant 𝒟=SwxxSwzzSwxz2\mathscr{D}=S_{wxx}S_{wzz}-S_{wxz}^{2}. Finally, the standard errors of the parameters of the fit are obtained from the equations:

(δp^1x)2=σ2Swzz𝒟(\delta\hat{p}_{1x})^{2}=\sigma^{2}\frac{S_{wzz}}{\mathscr{D}} (61)

and

(δp^1z)2=σ2Swxx𝒟,(\delta\hat{p}_{1z})^{2}=\sigma^{2}\frac{S_{wxx}}{\mathscr{D}}\,\,\,, (62)

where NDF=N2{\rm NDF}=N-2, as the linear model now contains two parameters. Regarding the expectation value of a new prediction ypy_{p} at x=xpx=x_{p} and z=zpz=z_{p}, as well as its standard error δyp\delta y_{p}, Eq. (56) applies without the first term on the rhs (i.e., p^0\hat{p}_{0}), whereas Eq. (57) now reads:

(δyp)2=σ2(wp1+xp2Swzz+zp2Swxx2xpzpSwxz𝒟).(\delta y_{p})^{2}=\sigma^{2}\left(w_{p}^{-1}+\frac{x_{p}^{2}S_{wzz}+z_{p}^{2}S_{wxx}-2x_{p}z_{p}S_{wxz}}{\mathscr{D}}\right)\,\,\,. (63)

Appendix B Numerical minimisation

The numerical minimisation can be achieved by means of a plethora of commercial and non-commercial products. In my projects, I have always relied on the MINUIT software package [69] of the CERN library (FORTRAN and C++ versions). In this work, the FORTRAN version was used in order that the analytical solutions, obtained after the C++ implementation of nearly all methods of Appendix A, be validated. The treatment of the two methods of Appendix A.4, which feature parameter-dependent weights wi(p1)w_{i}(p_{1}), was exclusively covered by MINUIT. (The same goes for the fit leading to Fig. 1.) Regarding my MINUIT-based applications, each optimisation is achieved by following the sequence: SIMPLEX, MINIMIZE, MIGRAD, and MINOS. The calls to the last two methods involve the high-level strategy of the numerical minimisation.

  • SIMPLEX uses the simplex method of Nelder and Mead [70].

  • MINIMIZE calls MIGRAD, but reverts to SIMPLEX if MIGRAD fails to converge.

  • MIGRAD, undoubtedly the warhorse of the MINUIT software package, is a variable-metric method, based on the Davidon-Fletcher-Powell algorithm. The method checks for the positive-definiteness of the Hessian matrix.

  • MINOS carries out a detailed error analysis (separately for each parameter), taking into account all correlations among the model parameters.

All aforementioned methods admit one optional argument, fixing the maximal number of calls to each method. If this limit is reached, the corresponding method is terminated (by MINUIT, internally) regardless of whether or not that method had converged. To have confidence in the results of the optimisation, it is recommended to the user to always inspect the (copious) MINUIT output, so that the convergence of the methods and the successful termination of the application be ascertained, or any setback during the execution of the application (e.g., lack of convergence of some MINUIT methods, erroneous evaluation of the Hessian matrix, large estimated vertical distance to the minimum, etc.) be discovered and resolved.

Although MINUIT is robust, some attention is required in order to avoid cases where the application might get trapped in the ‘valley’ of a local minimum. This is not unlikely to happen in non-linear problems, e.g., when using the two methods of Section A.4 of this work, featuring parameter-dependent weights. There are several ways to safeguard against this possibility. For instance, one may repeat the optimisation using different initial guesses for the parameter values (thus modifying the point from where MINUIT starts the minimisation of the user-defined function), broaden the initial guesses for the uncertainties of the parameters of the fit, or simply restrict the parameter space by introducing lower and/or upper bounds to some or to all parameters. To cope with this issue, the MINUIT developers also offer the (time-consuming) method SCAN, which scans the parameter space for an optimal starting point in the minimisation problem.

Finally, it must be borne in mind that the MINOS uncertainties for the parameters of the fit do not contain the Birge factor of Eq. (38). Consequently, it is up to the user to programme the user-defined function in such a way that the Birge factor be applied to the uncertainties (e.g., if BF>1{\rm BF}>1), or be omitted (e.g., if BF<1{\rm BF}<1 while the PDG recommendation is followed). Similarly, it is the user’s responsibility to obtain and apply the t-multiplier to the MINOS uncertainties.

Appendix C On the application of the Birge factor to the uncertainties of the parameters of a fit

When no suspicion can be cast on the modelling in a problem, and the uncertainties of the input datapoints are meaningful (see beginning of Appendix A), the minimisation function is expected to follow the χ2\chi^{2} distribution with NNpN-N_{p} DoFs (NpN_{p} being the number of parameters used in the fit). As a result, the expectation value of the minimisation function at the minimum should be equal to NDF, which implies that the expectation value of the reduced χ2\chi^{2}, i.e., of the ratio σ2χmin2/NDF\sigma^{2}\coloneqq\chi^{2}_{\rm min}/{\rm NDF}, to be identified with the square of the Birge factor BF, is equal to 11. This property of χ2\chi^{2}-distributed quantities has led most physicists to the subliminal use of the approximate criterion χmin2/NDF1\chi^{2}_{\rm min}/{\rm NDF}\approx 1 as a ‘rule of thumb’ when assessing the quality of fits.

Glossing over the arbitrariness in the interpretation of an approximate equality, this practice could be meaningful in case of large samples; however, it is misleading in case of small ones. Nevertheless, even for large samples, the χmin2/NDF1\chi^{2}_{\rm min}/{\rm NDF}\approx 1 criterion might provide a general impression about the quality of a fit, yet it represents no rigorous measure of that quality. For instance, does a resulting χmin2\chi^{2}_{\rm min} value of 900900 for 800800 DoFs, yielding a χmin2/NDF\chi^{2}_{\rm min}/{\rm NDF} value of 1.1251.125, indicate a satisfactory or an unsatisfactory fit at the 11 % significance level? Although one might be tempted to consider such a fit ‘satisfactory’, the p-value, which is associated with the aforementioned numbers, is about 7.771037.77\cdot 10^{-3}, i.e., below the 11 % significance level of this work, suggesting that the fit is (in the strict statistical sense) anything but satisfactory.

The correct assessment of the quality of fits rests upon the use of the p-value which is associated with the χ2\chi^{2} distribution for the given NDF. To this end, a number of software implementations of dedicated algorithms are available, e.g., see Refs. [71] (Chapter on ‘Gamma Function and Related Functions’) and [72], the routine PROB of the FORTRAN implementation of the CERN software library (which, unlike most other CERNLIB routines, is available only in single-precision floating-point format), the functions CHIDIST or CHISQ.DIST.RT of Microsoft Excel, the function chi2cdf of MATLAB, etc.

A few words regarding the application of the Birge factor are due. For the sake of argument, let me assume that the χmin2/NDF\chi^{2}_{\rm min}/{\rm NDF} result of a fit exceeds 11 (equivalently, BF>1{\rm BF}>1). So far as physical observations, accompanied by measurement uncertainties of the input datapoints, are concerned, the quantity BF may be interpreted as the amount by which the input uncertainties need to be rescaled, so that the new reduced χmin2\chi^{2}_{\rm min} becomes equal to its expectation value of 11. If, on the contrary, the fit is already ‘satisfactory’ (i.e., when the ratio χmin2/NDF\chi^{2}_{\rm min}/{\rm NDF} does not exceed 11), the PDG maintains that the rescale of the uncertainties of the input datapoints is not called for.