A discussion on two celebrated examples of application of linear regression
Abstract
This work aims at providing some (further) perspective into the analysis of two well-known datasets in terms of the methods of linear regression. The first set has been taken from Hubble’s 1929 investigation
of the relation between the distances of galaxies and their recessional velocities. The second set relates to Galton’s family data on human stature, collected for the purposes of his 1889 book on natural
inheritance.
PACS 2010: 07.05.Kf, 02.60.Pn
keywords:
Linear least-squares, linear regression, multiple linear regression, numerical optimisation, data analysis: algorithms and implementation1 Introduction
Regression methods have been indispensable tools in the analysis of observations for over two centuries. The first description of linear regression may be traced to one of Legendre’s books [1], published in 1805. It seems that Gauss had already developed and employed the technique before Legendre’s publication, but had considered it “trivial” and unworthy of any special attention, e.g., see Ref. [2].
The purpose of regression analysis is to provide optimal solutions (and predictions obtained thereof) to overdetermined systems, i.e., to situations where the available amount of information (i.e., the number of the input datapoints) exceeds the number of parameters being involved in the modelling of the processes yielding the observations. The simplest of the regression models involves the replacement of the entirety of the available information (e.g., of all lifespans of incandescent light bulbs, produced by a specific manufacturer) by one ‘constant’ (i.e., by the mean product lifetime) and by one uncertainty (e.g., by the ‘root mean square’ of the distribution of the lifetimes of the tested products): information is thus imparted to the potential customer that a purchased bulb of that specific brand and type is expected to withstand, say, h of operation with an estimated uncertainty of, say, h. Consequently, it will be very surprising if such a bulb eventually exceeds h of operation, and (in case that the underlying distribution of the product lifetime is symmetrical about the mean value) equally surprising if it becomes dysfunctional before it reaches h of operation 111In practice, however, the latter tends to be more probable than the former, though Statistics is hardly to blame for that mishap!. The next-to-simplest modelling involves a straight line 222Although a constant may be thought of as a straight line without slope, such a ‘semantic narrowing’ in the definition of the straight line will be avoided in this work. (and two adjustable parameters, intercept and slope).
Numerous authors have occupied themselves with the details of the methods of linear regression (ordinary, weighted, multiple, etc.). The task of listing even the most influential of these contributions would inevitably result in a very long reference list, and, in all probability, even such a list would be incomplete. Therefore, I have decided to simply quote a few relevant books from my own collection [3, 4, 5, 6, 7] and rather refer the interested reader to the works cited therein, as well as to two articles which discuss the results of the application of some standard methods of linear regression to simulated and experimentally acquired datasets, see Refs. [8, 9].
This work discusses details of the application of linear regression to two celebrated datasets, relating to different branches of Science and obtained in the distant past: the first dataset (Section 2.1) is taken from Astronomy, from the 1929 paper by Edwin Powell Hubble (1889-1953) on the relation between the distance and the radial velocity of galaxies [10], which became known as ‘Hubble’s law’ later on; the second dataset (section 2.2) is taken from Anthropometry, from the 1889 (the year Hubble was born) book by Francis Galton (1822-1911) on natural inheritance [11]. As the present paper grew longer than I initially thought, I decided to abandon my plan to include herein a discussion on two additional datasets.
All important details of the methods of linear regression which are applied to the two aforementioned datasets may be found in Appendix A. That appendix serves as a short guide to obtaining the formalism which is relevant to the various methods of linear regression, including the determination of the optimal values of the parameters and of their uncertainties, as well as of any predictions based on the results of the fits (including all correlations among the fit parameters). It is recommended to the reader to start with the material in the appendices, in order to become familiar, if not with the ‘theoretical’ details about the various analysis options, at least with the definitions.
For the sake of brevity, a number of acronyms will be used in this paper:
-
•
PDF will stand for ‘Probability Density Function’;
-
•
CDF will stand for ‘Cumulative Distribution Function’;
-
•
DoF will stand for ‘degree of freedom’;
-
•
NDF will stand for ‘number of DoFs’;
-
•
WLR will stand for ‘weighted linear regression’ (‘weighted least-squares optimisation’); and
-
•
rms will stand for ‘root mean square’ (square root of the unbiased variance of a dataset).
Finally, pc will stand for parsec, defined as the distance at which the length of one astronomical unit ( km), representative of the Sun-Earth distance, subtends an angle of one arcsecond: , where rad; km or (after using the former unit of length for cosmic distances, namely the light year) .
2 The datasets
2.1 Hubble’s dataset
2.1.1 The beginnings
The story behind Hubble’s determination of the famous constant, which has been named after him, started over years before Hubble was born: Clotho’s spindle started spinning when Edward Pigott (1753-1825), an English astronomer, developed a fondness for stars of variable brightness. Pigott’s family moved to York in 1781, where Pigott met John Goodricke (1764-1786), a deaf-mute with unparalleled visual acuity and sensitivity. Working in a private observatory built by Pigott’s father, Pigott and Goodricke embarked on a project aiming at exploring the properties of the most promising variable stars, taken from Pigott’s precompiled list [12]. It did not take long before Goodricke’s skills and attention to detail paid off: one year into the project, he announced to the Royal Society the discovery of the periodical variation of the brightness of the star Algol ( Persei) [13]. Apart from proposing a mechanism to account for the effect (and attributing it to the transits of Algol’s dimmer binary companion), Goodricke also examined the variations of brightness of several other stars, including Aquilae (A) and Cephei (A) [14], which became textbook examples of the Cepheid family of variable stars 333Cepheid variable stars pulsate radially, varying in both diameter and temperature [15]., featuring very regular ‘sharply-up-gradually-down’ light curves, dissimilar to the one obtained from Algol; evidently, the variation of brightness of these stars could not be attributed to stellar eclipses. Goodricke’s contributions to Astronomy did not go unnoticed: at the age of 21, he was elected a Fellow of the Royal Society, a few days before his sad passing.
2.1.2 The Harvard College Observatory
Over the course of the next century, many more Cepheid variable stars were detected. With the invention of photography, the acquisition of accurate brightness data was boosted. The Harvard College Observatory became the hub of the developments in Observational Astronomy, and its new director in 1877, Edward Charles Pickering (1846-1919), launched an ambitious programme aiming at photographing and cataloguing all observable stars. Whether due to budget issues (e.g., see Ref. [16], p. 205) or to the poor quality of the work of Pickering’s male assistants [17], over College-educated women were recruited, to work at the Observatory as human computers, i.e., as skilled employees, comparing photographic plates, processing astronomical data, and contributing to Astronomy for the first time after Hypatia of Alexandria and Aglaonice of Thessaly.
The computers excelled in their assignments. For instance, Annie Jump Cannon (1863-1941), who was deaf after she turned twenty, conducted the main work in the development of the Harvard Classification Scheme of stars, the first attempt towards a categorisation of the stars in terms of their surface temperature and spectral type. In an article entitled ‘Woman making index of stars for a catalogue’, the Danville Morning News reported on 10 February 1913 [18]: “When the work was new she could analyze at the rate of stars in three years. Now she analyzes stars in one month, stars an hour 444This is an erratum: evidently, the author of the article intended to write ‘day’, not ‘hour’.. On Jan. 1 she had examined about , which means that about two-fifths of the work is completed.”
Without doubt, Pickering’s research programme reached its climax with another computer, Henrietta Swan Leavitt (1868-1921). Leavitt’s methodical work was destined to make an indelible impact on Astronomy (and Cosmology). Having been given the task of studying variable stars in the Small and Large Magellanic Clouds (SMC and LMC, respectively), Leavitt catalogued such objects between 1903 and 1908 [19]. The decisive moment came when she focussed her attention on a small subset of these objects, i.e., on the Cepheid variable stars in the SMC (see Table VI of Ref. [19]); on p. 107 of her report, she inconspicuously commented: “It is worthy of notice that in Table VI the brighter variables have the longer period.”
Following her intuition in subsequent years, Leavitt examined the relation between the period of the variation of the apparent magnitude (which is associated with the apparent brightness) of the Cepheid variable stars in the SMC and the extrema of the apparent magnitude for each such object. Her decision to restrict her analysis to the SMC turned out to be of pivotal importance, because the selected (i.e., the from Table VI of Ref. [19] and the which were added to her database between 1908 and 1913) celestial bodies were, in practical terms, equidistant from the Earth 555It is currently known that the SMC is a dwarf galaxy (with a typical diameter of about kpc), gravitationally attracted to the Milky Way, and separated from the Earth by kpc [20]..
After plotting the period-brightness 666I shall use this short term when referring to the datapoints corresponding to the period and to the apparent magnitude of variable stars; other authors prefer the term ‘period-luminosity’. datapoints of the Cepheid variable stars on a linear-log plot (logarithmic scale on the axis, linear scale on the axis), Leavitt 777The fact that Pickering, rather than Leavitt, signed the 1912 report is baffling. The report starts with the sentence: “The following statement regarding the periods of variable stars in the Small Magellanic Cloud has been prepared by Miss Leavitt.” Although Pickering’s motive for signing Leavitt’s report is a mystery to me, I decided to include him in the author list of that report, yet as second author. obtained the “remarkable” result shown in Fig. 2 of Ref. [21]: two nearly parallel straight lines, modelling the apparent-magnitude maxima and minima of these stars, separated by about mag (difference between the corresponding intercepts); I shall use ‘mag’ to denote the unit of apparent magnitude (which, of course, is not a physical unit). Two remarks from that report serve as a prelude to the significance of Leavitt’s research:
-
•
“…there is a simple relation between the brightness of the variables and their periods” and
-
•
“Since the variables are probably at nearly the same distance from the Earth, their periods are apparently associated with their actual emission of light, as determined by their mass, density, and surface brightness.”
The consequences of Leavitt’s findings were beyond reckoning. The measurement of the period of a Cepheid variable star, whichever neighbourhood of the Universe that specific celestial body occupies, could lead to an estimate for its apparent brightness which that body would have if it had been confined within the SMC. By comparing the body’s measured apparent brightness to , one could obtain an estimate for the distance between the specific Cepheid variable star and the Earth, expressed (of course) in terms of the SMC-Earth distance. As the distances of all Cepheid variable stars would be expressed as multiples of the SMC-Earth distance, one would need to determine only one such distance (e.g., using the stellar-parallax method) in order that all distances become known. Within one year of Leavitt’s breakthrough, Ejnar Hertzsprung (1873-1967) provided a solution, however flawed 888On p. 204 of Hertzsprung’s article [22], one reads: “…so wird die entsprehende visuelle Sterngröße gleich . Diese Überlegung führt also zu einer Parallaxe der kleinen Magellanschen Wolke, welche durch gegeben ist. Man erhält , einem Abstand von etwa Lichtjahren entsprechend.” This part is translated as follows: “…in which case the corresponding apparent magnitude is equal to . This consideration leads to a parallax of the Small Magellanic Cloud, which is given by the relation . One obtains , which corresponds to a distance of about light years.” It is a puzzle to me how Hertzsprung obtained this result. Expressed in arcseconds, the parallax would be equal to about , yielding a distance of about m or about ly. However that may be, it is known today that the SMC-Earth distance is significantly larger than both aforementioned values., to this problem [22]. In spite of the serious underestimation of the cosmic distances when using Hertzsprung’s results, Harlow Shapley (1885-1972), then at the Mount Wilson Observatory, elaborated on Hertzsprung’s method and obtained a better calibration of the cosmic distances on the basis of the apparent brightness of the Cepheid variable stars [23]. The yardstick to measure cosmic distances had been invented. Or hardly so?
Looking at Leavitt’s period-brightness data from the comfort of my armchair (next to a fast desktop, running present-day software) over one century after she (arduously) collected the data of the Cepheid variable stars, I notice that the values of Pearson’s correlation coefficient between the logarithm of the period and both apparent-magnitude extrema are very close to : in case of the apparent-magnitude maxima and in case of the apparent-magnitude minima, both calling for linear regression. To advance to the modelling of these data, one first needs to establish a method to assess the uncertainties relevant to the apparent magnitude (there is no mention of uncertainties in Leavitt’s reports), and, to be able to select a reasonable model, one first needs to revisit the procedure by which the apparent-magnitude data are obtained.
The apparent brightness of a star is linked to the star’s luminosity , which is the absolute measure of the power emitted by the star in the form of electromagnetic radiation of all frequencies, via the formula
(1) |
where stands for the distance between the points of emission and absorption, i.e., for the star-Earth distance. From the measurement of the apparent brightness, one obtains the apparent magnitude of the star by using the relation
(2) |
where the quantity is the reference apparent brightness (i.e., the apparent brightness which is to be associated with the apparent magnitude of ). It must be borne in mind that smaller values of the apparent magnitude are indicative of higher apparent brightness: the apparent magnitude of Venus lies between and , of Sirius ( Canis Majoris) about , and of Polaris ( Ursae Minoris) about ; these three celestial bodies have been listed in order of decreasing apparent brightness.
Using Eq. (2), one obtains the uncertainty from the uncertainty of the apparent brightness , as follows:
(3) |
Although depends on the choice of the reference apparent brightness , is independent of that choice. It is reasonable to assume that the uncertainties of the apparent brightness are proportional; under this assumption, the uncertainties are constant. Without doubt, the matter is more complex than it has been presented, yet the discussion above suffices for ‘the sake of argument’ purposes.
Following the aforementioned train of thought, constant working uncertainties ( mag) were assigned to all apparent-magnitude data of Ref. [21]. It must be emphasised that, provided that all assigned uncertainties are equal (and non-zero), the exact value of has no bearing whatsoever on the results.
The separate linear fits to the data (compare the second and third columns of Table 1) suggest that there is nothing against the hypothesis that the two datasets be accounted for on the basis of a common slope parameter. The third row of Table 1 contains the results of the joint fit to the data, using as parameters: the intercept of the straight line which models the apparent-magnitude maxima, the common slope parameter , and the (positive) difference between the intercepts of the two straight lines which model the apparent-magnitude extrema. The results of the joint fit to Leavitt’s period-brightness data are shown in Fig. 1.
The results, contained in the second and third columns of this table, correspond to the separate fits to the maxima and the minima of the apparent magnitude, see Table VI of Ref. [21]; the results of the joint (three-parameter) fit (featuring a common slope parameter) are shown in the last column. The value of the slope appears positive in Fig. 2 of Ref. [21], because the vertical axis is shown inverted in that figure, evidently to conform with the common-sense expectation that brightness maxima appear ‘above’ brightness minima.
Fit to maxima | Fit to minima | Joint fit | |
---|---|---|---|
NDF | |||
(mag) | |||
(mag) | - | - | |
(mag) | |||
(mag) | - | - | |
t-multiplier | |||
Corrected (mag) | |||
Corrected | |||
Corrected (mag) | - | - |

I left two comments for the very end of the story regarding Leavitt’s invaluable contributions to Science. First, Leavitt too was suffering from severe hearing problems by the time she joined Pickering’s team in 1903. Second, like Goodricke years earlier, she did not live long enough to enjoy the fruits of her labour.
2.1.3 Hubble and Company
It was about time Hubble, a U.S. American from the state of Missouri, came on stage. Hubble had been offered a job at the Mount Wilson Observatory already in 1916, but arrived with a delay of about three years after he answered the call of duty and embarked on the task of defending the United Kingdom against the Quadruple Alliance. When Hubble finally arrived at the Observatory, the -inch Hooker Telescope had been completed and was about to yield unprecedented observations of the Cosmos for the next few decades. Shapley was still at the Observatory, at the peak of his scientific productivity [23]. That was a time when a paradigm shift was under way in the human understanding of the Cosmos, with the introduction of General Relativity a few years earlier and the gradual emergence of Quantum Mechanics. Astronomers were striving to provide answers to questions about the nature of the ‘nebulae’, which were detected at an increasing rate with the help of the modern telescopes, and the size of the Universe; a relevant question was whether or not the nebulae were part of the Milky Way. There was no shortage of arguments for a few years, as the case frequently is when reliable evidence is scarce. Intending to resolve the matter, the National Academy of Sciences proposed a public debate between representatives of the two sides in the first half of 1920 [24]; this event became known as the ‘Great Debate’, though - for the obvious reasons - I would rather replace ‘Debate’ with ‘Divide’. The two representatives, Shapley (for those who propounded that the nebulae were within the Milky Way) and Heber Doust Curtis (1872-1942) (for those who surmised that the nebulae were independent galaxies), were different in nearly all ways possible [16], not only regarding their views on the matter at hand: they had different personalities, different ways of expressing themselves when providing explanations to others, different skills when addressing the general public, and - the icing on the cake - different social background. On 27 April 1920 (the day after the Great Debate), there was no clear winner: the unquestionable winner was announced four years later, and his name was Hubble.
Hubble’s resolution of the question of the Great Debate was based on a somewhat serendipitous discovery: on 4 October 1923, he took a long exposure of the Andromeda Nebula (M31) [16] and noticed peculiar ‘spots’ on the photographic plate. Following up on the issue the next night, he found out that two of the spots were novae, whereas the third one turned out to be a Cepheid variable star, the first one to be detected beyond the two Magellanic Clouds. He evaluated the object’s distance and found it several times larger than the typical dimensions of the Milky Way, suggesting that the Andromeda Nebula was an independent galaxy, well beyond our own. Several other Cepheid variable stars were detected in M31 shortly afterwards (also by Hubble), confirming the earlier result [25]. Following the scientific code of conduct, Hubble broke the news to Shapley (who had moved to Cambridge, Massachusetts, to assume the position of the Director of the Harvard College Observatory) in early 1924. After reading Hubble’s letter, Shapley is said to have commented to one of his doctoral students: “Here is the letter that destroyed my universe,” see Ref. [24], p. 1142. That was Hubble’s first significant contribution to Observational Astronomy. He then turned his attention to a related problem.
In the early 1910s, Vesto Melvin Slipher (1875-1969) became the first astronomer to measure the radial velocity of a nebula. Utilising the Doppler shift, he determined the radial velocity of the Andromeda Galaxy (which, as aforementioned, still had the status of a nebula at those times) relative to the Solar System, and found out that it approaches the Milky Way at a speed of km/s, see first paper of Refs. [26]. Within a few years, Slipher had enlarged his database to a total of nebulae (see Table 1 of the last paper of Refs. [26]), of which were undoubtedly receding; in addition, four of the receding nebulae had radial velocities in excess of km/s (which was enormous in comparison with the typical relative velocities of stars - about km/s - in the Milky Way). Provided that there is no preferred direction in the motion of these objects, the probability that at most nebulae (in a set of ) either recede from or approach the Milky Way is equal to about . This probability suggests a statistically significant departure 999In this study, the threshold of statistical significance, which is frequently denoted in the literature as , is assumed to be equal to . This threshold is regarded by most statisticians as signifying the outset of statistical significance. A usual choice in several branches of Science is (which most statisticians associate with ‘probable statistical significance’). from the null hypothesis that the distribution of the radial velocities of the nebulae is symmetrical about . Slipher’s results posed the inevitable question: why do so many more nebulae retreat from the Milky Way?
In the early 1920s, Alexander Alexandrovich Friedmann (1888-1925) derived (from General Relativity) the equations which became known in Cosmology as ‘Friedmann’s equations’ [27, 28]: one of the solutions of these equations predicted an expanding Universe (see also Ref. [29], Chapters 10.3.4 and 10.8.3). A few years later (and working independently), Georges Henri Joseph Édouard Lematre (1894-1966), the father of the Big-Bang Theory with his seminal 1931 paper [30], also came up with the same equations and attributed the recession of the nebulae to the expansion of space [31]. It is far less known (or, perhaps, intentionally overlooked) that what many call ‘Hubble’s law’ can be found in Lematre’s 1927 paper [31]: the ratio (where, as mentioned in Section 1, represents the distance of a nebula from the Earth and its radial velocity), which in subsequent years became known as Hubble’s constant 101010Regarding Hubble’s constant, a caustic comment I usually make is that it is neither Hubble’s nor a constant (given its dependence on the cosmological time); is the value of the Hubble parameter at the current cosmological epoch. , was actually evaluated by Lematre to km/s Mpc-1 (see fifth line from the top of p. 56 of Lematre’s paper [31]), not far from Hubble’s two estimates a few years later. Unfortunately for Lematre, his paper had been published in a journal of very limited dissemination. In addition, Lematre’s efforts in 1927, to draw attention to his work, were not successful [32]. The consequence was that the famous relation between the distance and the radial velocity at which the galaxies recede from the Milky Way became known as ‘Hubble’s law’, instead of the fairer ‘Lematre-Hubble law’ (see also Ref. [29], Chapter 10.3.1).
Concerning the data on which Hubble based his determination, I have heard two Cosmologists remark in their seminars: “Hubble’s data seem to be all over the place” and “I do not know what Hubble did in his paper.” I find both comments lamentable (though, regrettably, I did not raise an objection on those two occasions) and shall next address both issues. Available to Hubble in his 1929 paper [10], were pairs of measurements of the distance and of the radial velocity of nebulae. To account for the observations, Hubble employed the linear model
(4) |
where was assumed to be a constant (to be identified with ) and the correction
(5) |
compensated for the motion of the Solar System (i.e., for the motion of the observer); the velocities , , and were parameters, to be obtained from the optimisation of the description of the observations. The angles and in Eq. (5) denote the right ascension and the declination of each observed object, respectively. Hubble’s datapoints, originating from “nebulae whose distances have been estimated from stars involved or from mean luminosities in a cluster,” see Table 1 of Ref. [10], are shown in Fig. 2.

Was Hubble wrong when suggesting that the distances of the nebulae and their radial velocities are linearly related? Do the data spread “all over the place?” The answer to both questions is negative. Pearson’s correlation coefficient for the data shown in Fig. 2 is equal to . Given that the resulting value of the -statistic is about , one obtains the p-value of about (two-tailed). Therefore, there is ample evidence to reject the null hypothesis that the two quantities, and , are not correlated: equivalently, Hubble’s data do not spread all over the place.
Using Hubble’s datapoints, I obtained the following results for (i.e., ), , , and : km/s Mpc-1, km/s, km/s, and km/s, respectively. Hubble quotes slightly different values and sizeably smaller uncertainties in Ref. [10] (in all probability, he employed constraints in his fit). His values read as follows: km/s Mpc-1, km/s, km/s, and km/s, respectively. After enhancing his database, Hubble updated his results two years later [33]; nevertheless, his value remained enormous (in fact, the estimate of 1931 is even larger than the 1929 result!), suggesting that the typical age of the Universe (obtained from ) was smaller than the age of the oldest rocks found on the Earth (established via radiometric dating), which in turn provides an estimate for the temporal interval which elapsed since the Earth’s crust was formed. Hubble’s result stood for over two decades, and so did the age-of-the-Universe paradox (also known as “timescale difficulty”).
It was Wilhelm Heinrich Walter Baade (1893-1960), another ‘user’ of the -inch Hooker Telescope at the Mount Wilson Observatory (as well as of the -inch Hale Telescope at the Palomar Observatory in the late 1940s), who came up with an explanation for the paradox. Hubble’s results had been flawed, because his estimates for the distances of the galaxies had been erroneous. It had not occurred to him that the Cepheid variable stars basically come in two types: the classical ones, which are young luminous stars, and the older ones which are fainter, e.g., see Ref. [34], figure in Section ‘Calculating distances using Cepheids’. The distance calibration had involved the second type, whereas Hubble had been processing data pertaining to classical Cepheid variable stars. The consequence was that the distances of the galaxies had seriously been underestimated (to be so luminous, the observed star had to be closer to the Milky Way). By how much had Hubble underestimated the present-day distances? It turns out that Hubble’s distances were between and % of the present-day estimates (available from the NASA/IPAC Extragalactic Database [20]), the median being equal to about %. (I find it very surprising that no one validated Hubble’s determination of the distances of the galaxies for such a long time.) Within a few years of Baade’s discovery, his doctoral student, Allan Rex Sandage (1926-2010), came up with the first realistic estimate for and for the age of the Universe in his 1958 paper [35]; in the abstract of that work, one reads: “…This gives km/s Mpc-1 or years …”
I had been wondering for some time what Hubble’s result would have been, if present-day data for the galaxies, which are mentioned in Ref. [10], had been available to him. To this end, I used the latest information from NED [20] for all galaxies mentioned in Ref. [10], i.e., for the then selected (whose data Hubble considered ‘reliable’), as well as for the which, though mentioned in Hubble’s paper, had not been fully measured by 1929. Information about the right ascension, declination, and radial velocity can be found directly in the main web page of these galaxies in NED, whereas all available distance estimates may be downloaded using a tab therein. It does not take long to realise that the variability of the distance estimates of these galaxies is sizeable even today, nearly one century after Hubble’s analysis. In any case, I obtained the median, the mean, and the rms of the distribution of all distance estimates (no exclusion of results) for each of these galaxies, and finally constructed the input (to the optimisation software) datafile using the means and the rms values; alternatively, one could replace the means with the medians.
The measured radial velocities were corrected for the motion of the Solar System, using updated information regarding the ‘peculiar’ velocity [36] and the rotational motion of the Local Standard of Rest (LSR) [37, 38], a coordinate system obtained after averaging the velocities of stars in our neighbourhood. The three components of the velocity of the Solar System relative to the centre of the Milky Way will be denoted as : in the coordinate system which follows the right-hand rule (to my great surprise, the use of left-handed coordinate systems is not uncommon in Observational Astronomy!), is positive towards the galactic centre, is positive in the direction of the galactic rotation, and is positive towards the North Galactic Pole. The velocity of the Solar System relative to the centre of the Milky Way is obtained by adding the peculiar velocity of the Sun (i.e., the velocity relative to the LSR) to the velocity of the LSR relative to the centre of the Milky Way. The velocity components and (and their uncertainties) were taken from Ref. [36], whereas (and its uncertainty) was fixed from Ref. [37]; the result of Ref. [37] is compatible with the estimate for the rotational velocity of the LSR which is recommended in Ref. [38]. For the transformation from equatorial to galactic (latitude and longitude of each celestial object) angles, Ref. [39] was followed, see p. 900 therein: the radial velocities of Ref. [20] were corrected by adding to the measured velocity the term .
A few words regarding the treatment of a few outliers in the dataset of the galaxies, mentioned in Ref. [10], are due.
-
•
After its radial velocity was corrected, NGC 3031 appears to recede from the Milky Way at a speed of km/s, whereas its distance suggested that it should recede at about km/s.
-
•
Six galaxies (SMC, LMC, NGC 6822, NGC 598, NGC 221, and NGC 224) are separated from the Milky Way by less than Mpc: the proximity of these galaxies to the Milky Way results in irregular radial velocities. The (corrected) recessional velocities of four of these galaxies are negative, implying that they approach the Milky Way (only LMC and NGC 6822 recede). Regarding the treatment of these six galaxies, one may either replace them with one representative object, situated at a typical distance (weighted average) of Mpc and having a typical velocity (weighted average) of km/s, or exclude them from the database.
The results of the WLRs (without intercept) following the two approaches of Appendix A.4, for four ways of treatment of the aforementioned outliers, are listed in Table 2; the input datapoints (after all seven outliers are removed from the database) and the fitted straight line in case of the EV2 method are shown in Fig. 3. The crux of the matter is that, had present-day data been available to Hubble for his 1929 paper, he probably would have ended up with km/s Mpc-1 (or with a similar result). The Particle Data Group (PDG) recommend: km/s Mpc-1, obtained from a plethora of precise astronomical data [40].
The results of the linear fit (without intercept, see Appendix A.4) to the pairs of measurements of the distance and of the radial velocity (corrected for the motion of the Solar System) of the galaxies mentioned in Ref. [10]; the input datapoints correspond to present-day information, available from the NASA/IPAC Extragalactic Database [20]. Results are quoted for four ways of treatment of the seven outliers, i.e., of the measurements corresponding to NGC 3031, as well as to the six galaxies closest to the Milky Way (see text). The t-multipliers, corresponding to effects in the normal distribution, have been applied to the quoted uncertainties.
Method | /NDF | |||
---|---|---|---|---|
(km/s Mpc-1) | (km/s Mpc-1) | |||
All galaxies: datapoints | ||||
Quadratic summation (EV2) | ||||
Linear summation | ||||
All but NGC 3031: datapoints | ||||
Quadratic summation (EV2) | ||||
Linear summation | ||||
No NGC 3031, six closest galaxies replaced by one: datapoints | ||||
Quadratic summation (EV2) | ||||
Linear summation | ||||
All outliers removed: datapoints | ||||
Quadratic summation (EV2) | ||||
Linear summation |

2.2 Of sweet peas, moths, and men: Galton’s family data on human stature
Galton was an English multidisciplinarian, with important contributions to subjects in a variety of branches of Science, ranging from Anthropology and Biology to Statistics. Before I started writing this article, I did not know that, in addition, Galton was among the very first to introduce weather maps in order to obtain forecasts, in particular to predict storms. Nor was I aware of the fact that he had been one of the pioneers of Forensic Science: in 1892, he published a book on the analysis of fingerprints as the means of identifying individuals [41]. In this paper, I shall focus on his 1889 book [11], which touches upon the subject of natural inheritance.
In his book, Galton set out to explore three questions (see pp. 1-2). I let him give a summary of these issues, copying a few sentences from his book.
-
•
“The large do not always beget the large, nor the small the small, and yet the observed proportions between the large and the small in each degree of size and in every quality, hardly varies from one generation to another.” Galton suggests that corresponding PDFs of the human height “hardly vary from one generation to another,” though (so far as human stature is concerned) the children are not close replicas of their parents.
-
•
“The second problem regards the average share contributed to the personal features of the offspring by each ancestor severally. Though one half of every child may be said to be derived from either parent, yet he may receive a heritage from a distant progenitor that neither of his parents possessed as personal characteristic. Therefore the child does not on the average receive so much as one half of his personal qualities from each parent, but something less than a half.” At this point, the laws of Mendelian inheritance, which (presumably unbeknownst to him) had been developed by the time Galton started writing his book, come to my mind. (Such inherited features from distant progenitors, categorised as ‘latent elements’, are addressed in Chapter XI of Galton’s book.)
-
•
“The last of the problems that I need mention now, concerns the nearness of kinship in different degrees. We are all agreed that a brother is nearer akin than a nephew, and a nephew than a cousin, and so on, but how much nearer are they in the precise language of numerical statement?” Here, Galton expresses interest in the quantification of the degree of consanguinity.
I shall next address a few issues relevant to Galton’s family data on human stature (‘Galton’s family data’ henceforth). Before skimming through Galton’s book, I was under the impression that Galton’s assistants had been responsible for all measurements of the height of the individuals on which the study [11] was based. This turned out to be untrue. Galton discusses the data-acquisition process in Chapter VI of his book, starting on p. 71: “I had to collect all my data for myself, as nothing existed, so far as I know, that would satisfy even my primary requirement. This was to obtain records of at least two successive generations of some population of considerable size. They must have lived under conditions that were of a usual kind, and in which no great varieties of nurture were to be found.” Although, according to his first sentence, Galton “had to collect all his data for himself,” the data was actually collected ‘through the offer of prizes’ by individuals who were inexperienced with acquiring measurements for scientific purposes (pp. 74-75). These individuals simply responded to Galton’s request (p. 72): “Mr. Francis Galton offers l. (i.e., £) in prizes to those British Subjects resident in the United Kingdom who shall furnish him before May 15, 1884, with the best Extracts from their own Family Records.” Although the conditions regarding the use of any submitted data were thoroughly addressed in the advertisement (pp. 72-74), those pertaining to the data acquisition itself were not! A certain amount of disappointment overtook me as I read the following passage in Galton’s book (p. 78): “In many cases there remains considerable doubt whether the measurement refers to the height with the shoes on or off; not a few of the entries are, I fear, only estimates, and the heights are commonly given only to the nearest inch.” He did not mention that this was mostly the result of his glaring failure to specify the process by which the measurements of the height should have been acquired. Nevertheless, Galton’s spirits must have had a miraculous recovery as, a few lines on, he shifted his mind-set towards the thesis that “a fair share of these returns are undoubtfully careful and thoroughly trustworthy, and as there is no sign or suspicion of bias, I have reason to place confidence in the values of the Means that are derived from them.” How can there be “no sign or suspicion of bias,” if the fraction of measurements, which had been made with the shoes on, remains unknown? Without doubt, the inclusion of such measurements into the database is bound to introduce bias, in particular (for the sake of example) if most of the fathers walked around the house (and let their heights be measured) in shoes, whereas most of the other family members preferred to walk (and let their heights be measured) barefoot.
Be that as it may, Galton’s records contain rows corresponding to different families: the essential information comprises the parental heights, as well as the ordered heights of all filial descendants, first of the sons, followed by those of the daughters. To be able to analyse the data in his time 111111It was Carl Pearson (1857-1936), one of Galton’s doctoral students, who brought Statistics to another level. (using one independent variable), Galton decided to combine the two parental heights into one measure of ‘parental tallness’. As the arithmetical mean of the two heights did not appeal to him 121212On p. 6 of his book, Galton writes: “A serious complexity due to sexual differences seems to await us at every step when investigating the problems of heredity …The artifice is never to deal with female measures as they are observed, but always to employ their male equivalents in the place of them. I transmute all the observations of females before taking them in hand, and thenceforward am able to deal with them on equal terms with the observed male values.”, he came up with a scheme of ‘transmuting females into male equivalents’ by increasing the heights of all females in his study (i.e., of the mothers and of the daughters) by %. When I first had a look at Galton’s family data, this factor appeared arbitrary to me, but (as I shall explain shortly) it emerges from the data sample which was available to him. After ‘transmuting the mothers into father equivalents’, Galton created a hypothetical parent (by averaging the two values), a ‘mid-parent’ in his own words (p. 87): “The word ‘Mid-Parent’ …expresses an ideal person of composite sex, whose stature is half way between the stature of the father and the transmuted stature of the mother.”
Galton finally presented the data in tabular form, see Table 11 of his book, p. 208; all heights are expressed in inches ( cm). Given in this table are the frequency distributions of the filial height in nine histogram bins of the mid-parent’s height ranging between and , with one overflow (mid-parent’s height in excess of ) and one underflow (mid-parent’s height below ) bin. As I was not content with this table, I started investigating whether Galton’s original data had appeared elsewhere.
This was how I came across Ref. [42] and found out that its author too, James A. Hanley, had been in pursuit of the original data over two decades ago: on p. 238 of his paper, Hanley writes: “And so, in 2000, I began my search for Galton’s ‘untransmuted’ data …I also hoped that the children would still be found with their families, that is, before they were marshalled into what Galton called ‘filial’ arrays.” Hanley then continues to recount his efforts to localise Galton’s records, which were finally ‘unearthed’ in 2001, in possession of the University College London. I can almost sense his contentment as I read his text: “The data were exactly what I had wished, in a single notebook, family by family, with sons and daughters identified, and with all female heights untransmuted. Because of the frail condition of the notebook, photocopying was not permitted …In February 2003, I requested and obtained permission to digitally photograph the material.” Digital photographs of the eight relevant pages of Galton’s notebook are now, as a result of Hanley’s commendable efforts, available online [43].
I copied the values, as they appear in the digital photos, onto an Excel file (the constant value of had been subtracted from all entries in Galton’s notebook) and edited a few family records as follows:
-
•
Family 70: one daughter, described as ‘tall’, was removed;
-
•
Family 76: one son (between the heights of and ) and a second daughter, both described as ‘medium’, were removed;
-
•
Family 92: one daughter, described as ‘tall’, was removed;
-
•
Family 95: one daughter, described as ‘deformed’, was removed;
-
•
Family 118: the fourth son, described as ‘tallish’, and all five daughters, described as ‘tallish’ (twice), ‘medium’ (twice), and ‘shortish’, were removed;
-
•
Family 119: the first three daughters, described as ‘tall’ (twice) and ‘medium’, were removed;
-
•
Family 131: three daughters, described as ‘medium’ and ‘short’ (twice), were removed;
-
•
Family 144: the third son, described as ‘deformed’, was removed;
-
•
Family 158: one daughter (between the heights of and ), described as ‘short’, was removed;
-
•
Family 159: the first three daughters, described as ‘very tall’ and ‘tall’ (twice), were removed;
-
•
Family 173: the last son, described as ‘short’, was removed;
-
•
Family 176: one son (between the heights of and ), described as ‘medium’, and the last daughter, described as ‘idiotic’, were removed;
-
•
Family 177: two daughters (between the two reported values), described as ‘tall’, were removed; and
-
•
Family 189: the second son and one daughter (between the two reported values), both described as ‘middle’, were removed.
No measurement of height is mentioned in the aforementioned cases of deleted entries. I thus came up with rows (families) and a total of children, sons and daughters. I cannot explain why Galton writes on p. 77 of his book: “I was able to extract …the statures of couples of parents, with those of an aggregate of of their adult children of both sexes.” Nor can I understand why the filial multiplicities in the eleven (i.e., ) bins of the mid-parent’s height in his Table 11 (p. 208) sum up to .
2.2.1 Correlations between the parental and filial heights
It is interesting to compare the values of Pearson’s correlation coefficient between the sets of heights corresponding to all combinations of parental and filial types. I am aware of three methods taking on the problem of unequal dimensions of the input datasets ( pairs of parents, sons, daughters).
-
•
Method A: The analysis is restricted to the families with either exactly one son or exactly one daughter. (Of course, families with exactly one son and exactly one daughter contribute to both cases.) As the heights of the sons and of the daughters will be analysed separately, the correspondence between each pair of parents and the relevant filial descendant (male and/or female, as the case might be) is one-to-one (bijective). In the accepted data, there are families with exactly one son (the multiplicity of the daughters in these families is irrelevant) and with exactly one daughter (the multiplicity of the sons in these families is irrelevant).
-
•
Method B: Within each family, the mean fraternal and sororal heights are evaluated. (Of course, families with no sons or with no daughters do not contribute to the corresponding dataset.) Again, a bijective correspondence between each pair of parents and the filial descendants is created (all sons and all daughters within a family are replaced by two persons - one male, another female - with stature corresponding to the aforementioned means). In the accepted data, there are families with at least one son and with at least one daughter.
-
•
Method C: The obvious commonality between the descendants within the same family is ignored, and the two parental heights are assigned to all filial descendants of each family. The correspondence is multi-valued: each pair of parents is linked to all of their filial descendants.
I do not expect the introduction of bias into the analysis in cases of methods A and B.
The values of Pearson’s correlation coefficient, obtained when following the aforementioned methods, are given in Table 3. Regardless of the method, the correlation of the heights of both filial types with the paternal height comes out stronger.
Pearson’s correlation coefficient between the heights corresponding to all combinations of parental and filial types.
Paternal height | Maternal height | |
Method A | ||
Son’s height | ||
Daughter’s height | ||
Method B | ||
Sons’ mean height | ||
Daughters’ mean height | ||
Method C | ||
Sons’ heights | ||
Daughters’ heights |
2.2.2 Distributions of human height
Table 4 contains the approximate values of the mean and of the rms of the distributions of parental and filial heights corresponding to Galton’s 205 families. The apparent similarities of these statistical measures for the male (fathers and sons) and for the female (mothers and daughters) subjects, which the contents of the table suggest, require further examination.
Approximate mean and rms values of the distributions of parental and filial heights corresponding to Galton’s 205 families, containing acceptable entries of sons and daughters. The ratio of the mean parental heights is equal to about , i.e., close to the factor of which Galton employed in order to transform the heights of the females into those of the ‘male equivalents’. In this work, there is no need for such a transformation.
Family member | Mean value | rms |
---|---|---|
Father | ||
Mother | ||
Son | ||
Daughter |
The four distributions of the parental and the filial heights will next be submitted to a number of tests. Implementations of the algorithms relevant to these tests are available in the form of classes in my C++ software library, developed within the framework of Microsoft Visual Studio. The input comprises the original arrays (i.e., not histograms of these data).
The first test concerns the symmetry of the distributions about their corresponding median values. The results of Wilcoxon’s signed-rank tests [44] are given in Table 5. On the basis of the two sets of p-values (depending on whether or not the so-called continuity correction is applied), there is no evidence in support of asymmetrical distributions about their corresponding median values.
The results of Wilcoxon’s signed-rank (two-tailed) tests [44] on the four distributions of parental and filial heights; owing to the largeness of the populations, the Edgeworth approximation [45] is reliable (and has been used). The quantity is the dimension of each input dataset. The T-domain and the T-value are standard quantities associated with the test. The two sets of p-values, depending on whether or not the so-called continuity correction is applied, are nearly identical.
Quantity | Fathers | Mothers | Sons | Daughters |
---|---|---|---|---|
T-domain | ||||
T-value | ||||
Without continuity correction | ||||
p-value | ||||
With continuity correction | ||||
p-value |
To test the similarity of the distributions of the height between a) fathers and sons, and b) mothers and daughters, two tests were performed: the Mann-Whitney-Wilcoxon test [44, 46] and the Brunner-Munzel test [47]. The p-values of these two tests on the data at hand are nearly identical, see Table 6: as a result, there is no evidence in support of dissimilar distributions of the height between a) fathers and sons, and b) mothers and daughters. Therefore, on the basis of his data, no evidence can be produced against Galton’s first conjecture about the similarity of the two (one for male, another for female subjects) PDFs of the human height 131313Without doubt, the mean height of the humans gradually increased during the twentieth century, e.g., see Figs. 6-8 of Ref. [48]. However, this effect has been associated with better nutrition and improved healthcare. According to Ref. [49]: “Poor nutrition and illness in childhood limit human growth. As a consequence, the mean height of a population is strongly correlated with living standards in a population.” “from one generation to another” (see beginning of Section 2.2).
Results of the two (two-tailed) tests for similarity of the distributions of the height between a) fathers and sons, and b) mothers and daughters. The quantities and are the dimensions of the input datasets (parent and filial descendant) in each case. The U-domain, U-value, and T-value are standard quantities associated with the two tests.
The aforementioned tests suggest that the distributions of the height a) of the fathers and of the sons, and b) of the mothers and of the daughters may be combined into two distributions: one for the male, another for the female subjects. To rid the data of the rounding effects 141414The height of (out of ) fathers in Galton’s family data is given as . Such shortcomings would hardly have occurred, if the data acquisition had been carried out by professionals. (which give rise to artefacts in the analysis), random uniformly-distributed noise U(0,1), offset by so that its expectation value would correspond to vanishing correction, was added to all available measurements of the height; in essence, this operation neutralises Galton’s concern that “the heights are commonly given only to the nearest inch.” Two new datafiles were created after the addition of the ‘noise’, one for the male, another for the female subjects, and the resulting distributions were tested for normality.
Numerous algorithms are available for testing the normality of distributions, e.g., see Refs. [50, 51] and the works cited therein. In this study, the normality will be tested by means of three well-established statistical methods.
-
•
The formal Shapiro-Wilk normality test, which emerges as the ultimate test for normality in power studies (e.g., see Ref. [52]), was introduced in 1965 [53] for small samples (in the first version of the algorithm, a maximum of observations could be tested) and was substantially extended (for application to large samples, certainly up to observations, perhaps to even larger samples) in a series of studies by Royston [54].
- •
- •
There is only one commonality between these three tests, the obvious one: the tests result in an estimate for the p-value for the acceptance of the null hypothesis, i.e., that the underlying distribution is the normal distribution . Without ado, I present the results of the tests in Table 7. The normal probability plots, corresponding to the male and to the female subjects, are given in Figs. 4 and 5, respectively. In both cases, a few outliers may be seen in the tails of the two distributions. Although such datapoints may be excluded in a more thorough analysis of Galton’s family data, it was decided to leave them in the database of this work. On the basis of Table 7, as well as of Figs. 4 and 5, no evidence can be produced in support of a departure of the two distributions from normality. The two PDFs of the height corresponding to Galton’s male (father and sons) and female (mother and daughters) subjects, obtained from the means and the rms values of Table 7, are shown (side by side) in Fig. 6.
Results of the three tests for normality of the distributions of the height of the male (fathers and sons) and of the female (mothers and daughters) subjects. The quantity is the dimension of the input datasets (parent and filial descendant) in each case. The -statistic, -statistic, and -statistic are the test statistics associated with the three tests.



2.2.3 Modelling
Regarding the modelling of Galton’s family data, two approaches would appeal to the physicist, both involving the extraction of the mean filial heights (for the sake of simplicity, both will be denoted as , but will be separately modelled) for each family (of course, families without any sons or without any daughters do not contribute to the corresponding datafile). Provided that the filial heights are sampled from normal distributions with two family-dependent means and two family-independent variances 151515In the language of the analysis of variance (ANOVA), these variances are associated with the within-treatments variations., i.e., one variance relevant to the heights of the sons , another to those of the daughters , the two standard errors of the means are expected to be equal to , where is to be identified either with or with , and stands for the fraternal or sororal multiplicity in the -th family. As explained in Appendix A.3, a statistical weight is introduced, to account for the uncertainty of each of the two filial means: evidently, , and, as the plan is to perform separate fits to the two types of mean filial heights, one may simply use for the purposes of the optimisation: the rescaling of the statistical weights (i.e., their multiplication by any constant ) has no impact on the important results of the fits.
I shall next address the two types of optimisation to be pursued in relation to Galton’s family data. The first one involves multiple linear regression (with the parental heights as the two independent variables), as detailed in Appendix A.5. The second possibility involves the combination of the two parental heights into one value, i.e., the one corresponding to the hypothetical mid-parent in Galton’s language. However, the realisation of this possibility does not involve Galton’s simple procedure, but a more rigorous one, namely the separate standardisation of the two paternal heights: the notion of tallness can be quantified (in the absolute sense) on the basis of a comparison of each parent’s height with the heights of the parents of the same sex. To this end, one may obtain an absolute measure of tallness for each of the two parents of the -th family by introducing the quantity with the relation
(6) |
where denotes the height of the parent, and and stand for the mean height and for the (unbiased) variance of the distribution of the height of the parents of the same sex; although the values of Table 4 could be used, I decided to employ the statistical measures of the two distributions of parental height for those of the families with at least one son (when modelling the fraternal heights) and at least one daughter (when modelling the sororal heights), see Table 8.
Approximate mean and rms values of the distributions of parental heights corresponding to Galton’s families with at least one son (suitable for the modelling of the fraternal heights) and the families with at least one daughter (suitable for the modelling of the sororal heights).
Family member | Mean value | rms |
---|---|---|
Obtained from the data of the families with at least one son | ||
Father | ||
Mother | ||
Obtained from the data of the families with at least one daughter | ||
Father | ||
Mother |
After standardising the height of each parent of each family (using separate distributions, as explained above), one may either proceed to add the two -scores or to average them (the choice in this work), and assign that score (independent variable) to the mid-parent for the purposes of a WLR, as detailed in Appendix A.1.
Before advancing, I should mention that I have not studied the papers relevant to the past analyses of Galton’s family data; this work aims at applying a few standard methods of linear regression, not at compiling a review article on Galton’s dataset. The interested reader is addressed to Ref. [42], as well as to the works cited therein.
2.2.3.1 Results of the optimisation in case of multiple linear regression
The results of linear regression, using two independent variables (the paternal heights and the maternal heights ), are given in Table 9 for the two fits to the filial heights. The square of the Birge factor [58] is an estimate for the (unbiased) ‘unexplained’ variance in each case, see also Appendices A.3, B, and C.
The results of linear regression, using two independent variables (the paternal heights and the maternal heights ) for the separate fits to the filial heights. All quantities have been defined in Appendix A.5. Due to the largeness of the two samples, the values of the t-multiplier are small, and modify (to the quoted precision) only the fitted uncertainty of (from to ) in case of the fit to the sororal heights.
Quantity | Sons | Daughters |
---|---|---|
Birge factor | ||
t-multiplier |
Scatter plots of the standardised residuals versus the fitted values are given in Figs. 7 and 8 for the fits to the fraternal and sororal heights, respectively.


2.2.3.2 Results of the optimisation in case of a WLR
To start with, the mean filial heights correlate equally well with the mid-parents’ -scores: Pearson’s correlation coefficient is equal to in case of the mean fraternal heights and in case of the mean sororal heights. The results for the fitted values of the parameters and of their uncertainties, obtained from the two fits to the available data, are given in Table 10. The quality of the two fits is nearly so good as that of the corresponding fits in case of the multiple linear regression of the previous section.
The results of a WLR, using as independent variable the average -scores of the parental heights and as dependent variable the filial heights. All quantities have been defined in Appendix A.5. Due to the largeness of the two samples, the values of the t-multiplier are small, and do not modify (to the quoted precision) the fitted uncertainties of the two parameters of the fit.
Quantity | Sons | Daughters |
---|---|---|
Birge factor | ||
t-multiplier |
Plots of the mean filial heights versus the mid-parents’ -scores are shown in Figs. 9 and 10 for the sons and the daughters, respectively.


I left one issue for the very end. Was Galton’s approach to ‘transmuting females into male equivalents’ deficient? To answer this question, let us consider two populations - one of male, another of female subjects - which fulfil two conditions.
-
•
The dimensions of the populations are equal.
-
•
So far as human stature is concerned, the correspondence between the two groups is bijective. In addition, for each male subject with height , there exists a female subject with height , where is a constant.
In such a population, the means of the heights of the male subjects and of the female subjects are related: ; similarly linked are the rms values of the two distributions, i.e., for the male subjects and for the female subjects: . Let us next focus on one family, with parental heights and . Galton suggested that each mid-parent’s height should be taken as
(7) |
On the other hand, the average -score, corresponding to the heights of these two parents - see Eq. (6), would be equal to
(8) |
which yields
(9) |
Given the constancy of the quantities and , there can be no doubt that, provided that the two aforementioned conditions about the two populations are fulfilled, Galton’s mid-parent approach and the method involving the average -score of the parental heights, are bound to give identical results so far as the quality of the description of the filial heights is concerned. In case of Galton’s family data, the first criterion is undoubtedly fulfilled: the families contain fathers and mothers. However, the second criterion is not fulfilled (in the mathematical sense): given that the heights of the subjects are sampled from continuous probability distributions, one cannot expect to find any two subjects in the entire population, whose heights could possibly have a ratio exactly equal to the ratio of the means of the heights in the two populations of male and female subjects; the consequence is that , and the average -score of the parental heights is no longer given by Eq. (9). Only one question remains: how effective (in terms of the description of the input data) was Galton’s introduction of the mid-parent’s height via Eq. (7) as the independent variable in the linear-regression problem?
The results of the two WLRs to the average filial heights, using as independent variable Galton’s mid-parent’s height of Eq. (7), are given in Table 11. The quality of the fits is comparable to that obtained with the two methods of this work. All things considered, Galton’s use of the mid-parent’s height as the independent variable in the problem, which he set out to examine, was a reasonable approximation.
The results of a WLR, using as independent variable Galton’s mid-parent’s height of Eq. (7) and as dependent variable the filial heights. The optimal values of per filial type have been used: for the fraternal heights, for the sororal ones. All quantities have been defined in Appendix A.5. Due to the largeness of the two samples, the values of the t-multiplier are small, and modify (to the quoted precision) only the fitted uncertainty of (from to ) in case of the fit to the fraternal heights.
Quantity | Sons | Daughters |
---|---|---|
Birge factor | ||
t-multiplier |
3 Conclusions
This work aimed at providing a modicum of perspective into the analysis of two well-known datasets in terms of the methods of linear regression.
The first of these sets has been taken from Hubble’s 1929 paper [10]; that work investigated the relation between the distances of nebulae (identified with galaxies later on) and their radial velocities (relative to the observer). Hubble assumed a linear relation between these two quantities, and determined the slope in their scatter plot; that slope, which became known as the ‘Hubble constant’ at later times, was estimated by Hubble to about km/s Mpc-1 [10, 33]. After revisiting Hubble’s problem, using his selection of galaxies in conjunction with present-day data [20], this work demonstrated that Hubble could have obtained an estimate close to the currently accepted value [40] (which is several times smaller than his 1929 and 1931 estimates), if accurate information (in particular, more accurate estimates for the distances ) had been available to him, see Table 2 and Fig. 3.
The second dataset of this work related to Galton’s family data on human stature [11]. Galton’s original data, retrieved from digital photographs of eight pages of his notebook available from Ref. [43], contain records of parental and filial heights for each family separately. The four distributions of the height (two parental, two filial) were first tested for symmetry (about their respective median values), see first part of Section 2.2.2. Subsequently, the similarity of the distributions of the height for the male subjects (fathers and sons), as well as for the female subjects (mothers and daughters), was investigated and confirmed. The two distributions, obtained after combining the heights of the male subjects, as well as those of the female subjects, were successfully tested for normality. The two PDFs of the height, corresponding to the male and female “British subjects, resident in the United Kingdom, who furnished Galton with the best extracts from their own family records” are shown, next to one another, in Fig. 6. The filial heights were subsequently modelled in two ways:
-
•
in terms of multiple linear regression, using the parental heights as independent variables and
-
•
in terms of a WLR, featuring the average -score of the two parents as the independent variable. Each of the -scores of the two parents was obtained on the basis of a comparison of that parent’s height with the statistical measures corresponding to the distribution of the height of parents of the same sex. My thesis is that the separate standardisation of the parental heights represents the best option towards establishing an absolute standard of ‘parental tallness’ in the general case.
The two methods of linear regression gave nearly identical results (in terms of the description of the input data), supporting the hypothesis of a positive correlation between parental and filial heights, see Tables 9 and 10, and Figs. 9 and 10.
In addition to these results, this paper may also serve as a short guide to obtaining the formalism which is relevant to the various methods of linear regression, including the determination of the optimal values of the parameters and of their uncertainties, as well as of any predictions based on the results of the fits: given in Appendix A are analytical formulae for all these quantities in case of one independent variable, for different types of input uncertainties, see Appendices A.1-A.4. Appendix A.5 deals with the simplest case of multiple linear regression, the one involving two independent variables. Appendices B and C deal with two technical issues, namely with some features of the numerical (as opposed to the analytical) minimisation, as well as with the application of the Birge factor to the fitted uncertainties of the parameters.
References
- [1] A.-M. Legendre, ‘Nouvelles méthodes pour la détermination des orbites des comètes’ (in French), F. Didot, Paris (1805); see https://books.google.ch/books?id=FRcOAAAAQAAJ&redir_esc=y
- [2] https://priceonomics.com/the-discovery-of-statistical-regression
- [3] R.J. Carroll, D. Ruppert, ‘Transformation and Weighting in Regression’, Springer-Science+Business Media, B.V. (1988). ISBN-13: 978-0-4120-1421-5
- [4] G.S. Maddala, ‘Introduction to Econometrics’, 2nd edn., Macmillan Publishing Company (1992). ISBN-10: 0-02-374545-2
- [5] F.A. Graybill, H.K. Iyer, ‘Regression Analysis: Concepts and Applications’, Duxbury Press (1994). ISBN-10: 0-53-419869-4, ISBN-13: 978-0-5341-9869-5
- [6] T. Hastie, R. Tibshirani, J. Friedman, ‘The Elements of Statistical Learning: Data Mining, Inference, and Prediction’, 2nd edn., Springer (2016). ISBN-10: 0-38-784857-6, ISBN-13: 978-0-3878-4857-0
- [7] R.B. Darlington, A.F. Hayes, ‘Regression Analysis and Linear Models: Concepts, Applications, and Implementation’, The Guilford Press (2017). ISBN-13: 978-1-4625-2113-5
- [8] C.A. Cantrell, ‘Technical Note: Review of methods for linear least-squares fitting of data and application to atmospheric chemistry problems’, Atmos. Chem. Phys. Discuss. 8(17), 6409 (2008). DOI: 10.5194/acp-8-5477-2008
- [9] S. Mikkonen et al., ‘Technical note: Effects of uncertainties and number of data points on line fitting - a case study on new particle formation’, Atmos. Chem. Phys. 19(19), 12531 (2019). DOI: 10.5194/acp-19-12531-2019
- [10] E. Hubble, ‘A relation between distance and radial velocity among extra-galactic nebulae’, P. Natl. Acad. Sci. USA 15, 168 (1929). DOI: 10.1073/pnas.15.3.168
- [11] F. Galton, ‘Natural Inheritance’, Macmillan (1889).
- [12] L.M. French, ‘John Goodricke, Edward Pigott, and their study of variable stars’, J. Amer. Assoc. Var. Star Obs. 40(1), 120 (2012); see https://app.aavso.org/jaavso/article/2742/
- [13] J. Goodricke, ‘XXVI. A series of observations on, and a discovery of, the period of the variation of the light of the bright star in the head of medusa, called algol. In a letter from John Goodricke, Esq. to the Rev. Anthony Shepherd, D.D.F.R.S. and Plumian Professor at Cambridge’, Philos. T. R. Soc. 73, 474 (1783). DOI: 10.1098/rstl.1783.0027. ISSN: 0261-0523; see https://royalsocietypublishing.org/doi/epdf/10.1098/rstl.1783.0027
- [14] J. Goodricke, ‘II. A series of observations on, and a discovery of, the period of the variation of the light of the star marked by Bayer, near the head of Cepheus. In a letter from John Goodricke, Esq. to Nevil Maskelyne, D.D.F.R.S.’, Philos. T. R. Soc. 76, 48 (1786). DOI: 10.1098/rstl.1786.0002. ISSN: 0261-0523; see https://royalsocietypublishing.org/doi/epdf/10.1098/rstl.1786.0002
- [15] H. Shapley, ‘On the nature and cause of Cepheid variation’, Astrophys. J. 40, 448 (1914). DOI: 10.1086/142137
- [16] S.L. Singh, ‘Big Bang: The origin of the Universe’, Harper Perennial (2005). ISBN-10: 0-00-716221-9, ISBN-13: 978-0-0071-6221-5; first published (under a slightly different title) by Fourth Estate (2004).
- [17] M.W. Rossiter, “‘Women’s Work” in Science, 1880-1910’, Isis 71(3), 381 (1980). DOI: 10.1086/352540. ISSN 0021-1753.
- [18] https://www.newspapers.com/article/10248290
-
[19]
H.S. Leavitt, ‘ variables in the Magellanic Clouds’, Annals of Harvard College Observatory 60(4), 87 (1908);
see https://articles.adsabs.harvard.edu/pdf/1908AnHar..60…87L - [20] https://ned.ipac.caltech.edu
- [21] H.S. Leavitt, E.C. Pickering, ‘Periods of Variable Stars in the Small Magellanic Cloud’, Harvard College Observatory Circular 173 (1912) 1; see https://articles.adsabs.harvard.edu/pdf/1912HarCi.173….1L
- [22] E. Hertzsprung, ‘Über die räumliche Verteilung der Veränderlichen vom Cephei-Typus’ (in German), Astron. Nachr. 196, 201 (1913); see https://adsabs.harvard.edu/full/1913AN….196..201H
-
[23]
H. Sawyer Hogg, ‘Harlow Shapley and globular clusters’, Publ. Astron. Soc. Pac. 77(458), 336 (1965).
DOI: 10.1086/128229 - [24] V. Trimble, ‘The 1920 Shapley-Curtis discussion: Background, issues, and aftermath’, Publ. Astron. Soc. Pac. 107(718), 1133 (1995). DOI: 10.1086/133671
- [25] https://www.nasa.gov/mission_pages/hubble/science/star-v1.html
-
[26]
V.M. Slipher, ‘The radial velocity of the Andromeda Nebula’, Lowell Observatory Bulletin 2(8), 56 (1913); ‘Spectrographic observations of nebulae’, Popular Astronomy 23, 21 (1915); and
‘Nebulae’, P. Am. Philos. Soc. 56, 403 (1917).
Available from https://ui.adsabs.harvard.edu/abs/1913LowOB…2…56S, https://ui.adsabs.harvard.edu/abs/1915PA…..23…21S, and https://ui.adsabs.harvard.edu/abs/1917PAPhS..56..403S, respectively. - [27] A. Friedmann, ‘Über die Krümmung des Raumes’ (in German), Z. Phys. 10, 377 (1922). DOI: 10.1007/BF01332580
- [28] A. Friedmann, ‘Über die Möglichkeit einer Welt mit konstanter negativer Krümmung des Raumes’ (in German), Z. Phys. 21, 326 (1924). DOI: 10.1007/BF01328280
- [29] E. Matsinos, ‘From Copernicus to Lambda-CDM’, Nova Science Publishers Inc. (2017). ISBN-10: 1-53-612034-0, ISBN-13: 978-1-5361-2034-9
- [30] G. Lematre, ‘The expanding universe’, Mon. Not. R. Astron. Soc. 91, 490 (1931); see https://ui.adsabs.harvard.edu/abs/1931MNRAS..91..490L
- [31] G. Lematre, ‘Un univers homogène de masse constante et de rayon croissant rendant compte de la vitesse radiale des nébuleuses extra-galactiques’ (in French), Annales de la Société Scientifique de Bruxelles A 47, 49 (1927); English translation: G. Lematre, ‘Republication of: A homogeneous universe of constant mass and increasing radius accounting for the radial velocity of extra-galactic nebulae’, Gen. Relat. Gravit. 45 (2013) 1635. DOI: 10.1007/s10714-013-1548-3
-
[32]
C. O’Raifeartaigh, ‘Eddington, Lemaitre and the hypothesis of cosmic expansion’, arXiv:1907.12297 [physics.hist-ph].
DOI: 10.48550/arXiv.1907.12297 - [33] E. Hubble, M.L. Humason, ‘The velocity-distance relation among extra-galactic nebulae’, Astrophys. J. 74, 43 (1931). DOI: 10.1086/143323
-
[34]
https://www.atnf.csiro.au/outreach/education/senior/astrophysics/
variable_cepheids.html - [35] A. Sandage, ‘Current problems in the extragalactic distance scale’, Astrophy. J. 127(3), 513 (1958). DOI: 10.1086/146483
- [36] R. Schönrich, J. Binney, W. Dehnen, ‘Local kinematics and the Local Standard of Rest’, Mon. Not. R. Astron. Soc. 403, 1829 (2010). DOI: 10.1111/j.1365-2966.2010.16253.x
- [37] M.J. Reid et al., ‘Trigonometric parallaxes of high mass star forming regions: The structure and kinematics of the Milky Way’, Astrophys. J. 783(130) (2014). DOI: 10.1088/0004-637X/783/2/130
- [38] M.J. Reid, T.M. Dame, ‘On the rotation speed of the Milky Way determined from H I emission’, Astrophys. J. 832(159) (2016). DOI: 10.3847/0004-637X/832/2/159
- [39] B.W. Carroll, D.A. Ostlie, ‘An Introduction to Modern Astrophysics’, 2nd edn., Pearson Addison-Wesley (2007). ISBN-10: 0-32-144284-9, ISBN-13: 978-0-8053-0402-2.
- [40] R.L. Workman et al.(Particle Data Group), Prog. Theor. Exp. Phys. 2022, 083C01 (2022); see also 2023 update.
- [41] F. Galton, ‘Finger Prints’, Macmillan (1892).
- [42] J.A. Hanley, “‘Transmuting” women into men: Galton’s family data on human stature’, Am. Stat. 58(3), 237 (2004). DOI: 10.1198/000313004X1558
- [43] http://www.epi.mcgill.ca/hanley/galton
- [44] F. Wilcoxon, ‘Individual comparisons by ranking methods’, Biometrics Bull. 1(6), 80 (1945). DOI: 10.2307/3001968
- [45] J.E. Kolassa, ‘Edgeworth approximations for rank sum test statistics’, Stat. Probab. Lett. 24(2), 169 (1995). DOI: 10.1016/0167-7152(95)00164-H
- [46] H.B. Mann, D.R. Whitney, ‘On a test of whether one of two random variables is stochastically larger than the other’, Ann. Math. Stat. 18(1), 50 (1947). DOI: 10.1214/aoms/1177730491
- [47] E. Brunner, U. Munzel, ‘The nonparametric Behrens-Fisher problem: Asymptotic theory and a small-sample approximation’, Biometrical J. 42(1), 17 (2000). DOI: 10.1002/(SICI)1521-4036(200001)42:1¡17::AID-BIMJ17¿3.0.CO;2-U
- [48] NCD Risk Factor Collaboration (NCD-RisC), ‘A century of trends in adult human height’, eLife 5:e13410 (2016). DOI: 10.7554/eLife.13410
- [49] M. Roser, C. Appel, H. Ritchie, ‘Human height’, Our World in Data (2013); see https://ourworldindata.org/human-height
- [50] https://math.mit.edu/ rmd/465/shapiro.pdf
- [51] R.B. D’Agostino, A. Belanger, R.B. D’Agostino Jr., ’A suggestion for using power and informative tests of normality’, Am. Stat. 44(4), 316 (1990). DOI: 10.2307/2684359
- [52] N. Mohd Razali, B. Yap, ‘Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling tests’, J. Stat. Model. Analytics 2(1), 21 (2011). ISBN-13: 978-9-6736-3157-5
- [53] S.S. Shapiro, M.B. Wilk, ‘An analysis of variance test for normality (complete samples)’, Biometrika 52(3/4), 591 (1965). DOI: 10.2307/2333709
- [54] J.P. Royston, ‘An extension of Shapiro and Wilk’s test for normality to large samples’, Appl. Statist. 31(2), 115 (1982). DOI: 10.2307/2347973; P. Royston, ‘Approximating the Shapiro-Wilk W-test for non-normality’, Stat. Comput. 2, 117 (1992). DOI: 10.1007/BF01891203; P. Royston, ‘A toolkit for testing for non-normality in complete and censored samples’, J. R. Stat. Soc., D: The Statistician 42, 37 (1993). DOI: 10.2307/2348109; J.P. Royston, ‘A Remark on Algorithm AS 181: The W-test for normality’, J. R. Stat. Soc., C: Appl. Stat. 44(4), 547 (1995). DOI: 10.2307/2986146
- [55] T.W. Anderson, D.A. Darling, ‘Asymptotic theory of certain goodness of fit criteria based on stochastic processes’, Ann. Math. Stat. 23(2), 193 (1952). DOI: 10.1214/aoms/1177729437
- [56] R.B. D’Agostino, ‘Tests for the Normal Distribution’, in ‘Goodness-of-Fit Techniques’, ed. R.B. D’Agostino, M.A. Stephens, New York: Marcel Dekker (1986). ISBN-10: 0-82-477487-6
- [57] R. D’Agostino, E.S. Pearson, ‘Tests for departure from normality. Empirical results for the distributions of and ’, Biometrika 60, 613 (1973). DOI: 10.2307/2335012
- [58] R.T. Birge, ‘The calculation of errors by the method of least squares’, Phys. Rev. 40(2), 207 (1932). DOI: 10.1103/PhysRev.40.207
- [59] https://en.freedownloadmanager.org/Windows-PC/CaRMetal-FREE.html
- [60] S.F. Gull, ‘Bayesian data analysis: straight-line fitting’, in ‘Maximum Entropy and Bayesian Methods. Fundamental Theories of Physics’, Vol. 36, ed. J. Skilling, Springer (1989), Dordrecht. DOI: 10.1007/978-94-015-7860-8_55
- [61] W.A. Fuller, ‘Measurement Error Models’, John Wiley & Sons, New York (1987). ISBN-10: 0-47-186187-1
- [62] J.W. Gillard, ‘An historical overview of linear regression with errors in both variables’, Technical report, Cardiff University (2006); see https://mathsdemo.cf.ac.uk/maths/resources/Gillard_Tech_Report.pdf
- [63] G.Y. Yi, J.S. Buzas, ‘Measurement Error Models - A brief account of past developments and modern advancements’, in ‘Handbook of Measurement Error Models’, ed. G.Y. Yi, A. Delaigle, P. Gustafson, CRC Press (2021). DOI: 10.1201/9781315101279-1
- [64] J. Tellinghuisen, ‘Least-squares analysis of data with uncertainty in and : A Monte Carlo methods comparison’, Chemom. Intell. Lab. Syst. 103(2), 160 (2010). DOI: 10.1016/j.chemolab.2010.07.003
- [65] J. Tellinghuisen, ‘Least squares methods for treating problems with uncertainty in and ’, Anal. Chem. 92(16), 10863 (2020). DOI: 10.1021/acs.analchem.0c02178
- [66] H.-Ch. Schröder et al., ‘Determination of the scattering lengths from pionic hydrogen’, Phys. Lett. B 469(1), 25 (1999); H.-Ch. Schröder et al., ‘The pion-nucleon scattering lengths from pionic hydrogen and deuterium’, Eur. Phys. J. C 21(3), 473 (2001). DOI: 10.1016/S0370-2693(99)01237-X, 10.1007/s100520100754
- [67] M. Hennebach et al., ‘Hadronic shift in pionic hydrogen’, Eur. Phys. J. A 50(12), 190 (2014); see also Erratum: Eur. Phys. J. A 55(2), 24 (2019). DOI: 10.1140/epja/i2014-14190-x, 10.1140/epja/i2019-12710-x
- [68] A. Hirtl et al., ‘Redetermination of the strong-interaction width in pionic hydrogen’, Eur. Phys. J. A 57(2), 70 (2021). DOI: 10.1140/epja/s10050-021-00387-x
- [69] F. James, ‘MINUIT - Function Minimization and Error Analysis’, CERN Program Library Long Writeup D506.
- [70] J.A. Nelder, R. Mead, ‘A simplex method for function minimization’, Comput. J. 7(4), 308 (1965). DOI: 10.1093/comjnl/7.4.308
- [71] M. Abramowitz, I.A. Stegun, ‘Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables’, United States Department of Commerce, National Bureau of Standards (1972). ISBN-13: 978-0-3181-1730-0
- [72] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, ‘Numerical Recipes: The Art of Scientific Computing’, 3rd edn., Cambridge University Press (2007). ISBN-13: 978-0-5218-8068-8
Appendix A Cases of linear regression relevant to this work
The structure of this appendix is as follows. In the next section, I shall give a concise description of the WLR with one independent variable. The ordinary (called ‘simple’ by some authors 161616In my opinion, the adjective ‘simple’ should rather be used in order to distinguish linear regression with one independent variable from multiple linear regression.) linear regression with one independent variable will be treated in Section A.2 as a special case of Section A.1. Section A.3 addresses the problem of the WLR when ‘meaningful’ (i.e., representing effects in the normal distribution for each of the observations) measurement uncertainties are supplied for one variable. To deal with meaningful measurement uncertainties in both variables (a relatively new subject, which has not been addressed in any of the books [3, 4, 5, 6, 7]), several methods have been developed: Bayesian [60], errors-in-variables (e.g., see Refs. [61, 62, 63]), as well as least-squares methods [64, 65], etc. Detailed in Section A.4 are two relevant options, including the method which is recommended in Ref. [65]. The simplest of multiple linear regressions, i.e., the one involving two independent variables, is discussed in Section A.5.
Not addressed in this appendix are established tests regarding the applicability of the aforementioned methods (e.g., whether or not linearity is established in a dataset, whether or not heteroscedasticity is detected in a dataset, whether or not the fitted residuals are normally distributed, and the like). The technical details of such tests are not difficult to obtain, e.g., see Refs. [3, 4, 5, 6, 7], and there is no reason for repetition in this study.
For the purposes of this appendix, the quantity (the unbiased average of the weighted squares comprising the minimisation function , where , , represent the parameters of the fit) will stand for the ratio between the minimal value of the minimisation function and the NDF in the fit:
(10) |
The quantity NDF is equal to the number of input datapoints reduced by the number of the parameters of the fit: . (Some authors remove one additional DoF from , believing that, by doing so, they will obtain an unbiased variance. However, this approach is simply wrong: the variance, obtained using the relation , is already unbiased!)
A.1 The WLR with one independent variable
In a WLR, involving two quantities - namely (independent variable) and (dependent variable) - which are expected to be linearly related (), the minimisation function has the form
(11) |
where a statistical weight 171717The case is equivalent to excluding the -th datapoint from the optimisation. All formulae in this work relate to datapoints which are allowed to contribute to the minimisation function . , independent of the parameters , is assigned to each pair of observations . For the purposes of this section, the input-data structure comprises the observations and the (independent of ) weights . If the observations correspond to measurements of physical quantities, they are - more often than not - accompanied by meaningful measurement uncertainties, mostly by uncertainties associated with the dependent variable (), frequently by uncertainties in both variables, and rarely only by uncertainties in the independent variable (); each statistical weight reflects the size of such measurement uncertainties.
The optimal (or fitted) values of the two parameters entering Eq. (11), namely of the intercept and of the slope , correspond to the global minimum of the function , where two conditions are fulfilled:
(12) |
From now on, the values of the parameters and , which correspond to the global minimum of the function , will be denoted by and , respectively.
The application of these conditions leads to the emergence of the system of linear equations
(13) |
or, after adopting a self-explanatory notation for the sums,
(14) |
Equations (A.1) may be rewritten in matrix form as
(15) |
where the matrix may be identified with half of the Hessian matrix
(16) |
while and represent the column vectors of the array and of the two constants on the right-hand side (rhs) of Eqs. (A.1), respectively. The solution of the matrix Eq. (15) is
(17) |
where is evidently the inverse of the matrix . In detail, the solution reads as
(18) |
where . A unique solution is obtained when .
The statistical uncertainties of the two parameters of the fit may be obtained from the Taylor expansion of the function in the vicinity of the minimum . Given that (on account of the conditions of Eqs. (12)) the first derivatives identically vanish at the minimum, one obtains after retaining the first two terms in the expansion:
(19) |
which implies that
(20) |
The covariance matrix of the fit takes the form
(21) |
The statistical uncertainties of the two parameters are obtained from the diagonal elements of the covariance matrix:
(22) |
and
(23) |
I shall next address two subtle issues. First, many authors associate integer multiples (e.g., , , , and so on) of the uncertainties of Eqs. (22,23) with the confidence intervals corresponding to the same number (e.g., , , , and so on) of ’s in the normal distribution; the same applies to any requested new predictions (see next paragraph in this section). This approximation is poor when the results of the fit have been obtained from a small number of observations . Although the aforementioned quantities and do represent the standard errors of the estimators and , the extraction of the correct confidence intervals necessitates the application of the so-called t-multiplier 181818The quantity is the (user-defined) threshold which is associated with the outset of statistical significance. , i.e., a quantity which takes account of the size of the dataset which yielded the results of Eqs. (22,23), and which follows Student’s t-distribution for NDF DoFs (e.g., for DoFs, if two parameters are used in the fit), see Refs. [4] (pp. 78-80) and [5] (Chapter 3.6). Using the t-multiplier, the confidence intervals of the two parameters of the fit are expressed as
(24) |
The importance of this correction for small samples is demonstrated in Table 12.
Listed in this table are a few values of the t-multiplier , a quantity which must be used in the evaluation of the confidence intervals of the parameters and , as well as in that of the confidence intervals of any requested new predictions . This quantity depends on the significance level and on the number of the input datapoints . The corresponding values in number of ’s in the normal distribution, which may be thought of as the limits of when , are also cited. This table demonstrates that, if the multiples of the standard errors of the estimators and represent numbers of ’s in the normal distribution, then narrower confidence intervals are obtained (e.g., compare the second and last columns of this table). The confidence intervals, corresponding to the limits in the normal distribution, are close to those favoured by authors in several branches of Science ( % confidence level); in Physics, intervals are predominantly used. The values were obtained with the Excel methods NORMSDIST (CDF of the standard normal distribution) and TINV (Inverse CDF of Student’s t-distribution).
# of ’s | |||
---|---|---|---|
I shall next touch upon the subject of forecasting, which, to my surprise, is not addressed in any of the books [3, 4, 5, 6, 7] in case of the WLR. Let us assume that the objective is to extract (from the results of the optimisation) a ‘prediction’ for the value of the dependent variable () at one specific value of the independent variable (). Following the equation yielding the variance of the prediction error on p. 86 of Ref. [4], as well as the covariance matrix of the fit of Eq. (21) of this work, one obtains the expectation value of the prediction and its standard error as follows:
(25) |
and
(26) |
where the quantity stands for the weighted mean of the values of the input datapoints and is (or would be) the (expected) statistical weight of a datapoint at . It must be mentioned that Eq. (26) yields the standard error of a new prediction. The standard error of the fit does not contain the first term within the brackets on the rhs of Eq. (26), see also Chapter 3.7 of Ref. [4] and Eqs. (3.6.4,3.6.5) in Chapter 3.6 of Ref. [5] (both assuming that , ).
I shall finalise this section by giving the relevant expressions in case that the theoretical straight line contains no intercept ; datasets relevant to this part may include force-displacement measurements following Hooke’s law, distance-velocity data of galaxies following the Lematre-Hubble law, etc. In such a case, the minimisation function reads as
(27) |
The minimal value , the fitted value , whereas its variance
(28) |
where , as the linear model now contains only one parameter. Regarding the expectation value of a new prediction at and its standard error , Eq. (25) applies without the first term on the rhs (i.e., ), whereas Eq. (26) has now the form:
(29) |
A.2 The linear regression with constant (or ‘without’) weights
If all statistical weights are equal (to a constant ) in Section A.1, a case which is usually referred to in the standard literature in Statistics as (fulfilling) ‘homoscedasticity’, or if no weights are supplied (or are relevant) in a problem, all expressions of Section A.1 apply after simply replacing all ’s by . It so happens that there is no dependence of any of the important quantities of the fit (i.e., of the optimal values of the parameters of the fit, of the standard errors of these parameters, as well as of any predictions and of their standard errors) on the value of the constant . Only the minimisation function is -dependent, but (as the important results of the optimisation - which also involve - are ratios of quantities containing the same powers of in the nominators and denominators) the -dependence is eliminated from the useful output. This special case of linear regression, which is generally known as ‘ordinary linear regression’ (OLR for short) or ‘ordinary linear least-squares optimisation’, provides (in the eyes of many) better insight into the nature of the various quantities. As a result, most contributions in the standard literature start the description of the methods of linear regression by treating the OLR case (and some do not even venture any further!).
I shall next cover some of the main features of the OLR, starting from the fitted value of the slope. The determinant is equal to , which turns out to be simply the variance of the input values multiplied by : . From Eq. (18), one obtains for the fitted value of the slope:
(30) |
Pearson’s correlation coefficient between two samples (of equal dimension) corresponding to two quantities and , defined by the formula
(31) |
is a measure of the linear correlation between these quantities. Using the first of Eqs. (A.1) with all statistical weights set to , one obtains
(32) |
Using this equation in order to replace , one obtains for the minimal value of the minimisation function:
(33) |
Making use of Eqs. (30,31), one finally obtains
(34) |
The quantity on the left-hand side of this equation is routinely interpreted as the part of the variance of the dependent variable which remains ‘unexplained’ (after the optimisation), whereas the fraction of the variance which is associated with the term is considered ‘explained’, in the sense that this fraction of the total variance of the dependent variable is understood as essentially originating from the variation of the independent variable. Of course, all these expressions are analogous to those obtained in the WLR after replacing the numerical measures ‘mean’, ‘variance’, and ‘correlation’ with their weighted counterparts.
A.3 The WLR with meaningful measurement uncertainties only in one variable
At least so far as physical measurements are concerned, this is the most frequent, and hence useful case. In Physics, all measurements of physical quantities are expected to be accompanied by meaningful uncertainties: measurements of physical quantities without uncertainties might provide a general impression about the order of magnitude of effects, but are hardly of any use in statistical analyses. As a result, particular attention is paid in extracting such uncertainties for the physical quantities which are under experimental exploration, i.e., uncertainties which are indicative of the likelihood of the departure of the measured from the true values. Unlike most other branches of Science, Physics favours uncertainties in the normal distribution, representing a confidence level of about %.
In short, the pairs of observations are replaced in most Physics studies by triplets of observations , and the optimisation problem reduces to a WLR with statistical weights . Had it not been for one additional consideration, the treatment of this subcategory would have been unremarkable. So, what makes this case special?
The fact that the measurement uncertainties represent effects in the normal distribution implies that each standardised residual
(35) |
is expected to follow the standard normal distribution; in the expression above, the quantity represents the fitted value at , obtained via the modelling as the case might be, in particular, via the linear model in this work. As each of the quantities under the sum in Eq. (11) follows the standard normal distribution, is expected to follow the distribution with DoFs. Given that two parameters are used in the general problem of linear regression, the minimal value is, in the context of this section, -distributed with DoFs.
One of the obvious advantages of the use of meaningful uncertainties is that the quality of the fit (or, equivalently, the adequacy of the linear model to account for the input dataset) may be assessed on the basis of the p-value 191919The p-value represents the upper tail of the CDF of the relevant distribution; of the distribution in this case. corresponding to the result for the given DoFs.
When dealing with the subject of this section, there is one notable difference to the formalism developed earlier in order to treat the general-case WLR: it relates to the standard errors of the parameters of the fit, and . Equations (22) and (23) may be rewritten as
(36) |
and
(37) |
where
(38) |
The importance of the quantity BF of Eq. (38) in the problem of linear regression was (to the best of my knowledge) first addressed by Birge in 1932 [58]. Although BF should (in my opinion) be called ‘Birge factor’, the Particle Data Group (PDG) use instead the plain term ‘scale factor’ [40] in their compilations of physical constants, and recommend its application (in the context of this section) only when ; if , the factor is omitted from Eqs. (36,37), the intention evidently being to prevent the decrease of the statistical uncertainties and when the input uncertainties have been overly generous (which may be considered a ‘rare event’ in Physics, yet this is another story). Being a particle physicist, I abide by (and have nothing against) the PDG recommendation, see also Appendix C.
Last but not least, if only meaningful uncertainties are supplied in a problem, one may swap the roles of the quantities and , and perform a WLR using the linear model: . The optimal values of the two parameters of the originally intended straight line could then be retrieved from the quantities as follows: .
A.4 The WLR with meaningful measurement uncertainties in both variables
My first effort towards obtaining a solution, when I first encountered such a problem in the mid-1980s, rested upon the use of the WLR method with statistical weights inversely proportional to the square of the product of the two uncertainties for each input datapoint. In that implementation, the statistical weight was defined by the formula:
(39) |
Given that the statistical weights of this equation fulfil , one arrives at a situation reminiscent of Appendix A.3 with redefined statistical weights, and could trick oneself into considering the minimal value -distributed with DoFs. Although the statistical weights sum up (by construction) to the same constant in the two cases, those given in Eq. (39) also take account of the uncertainties . Of course, one disadvantage of using the weights of Eq. (39) is that an input datapoint needs to be excluded if either of the two input uncertainties vanishes, whereas such an exclusion is necessary only if both input uncertainties vanish in case of the following two methods of this section.
Several methodologies were developed during the last decades for dealing with this problem in a rigorous manner 202020In particular, errors-in-variables models emerged with the principal scope of providing reliable solutions to this particular situation [61, 62, 63].. A recent article [65] lists and compares the most important of these methods. Following the results of his analysis, the author recommends the use of the so-called ‘Modified effective-variance method’ (EV2), in which parameter-dependent statistical weights are assigned to the input datapoints: in the general case, each such weight is given by
(40) |
which, in case of linear regression, takes the form
(41) |
Owing to its obvious dependence on one of the parameters of the WLR, the application of a statistical weight of this form introduces non-linearity into a linear problem and calls for numerical minimisation, see Appendix B. I shall next elaborate on the emergence of the statistical weight of Eq. (41), by allowing myself to be guided by physical intuition.
Referring to Fig. 11, one may argue that the contribution of each input datapoint to the minimisation function must involve two quantities:
-
•
the minimal distance of that datapoint to the ‘current’ (i.e., relating to a specific iteration of the numerical minimisation) straight line and
-
•
the input uncertainties and .
One may argue that the aforementioned contribution depends on the relative largeness of the quantity , judged in terms of a representative combined size of the two uncertainties and . To assess this, one first obtains the coordinates of point , representing the intersection of the ‘current’ straight line and its orthogonal straight line passing through point :
(42) |
and obtains the distance as follows:
(43) |
The two uncertainties and may be interpreted as representing different effects: is directly associated with the statistical effects regarding the specific observation, whereas may be taken to represent systematic effects, i.e., effects which are induced on as the result of the variation of the independent variable. Projected on the orthogonal direction, the systematic component of the uncertainty is equal to
(44) |
whereas the statistical component of the uncertainty yields
(45) |
of course, . As these two uncertainties are independent, the strict application of Gauss’ law of error propagation suggests the use of the combined uncertainty in the form
(46) |
Therefore, the application of Gauss’ law suggests the summation of the two uncertainties of Eqs. (44,45) in quadrature, leading to the minimisation function
(47) |
with given by Eq. (41).

To cope with underestimated uncertainties (which, unfortunately, cannot be considered a sporadic phenomenon in Physics), many physicists favour the linear summation of the uncertainties corresponding to statistical and systematic effects. For the sake of example (taken from Particle Physics), two experimental groups measured (over two decades ago) the strong-interaction shift [66, 67] and the total decay width [66, 68] of the ground state in pionic hydrogen, utilising the same low-energy pion beamline () at the Paul Scherrer Institute (PSI), the same degrading device (cyclotron trap), similar crystal-spectrometer systems, and measurements corresponding (predominantly) to the same X-ray transition (). In spite of the obvious similarities, the earlier collaboration (the ETHZ-Neuchâtel-PSI Collaboration) recommended the linear summation of their statistical and systematic uncertainties, whereas the later one (the Pionic Hydrogen Collaboration) favoured the quadratic summation of their corresponding uncertainties. In essence, there is no conflict between these two choices: some researchers choose to ‘err on the side of caution’ (use of the norm when combining statistical and systematic effects), whereas others assume a somewhat bolder stance (use of the norm).
In the context of this section, the linear summation of the uncertainties corresponding to statistical and systematic effects yields:
(48) |
A.5 The WLR with two independent variables
Numerous articles in the literature have treated the problem of multiple linear regression. In this section, I shall detail the solution to the problem involving two independent variables and , which is the simplest case in multiple linear regression. Assuming the linear relation , the minimisation function reads as
(49) |
The observation set now comprises quadruplets , reducing to triplets for constant statistical weights , in which case the WLR with two independent variables becomes ordinary. Using the notation of Appendix A.1, one obtains
(50) |
where is evidently the inverse of the symmetrical matrix
(51) |
and the column arrays and are given by
(52) |
and
(53) |
After a few algebraical operations, one may obtain the inverse of the matrix as
(54) |
where .
The diagonal elements of the matrix yield the standard errors of the parameters of the fit as follows:
(55) |
where has been obtained after using , as the general linear model now contains three parameters.
Following again the procedure set forth in Chapter 3.7 of Ref. [4], one obtains (after some effort) the expectation value of the prediction and its standard error corresponding to and as follows:
(56) |
and
(57) |
where , , , and ; the statistical weight has been explained in Section A.1. Finally, the weighted version of Pearson’s correlation coefficient reads as
(58) |
For the sake of completeness, I shall finalise this section by giving the relevant expressions in case that the theoretical straight line is expected to contain no intercept . In such a case, the minimisation function reads as
(59) |
The minimal value , whereas the fitted values of the parameters and are as follows:
(60) |
where the determinant . Finally, the standard errors of the parameters of the fit are obtained from the equations:
(61) |
and
(62) |
where , as the linear model now contains two parameters. Regarding the expectation value of a new prediction at and , as well as its standard error , Eq. (56) applies without the first term on the rhs (i.e., ), whereas Eq. (57) now reads:
(63) |
Appendix B Numerical minimisation
The numerical minimisation can be achieved by means of a plethora of commercial and non-commercial products. In my projects, I have always relied on the MINUIT software package [69] of the CERN library (FORTRAN and C++ versions). In this work, the FORTRAN version was used in order that the analytical solutions, obtained after the C++ implementation of nearly all methods of Appendix A, be validated. The treatment of the two methods of Appendix A.4, which feature parameter-dependent weights , was exclusively covered by MINUIT. (The same goes for the fit leading to Fig. 1.) Regarding my MINUIT-based applications, each optimisation is achieved by following the sequence: SIMPLEX, MINIMIZE, MIGRAD, and MINOS. The calls to the last two methods involve the high-level strategy of the numerical minimisation.
-
•
SIMPLEX uses the simplex method of Nelder and Mead [70].
-
•
MINIMIZE calls MIGRAD, but reverts to SIMPLEX if MIGRAD fails to converge.
-
•
MIGRAD, undoubtedly the warhorse of the MINUIT software package, is a variable-metric method, based on the Davidon-Fletcher-Powell algorithm. The method checks for the positive-definiteness of the Hessian matrix.
-
•
MINOS carries out a detailed error analysis (separately for each parameter), taking into account all correlations among the model parameters.
All aforementioned methods admit one optional argument, fixing the maximal number of calls to each method. If this limit is reached, the corresponding method is terminated (by MINUIT, internally) regardless of whether or not that method had converged. To have confidence in the results of the optimisation, it is recommended to the user to always inspect the (copious) MINUIT output, so that the convergence of the methods and the successful termination of the application be ascertained, or any setback during the execution of the application (e.g., lack of convergence of some MINUIT methods, erroneous evaluation of the Hessian matrix, large estimated vertical distance to the minimum, etc.) be discovered and resolved.
Although MINUIT is robust, some attention is required in order to avoid cases where the application might get trapped in the ‘valley’ of a local minimum. This is not unlikely to happen in non-linear problems, e.g., when using the two methods of Section A.4 of this work, featuring parameter-dependent weights. There are several ways to safeguard against this possibility. For instance, one may repeat the optimisation using different initial guesses for the parameter values (thus modifying the point from where MINUIT starts the minimisation of the user-defined function), broaden the initial guesses for the uncertainties of the parameters of the fit, or simply restrict the parameter space by introducing lower and/or upper bounds to some or to all parameters. To cope with this issue, the MINUIT developers also offer the (time-consuming) method SCAN, which scans the parameter space for an optimal starting point in the minimisation problem.
Finally, it must be borne in mind that the MINOS uncertainties for the parameters of the fit do not contain the Birge factor of Eq. (38). Consequently, it is up to the user to programme the user-defined function in such a way that the Birge factor be applied to the uncertainties (e.g., if ), or be omitted (e.g., if while the PDG recommendation is followed). Similarly, it is the user’s responsibility to obtain and apply the t-multiplier to the MINOS uncertainties.
Appendix C On the application of the Birge factor to the uncertainties of the parameters of a fit
When no suspicion can be cast on the modelling in a problem, and the uncertainties of the input datapoints are meaningful (see beginning of Appendix A), the minimisation function is expected to follow the distribution with DoFs ( being the number of parameters used in the fit). As a result, the expectation value of the minimisation function at the minimum should be equal to NDF, which implies that the expectation value of the reduced , i.e., of the ratio , to be identified with the square of the Birge factor BF, is equal to . This property of -distributed quantities has led most physicists to the subliminal use of the approximate criterion as a ‘rule of thumb’ when assessing the quality of fits.
Glossing over the arbitrariness in the interpretation of an approximate equality, this practice could be meaningful in case of large samples; however, it is misleading in case of small ones. Nevertheless, even for large samples, the criterion might provide a general impression about the quality of a fit, yet it represents no rigorous measure of that quality. For instance, does a resulting value of for DoFs, yielding a value of , indicate a satisfactory or an unsatisfactory fit at the % significance level? Although one might be tempted to consider such a fit ‘satisfactory’, the p-value, which is associated with the aforementioned numbers, is about , i.e., below the % significance level of this work, suggesting that the fit is (in the strict statistical sense) anything but satisfactory.
The correct assessment of the quality of fits rests upon the use of the p-value which is associated with the distribution for the given NDF. To this end, a number of software implementations of dedicated algorithms are available, e.g., see Refs. [71] (Chapter on ‘Gamma Function and Related Functions’) and [72], the routine PROB of the FORTRAN implementation of the CERN software library (which, unlike most other CERNLIB routines, is available only in single-precision floating-point format), the functions CHIDIST or CHISQ.DIST.RT of Microsoft Excel, the function chi2cdf of MATLAB, etc.
A few words regarding the application of the Birge factor are due. For the sake of argument, let me assume that the result of a fit exceeds (equivalently, ). So far as physical observations, accompanied by measurement uncertainties of the input datapoints, are concerned, the quantity BF may be interpreted as the amount by which the input uncertainties need to be rescaled, so that the new reduced becomes equal to its expectation value of . If, on the contrary, the fit is already ‘satisfactory’ (i.e., when the ratio does not exceed ), the PDG maintains that the rescale of the uncertainties of the input datapoints is not called for.