This page has been proofread, but needs to be validated.
LAWS OF ERROR]
PROBABILITY
397


of an exploding shell so far as to know the distance of each mark measured (from an origin) along a right line, say the line of an extended fortification, and it was known that the shell was fired perpendicular to the fortification from a distant ridge parallel to the fortification, and that the shell was of a kind of which the fragments are scattered according to a normal law[1] with a known coefficient of dispersion; the question is at what position on the distant ridge was the enemy's gun probably placed? By received principles the probability, say P, that the given set of observations should have resulted from measuring (or aiming at) an object of which the real position was between x and x + ∆x is

x J exp − [(x - x1)2 + (x - x1)2 + &c.]/c2;

where J is a constant obtained by equating to unity (since the given set of observations must have resulted from some position on the axis of x). The value of x, from which the given set of observations most probably resulted, is obtained by making P a maximum. Putting dP/dx = 0, we have for the maximum (d2P/dx2 being negative for this value) the arithmetic mean of the given observations. The accuracy of the determination is measured by a probability-curve with modulus c/√n. This in the course of a very long siege if every case in which the given group of shell-marks x1, x2, . . . xn was presented could be investigated, it would be found that the enemy's cannon was fired from the position x′, the (point right opposite to the) arithmetic mean of x1, x2, &c., xn, with a frequency assigned by the equation

z = (√n/√πc) exp − n(xx′)2/c2.

The reasoning is applicable without material modification to the case in which the data and the quaesitum are not absolute quantities, but proportions; for instance, given the percentage of white balls in several large batches drawn at random from an immense urn containing black and white balls, to find the percentage of white balls in the urn—the inverse problem associated with the name of Bayes.

131. Simple as this solution is, it is not the one which has most recommended itself to Laplace. He envisages the quaesitum not so much as that point which is most probably the real one, as that point which may most advantageously be put for the real one. In our illustration it is as if it were required to discover from a number of shot-marks not the point[2] which in the course of a long siege would be most frequently the position of the cannon which had scattered the observed fragments but the point which it would be best to treat as that position—to fire at, say, with a view of silencing the enemy's gun—having regard not so much to the frequency with which the direction adopted is right, as to the extent to which it is wrong in the long run. As the measure of the detriment of error, Laplace[3] takes “la Valeur moyenne de l'erreur à craindre,” the mean first power of the errors taken positively on each side of the real point. The mean spquare of errors is proposed by Gauss as the criterion.[4] Any mean power indeed, the integral of any function which increases in absolute magnitude with the increase of its variable, taken as the measure of the detriment, will lead to the same conclusion, if the normal law prevails.[5]

132. Yet another speculative difficulty occurs in the simplest, and recurs in the more complicated inverse problem. In putting P as the probability, deduced from the observations that the real point for which they stand is x (between x and x + ∆x), it is tacitly assumed that prior to observation one value of x is as probable as another. In our illustration it must be assumed that the enemy's gun was as likely to be at one point as another of (a certain tract of) the ridge from which it was fired. If, apart from the evidence of the shell-marks, there was any reason for thinking that the gun was situated at one point rather than another, the formula would require to be modified. This a priori probability is sometimes grounded on our ignorance; according to another view, the procedure is justified by a rough general knowledge that over a tract of x for which P is sensible one value of x occurs about as often as another.[6]

133. Subject to similar speculative difficulties, the solution which has been obtained may be extended to the analogous problem in which the quaesitum is not the real value of an observed magnitude, but the mean to which a series of statistics indefinitely prolonged converges.[7]

134. Next, let the modulus, still supposed given, not be the same for all the observations, but c1 for x1, c2 for x2, &c. Then P becomes proportional to

exp − [(xx1)2/c12 + (xx2)2/c22 + &c.].

And the value of x which is both the most probable and the “most Method of least Squares. advantageous” is (x1/c12 + x2/c22 + &c.)/(1/c12 + 1/c22 + &c.); each observation being weighted with the inverse mean square of observations made under similar conditions.[8] This is the rule prescribed by the “method of least squares”; but as the rule in this case has been deduced by genuine inverse probability, the problem does not exemplify what is most characteristic in that method, namely, that a rule deducible from the hypothesis that the errors of observations obey the normal law of error is employed in cases where the normal law is not known, or even is known not, to hold good. For example, let the curve of error for each observation be of the form of

z = [1/√(πc)]× exp[−x2/c2 − 2j(x/c - 2x3/3c3)],

where j is a small fraction, so that z may equally well be equated to (1/√πc)[1 - 2j(x/c - 2x3/3c3)] exp − x2/c2, a law which is actually very prevalent. Then, according to the genuine inverse method, the most probable value of x is given by the quadratic equation d/dxlog P = 0, where log P = const. − ∑(xxr)2/cr2 − ∑2j[(xxr)3/cr3 − 2(xxr)3/3cr3], ∑ denoting summation over all the observations. According to the “method of least squares,” the solution is the weighted arithmetic mean of the observations, the weight of any observation being inversely proportional to the corresponding mean square, i.e. cr2/2 (the terms of the integral which involve j vanishing), which would be the solution if the j's are all zero. We put for the solution of the given case what is known to be the solution of an essentially different case. How can this paradox be justified?

135. Many of the answers which have been given to this question seem to come to this. When the data are unmanageable, it is legitimate to attend to a part thereof, and to determine the most probable (or the “most advantageous”) value of the quaesitum, and the degree of its accuracy, from the selected portion of the data as if it formed the whole. This throwing overboard of part of the data in order to utilize the remainder has often to be resorted to in the rough course of applied probabilities. Thus an insurance office only takes account of the age and some other simple attributes of its customers, though a better bargain might be made in particular cases by taking into account all available details. The nature of the method is particularly clear in the case where the given set of observations consists of several batches, the observations in any batch ranging under the same law of frequency with mean xr and mean square of error kr, the function and the constants different for different batches; then if we confine our attention to those parts of the data which are of the type xr and kr—ignoring what else may be given as to the laws of error—we may treat the xr's as so many observations, each ranging under the normal law of error with its coefficient of dispersion; and apply the rules proper to the normal law. Those rules applied to the data, considered as a set of derivative observations each formed by a batch of the original observations) averaged, give as the most probable (and also the most advantageous) combination of the observations the arithmetic mean weighted according to the inverse mean square pertaining to each observation, and for the law of the error to which the determination is liable the normal law with standard deviation[9] √(∑k/n)—the very rules that are prescribed by the method of least squares.

136. The principle involved might be illustrated by the proposal to make the economy of datum a littler less rigid: to utilize, not indeed all, but a little more of our materials—not only the mean square of error for each batch, but also the mean cube of error. To begin with the simple case of a single homogeneous batch: suppose that in our example the fragments of the shell are no longer scattered according to the normal law. By the method of least squares it would still be proper to put the arithmetic mean to the given observations for the true point required, and to measure the accuracy of that determination by a probability-curve of which the modulus is √(2k), where k is the mean square of deviation (of fragments from their mean). If it is thought desirable to utilize more of the data there is available, the proposition that the arithmetic mean of a


  1. If normally in any direction indifferently according to the two- or three-dimensioned law of error, then normally in one dimension when collected and distributed in belts perpendicular to a horizontal right line, as in the example cited below, par. 155.
  2. Or small interval (cf. preceding section).
  3. “Toute erreur soit positive soit négative doit être considerée comme un désavantage ou une perte réelle à un jeu quelconque,” Théorie analytique, art. 20 seq., especially art. 25. As to which it is acutely remarked by Bravais (op. cit. p. 258), “Cette règle simple laisse à désirer une démonstration rigoureuse, car l'analogue du cas actuel avec celui des jeux de hasard est loin d'être complète.”
  4. Theoria combinationis, pt. i. § 6. Simon Newcomb is conspicuous by walking in the way of Laplace and Gauss in his preference of the most advantageous to the most probable determinations. With Gauss he postulates that “the evil of an error is proportioned to the square of its magnitude” (American Journal of Mathematics, vol. viii. No. 4).
  5. As argued by the present writer, Camb. Phil. Trans. (1885), vol; xiv. pt. ii. p. 161. Cf. Glaisher, Mem. Astronom. Soc. xxxix. 108.
  6. The view taken by the present writer on the “Philosophy of Chance,” in Mind (1880; approved by Professor Pearson, Grammar of Science, 2nd ed. p. 146). See also “A priori Probabilities,” Phil. Mag. (Sept. 1884), and Camb. Phil. Trans. (1885), vol. xiv. pt. ii. p. 147 seq.
  7. Above, pars. 6, 7.
  8. The mean square .
  9. The standard deviation pertaining to a set of (n/r) composite observations, each derived from the original n observations by averaging a batch thereof numbering r, is √(k/r)/√(n/r) = √(k/n), when the given observations are all of the same weight; mutatis mutandis when the weights differ.