PROBABILITY (Lat. probabilis, probable or credible), a term which in general implies credibility short of certainty.

The mathematical theory of probabilities deals with certain phenomena which are employed to measure credibility. This Description and Division of the Subject. measurement is well exemplified by games of chance. If a pack of cards is shuffled and a card dealt, the probability that the card will belong to a particular suit is measured by—we may say, is—the ratio 1:4, or 1/4; there being four suits to any one of them the card might have belonged. So the probability that an ace will be drawn is 4/52, as out of the 52 cards in the pack 4 are aces. So the probability that ace will turn up when a die is thrown is 1/6. The probability that one or other of the two events, ace or deuce, will occur is 1/3. If simultaneously a die is thrown and a card is dealt from a pack which has been shuffled, the probability that the double event will consist of two aces is 1 × 4 divided by 6 × 52, as the total number of double events formed by combining a face of a die with a face of a card is 6 × 52, and out of these 1 × 4 consist of two aces.

The data of probabilities are often prima facie at least of a type different from that which has been described. For example, the probability that a child about to be born will be a boy is about 0.51. This statement is founded solely on the observed fact that about 51% of children born (alive, in European countries) prove to be boys. The probability is not, as in the instance of dice and cards, measured by the proportion between a number of cases favourable to the event and a total number of possible cases. Those instances indeed also admit of the measurement based on observed frequency. Thus the number of times that a die turns up ace is found by observation to be about of the number of throws; and similar statements are true of cards and coins.[1] The probabilities with which the calculus deals admit generally of being measured by the number of times that the event is found by experience to occur, in proportion to the number of times that it might possibly occur.

The idea of a probable or expected number is not confined to the number of times that an event occurs; if the occurrence of the event is associated with a certain amount of money or any other measurable article there will be a probable or expected amount of that article. For instance, if a person throwing dice is to receive two shillings every time that six turns up, he may expect in a hundred throws to win about (about ) shillings. If he is to receive two shillings for every six and one shilling for every ace, his expectation will be (50) shillings. The expectation of lifetime is calculated on this principle. Of 1000 males aged ten say the probable number who will die in their next year is 490, in the following year 397, and so on; if we (roughly) estimate that those who die in the first year will have enjoyed one year of life after ten, those who die in the next year will have enjoyed two years of life, and so on; then the total number of years which the 1000 males[2] aged ten may be expected to live is

1 × 1000 + 2 × (1000 - 490) + 3 (1000 - 490 - 397) + . . .

Space as well as time may be the subject of expectation. If drops of rain fall in the long run with equal frequency on one point—or rather on one small interval, say of a centimetre or two—on a band of finite length and negligible breadth, the distance which is to be expected between a point of impact in the upper half of the line and a point of impact in the lower half has a definite proportion to the length of the given line.[3]

Expectation in the general sense may be considered as a kind of average.[4] The doctrine of averages and of the deviations therefrom technically called “errors” is distinguished from the other portion of the calculus by the peculiar difficulty of its method. The paths struck out by Laplace and Gauss have hardly yet been completed and made quite secure. The doctrine is also distinguished by the importance of its applications. The theory of errors enables the physicist so to combine discrepant observations as to obtain the best measurement. It may abridge the labour of the statistician by the use of samples.[5] It may assist the statistician in testing the validity of inductions.[6] It promises to be of special service to him in perfecting the logical method of concomitant variations; especially in investigating the laws of heredity. For instance the correlation between the height of parents and that of children is such that if we take a number of men all of the same height and observe the average height of their adult sons, the deviation of the latter average from the general average of adult males bears a definite proportion—about a half—to the similarly measured deviation of the height common to the fathers. The same kind and amount of correlation between parents and children with respect to many other attributes besides stature has been ascertained by Professor Karl Pearson and his collaborators.[7] The (kinetics of free molecules (gases) forms another important branch of science which involves the theory of errors.

The description of the subject which has been given will explain the division which it is proposed to adopt. In Part I. probability and expectation will be considered apart from the peculiar difficulties incident to errors or deviation from averages. The first section of the first part will be devoted to a preliminary inquiry into the evidence of the primary data and axioms of the science. Freed from philosophical difficulties the mathematical calculation of probabilities will proceed in the second section. The analogous calculation of expectation will follow in the third section. The contents of the first three sections will be illustrated in the fourth by a class of examples dealing with space measurements—the so-called “local” or “geometrical” probabilities. Part II. is devoted to averages and the deviations therefrom, or more generally that grouping of statistics which may be called a law of frequency. Part II. is divided into two sections distinguished by differences in character and extent between the principal generalizations respecting laws of frequency.

Part I.—Probability and Expectation

Section I.—First Principles.

1. As in other mathematical sciences, so in probabilities, or even more so, the philosophical foundations are less clear than the calculations based thereon. On this obscure and controversial topic absolute uniformity is not to be expected. But it is hoped that the following summary in which diverse authoritative judgments are balanced may minimize dissent.

2. (1) How the Measure of Probability is Ascertained.—The first question which arises under this head is: on what evidence are the facts obtained which are employed to measure probability? A very generally accepted view is that which Laplace has thus expressed:

“The probability of an event is the ratio of the number of cases which favour it to the number of all the possible cases, when nothing leads us to believe that one of these cases ought to occur rather than the others; which renders them, for us, equally possible.”[8] Against this view it is urged that merely psychological facts can at best afford a measure of belief, not of credibility. Accordingly, the ground of probability is sought in the observed fact of a class or “series”[9] such that if we take a great many members of the class, or terms of the series, the members thereof which belong to a certain assigned species compared with the total number taken tends to a certain fraction as a limit. Thus the series which consists of heads and tails obtained by tossing up a well-made coin is such that out of a large number of throws the proportion giving heads is nearly half.

3. These views are not so diametrically opposed as may at first appear. On the one hand, those who follow Laplace would of course admit that the presumption afforded by the “number of favourable cases” with respect to the probability of throwing either five or six with a die must be modified in accordance with actual experience such as that below cited[10] respecting particular dice that they turn up five or six rather oftener than once in three times. On the other hand, the series which is regarded as the empirical basis of probability is not a simple matter of fact. There are implied conditions which are not satisfied by the sort of uniformity which ordinarily characterizes scientific laws; which would not be satisfied for instance by the proportionate frequency of any one digit, e.g. 8, in the expansion of any vulgar fraction, though the expression may consist of a circulating decimal with a very long period.[11]

4. The type of the series is rather the frequency of the several digits in the expansion of an incommensurable constant such as √2, log 11, π, &c.[12] The observed fact that the digits occur with equal frequency is fortified by the absence of a reason why one digit should occur oftener than another.[13]

5. The most perfect types of probability appear to present the two aspects: proportion of favourable cases given a priori and frequency of occurrence observed a posteriori. When one of these attributes is not manifested it is often legitimate to infer its existence from the presence of the other. Given numerous batches of balls, each batch numbering say 100 and consisting partly of white and partly of black balls; if the percentages of white balls presented by the set of batches averaged, and, as it were, hovered about some particular percentage, e.g. 50, though we knew as an independent datum, or by inspection of the given percentages, that the series was not obtained by simply extracting a hundred balls from a jar containing a mélange of white and black balls, we might still be justified in concluding that the observed phenomenon resulted from a system equivalent to a number of jars of various constitution, compounded in some complicated fashion. So Laplace may be justified in postulating behind frequencies embodied in vital statistics the existence of a “constitution” analogous to games of chance, “possibilities” or favourable cases which might conceivably be “developed” or discussed.[14] On the other hand, it is often legitimate to infer from the known proportion of favourable cases a corresponding frequency of occurrence. The cogency of the inference will vary according to the degree of experience. That one face of a die or a coin will turn up nearly as often as another might be affirmed with perfect confidence of the particular dice which Weldon threw some thousands of times,[15] or the coins with which Professor Pearson similarly operated.[16] It may be affirmed with much confidence of ordinary coins and dice without specific experience, and generally, where fairplay is presumed, of games of chance. This confidence is based not only on experiments like those tried by Buffon, Jevons and many others,[17] but also on a continuous, extensive, almost unconsciously registered experience in pari materia. It is this sort of experience which justifies our expectation that commonly in mathematical tables one digit will occur as often as another, that in a shower about as many drops will fall on one element of area as upon a neighbouring spot of equal size. Doubtless the presumption must be extended with caution to phenomena with which we are less familiar. For example, is a meteor equally likely to hit one square mile as another of the earth's surface? We seem to descend in the scale of credibility from absolute certainty that alternative events occur with about equal frequency to absolute ignorance whether one occurs more frequently than the other. The empirical basis of probability may appear to become evanescent in a case like the following, which has been discussed by many writers on Probabilities.[18] What is the probability of drawing a white ball from a box of which we only know that it contains balls both black and white and none of any other colour? In this case, unlike the case of an urn containing a mixture of white and black balls in equal proportions, we have no reason to expect that if we go on drawing balls from the urn, replacing each ball after it has been drawn, that the series so presented will consist of black and white in about equal numbers. But there is ground for believing that in the long course of experiences in pari materia—other urns of similar constitution, other cases in which there is no reason to expect one alternative more than another—an event of one kind will occur about as often as one of another kind. A “cross-series”[19] is thus formed which seems to rest on as extensive if not so definite an empirical basis as the series which we began by considering. Thus the so-called “intellectual probability”[20] which it has been sought to separate from the material probability verified by frequency of occurrence, may still rest on a similar though less obvious ground of experience. This type of probability not verified by specific experience is presented in two particularly important classes.

6. Unverified Probabilities.—In applying the theory of errors to the art of measurement it is usual to assume that prior to observation one value of the quantity under measurement is as likely as another. “When the probability is unknown,” says Laplace,[21] “we may equally suppose it to have any value between zero and unit.” The assumption is fundamentally similar whether the quantum is a ratio to be determined by the theorem of Bayes,[22] or an absolute quantity to be determined by the more general theory of error. Of this first principle it is well observed by Professor Karl Pearson[23]: “There is an element of human experience at the bottom of Laplace's assumption.” Professor Pearson quotes with approbation[24] the following account of the matter: “The assumption that any probability-constant about which we know nothing in particular is as likely to have one value as another is grounded upon the rough but solid experience that such constants do as a matter of fact as often have one value as another.”

7. It may be objected, no doubt, that one value (of the object under measurement) is often known beforehand not to be as likely as another. The barometric height for instance is not equally likely to be 29 in. or to be 2 in. The reply is that the postulate is only required with respect to a small tract in a certain neighbourhood, some 2 in. above and below 291/2 in. in the case of barometric pressure.

8. It is further objected that the assumption in question involves inconsistencies in cases like the following. Suppose observations are made on the length of a pendulum together with the time of its oscillation. As the time is proportional to the square root of the length, it follows that if the values of the length occur with equal frequency those of the time cannot do so; and, inversely, if the proposition is true of the times it cannot be true of the lengths.[25] One reply to this objection is afforded by the reply to the former one. For where we are concerned only with a small tract of values it will often happen that both the square and the square root and any ordinary function of a quantity which assumes equivalent values with equal probability will each present an approximately equal distribution of probabilities.[26] It may further be replied that in general the reasoning does not require the a priori probabilities of the different values to be very nearly equal; it suffices that they should not be very unequal;[27] and this much seems to be given by experience.

9. Whenever we can justify Laplace's first principle[28] that “probability is the ratio of the number of favourable cases to the number of all possible cases” no additional difficulty is involved in his second principle, of which the following may be taken as an equivalent. If we distribute the favourable cases into several groups the probability of the event will be sum of the probabilities pertaining to each group.[29]

10. Another important instance of unverified probabilities occurs when it is assumed without specific experience that one phenomenon is independent of another in such wise that the probability of a double event is equal to the product of the one event multiplied by the probability of the other—as in the instance already given of two aces occurring. The assumption has been verified with respect to “runs” in some games of chance;[30] but it is legitimately applied far beyond those instances. The proposition that very long runs of particular digits, e.g. of 7, may be expected in the development of a constant like πe.g. a run of six consecutive sevens if the expansion of the constant was carried to a million places of decimals—may be given as an instance in which our conviction greatly transcends specific verification. In the calculation of probable, and improbable, errors, it[31] has to be assumed without specific verification that the observations on which the calculation is based are independent of each other in the sense now under consideration. With these explanations we may accept Laplace's third principle “If the events are independent of each other the probability of their concurrence (l'existence de leur ensemble) is the product of their separate probabilities.”[32]

11. Interdependent Probabilities.—Among the principles of probabilities it is usual to enunciate, after Laplace, several other propositions.[33] But these may here be rapidly passed over as they do not seem to involve any additional philosophical difficulty.

12. It has been shown that when two events are independent of each other the product of their separate probabilities forms the probability of their concurrence. It follows that the probability of the double event divided by the probability of either, say the first, component gives the probability of the other, the second component event. The quotient, we might say, is the probability that when the first event has occurred, the second will occur. The proposition in this form is true also of events which are not independent of one another. Laplace exemplifies the composition of such interdependent probabilities by the instance of three urns, A, B, C, about which it is known that two contain only white balls and one only black balls.[34] The probability of drawing a white ball from an assigned urn, say C, is ⅔. The probability that, a white ball having been drawn from C, a ball drawn from B will be white, is ½. Therefore the probability of the double event drawing a white ball from C and also from B is ⅔ × ½ or ⅓. The question now arises. Supposing we know only the probability of the double event, which probability we will call BC, and the probability of one of them, say [C] (but not, as in the case instanced, the mechanism of their interdependence); what can we infer about the probability [B] of the other event (an event such as in the above instance drawing a white ball from the urn B)—the separate probability irrespective of what has happened as to the urn C? We cannot in general say that [B] = [BC] divided by [C] but rather that quotient × k, where k is an unknown coefficient which may be either positive or negative. It might, however, be improper to treat k as zero on the ground that it is equally likely (in the long run of similar data) to be positive or negative. For given values of [BC] and [C], k has not this equiprobable character, since its positive and negative ranges are not in general equal; as appears from considering that [B] cannot be less than [BC], nor greater than unity.[35]

13. Probability of Causes and Future Effects.—The first principles which have been established afford an adequate ground for the reasoning which is described as deducing the probability of a cause from an observed event.[36] If with the poet[37] we may represent a perfect mixture by the waters of the Po in which the “two Doras” and other tributaries are indiscriminately commingled, there is no great difference in respect of definition and deduction between the probability that a certain particle of water should have emanated from a particular source, or should be discharged through a particular mouth of the river. “This principle,” we may say with De Morgan, “of the retrospective or ‘inverse’ probability is not essentially different from the one first stated (Principle I.).”[38] Nor is a new first principle necessarily involved when after ascending from an effect to a cause we descend to a collateral effect.[39] It is true that in the investigation of causes it is often necessary to have recourse to the unverified species of probability. An instance has already been given of several approximately equiprobable causes, the several values of a quantity under measurement, from one of which the observed phenomena, the given set of observations, must have, so to speak, emanated. A simpler instance of two alternative causes occurs in the investigation which J. S. Mill[40] has illustrated—whether an event, such as a succession of aces, has been produced by a particular cause, such as loading of the die, or by that mass of “fleeting causes” called chance. It is sufficient for the argument that the “a priori” probabilities of the alternatives should not be very unequal.[41]

14. (2) Whether Credibility is Measurable.—The domain of probabilities according to some authorities does not extend much, if at all, beyond the objective phenomena which have been described in the preceding paragraphs. The claims of the science to measure the subjective quantity, degree of belief, are disallowed or minimized. Belief, it is objected, depends upon a complex of perceptions and emotions not amenable[42] to calculus. Moreover, belief is not credibility; even if we do believe with more or less confidence in exact conformity with the measure of probability afforded by the calculus, ought we so to believe? In reply it must be admitted that many of the beliefs on which we have to act are not of the kind for which the calculus prescribes. It was absurd of Craig[43] to attempt to evaluate the credibility of the Christian religion by mathematical calculation. But there seem to be a number of simpler cases of which we may say with De Morgan[44] “that in the universal opinion of those who examine the subject, the state of mind to which a person ought to be able to bring himself” is in accordance with the regulation measure of probability. If in the ordeal to which Portia's suitors were subjected there had been a picture of her not in one only, but in two of the caskets, then—though the judgment of the principal parties might be distorted by emotion—the impartial spectator would normally expect with greater confidence than before that at any particular trial a casket containing the likeness of the lady would be chosen. So the indications of a thermometer may not correspond to the sensations of a fevered patient, but they serve to regulate the temperature of a public library so as to secure the comfort of the majority. This view does not commit us to the quantitative precision of De Morgan that in a case such as above supposed we ought to “look three times as confidently upon the arrival as upon the non-arrival” of the event.[45] Two or three roughly distinguished degrees of credibility—very probable, as probable as not, very improbable, practically impossible—suffice for the more important applications of the calculus. Such is the character of the judgments which the calculus enables us to form with respect to the occurrence of a certain difference between the real value of any quantity under measurement and the value assigned to it by the measurement. The confidence that the constants which we have determined are accurate within certain limits is a subjective feeling which cannot be dislodged from an important part of probabilities.[46] This sphere of subjective probability is widened by the latest developments of the science[47] so far as they add to the number of constants for which it is important to determine the probable—and improbable—error. For instance, a measure of the deviation of observations from an average or mean value was required by the older writers only as subordinate to the determination of the mean, but now this “standard deviation” (below, par. 98) is often treated as an entity for which it is important to discover the limits of error.[48] Some of the newer methods may also serve to countenance the measurement of subjective quantity, in so far as they successfully apply the calculus to quantities not admitting of a precise unit, such as colour of eye or curliness of hair.[49] A closer analogy is supplied by the older writers who boldly handle “moral” or subjective advantage, as will be shown under the next head.

15. (3) Axioms of Expectation.—Expectation so far as it involves probability presents the same philosophical questions. They occur chiefly in connexion with two principles analogous to and deducible from propositions which have been stated with respect to probability.[50] (i.) The expectation of the sum of two quantities subject to risk is the sum of the expectations of each. (ii.) The expectation of the product of two quantities subject to risk is the product of the expectations of each; provided that the risks are independent. For example, let one of the fortuitously fluctuating quantities be the winnings of a player at a game in which he takes the amount A if he throws ace with a die (and nothing if he throws another face). Then the expectation of that quantity is 1/6A; or, in n trials (n being large), the player may expect to win about n1/6A. Let the other fortuitously fluctuating quantity be winnings of a player at a game in which he takes the amount B when an ace of any suit is dealt from an ordinary pack of cards. The expectation of this quantity is 1/13B; or in n trials the player may expect to win about n1/13B. Now suppose a compound trial at which one simultaneously throws a die and deals a card; and let his winning at a compound trial be the sum of the amounts which he would have received for the die and the card respectively at a simple trial. In n such compound trials he may expect to win about n1/6A + n1/13B, or the expectation of the winning at a compound trial is the sum of the separate expectations. Next suppose the winning at a compound trial to be the product of the two amounts which he would have received for the die and the card if played at a simple trial. It is zero unless the player obtains two aces. It is A × B when this double event occurs. But this double event occurs in the long run only once in 78 times. Accordingly the expectation of the winning at a compound trial at which the winning is the product of the winnings at two simple trials is the product of the separate expectations. What has been shown for two expectations of the simplest type, where α is the probability of an event which has been associated with a quantity a, may easily be extended to several expectations each of the type

a1α1 + a2α2 + a3α3 + . . .

where arαr is an expectation of the simplest type, above exemplified, or of the type a1α1 × a2α2 × a3α3 × . . . or a mixture of these types. For by the law which has been exemplified the sum of r expectations can always be reduced to the sum of r − 1, and then the r − 1 to r − 2, and so on; and the like is true of products.

16. It should be remarked that the proviso as to the independence of the probabilities involved is required only by the second of the two fundamental propositions. It may be dispensed with by the first. Thus in the example of interdependent probabilities given by Laplace[51]—three urns about which it is known that two contain only black balls and one only white—if a person drawing a ball first from C and then from B is to receive x shillings every time he draws a white ball, from one or other of the urns, he may expect if he performs the compound operation n times to receive n × 2 × ⅔x shillings. But the expectation of the product of the number of shillings won by drawing a white ball from C and the number of shillings won by afterwards drawing a white ball from B is not n(⅔)2x2, but nx2.

17. The first of the two principles is largely employed in the practical applications of probabilities. The second principle is largely employed in the higher generalizations of the science[52] (the laws of error demonstrated in Part II.); the requisite independence of the involved probabilities being mostly of the unverified[53] species.

18. Expectation of Utility.—A philosophical difficulty peculiar to expectation[54] arises when the quantity expected has not the objective character usually presupposed in the applications of mathematics. The most signal instance occurs when the expectation relates to an advantage, and that advantage is estimated subjectively by the amount of utility or satisfaction afforded to the possessor. Mathematicians have commonly adopted the assumption made by Daniel Bernoulli that a small increase in a person's material means or “physical fortune” causes an increase of satisfaction or “moral fortune,” inversely proportional to the physical fortune; and accordingly that the moral fortune is equateable to the logarithm of the physical fortune.[55] The spirit in which this assumption should be employed is well expressed by Laplace when he says[56] that the expectation of subjective advantage (l'espérance morale) “depends on a thousand variable circumstances which it is almost always impossible to define and still more to submit to calculation.” “One cannot give a general rule for appreciating this relative value,” yet the principle above stated in “applying to the commonest cases leads to results which are often useful.”

19. In this spirit we may regard the logarithm in Bernoulli's (as in Malthus's) theory as representative of a more general relation. Thus generalized the principle has been accepted by economists and utilitarian philosophers whose judgment on the relation between material goods and utility or satisfaction carries weight. Thus Professor Alfred Marshall writes:[57] “in accordance with a suggestion made by Daniel Bernoulli, we may perhaps suppose that the satisfaction which a person derives from his income may be regarded as beginning when he has enough to support life and afterwards as increasing by equal amounts with every equal successive percentage that is added to his income; and vice versa for loss of income.”[58] The general principle is embodied in Bentham's utilitarian reasoning which has been widely accepted.[59] The possibility of formulating the relation between feeling and its external cause is further supported by Fechner's investigations. This branch of Probabilities also obtains support from another part of the science, the calculation sanctioned by Laplace, of the disutility incident to error of measurement.[60] Altogether it seems impossible to deny that some simple mathematical operations prescribed by the calculus of probabilities are sometimes serviceably employed to estimate prospective benefit in the subjective sense of desirable feeling.

20. Single Cases and “Series.”—Analogous to the question regarding the standard of belief which arose under a former head, a question regarding the standard of action arises under the head of expectation. The former question, it may be observed, arises chiefly with respect to events which are considered as singular, not forming part of a series. There is no doubt, there is a full belief, that if we go on tossing (unloaded) dice the event which consists of obtaining either a five or a six will occur in approximately of the trials. The important question is what is or should be our state of mind with regard to the result of a trial which is sui generis and not to be repeated, like the choice of a casket in the Merchant of Venice.[61] A similar difficulty is presented by singular events, with respect to volition. Is the chance of one to a thousand of the prize £1000 at a lottery approximately equivalent to £1 in the eyes of a person who for once, and once only, has the offer of such a stake? The question is separable from one with which it is often confounded, the one discussed in the last paragraph what is the “moral” value of the prize? The person might be a millionaire for whom £1 and £1000 both belong to the category of small change. The stake and the prize might both be “moral.” The better opinion seems that apart from a system of transactions like that in which an insurance company undertakes, or at least a “cross-series”[62] of the kind which seem largely to operate in ordinary life, expectations in which the risks are very different are no longer equateable. So De Morgan with regard to the “single case” (the solitary transaction in question) declares that the “mathematical expectation is not a sufficient approximation to the actual phenomenon of the mind when benefits depend upon very small probabilities; even when the fortune of the player forms no part of the consideration”[63] [without making allowance for the difference between “moral” and mathematical probabilities]. So Condorcet, “If one considers a single man and a single event there can be no kind of equality”[64] (between expectations with very different risks). It is only for the long run—lorsqu'on embrasse la suite indéfinie des évènements—that the rule is valid: To the same effect at greater length the logicians Dr Venn[65] and von Kries.[66] Some of the mathematical writers have much to learn from their logical critics[67] on this and other questions relating to first principles.

Section II.—Calculation of Probability.

21. Object of the Section.—In the following calculations the principal object is to ascertain the number of cases favourable to an event in proportion to the total number of possible cases.[68] “The difficulty consists in the enumeration of the cases,” as Lagrange says. Sometimes summation is the only mathematical operation employed; but very commonly it is necessary to apply the theory of permutations and combinations involving multiplication.[69]

22. Fundamental Theorem.—One of the simplest problems of this sort is one of the most important. Given a mélange of things consisting of two species, if n things are taken at random what is the probability that s out of these n things will be of a certain species? For example, the mélange might be a well-shuffled pack of cards, and the species black and red; the quaesitum, what is the probability that if n cards are dealt, s of them will be black? There are two varieties of the problem: either after each card is dealt it is returned to the pack, which is reshuffled, or all the n cards are dealt (as in ordinary games of cards) without replacement. The first variety of the problem deserves its place as being not only the simpler, but also the more important, of the two.

23. At the first deal there are 26 cases favourable to black, 26 to red. When two deals have been made (in the manner prescribed), out of 522 cases formed by combinations between a card turned up at the first deal and a card turned up at the second, 26 × 26 cases are combinations of two blacks, 26 × 26 are combinations of two reds, and the remainder 2(26 × 26) are made up of combinations between one black and one red; 26 × 26 cases of black at the first deal and red at the second, and 26 × 26 cases of red at the first and black at the second deal. The number of cases favourable to each alternative is evidently given by the several terms in the expansion of (26 + 26)2. The correspond in probabilities are given by dividing each term by the total number of cases, viz. 522. Similarly, when we go on to a third deal, the respective probabilities of the three possible cases, three blacks, two blacks and one red, two reds and one black, three blacks, are given by the successive terms in the binomial expansion of (26 + 26)3, and so on. The reasoning is quite general. Thus for the event which consists of dealing either clubs or spades (black) we might substitute an event of which the probability at a single trial is not ½, e.g. dealing hearts. Generally, if p and 1 − p are the respective probabilities of the event occurring or not occurring at a single trial, the respective probabilities that in n trials the event will occur n times, n − 1 times . . . twice, once or not at all, are given by the successive terms in the expansion of [p+ (1 − p)]n; of which expansion the general term is .

24. The probability may also be calculated as follows. Taking for example the case in which the event consists of dealing hearts; consider any particular arrangement of the n cards, of which s are hearts, e.g. the arrangement in which the s cards first dealt are hearts and the following ns all belong to other suits. The probability of the first s cards being all hearts is ; the probability that none of the last (ns) cards are hearts is . Hence the probability of that particular arrangement occurring is . But this arrangement is but one of many, e.g. that in which the s hearts are the last dealt, which are equally likely to occur. There are as many different arrangements of this type as there are combinations of n things taken together s times, that is The probability thus calculated agrees with the preceding result.

25. It follows from the law of expansion for [p+ (1 − p)]n that as n is increased, the value of the fractions which form the terms at either extremity diminishes. When n becomes very large, the terms which are in the neighbourhood of the greatest term of the expansion overbalance the sum total of the remaining terms.[70] Thus in the example above given, if we go on and on dealing cards (with replacement) the ratio of the red cards dealt to all the cards dealt tends to become more and more nearly approximate to the limit ½. These statements are comprised in the theorem known as James Bernoulli's. Stated in its simplest form—that “in the long run all events will tend to occur with a relative frequency proportional to their objective probabilities”[71]—this theorem has been regarded as tautological or circular. Yet the proofs of the theorem which have been given by great mathematicians may deserve attention as at least showing the consistency of first principles.[72] Moreover, as usually stated, James Bernoulli's imports something more than the first axiom of probabilities.[73]

26. The generalization of the Binomial Theorem which is called the “Multinomial Theorem”[74] gives the rule when there are more than two alternatives at each trial. For instance, if there are three alternatives, hearts, diamonds or a card belonging to a black suit, the probability that if n cards are dealt there will occur s hearts, t diamonds, and nst cards which are either clubs or spades is

27. Applications of Fundamental Theorem.—The peculiar interest of the problem which is here placed first is that its solution represents a law of almost universal application: the law assigning the frequency with which different values assumed by a quantity, like most of the quantities with which statistics has to do, depends upon several independent agencies. It is remarkable that the problem in probabilities which historically was almost the first belongs to the kind which is first in interest. Of this character is a question which occupied Galileo and before him Cardan, and an even earlier writer: what are the chances that, when two or three dice are thrown, the sum of the points or pips turned up should amount to a certain number? A particular case of this problem is presented by the old game of “passedix”: what is the probability that if three dice are thrown the sum of the pips should exceed ten?[75] The answer is obtained by considering the number of combinations that are favourable to each of the different alternatives, 18 pips, 17, 16 . . . . 11 pips, which make up the event in question. Thus out of the total of 216 (63) combinations, one is favourable to 18, three to 17, and so on. There are twenty-five chances, as we may call the permutations, in favour of 12, twenty-seven in favour of 11.[76] The sum of all these being 108, we have for the event in question 108/216, an even chance. More generally it may be inquired: what is the probability that, if n dice are thrown, the number of points turned up will be exactly s? By an extension of the reasoning which was employed in the first problem it is seen that the required probability is that of which the index is s in the expansion of the expression

The calculation may be simplified by writing this expression in the form

.

The successive terms of the expansion give the respective probabilities that the number in question should be n, n + 1 . . . 6n comprising all the possible numbers among which s is presumably included (otherwise the answer is zero). Of course we are not limited to six alternatives; instead of a die we may have a teetotum with any number of sides. The series expressing the probabilities of the different sums can be written out in general terms, as Laplace and others have done; but it seems to be of less interest than the approximate formula which will be given later.[77]

28. Variant of the Fundamental Theorem.—The second variety of our first problem may next be considered. Suppose that after each trial the card dealt (ball drawn, &c.) is not replaced in statu quo ante. For instance, if r cards are dealt in the ordinary way from a shuffled pack, what is the probability that s of them will be hearts (s < 13)? Consider any particular arrangement of the r cards, whereof s are hearts, e.g. that in which the s cards first dealt are all hearts, the remaining rs belonging to other suits. The probability of the first card being a heart is 13/52; the probability that, the first having been a heart, the second should be a heart is 12/51 (since a heart having been removed there are now 12 favourable cases out of a total of 51 cases). And so on. Likewise the probability of the (s + 1)th card being not a heart, all the preceding s having been hearts, is 39/(52 − s), the probability of the (s + 2)th card being not a heart is similarly reckoned. And thus the probability of the particular arrangement considered is found to be

Now consider any other arrangement of the r cards, e.g. t of the s hearts to occur first and the remaining st last. The denominator in the above expression will remain the same; and in the numerator only the order of the factors will be altered. The probability of the second arrangement is therefore the same as that of the first; and the probability that some one or other of the arrangements will occur is given by multiplying the probability of any one arrangement and the number of different arrangements, which, as in the simpler case of the problem,[78] is the same as the number of combinations formed by r things taken together s times, that is r!/s!(rs)!. The formula thus obtained may be generalized by substituting n for 52, pn for 13, qn for 39 (where p + q = 1; pn and qn are integers). A formula thus generalized is proposed by Professor Karl Pearson[79] as proper to represent the frequency with which different values are assumed by a quantity depending on causes which are not independent.

29. Miscellaneous Examples: Games of Chance.—The majority of the problems under this heading cannot, like the preceding two, be regarded as conducing directly to statistical methods which are required in investigating some parts of nature. They are at best elegant exercises in a kind of mathematical reasoning which is required in most of such methods. Games of chance present some of the best examples. We may begin with one of the oldest, the problem which the Chevalier de Méré put to Pascal when he questioned: How many times must a pair of dice be thrown in order that it may be an even chance that double six—the event called sonnez—may oocur at least once?[80] The answer may be obtained by finding a general expression for the rob ability that the event will occur at least once in n trials; and then determining n so that this expression = ½. The probability of the event occurring is the difference between unity and the probability of its failing. Now the probability of “sonnez” failing at a single throw (of two dice) is 35/36. Therefore the probability of its failing in n throws is . Whence we obtain, to determine n, the equation , which gives n = 24.605 nearly.

30. In the preceding problem the quaesitum was (unity minus) the probability that out of all the possible events an assigned one (“sonnez”) should fail to occur in the course of n trials. In the following problem the quaesitum is the probability that out of all the possible events one or other should fail—that they should not all be represented in the course of n trials. A die being thrown n times, what is the probability that all three of the following events will not be represented (that one or other of the three will not occur at least once); viz. (a) either ace or deuce turning up, (b) either 3 or 4, (c) either 5 or 6. The number of cases in which one at least of these events fails to occur is equal to the number of cases in which (a) fails, plus the number in which (b) fails, plus the number in which (c) fails, minus the number of cases in which two of the events fail concurrently (which cases without this subtraction would be counted twice).[81] Now the number of cases in which (a) fails to occur in the course of the n trials is of all the possible cases numbering 3n. Like propositions are true of (b) and (c). The number of cases in which both (a) and (b) fail is of the total;[82] and the like is true of the cases in which both (a) and (c) fail and the cases in which both (b) and (c) fail. Accordingly the probability that one at least of the events will fail to occur in the course of n trials is

.

31. One more step is required by the following problem: If n cards are dealt from a pack, each card after it has been dealt being returned to the pack, which is then reshuffled, what is the probability that one or other of the four suits will not be represented? The probability that hearts will fail to occur in the course of the n deals is ; and the like is true of the three other suits. From the sum of these probabilities is to be subtracted the sum of the probabilities that there will be concurrent failures of any two suits; but from this subtrahend are to be subtracted the proportional number of cases in which there are concurrent failures of any three suits (otherwise cases such as that in which e.g. hearts, diamonds and clubs concurrently failed[83] would not be represented at all). Now the probability of any assigned two suits failing is ; the probability of any assigned three suits failing is . Accordingly the required probability is

.

The analogy of the Binomial Theorem supplies the clue to the solution of the general problem of which the following is an example. If a die is thrown n times the probability that every face will have turned up at least once is[84]

.

32. If in the (first) problem stated in paragraph 31 the cards are dealt in the ordinary way (without replacement), we must substitute for , the continued product; for the continued product , and so on.

33. Still considering miscellaneous examples relating to games of chance let us inquire what is the probability that at whist each of the two parties should have two honours?[85] If the turned-up card is an honour, the probability that of the three other honours an assigned one is among the twenty-five which are in the hands of the dealer or his partner, while the remaining two honours are in the hands of the other party, is 25/5126/5025/49. But the assigned card may with equal probability be any one of three honours; and accordingly the above written probability is to be multiplied by 3. If the turned-up card is not an honour then the probability that an assigned pair of honours is in the hands of the dealer or his partner, while the remaining two honours are in the hands of their adversaries, is 25/5124/5026/4925/48; this probability is to be multiplied by six, as the assigned pair may be any of the six binary combinations formed by the four honours. Now the probability of the alternative first considered—the turned-up card being an honour—is 4/13; and the probability of the second alternative, 9/13. Accordingly the required probability is

.

34. The probability that each of the four players should have an honour may be calculated thus.[86] If the card turned up is an honour then ipso facto the dealer has one honour and the probability that the remaining players have each an assigned one of the three remaining honours, is 13/5113/5013/49. Which probability is to be multiplied by 3!, as there are that number of ways in which the three cards may be assigned. If the card turned up is not an honour the probability that each player has an assigned honour is 13/5113/5013/4912/48. Which probability is to be multiplied by 4!. Accordingly the required probability is

.

(the chance not being affected by the character of the card turned up).

35. The probability of all the trumps being held by the dealer is

or , which being calculated by means of tables for (logarithms of) factorials[87] or directly,[88] is 158,753,389,900.

36. There is a set of dominoes which goes from double blank to double nine (each domino presenting either a combination—which occurs only once—of two digits, or a repetition of the same digit). What is the probability that a domino drawn from the set will prove to be one assigned beforehand? The probability is the reciprocal of the number of dominoes: which is 10 × 9/2 (the number of combinations of different digits) + 10 (the number of doubles) = 55.

37. Choice and Chance.—When we leave the sphere of games of chance and frame questions relating to ordinary life there is a danger of assuming distributions of probability which are far from probable. For example, let this be the question. The House of Commons formerly consisting of 489 English members, 60 Scottish and 103 Irish, what was the probability that a committee of three members should represent the three nationalities? An assumption of indifference where it does not exist is involved in the answer that the required probability is the ratio of the number of favourable triplets, viz. 489 × 60 × 103 to the total number of triplets, viz. 652 × 651 × 650 × 3! A similar absence of selection is postulated by the ordinary treatment of a question like the following. There being s candidates at an examination and r optional subjects from which each candidate chooses one (r > s), what is the probability that no two candidates should choose the same subject? If the candidates be arranged in any order, the probability that the second candidate should not choose the same subject as the first candidate is (n - 1)/n. The probability that the third candidate will not choose either of the two subjects taken by the aforesaid candidates is (n - 2)/n, and so on. Thus the required probability is

n(n − 1)(n − 2) ⋅ ⋅ ⋅ {n − (s − 1)}/ns.

38. When as in these cases the interest of the problem lies chiefly in the application of the theory of combinations, or permutations, there is a propriety in Whitworth's enunciation of the questions under the head of choice rather than chance. It comes to the same whether we say that there are x ways in which an event may happen, or that the probability of its happening in an assigned one of those ways is 1/x. For example, suppose that there are n couples waltzing at a ball; if the names of the men are arranged in alphabetical order, what is the probability that the names of their partners will also be in alphabetical order? The probability that the man who is first in alphabetical order should have for partner the lady who is first in that order is 1/n. The probability that the man who is second alphabetical order should have for partner the lady who is second in that order is 1/(n - 1), and so on. Therefore the required probability is 1/n!. Or it may be easier to say that the number of ways, each consisting of a set of couples in which the party can be arranged, is n!; of which only one is favourable.

39. The same principle governs the following question. For how many days can a family of 10 continue to sit down to dinner in a different order each day; it not being indifferent who sits at the head of the table—what is the absolute, as well as the relative, position of the members? The number of permutations, viz. 10!, is the answer. If we are to attend to the relative position only—as would be natural if the question related to 10 children turning round a flypole—the number of different arrangements would be only 9!

40. Method of Equations in Finite Differences.—The last question may serve to introduce a method which Laplace has applied with great éclat to problems in probabilities. Let yn be the number of ways in which n men can take their places at a round table, without respect to their absolute position; and consider how the number will be increased by introducing an additional man. From every particular arrangement of the original n men can now be obtained n different arrangements of the n + 1 men (since the additional man may sit between any two of the party of n). Hence yn+1 = nyn, an equation of differences of which the solution is C (n - 1)! The constant may be determined by considering the case in which n is 2.

41. The following example is not quite so simple. If a coin is thrown n times, what is the chance that head occurs at least twice running? Calling each sequence of n throws a “case,” consider the number of cases in which head never occurs twice running; let un be this number, then 2nun must be the number of cases when head occurs at least twice successively. Consider the value of un+1; if the last or (n + 2)th throw be tail, un+2 includes all the cases (un+1) of the n + 1 preceding throws which gave no succession of heads; and if the last be head the last but one must be tail, and these two may be preceded by any one of the un favourable cases for the first n throws. Consequently,

un+2 = un+1 + un

If α, β are the roots of the quadratic x2x − 1 = 0, this equation gives[89]

un = Aαn + Bβn.

Here A and B are easily found from the conditions u1 = 2, u2 = 3; viz.

, ,

whence .

The probability that head never turns up twice running is found by dividing this by 2n, the whole number of cases. This probability, of course, becomes smaller and smaller as the number of trials (n) is increased. This is a particular case of a more general problem solved by Laplace[90] as to the occurrence i times running of an event of which the probability at one trial is p.

42. In such problems where we now employ the calculus of finite difference Laplace employed his method of generating functions. A distinguished instance is afforded by the problem of points which was put by the Chevalier de Méré to Pascal and has exercised generations of mathematicians. It is thus stated by Laplace.[91] Two players of equal skill have staked equal sums; the stakes to belong to the player who shall have won a certain number of games. Suppose they agree to leave off playing when one player, A, wants x “points” (games to be won) in order to complete the assigned number, while the second player wants x′ points: how ought they to divide the stakes? This is a question in Expectation, but its difficulty consists in determining the probability that one of the players, say A, shall win the stakes. Let that probability be yx,x′. Then, after the next game, if A has won, the probability of his winning the stakes will be yx-1,x′. But if A loses, B winning, the probability will be yx,x′-1. But these alternatives are equally likely. Accordingly the probability of A winning the stakes may be written

½yx-1,x′ + ½yx,x′-1.

This is the same probability as that which was before written yx,x′. Equating the two expressions we have, for the function y, an equation of finite difference involving two variables, of which the solution is[92]

.

43. The problem of points is to be distinguished from another classical problem, relating to a contest in which the winner has not simply to win a certain number of games, but to win a certain number of counters from his opponent.[93] Space does not admit even the enunciation of other complicated problems to which Laplace has applied the method of generating functions.

44. Probability of Causes Deduced from Observed Events.—Problems relating to the probability of alternative causes, deduced from observed effects, are usually placed in the separate category of “inverse” probability, though, as above remarked,[94] they do not necessarily involve different principles. The difference principally consists in the need of evidence, other than that which is afforded by the observed event, as to the probability of the alternative causes existing and operating. The following is an example free from the difficulty incident to unverified a priori probabilities, which commonly besets this kind of problem. A digit having been taken at random from mathematical tables (or the expansion of an endless constant such as π); a second digit is obtained by taking from a random succession of digits one that added to the first digit makes a sum greater than 9. Given a result thus formed, what are the respective probabilities that the second digit should have been 0, 1, 2, . . . 8 or 9? In the long run the first digit assumes with equal frequency the values 0, 1, 2 . . . 8, 9. Accordingly the second digit can never be 0. There is only one chance of its being 1, namely when the first digit is 9. If the second digit is 2, and the first either 8 or 9, the observed effect will be produced. And so on. If the second digit is 9, the effect may occur in nine ways. Accordingly in the long run of pairs thus formed it will occur that the cases or causes which are defined by the circumstances that the second digit is 0, 1, 2, . . . 8, 9, respectively, will occur with frequencies in the following ratios 0 : 1 : 2 . . . 8 : 9. The probability of the observed event having been caused by a particular (second) digit, e.g. 7, is 7/(0 + 1 + 2 + . . + 9) = 7/45.

45. The following example taken from Laplace[95] is of a more familiar type. An urn is known to contain three balls made up of white and black balls in some unknown proportion. From this urn a ball is extracted m times (being each time replaced after extraction). If a white ball is drawn every time, what are the respective probabilities that the number of white balls in the urn are 3, 2, 1 or 0? By parity of reasoning it appears that in the first case the result is certain, its probability 1, in the second case the probability of the observed event occurring is (⅔)m, in the third case that probability is (⅓)m, in the fourth case zero. Accordingly the respective inverse probabilities are in the ratios

1 : (⅔)m : (⅓)m : 0;

provided that (as in the preceding example, with respect to the second digits) the alternative causes, the four possible constitutions of the urn, are (a priori) equally probable. This is rather a bold assumption with respect to the contents of concrete urns[96] and similar groupings; but with regard to things in general may perhaps be justified on the principle of cross-series.[97]

46. Often in the investigation of causes we are not thrown back on unverified a priori probabilities. We have some specific evidence though of a very rough character. An example has been cited from Mill in a preceding paragraph.[98] Against the improbabilities calculated by the methods of the present section there has often to be balanced an improbability evidenced by common sense, which does not admit of mathematical calculation. Bertrand[99] puts the following case. The manager of a gambling house has purchased a roulette table which is found to give red 5300 times, black 4700 times, out of 10,000 trials. The purchaser claims an indemnity from the maker. What can the calculus tell us as to the justice of the claim? Nothing precise, yet something worth knowing. The a priori improbability of the maker's inaccuracy must be very great to overcome the improbability of such an event occurring by chance if the machine is accurately made (accuracy being defined, say, by the condition that the ratio of red to [red + white] would prove to be in the indefinitely long run of trials between 0.499 and 0.501). The odds against the so defined event occurring are found to be some millions to one.[100]

47. The difficulty recurs in more practical problems: for instance, certain symptoms having been observed, to find the probability that they are produced by a particular disease. Such concrete applications of probabilities are often open to the sort of objections which have been urged against the classical use of the calculus to determine the probability that witnesses are true, or judges just.

48. Probability of Testimony.—The application of probabilities to testimony proceeds upon two assumptions: (1) that to each witness there pertains a coefficient of probability representing the average frequency with which he speaks the truth or untruth, (2) that the statements of witnesses are independent in the sense proper to probabilities. Thus if two witnesses concur in making a statement which must be either true or false, their agreement is a circumstance which is only to be accounted for by one of two alternatives: either that they are both speaking the truth, or both false. If the average truthfulness—the credibility—of one witness is p, that of the other p′, then the probabilities of the two alternative explanations are to each other in the ratio pp′ : (1 − p)(1 − p′); the probability that the statement is true is pp′/{pp′ + (1 − p)(1 − p′)}. So far no account is taken of the a priori probability of the statement. This evidence may be treated as an independent witness. Thus, if a person whose credibility is p asserts that he has seen at whist a hand consisting entirely of trumps dealt from a well-shuffled pack of cards, there are two alternative explanations of his assertion, with probabilities in the ratio

p × 0.000,000,000,0063 : (1 - p) × 0.999,999,999,993.

The truthfulness of the witness must be very great to outweigh the a priori improbability of the fact.[101] These formulae are easily extended to the case of three or more witnesses. The probability of a statement made by three witnesses of respective credibilities p, p′, p″ is

pp′p″/{pp′p″ + (1 − p)(1 − p′)(1 − p″)}.

For r witnesses we have

p1p2 ⋅ ⋅ ⋅ pr / {p1p2 ⋅ ⋅ ⋅ pr + (1 − p1)(1 − p2) ⋅ ⋅ ⋅ (1 − pr)}.

Dividing both the numerator and the denominator by p1p2 ⋅ ⋅ ⋅ pr, we see that the probability of the statement increases with the number of the witnesses, provided that for every witness (1 − p)/p is a proper fraction, and accordingly p > ½. As an example of several witnesses, let us inquire how many witnesses to a fact such as a hand at whist consisting entirely of trumps would be required in order to make it an even chance that the fact occurred, supposing the credibility of each witness to be 9/10.[102] Let x be the required number of witnesses. We have the 1/(1 + (1/9)x0.000,000,000,006) = ½, or x log 9 = 12.2. Whence, if x is 13, it is more than an even chance that the statement is true.

49. When an event may occur in two or more ways equally probable a priori, the formulae show that the probability of the statement will depend on the credibility of the witnesses; and accordingly the explicit consideration of a priori probabilities may, as in our first instance, be omitted. One who reports the number of a ticket obtained at a lottery ordinarily makes a statement against which there is no a priori improbability; but if the number is one which had been predicted, there is an a priori improbability 1/n that an assigned ticket should be drawn out of a mélange of n tickets. Similar reasoning is applicable to the probability that the decisions of judgments, the verdict of juries, is right.

50. The assumptions upon which all this reasoning is based are open to serious criticisms. The postulated independence of witnesses and judges is frequently not realized. The revolutionary tribunal which condemned Condorcet was affected by an identity of illusions and passions which that mathematician had not taken into account when he calculated “that the probability of a decision being conformable to truth will increase indefinitely as the number of voters is increased.”[103]

51. The use of coefficients based on the average truthfulness or justice of each witness and judge involves the neglect of particulars which ought to influence our estimate of probability, such as the consistency of a witness's statements and the relation of the case to the interests, prejudices and capacities of the witness or the judge.[104] Thus even in so simple a case as the alleged occurrence of an extraordinary hand at whist, the “truthfulness” of the witness in the general sense of the term may not adequately represent his liability to have made a mistake about the shuffling.[105] A neglect of particulars, however, is sometimes practised with success in the applications of statistics (insurance, for instance). Perhaps there are broad results and general rules to which the mathematical theory may be applicable. Perhaps the laborious researches of Poisson on the “probability of judgments" are not, as they have been called by an eminent mathematician, absolument rien.[106] More than mathematical interest may attach to Laplace's investigation of a rule appropriate to cases like the following. An event (suppose the death of a certain person) must have proceeded from one of n causes A, B, C, &c., and a tribunal has to pronounce on which is the most probable. Professor Morgan Crofton's original proof of Laplace's rule is here reproduced.[107]

52. Let each member of the tribunal arrange the causes in the order of their probability according to his judgment, after weighing the evidence. To compare the presumption thus afforded by any one judge in favour of a specified cause with that afforded by the other judges, we must assign a value to the probability of the cause derived solely from its being, say, the rth on his list. As he is supposed to be unable to pronounce any closer to the truth than to say (suppose) H is more likely than D, D more likely than L, &c., the probability of any cause will be the average value of all those which that probability can have, given simply that it always occupies the same place on the list of the probabilities arranged in order of magnitude. As the sum of the n probabilities is always 1, the question reduces to this:—

Any whole (such as the number 1) is divided at random into n parts, and the parts are arranged in the order of their magnitude—least, second, third, . . . greatest; this is repeated for the same whole a great number of times; required the mean value of the least, of the second, &c., parts, up to that of the greatest.

A B b


Let the whole in question be represented by a line AB = a, and let it be divided at random into n parts by taking n − 1 points indiscriminately on it. Let the required mean values be

λ1a, λ2a, λ3a . . . . λna,

where λ1, λ2, λ3 . . . must be constant fractions. As a great number of positions is taken in AB for each of the n points, we may take a as representing that number; and the whole number N of cases will be

N = an−1.

The sum of the least parts, in every case, will be

S1 = Nλ1a = λ1an.

Let a small increment, Bb = δa, be added on to the line AB at the end B; the increase in this sum is δS1 = nλ1an−1δa. But, in dividing the new line Ab, either the n − 1 points all fall on AB as before, or n − 2 fall on AB and 1 on Bb (the cases where 2 or more fall on Bb are so few we may neglect them). If all fall on AB, the least part is always the same as before except when it is the last, at the end B of the line, and then it is greater than before by δa; as it falls last in n−1 of the whole number of trials, the increase in S1 is n−1an−1δa. But if one point of division falls on Bb, the number of new cases introduced is (n − 1)an − 2δa; but, the least part being now an infinitesimal, the sum S1 is not affected; we have therefore

δS1 = nλ1an−1δa = n−1an−1δa;

∴λ1 = n−2.

To find λ2, reasoning exactly in the same way, we find that where one point falls on Bb and n − 2 on AB, as the least part is infinitesimal, the second least part is the least of the n − 1 parts made by the n − 2 points; consequently, if we put λ1′ for the value of λ1 when there are n − 1 parts only, instead of n,

δS2 = nλ2an−1δa = n−1an−1δa + (n − 1)an−2λ1aδa,

nλ2 = n−1 + (n − 1)λ1′; but λ′1 = (n − 1)−2;

nλ2 = n−1 + (n − 1)−1.

In the same way we can show generally that

nλr = n−1 + (n − 1)λ′r−1;

and thus the required mean value of the rth part is

λra = an−1{n−1 + (n − 1)−1 + (n − 2)−1 + ⋅ ⋅ ⋅ (nr + 1)−1}.

Thus each judge implicitly assigns the probabilities

1/n2, 1/n(1/n + 1/n − 1), 1/n(1/n + 1/n − 1 + 1/n − 2),

to the causes as they stand on his list, beginning from the lowest. The values asiigned for the probability of each alternative cause may be treat as so many equally authoritative observations representing a quantity which it is required to determine. According to a general rule given below[108] the observations are to be added and divided by their number; but here if we are concerned only with the relative magnitudes of the probabilities in favour of each alternative it suffices to compare the sums of the observations. We thus arrive at Laplace's rule. Add the numbers found on the different lists for the cause A, for the cause B, and so on; that cause which has the greatest sum is the most probable.

53. Probability of Future Effects deduced from Causes.—Another class of problems which it is usual to place in a separate category are those which require that, having ascended from an observed event to probable causes, we should descend to the probability of collateral effects. But no new principle is involved in such problems. The reason may be illustrated by the following modification of the problem about digits which was above set[109] to illustrate the method of deducing the probability of alternative causes. What is the probability that if to the second digit which contributed to the effect there described there is added a third digit taken at random, the sum of the second and third will be greater than 10 (or any other assigned figure)? The probabilities—the a posteriori probabilities derived from the observed event (that the sum of the first and second digit exceeds 9)—each multiplied by 45, of the alternatives constituted by the different values 0, 1, 2, . . . 8, 9 of the second figure are written in the first of the subjoined rows.

0 1 2 3  4  5  6  7  8  9
0 0 1 2  3  4  5  6  7  8
0 0 2 6 12 20 30 42 56 72

Below each of these probabilities is written the probability, × 10 that if the corresponding cause existed the effect under consideration would result. The product of the two probabilities pertaining to each alternative way of producing the event gives the probability of the event occurring in that way. The sum of these products which are written in the third row divided by 45 × 10, viz. 240/450 = 8/15, is the required probability. It may be expected that actual trial would verify this result.

54. “Rule of Succession.”—One case of inferred future effects, sometimes called the “rule of succession,” claims special notice as having been thought to furnish a test for the cogency of induction. A white ball has been extracted (with replacement after extraction) n times from an immense number of black and white balls mixed in some unknown proportion; what is the probability that at the (n + 1)th trial a white ball will be drawn.? It is assumed that each constitution of the mélange[110] formed by the proportion of white balls (the probability of drawing a white ball), say p, is a priori as likely to have any one value as another of the series

Δp, 2Δp, 3Δp, . . . 1 - 2Δp, 1 - Δp, 1.

Whence a posteriori the probability of any particular value of p as the cause of the observed recurrence is pn/∑pn, where p in the denominator receives every value from Δp to 1. The probability that this cause, if it exists, will produce the effect in question, the extraction of a white ball at the (n + 1)th trial, is p. The probability of the event, obtained by summing the probabilities of all the different ways in which it may occur, is accordingly ∑pn+1/∑pn, where p both in the numerator and the denominator is to receive all possible values between Δp and 1. In the limit we have

.

In particular if n = 1, the probability that an event which has been observed once will recur on a second trial is ⅔. These results are perhaps not so absurd as they have seemed to some critics, when the principle of “cross-series”[111] is taken into account. Among authorities who seem to attach importance to the rule of succession, in addition to the classical writers on Probabilities, may be mentioned Lotze[112] and Karl Pearson.[113]

Section III.—Calculation of Expectation.

55. Analogues of Preceding Problems.—This section presents problems analogous to the preceding. If n balls are extracted from an urn containing black and white balls mixed up in the proportions p : (1 - p), each ball being replaced after extraction, the expected number of white balls in the set of n is by definition np.[114] It may be instructive to verify the consistency of first principles by demonstrating this axiomatic proposition.[115] Consider the respective probabilities that in the series of n trials there will occur no white balls, exactly one white ball, exactly two white balls, and so on, as shown in the following scheme:—

No. of white balls 0, 1, 2,  . . . n
Corresponding probability   (1 − p)n,   n!/(n − 1)!1(1 − p)n−1p,   n!/(n − 2)!2!(1 − p)n−2p2   . . . pn

To calculate the expectation of white balls it is proper to multiply 1 by the probability that exactly one white ball will occur, 2 by the probability of two white balls, and so on. We have thus for the required expectation

n!/(n − 1)!(1 − p)n−1p + n!/(n − 2)!(1 − p)n−2p2 + . . . + n!/(nr)!(r − 1)!(1 − p)nrpr + . . . + npn

= np[(1 − p)n−1 + (n − 1)(1 − p)n−2p + . . . + (n − 1)!/(nr)!(r − 1)!(1 − p)nrpr−1 + . . . + pn−1]

= np[(1 − p) + p]n−1 = np.

The expectation in the case where the balls are not replaced—not similarly axiomatic—may be found by approximative formulae.[116]

56. Games of Chance.—With reference to the topic which occurred next under the head of probabilities, a distinction must be drawn between the number of trials which make it an even chance that all the faces of a die will not have turned up at least once, and the number of trials which are made on an average before that event occurs. We may pass from the probability to expectation in such cases by means of the following theorem. If s is the number of trials in which on an average success (such as turning up every face of a die at least once) is obtained, then s = 1 + f1 + f2 + . . . ; where fr denotes the probability of failing in the first r trials. For the required expectation is equal to 1 × probability of succeeding at the first trial + 2 × probability of succeeding at the second trial + &c. Now the probability of succeeding at the first trial is 1 − f1; the probability of succeeding at the second trial, (after failing at the first) is f1(1 − f2); the probability of succeeding at the third trial is similarly f2(1 − f3); and so on. Substituting these values for the expression for the expectation, we have the proposition which was to be proved. In the proposed problem

fn = 6(5/6)n − 15(4/6)n × 20(3/6)n − 15(2/6)n + 6(1n/6)

Assigning to n in each of these terms, every value from 1 to ∞ we have 6⋅5/6/(1 − 5/6), = 30, for the sum of the first set, with corresponding expressions for the sets formed from the following terms. Whence s = 1 + 30 − 30 + 20 − 15/2 + 5/6 = 14.7. By parity of reasoning it is proved that on an average 7419/630 cards[117] must be dealt before at least one card of every suit has turned up.[118]

57. Dominoes are taken at random (with replacement after each extraction) from the set of the kind described in a preceding paragraph.[119] What is the difference (irrespective of sign) to be expected between the two numbers on each domino? The digit 9, according as it is combined with itself, or any smaller digit, gives the sum of differences

0 + 1 + 2 + . . . + 9.

The digit 8 combined with itself or any smaller digit gives the sum of differences 0 + 1 + 2 + . . . + 8 and so on. The sum of the differences is ∑1/2r. r + 1, where r has every integer value from 1 to 9 inclusive, = 9(9 + 1)(9 + 2)/2⋅3, = 165. And the number of, the differences is 10 + 9 + 8 + . . . + 2 + 1 = 55. Therefore the required expectation is 165/55 = 3.

58. Digits taken at Random.—The last question is to be distinguished from the following. What is the difference (irrespective of sign) between two digits, taken at random from mathematical tables, or the expansion of an endless constant like π? The combinations of different digits will now occur twice as often as the repetitions of the same digit. The sum of the differences may now be obtained from the consideration that the sum of the positive differences must be equal to sum of the negative differences when the null differences are distributed equally between the positive and the negative set. The sum of the positive set is, as before, 165. But the denominator of this numerator is not the same as before, but less by half the number of null differences, that is 5. We thus obtain for the required expectation 165/50 = 3.3.

59. A simple verification of this prediction may thus be obtained. In a table of logarithms note any two digits so situated as to afford no presumption of close correlation; for instance, in the last place of the logarithm of 10009 the digit 7 and in the last lace of the logarithm of 10019 the digit 4, and take the difference between these two, viz. 3, irrespective of sign. Proceed similarly with the similarly situated pair which form the last places of the logarithms of 10029 and 10039; for which the difference is 1, and so on. The mean of the differences thus found ought to be approximately 3.3. Experimenting thus on the last digits of logarithms, in Hutton's tables extending to seven places, from the logarithm of 10009 to the logarithm of 10909, the writer has found for the mean of 250 differences, 3.2.

60. Points taken at Random.—By parity of reasoning it may be shown that if two different milestones are taken at random on a road n miles long (there being a stone at the starting-point) their average distance apart is ⅓(n + 2).

61. If instead of finite differences as in the last two problems the intervals between the numbers or degrees which may be selected are indefinitely small, we have the theorem that the mean distance between two points taken at random on a finite straight line is a third of the length of that straight line.

62. The fortuitous division of a straight line is happily employed by Professor Morgan Crofton to exhibit Laplace's method of Rules for Voting at Elections. determining the worth of several candidates by combining the votes of electors. There is a close relation between this method and the method above given for determining the probabilities of several alternatives by combining the judgments of different judges.[120] But there is this difference—that the several estimates of worth, unlike those of probability, are not subject to the condition that their sum should be equal to a constant quantity (unity). The quaesita are now expectations, not probabilities. Professor Morgan Crofton's version.[121] of the argument is as follows. Suppose there are n candidates for an office; each elector is to arrange them in what he believes to be the order of merit; and we have first to find the numerical value of the merit he thus implicitly attributes to each candidate. Fixing on some limit a as the maximum of merit, n arbitrary values less than a are taken and then arranged in order of magnitude—least, second, third, . . . greatest; to find the mean value of each.





A X Y Z B

Take a line AB = a, and set off n arbitrary lengths AX, AY, AZ . . . beginning at A; that is, n points are taken at random in AB. Now the mean values of AZ, XY, YZ, . . . are all equal; for if a new point P be taken at random, it is equally likely to be 1st, 2nd, 3rd, &c., in order beginning from A, because out of n + 1 points the chance of an assigned one being 1st is (n + 1)−1; of its ing 2nd (n + 1)−1; and so on. But the chance of P being 1st is equal to the mean value of AX divided by AB; of its being 2nd M(XY) ÷ AB; and so on. Hence the mean value of AX is AB (n + 1)−1; that of AY is 2AB (n + 1)−1; and so on. Thus the mean merit assigned to the several candidates is

a(n + 1)−1, 2a(n + 1)−1, 3a(n + 1)−1 . . . na(n + 1)−1.

Thus the relative merits may be estimated by writing under the names of the candidates the numbers 1, 2, 3, . . . n. The same being done by each elector, the probability will be in favour of the candidate who has the greatest sum.

Practically it is to be feared that this plan would not succeed, because, as Laplace observes, not only are electors swayed by many considerations independent of the merit of the candidates, but they would often place low down in their list any candidate whom they judged a formidable competitor to the one they preferred, thus giving an unfair advantage to candidates of mediocre merit.

63. This objection is less appropriate to competitive examinations, to which the method may seem applicable. But there is a more fundamental objection in this case, if not indeed in every case, to the reasoning on which the method rests: viz. that there is supposed an a priori distribution of values which is in general not supposable; viz. that the several estimates of worth, the marks given to different candidates by the same examiner, are likely to cover evenly the whole of the tract between the minimum and maximum, e.g. between 0 and 100. Experience, fortified by theory, shows that very generally such estimates are not thus indifferently disposed, but rather in an order which will presently be described as the normal law of error.[122] The theorem governing the case would therefore seem to be not that which is applied by Laplace and Morgan Crofton, but that which has been investigated by Karl Pearson,[123] a theorem which does not lend itself so readily to the purpose in hand.[124]

64. Expectation of Advantage.—The general examples of expectation which have been given may be supplemented by some appropriate to that special use of the term which Laplace has sanctioned when he considers the subject of expectation as a “good”; in particular money, or that for the sake of which money is desired, “moral” advantage, in more modern phrase utility or satisfaction.

65. Pecuniary Advantage.—The most important calculations of pecuniary expectation relate to annuities and insurance; based largely on life tables from which the expectation of life itself, as well as of money value at the end, or at any period, of life is predicted. The reader is referred to these heads for practical exemplifications of the calculus. It must suffice here to point out how the calculations are facilitated by the adoption of a law of frequency, the Gompertz or the Gompertz-Makeham law, which on the one hand can hardly be ranked with hypotheses resting on a vera causa, yet on the other hand is not purely empirical, but is recommended, as germane to the subject-matter, by colourable suppositions.[125]

66. There is space here only for one or two simple examples of money as the subject of expectation. Two persons A and B throw a die alternately, A beginning, with the understanding that the one who first throws an ace is to receive a prize of £1. What are their respective expectations?[126] The chance that the prize should be won at the first throw is 1/6, the chance that it should be won at the second throw is 5/6 1/6; at the third throw (5/6)2 1/6, at the fourth throw (5/6)3 1/6, and so on. Accordingly the expectation of A

= £1 × 1/6 {1 + (5/6)2 + (5/6)4 + . . .};

of B

= £1 × 1/65/6 {1 + (5/6)2 + (5/6)4 + . . .};

Thus A's expectation is to B's as 1 : 5/6. But their expectations must together amount to £1. Therefore A's expectation is 6/11 of a pound, B's 5/11.

67. There are n tickets in a bag, numbered 1, 2, 3, . . . n. A man draws two tickets at once, and is to receive a number of sovereigns equal to the product of the numbers drawn. What is his expectation?[127] It is the number of pounds divided by an improper fraction of which the denominator is the number of possible products, ½n(n − 1), and the numerator is the sum of all possible products = ½{(1 + 2 + 3 . . . + n)2 − (12 + 22 + . . . + n2)}. Whence the required number (of pounds) is found to be 1/12(n + 1)(3n + 2). The result may be contrasted with what it would be the two tickets were not to be drawn at once, but the second after replacement of the first. On this supposition the expectation in respect of one of the tickets separately is ½(n + 1). Therefore, as the two events are now independent, the expectation of the product,[128] being the product of the expectations, is {½(n + 1)}2.

68. Peter throws three coins, Paul two. The one who obtains the greater number of heads wins £1. If the number of heads are equal, they play again, and so on, until one or other obtains a greater number of heads. What are their respective expectations?[129] At the first trial there are three alternatives: (α) Peter obtains more heads than Paul, (β) an equal number, (γ) fewer. The cases in favour of α are (1) Peter obtains three heads, (2) Peter, two heads, while Paul one or none, (3) Peter one head, Paul none. The cases in favour of β are (1) two heads for both, or (2) one head, or (3) none, for both. The remaining case favours γ. The probability of α is 1/8 + 3/8 3/4 + 3/8 1/4 = 1/2. The probability of β is 3/8 1/4 + 3/8 1/2 + 1/8 1/4 = 5/16. The probability of γ is 1 − 13/16 = 3/16. Alternative, β is to be split up into three α′, β′, γ′, of which the probabilities (when β has occurred) are as before, 8/16, 5/16, 3/16. β′ is similarly split up, and so on. Thus Peter's expectation is 8/16{1 + 5/16 + (5/16)2 + . . .}£1 = 8/11£1. Paul's expectation is 8/11£1.

An urn contains m balls marked 1, 2, 3, . . . m. Paul extracts successively the m balls, under an agreement to give Peter a shilling every time that a ball comes out in its proper order. What is Peter's expectation? The expectation with respect to any one ball is 1/m, and therefore the expectation with respect to all is 1 (shilling).[130]

69. Advantage subjectively estimated.—Elaborate calculations are paradoxically employed by Laplace and other mathematicians to determine the expectation of subjective advantage in various cases of risk. The calculation is based on Daniel Bernoulli's formula which may be written thus: If x denote a man's physical fortune, and y the corresponding moral fortune

y = k log (x/h),

k, h being constants. x and y are always positive, and x > h; for every man must possess some fortune, or its equivalent, in order to live. To estimate now the value of a moral expectation. Suppose a person whose fortune is a to have the chance p of obtaining a sum α, q of obtaining β, r of obtaining γ, &c., and let

p + q + r + . . . = 1,

only one of the events being possible. Now his moral expectation from the first chance—that is, the increment of his moral fortune multiplied by the chance—is

.

Hence his whole moral expectation is[131]

E = kp log (a + α) + kq log (a + β) + kr log (a + γ) + . . . - k log a;

and, if Y stands for his moral fortune including this expectation, that is, k log (a/h) + E, we have

Y = kp log (a + α) + kq log (a + β) + . . . - k log h.

To find X, the physical fortune corresponding to this moral one, we have

Y = k log X − k log h.

Hence

X = (a + α)p(a + β)q(a + γ)r,

and X − a will be the actual or physical increase of fortune which is of the same value to him as his expectation, and which he may reasonably accept in lieu of it. The mathematical value of the same expectation is[132]

pα + qβ + rγ + . . .

70. Gambling and Insurance.—These formulae are employed, often with the aid of refined mathematical theorems, to demonstrate received propositions of great practical importance: that in general gambling is disadvantageous, insurance beneficial, and that in speculative operations it is better to subdivide risks—not to “have all your eggs in one basket.”

71. These propositions may be deduced by the use of a formula which perhaps keeps closer to the facts: viz. that utility or satisfaction is a function of material goods not definitely ascertainable, defined only by the conditions that the function continually increases with the increase of the variable, but at a continually decreasing rate (and some additional postulate as to the lower limit of the variable), say y = ψ(x) (if x as before denotes physical fortune, and y the corresponding utility or satisfaction); where all that is known in general of ψ is that ψ′(x) is positive, ψ″(x) is negative; and ψ(x) is never less, x is always greater than zero. Suppose a gambler whose (physical) fortune is a, to have the chance p of obtaining a sum α and the chance q( = 1 − p) of losing the sum β. If the game is fair in the usual sense of the term pα = qβ. Accordingly the prospective psychical advantage of the party is pψ(a + α) + qψ(a − β) = pψ(a + α) + qψ{a − (p/q)α}, say yα. When α is zero the expression reduces to the first state of the man, ψ(a), say y0. To compare this state with what it becomes by the gambling transaction, let α receive continually small increments of Δα. When α is zero the first differential coefficient of (yαy0), viz. pψ′(a) − pψ′(a), = 0. Also the second differential coefficient, viz. pψ″(a) + p2/qψ″(a), is negative, since by hypothesis ψ″ is continually negative. And as α continues to increase from zero, the second differential coefficient of (yαy0), viz. pψ″(a + α) + p2/qψ″(a + p/qα), continues to be negative. Therefore the increments received by the first differential coefficient of (yαy0) are continually negative; and therefore (yαy0) is continually negative; yα < y0[133] for finite values of α (not exceeding qa/p).[134]

72. To show the advantage of insurance, let us suppose with Morgan Crofton[135] that a merchant, whose fortune is represented by 1, will realize a sum ε if a certain vessel arrives safely. Let the probability of this be p. To make up exactly for the risk run by the insurance company, he should pay them a sum (1 − p)ε. If he does, his moral fortune becomes, according to the formula now proposed ψ(1 + pε), since his physical fortune is increased by the secured sum ε, minus the payment (1 − p)ε; while if he does not insure it will be pψ(1 + ε) + (1 − p)ψ(1). We have then to compare ψ(1 + pε), say y1, with pψ(1 + ε) + (1 − p)ψ(1), say y2. By reasoning analogous to that of the preceding paragraph it appears that (y2y1) is zero when ε = 0 and continually diminishes as ε increases up to any assigned finite (admissible) value. Similarly it may be shown that it is better to expose one's fortune in a number of separate sums to risks independent of each other than to expose the whole to the same danger. Suppose a merchant, having a fortune, has besides a sum ε which he must receive if a ship arrives in safety. Then, if the chance of the ship arriving = p, and q = 1 − p, his prospective advantage is pψ(1 + ε) + qψ(1). Now instead of exposing the lump sum ε to a single risk, let him subdivide ε into n equal parts, each exposed to an independent equal risk (q) of being lost. As n is made larger[136] it becomes more and more nearly a certainty that he will realize pε out of the total ε exposed to risk. Therefore his condition (in respect of the sort of advantage which is under consideration) will be approximately ψ(1 + pε). Then we have to compare ψ(1 + pε), say y1, with pψ(1 + ε) + qψ(1), say y2. By reasoning analogous to that which has been above employed—observing that (pp2)ψ″(1) is negative for all possible values of p—we conclude that y2 < y1.

73. The Petersburg Problem.—The doctrine of “moral fortune” was first formulated by Daniel Bernoulli[137] with reference to their celebrated “Petersburg Problem,” which is thus stated by Todhunter[138]: “A throws a coin in the air: if head appears at the first throw he is to receive a shilling from B, if head does not appear until the second throw he is to receive 2s., if head does not appear until the third throw he is to receive 4s., and so on, required the expectation of A.” So many lessons are presented by this problem that there has been room for disputing what is the lesson. Laplace and other high authorities follow Daniel Bernoulli. Poisson finds the explanation in the fact that B could not be expected to pay up so large a sum. Whitworth, who regards the disadvantage of gambling as consisting mainly in the danger of becoming “cleaned out,”[139] finds this moral in the Petersburg problem. All have not noticed what some regard as the principal lesson to be obtained from the paradox: viz. that a transaction which cannot be regarded as one of a series—at least a “cross-series”[140]—is not subject to the general rule for expectations of advantage whether material or moral.[141]

Section IV.—Geometrical Applications.

74. Under this head occur some interesting illustrations of principles employed in the preceding sections; in particular of a priori probabilities and of the relation between probability and expectation.

75. Illustrations of a priori Probabilities.—The assumption which has been made under preceding heads that the probability of certain alternatives is approximately equal appears to rest on evidence of much the same character as the assumption which is made under this head that one point in a line, plane or volume is as likely to occur as another, under certain circumstances. Thus consider the proposition: if a given area S is included within a given area A, the chance of a point P, taken at random on A, falling on S is S/A. In a great variety of circumstances such a size can be assigned to the spaces, and “taking at random” can be so defined that the proposition is more or less directly based on experience. The fact that the points of incidence are equally distributed in space is observed, or connected by inference with observation, in many cases, e.g. raindrops and molecules. There is a solid substratum of evidence for the premiss employed in the solution of problems like the following: On a chess-board, on which the side of every square is a, there is thrown a coin of diameter b (b < a) so as to be entirely on the board, which may be supposed to have no border. What is the probability that the coin is entirely on one square?[142] The area on which the coin can fall is (8ab)2. The portion of the area which is favourable to the event is 64(ab)2. Therefore the required probability is (ab)2/(a1/8b)2.

76. Random Lines.—Speculative difficulties recur when we have to define a straight line taken at random in a plane; for instance, in the following problem proposed by Buffon.[143]

A floor is ruled with equidistant parallel lines; a rod, shorter than the distance between each pair, being thrown at random on the floor, to find the chance of its falling on one of the lines. The problem is usually solved as follows:—

Let x be the distance of the centre of the rod from the nearest line, θ the inclination of the rod to a perpendicular to the parallels, 2a the common distance of the parallels, 2c the length of rod; then, as all values of x and θ between their extreme limits are equally probable, the whole number of cases will be represented by

Now if the rod crosses one of the lines we must have c > x/cos θ; so that the favourable cases will be measured by

.

Thus the probability required is p = 2ca.

It may be asked—why should we take the centre of the rod as the point where distance from the nearest line has all its values equally probable? Why not one extremity of the line, or some other point suited to the circumstances of projection? Fortunately it makes no difference in the result to what point in the rod we assign this pre-eminence.

77. The legitimacy of the assumption obtains some verification from the success of a test suggested by Laplace. If a rod is actually thrown, as supposed in the problem, a great number of times, and the frequency with which it falls on one of the parallels is observed, that proportionate number thus found, say p, furnishes a value for the constant π. For π ought to equal 2c/pa. The experiment has been made by Professor Wolf of Frankfort. Having thrown a needle of length 36 mm. on a plane ruled with parallel lines at a distance from each other of 45 mm. 5000 times, he observed that the needle crossed a parallel 2532 times. Whence the value of π is deduced 3.1596, with a probable error[144] ± .05.

78. More hesitation may be felt when we have to define a random chord of a circle,[145] for instance, with reference to the question, what is the probability that a chord taken at random will be greater than the side of an equilateral triangle? For some purposes it would no doubt be proper to assume that the chord is constructed by taking any point on the circumference and joining it to another point on the circumference, the points from which one is taken at random being distributed at equal intervals around the circumference. On this understanding the probability in question would be ½. But in other connexions, for instance, if the chord is obtained by the intersection with the circle of a rod thrown in random fashion, it seems preferable to consider the chord as a case of a straight line falling at random on a plane. Morgan Crofton[146] himself gives the following definition of such a line: If an infinite number of straight lines be drawn at random in a plane, there will be as many parallel to any given direction as to any other, all directions being equally probable; also those having any given direction will be disposed with equal frequency all over the plane. Hence, if a line be determined by the co-ordinates p, ω, the perpendicular on it from a fixed origin O, and the inclination of that perpendicular to a fixed axis, then, if p, ω be made to vary by equal infinitesimal increments, the series of lines so given will represent the entire series of random straight lines. Thus the number of lines for which p falls between p and p + dp, and ω between ω and ω + dω, will be measured by dpdω, and the integral ∬dpdω, between any limits, measures the number of lines within those limits.

79. Authoritative and useful as this definition is, it is not entirely free from difficulty. It amounts to this, that if we write the equation of the random line

x cos α + y sin αp = 0,

we ought to take α and p as those variables, of which, the equicrescent values are equally probable—the equiprobable variables, as we may say. But might we not also write the equation in either of the following forms

(1) x/a + y/b − 1 = 0,  
(2) ax + by − 1 = 0,

and take a and b in either system as the equiprobable variables? To be sure, if the equal distribution of probabilities is extended to infinity we shall be landed in the absurdity that of the random lines passing through any point on the axis of y a proportion differing infinitesimally from unity—100%—are either (1) parallel or (2) perpendicular to the axis of x. But the admission of infinite values will render any scheme for the equal distribution of probabilities absurd. If Professor Crofton's constant p, for example, becomes infinite, the origin being thus placed at an infinite distance, all the random chords intersecting a finite circle would be parallel!

80. However this may be, Professor Crofton's conception has the distinction of leading to a series of interesting propositions, of which specimens are here subjoined.[147] The number of random lines which meet any closed convex contour of length L is measured by L. For, taking O inside the contour, and integrating first for p, from 0 to p, the perpendicular on the tangent to the contour, we have ∫pdω; taking this through four right angles for ω, we have by Legendre's theorem on rectification, N being the measure of the number of lines,

N − = L.[148]

Thus, if a random line meet a given contour, of length L, the chance of its meeting another convex contour, of length l, internal to the former is p = l/L. If the given contour be not convex, or not closed, N will evidently be the length of an endless string, drawn tight around the contour.

Fig. 1.

81. If a random line meet a closed convex contour of length L, the chance of it meeting another such contour, external to the former, is p = (X − Y)/L, where X is the length of an endless band enveloping both contours, and crossing between them, and Y that of a band also enveloping both, but not crossing. This may be shown by means of Legendre's integral above; or as follows:—

Call, for shortness, N(A) the number of lines meeting an area A; N(A, A′) the number which meet both A and A′; then (fig. 1)

N(SROQPH) + N(S′Q′OR′P′H′) = N(SROQPH + S′Q′OR′P′H′) + N(SROQPH, S′Q′OR′P′H′),

since in the first member each line meeting both areas is counted twice. But the number of lines meeting the non-convex figure consisting of OQPHSR and OQ′S′H′P′R′ is equal to the band Y, and the number meeting both these areas is identical with that of those meeting the given areas Ω, Ω′; hence X = Y + N(Ω, Ω′). Thus the number meeting both the given areas is measured by X − Y. Hence the theorem follows.

82. Two random chords cross a given convex boundary, of length L, and area Ω; to find the chance that their intersection falls inside the boundary.

Consider the first chord in any position; let C be its length; considering it as a closed area, the chance of the second chord meeting it is 2C/L; and the whole chance of its coordinates falling in dp, dω and of the second chord meeting it in that position is

2C dpdω/dpdω = 2/L2Cdpdω.

But the whole chance is the sum of these chances for all its positions;

∴ prob. = 2L−2∬Cdpdω.

Now, for a given value of ω, the value of ∫Cdp is evidently the area Ω; then, taking ω from π to 0, we have

required probability = 2πΩL−2.

The mean value of a chord drawn at random across the boundary is

M = ∬Cdpdω/dpdω = πΩ/L.

83. A straight band of breadth c being traced on a floor, and a circle of radius r thrown on it at random; to find the mean area of the band which is covered by the circle. (The cases are omitted where the circle falls outside the band.)[149]

If S be the space covered, the chance of a random point on the circle falling on the band is p = M(S)/πr 2, this is the same as if the circle were fixed, and the band thrown on it at random. Now let A (fig. 2) be a position of the random point; the favourable cases are when HK, the bisector of the band, meets a circle, centre A, radius ½c; and the whole number are when HK meets a circle, centre O, radius r + ½c; hence the probability is

p = 2π ⋅ ½c/2π(r + ½c) = c/r + ½c.

Fig. 2.

This is constant for all positions of A; hence, equating these two values of p, the mean value required is M(S) = c(2r + c)−1πr 2.

The mean value of the portion of the circumference which falls on the band is the same fraction c/(2r + c) of the whole circumference.

If any convex area whose surface is Ω and circumference L be thrown on the band, instead of a circle, the mean area covered is

M(S) = πc(L + πc)−1Ω.

For as before, fixing the random point at A, the chance of a random point in Ω falling on the band is p = 2π ⋅ ½c/L′, where L′ is the perimeter of a parallel curve to L, at a normal distance ½c from it. Now

L′ = L + 2π ⋅ ½c.

M(S)/Ω = πc/L + πc.

Fig. 3.

84. Buffon's problem may be easily deduced in a similar manner. Thus, if 2r = length of line, a = distance between the parallels, and we conceive a circle (fig. 3) of diameter a with its centre at the middle O of the line,[150] rigidly attached to the latter, and thrown with it on the parallels, this circle must meet one of the parallels; if it be thrown an infinite number of times we shall thus have an infinite number of chords crossing it at random. Their number is measured by 2π ⋅ ½a, and the number which meet 2r is measured by 4r. Hence the chance that the line 2r meets one of the parallels is p = 4r/πa.

Fig. 4.

85. To investigate the probability that the inclination of the line joining any two points in a given convex area Ω shall lie within given limits. We give here a method of reducing this question to calculation, for the sake of an integral to which it leads, and which is not easy to deduce otherwise.

First let one of the points A (fig. 4) be fixed; draw through it a chord PQ = C, at an inclination θ to some fixed line; put AP = r, AQ = r′; then the number of cases for which the direction of the line joining A and B lies between θ and θ + dθ is measured by ½(r2 + r′2)dθ.

Now let A range over the space between PQ and a parallel chord distant dp from it, the number of cases for which A lies in this space and the direction of AB from θ to θ + dθ is (first considering A to lie in the element drdp)

Let p be the perpendicular on C from a given origin O, and let ω be the inclination of p (we may put dω for dθ), C will be a given function of p, ω; and, integrating first for ω constant, the whole number of cases for which ω falls between given limits ω′, ω″ is

;

the integral ∫C3dp being taken for all positions of C between two tangents to the boundary parallel to PQ. The question is thus reduced to the evaluation of this double integral, which, of course, is generally difficult enough; we may, however, deduce from it a remarkable result; for, if the integral ⅓∬C3dpdω be extended to all possible positions of C, it gives the whole number of pairs of positions of the points A, B which lie inside the area; but this number is Ω2; hence

∬C3dpdω = 3Ω2;

the integration extending to all possible positions of the chord C,—its length being a given function of its co-ordinates p, ω.[151]

Cor. Hence if L, Ω be the perimeter and area of any closed convex contour, the mean value of the cube of a chord drawn across it at random is 3Ω2/L.

Fig. 5.

86. Let there be any two convex boundaries (fig. 5) so related that a tangent at any point V to the inner cuts off a constant segment S from the outer (e.g. two concentric similar ellipses); let the annular area between them be called A; from a point X taken at random on this annulus draw tangents XA, XB to the inner. The mean value of the arc AB, M(AB) = LS/A, L being the whole length of the inner curve ABV.

The following lemma will first be proved:—

Fig. 6.

If there be any convex arc AB (fig. 6), and if N1 be (the measure of) the number of random lines which meet it once, N2 the number which meet it twice,

2 arc AB = N1 + 2N2.

For draw the chord AB; the number of lines meeting the convex-figure so formed is N1 + N2 = arc + chord (the perimeter); but N1 = number of lines meeting the chord = 2 chord;

∴ 2 arc + N1 = 2N2 + 2N2, ∴ 2 arc = N1 + 2N2.

Now fix the point X, in fig. 5, and draw XA, XB. If a random line cross the boundary L, and p1 be the probability that it meets the arc AB once, p2 that it does so twice,

2AB/L = p1 + 2p2;

and if the point X range all over the annulus, and p1, p2 are the same probabilities for all positions of X,

2M(AB)/L = p1 + 2p2.

Fig. 7.

Let now IK (fig. 7) be any position of the random line; drawing tangents at I, K, it is easy to see that it will cut the arc AB twice when X is in the space marked α, and once when X is in either space marked β; hence, for this position of the line, p1 + 2p2 = 2(α + β)/A = 2S/A, which is constant; hence M(AB)/L = S/A.

Hence the mean value of the arc is the same fraction of the perimeter that the constant area S is of the annulus.

If L be not related as above to the outer boundary, M(AB)/L = M(S)/A, M(S) being the mean area of the segment cut off by a tangent at a random point on the perimeter L.

The above result may be expressed as an integral. If s be the arc AB included by tangents from any point (x, y) on the annulus,

sdxdy = LS.

It has been shown (Phil. Trans., 1868, p. 191) that, if θ be the angle between the tangents XA, XB,

θdxdy = π(A − 2S).

The mean value of the tangent XA or XB may be shown to be M(XA) = SP/2A, where P = perimeter of locus of centre of gravity of the segment S.

87. When we go on to species of three dimensions further speculative difficulties occur. How is a random line through a given point to be defined? Since it is usual to define a vector by two angles (viz. φ the angle made with the axis X by a vector r in the plane XY, and θ (or ½πθ) the angle made by the vector ρ with r in the plane containing both ρ and r and the axis Z) it seems natural to treat the angles φ and θ as the equiprobable variables. In other words, if we take at random any meridian on the celestial globe and combine it with any right ascension the vector joining the centre to the point thus assigned is a random line.[152] It is possible that for some purposes this conception may be appropriate. For many purposes surely it is proper to assume a more symmetrical distribution of the terminal points on the surface of a sphere, a distribution such that each element of the surface shall contain an approximately equal number of points. Such an assumption is usually made in the kinetic theory of molecules with respect to the direction of the line joining the centres of two colliding spheres in a “molecular chaos.”[153] It is safe to say with Czuber, “No discussion can remove indeterminateness.” Let us hope with him that “though this branch of probability can for the present claim only a theoretic interest, in the future it will perhaps also lead to practical results.”[154]

88. Illustrations of probability and expectation.—The close relation between probability and expectation is well 'illustrated by geometrical examples. As above stated, when a given space S is included within a given space A, if p is the probability that a point P, taken at random on A, falling on S, p = S/A. If now the space S be variable, and M(S) be its mean value

p = M(S)/A.

For, if we suppose S to have n equally probable values S1, S2, S3 . . ., the chance of any one S1 being taken, and of P falling on S1, is

p1 = n−1S1/A;

now the whole probability p = p1 + p2 + p3 + . . . , which leads at once to the above expression. The chance of two points falling on S is, in the same way,

p = M(S2)/A2,

and so on.

In such a case, if the probability be known, the mean value follows, and vice versa. Thus, we might find the mean value of the nth power of the distance XY between two points taken at random in a line of length l, by considering the chance that, if n more points are so taken, they shall all fall between X and Y. This chance is

M(XY)n/ln = 2(n + 1)−1(n + 2)−1;

for the chance that X shall be one of the extreme points, out of the whole (n + 2), is 2(n + 2)−1; and, if it is, the chance that the other extreme point is Y is (n + 1)−1. Therefore

M(XY)n = 2ln(n + 1)−1(n + 2)−1.

A line l is divided into n segments by n − 1 points taken at random; to find the mean value of the product of the n segments. Let a, b, c, . . . be the segments in one particular case. If n new points are taken at random in the line, the chance that one falls on each segment is

1.2.3 . . . nabc . . . /ln;

hence the chance that this occurs, however the line is divided, is

n!lnM(abc . . .).

Now the whole number of different orders in which the whole 2n − 1 points may occur is (2n − 1)!; out of these the number in which one of the first series falls between every two of the second is easily found by the theory of permutations to be n!(n − 1)!. Hence the required mean value of the product is

M(abc. . .) = (n − 1)!/(2n − 1)!ln.

89. Additional examples of the relation between probability and expectation appear in the following series of propositions: (1) If M be the mean value of any quantity depending on the positions of two points (e.g. their distance) which are taken, one in a space A, the other in a space B (external to A); and if M′ be the same mean when both points are taken indiscriminately in the whole space A + B; Ma, Mb, the same mean when both points are taken in A and both in B respectively; then

(A + B)2M′ = 2ABM + A2Ma + B2Mb.

If the space A = B, 4M′ = 2M + Ma + Mb; if, also, Ma = Mb, then 2M′ = M + Ma.

(2) The mean distance of a point P within a given area from a fixed straight line (which does not meet the area) is evidently the distance of the centre of gravity G of the area from the line. Thus, if A, B are two fixed points on a line outside the area, the mean value of the area of the triangle APB = the triangle AGB. From this it will follow that, if X, Y, Z are three points taken at random in three given spaces on a plane (such that they cannot all be cut by any straight line), the mean value of the area of the triangle XYZ is the triangle GG′G″, determined by the three centres of gravity of the spaces.

(3) This proposition is of use in the solution of the following problem:—

Two points X, Y are taken at random within a triangle. What is the mean area M of the triangle XYC, formed by joining them with one of the angles of the triangle?

Bisect the triangle by the line CD; let M1 be the mean value when both points all in the triangle ACD, and M2 the value when one falls in ACD and the other in BCD; then 2M = M1 + M2. But M1 = ½M; and M2 = GG′C, where G, G′ are the centres of gravity of ACD, BCD; hence M2 = 2/9ABC, and M = 4/27ABC.

(4) From this mean value we pass to probabilities. The chance that a new point Z falls on the triangle XYC is 4/27; and the chance that three points X, Y, Z taken at random form, with a vertex C, a re-entrant quadrilateral, is 4/9.

90. The calculation of geometrical probability and expectation is much facilitated by the following general principle: If M be a mean value depending on the positions of n points falling on a space A; and if this space receive a small increment α, and M′ be the same mean when the n points are taken on A + α, and M the same mean when one point falls on α and the remaining n − 1 on A; then, the sum of all the cases being M′(A + α)n, and this sum consisting of the cases (1) when all the points are on A, (2) when one is on α the others on A (as we may neglect all where two or more fall on α), we have

M′(A + α)n = MAn + nM1αn−1;

∴ (M′ − M)A = nA(M1 − M),

as M′ nearly = M. For example, suppose two points X, Y are taken in a line of length l, to find the mean value M of (XY)n. If l receives an increment dl, ldM = 2dl(M1 − M). Now M1 here = the mean nth power of the distance of a single point taken at random in l from one extremity of l; and this is ln(n + 1)−1 (as is shown by finding the chance of n other points falling on that distance); hence

ldM = 2dl{ln(n + 1)−1 − M};

ldM + 2Mdl = 2(n + 1)−1lndl,

or

l−1.d.Ml2 = 2(n + 1)−1lndl:

∴ Ml2 = 2(n + 1)−1ln+1dl = 2ln+2/(n + 1)(n + 2) + C;

∴ N = 2ln/(n + 1)(n + 2),

C being evidently 0.

91. The corresponding principle for probabilities may thus be stated: If p is the probability of a certain condition being satisfied by the n points within A in art. 90, p′ the same probability when they fall on the space A + α, and p′ the same when one point falls on α and the rest on A, then, since the numbers of favourable cases are respectively p′(A + α)n, pAn, np1αn−1, we find,

(p′ − p)A = nα(p1p).

Fig. 8.

Hence if p′ = p then p1 = p. For example, if we have to find the chance of three points within a circle forming an acute-angled triangle, by adding an infinitesimal concentric ring to the circle, we have evidently p′ = p; hence the required chance is unaltered by assuming one of the three points taken on the circumference. Again, in finding the chance that four points within a triangle shall form a convex quadrilateral, if we add to the triangle a small band between the base and a line parallel to it, the chance is clearly unaltered. Therefore we may take one of the points at random on the base (fig. 8), the others X, Y, Z within the triangle. Now the four lines from the vertex B to the four points are as likely to occur in any specified order as any other. Hence it is an even chance that X, Y, Z fall on one of the triangles ABW, CBW, or that two fall on one of these triangles and the remaining one on the other. Hence the probability of a re-entrant quadrilateral is

½p1 + ½p2,

where

p1 prob. (WXYZ re-entrant),  X, Y, Z in one triangle;
p2 do., X in one triangle, Y in the other, Z in either.

But p1 = 4/9. Now to find p2; the chance of Z falling within the triangle WXY is the mean area of WXY divided by ABC. Now by par. 89, for any particular position of W, M(WXY) = WGG′, where G, G′ are the centres of gravity of ABW, CBW. It is easy to see that WGG′ = 1/9ABC = 1/9, putting ABC = 1. Now if Z falls in CBW, the chance of WXYZ re-entrant is 2M(IYW), for Y is as likely to fall in WXZ as Z to fall in WXY; also if Z falls in ABW the chance of WXYZ re-entrant is 2M(IXW). Thus the whole chance is p2 = 2M(IYW + IXW) = 2/9. Hence the probability of a re-entrant quadrilateral is

1/24/9 + 1/22/9 = 1/3.

That of its being convex is ⅔.

Fig. 9.

92. From this probability we may pass to the mean value of the area XYZ, if M be this mean, and A the given area, the chance of a fourth point falling on the triangle is M/A; and the chance of a re-entrant quadrilateral is four times this, or 4M/A. This chance has just been shown to be 1/3; and accordingly M = 1/12A.

93. The preceding problem is a particular case of a more general problem investigated by Sylvester. For another instance, let the given area A be a circle; within such three points are taken at random; and let M be the mean value of the triangle thus formed. Adding a concentric ring a, we have since M': M as the areas of the circles, M′ = M(A + α)/A.

AMα/A = 3α(M1 − M); ∴ M = ¾M1,

where M1 is the value of M when one of the points is on the circumference. Take O fixed; we have to find the mean value of OXY (fig. 9). Taking (ρ, θ)(ρ′, θ′) as co-ordinates of X, Y,

M1 = (πα2)−2ρdρdθρdρdθ. (OXY).

∵ M1 = (π4α2)−1∬∬½ρρ′ sin (θθ′)ρρdρdρdθdθ

= (π2α4)−1⋅½∬1/9r2r2⋅sin (θθ′)dθdθ′,

putting r = OH, r′ = OK; as r = 2α sin θ, r′ = 2α sin θ′,

M1 = .

Professor Sylvester has remarked that this double integral, by means of the theorem

,

is easily shown to be identical with

.

∵ M1 = 35α2/36π; ∵ M = 35/48π2πα2.

From this mean value we pass to the probability that four points within a circle shall form a re-entrant figure, viz.

p = 35/12π2.

94. The function of expectation in this class of problem appears to afford an additional justification of the position here assigned to this conception[155] as distinguished from an average in the more general sense which is proper to the following Part.

Part II.—Averages and Laws of Error

95. Averages.—An average may be defined as a quantity derived from a given set of quantities by a process such that, if the constituents become all equal, the average will coincide with the constituents, and the constituents not being equal, the average is greater than the least and less than the greatest of the constituents. For example, if x1, x2, . . . xn, are the constituents, the following expressions form averages (called respectively the arithmetic, geometric and harmonic means):—

x1 + x2 + . . . + xn/n.

(x1 × x2 × . . . × xn)1/n.

1/1/n(1/x1 + 1/x2 + . . . + 1/xn).

The conditions of an average are likewise satisfied by innumerable other symmetrical functions, for example:—

(x12 + x22 + . . . + xn2/n)½

The conception may be extended from symmetrical to unsymmetrical functions by supposing any one or more of the constituents in the former to be repeated several times. Thus if in the first of the averages above instanced (the arithmetic mean) the constituent xr, occurs l times, the expression is to be modified by putting lxr for xr in the numerator, and in the denominator, for n, n+r−1. The definition of an average covers a still wider held. The process employed need not be a function.[156] One of the most important averages is formed by arranging the constituents in the order of magnitude and taking for the average a value which has as many constituents above it as below it, the median. The designation is also extended to that value about which the greatest number of the constituents cluster most closely, the “centre of greatest density,” or (with reference to the geometrical representation of the grouping of the constituents) the greatest ordinate, or, as recurring most frequently, the mode.[157] But to comply with the definition there must be added the condition that the mode does not occur at either extremity of the range between the greatest and the least of the constituents. There should be also in general added a definition of the process by which the mode is derived from the given constituents.[158] Perhaps this specification may be dispensed with when the number of the constituents is indefinitely large. For then it may be presumed that any method of determining the mode will lead to the same result. This presumption presupposes that the constituents are quantities of the kind which form the sort of “series” which is proper to Probabilities.[159] A similar presupposition is to be made with respect to the constituents of the other averages, so far as they are objects of probabilities.

96. The Law of Error.—Of the propositions respecting average with which Probabilities is concerned the most important are those which deal with the relation of the average to its constituents, and are commonly called “laws of error.” Error is defined in popular dictionaries as “deviation from truth”; and since truth commonly lies in a mean, while measurements are some too large and some too small, the term in scientific diction is extended to deviations of statistics from their average, even when that average—like the mean of human or barometric heights—does not stand for any real objective thing. A “law of error” is a relation between the extent of a deviation and the frequency with which it occurs: for instance, the proposition that if a digit is taken at random from mathematical tables, the difference between that figure and the mean of the whole series (indefinitely prolonged) of figures so obtained, namely, 4.5, will in the long run prove to be equally often ±0.5, ±1.5, ±2.5, ±3.5, ±4.5.[160] The assignment of frequency to discrete values—as 0, 1, 2, &c., in the preceding example—is often replaced by a continuous curve with a corresponding equation. The distinction of being the law of error is bestowed on a function which is applicable not merely to one sort of statistics—such as the digits above instanced—but to the great variety of miscellaneous groups, generally at least, if not universally. What form is most deserving of this distinction is not decided by uniform usage; different authorities do not attach the same weight to the different grounds on which the claim is based, namely the extent of cases to which the law may be applicable, the closeness of the application, and the presumption prior to specific experience in favour of the law. The term “the law of error” is here employed to denote (1) a species to which the title belongs by universal usage, (2) a wider class in favour of which there is the same sort of a priori presumption as that which is held to justify the more familiar species. The law of error thus understood forms the subject of the first section below.

97. Laws of Frequency—What other laws of error may require notice are included in the wider genus “laws of frequency,” which forms the subject of the second section. Laws of frequency, so far as they belong to the domain of Probabilities, relate much to the same sort of grouped statistics as laws of error, but do not, like them, connote an explicit reference to an average. Thus the sequence of random digits above instanced as affording a law of error, considered without reference to the mean value, presents the law of frequency that one digit occurs as often as another (in the long run). Every law of error is a law of frequency; but the converse is not true. For example, it is a law of frequency—discovered by Professor Pareto[161]—that the number of incomes of different size (above a certain size) is approximately represented by the equation y = A/xa, where x denotes the size of an income, y the number of incomes of that size. But whether this generalization can be construed as a law of error (in the sense here defined) depends on the nice inquiry whether the point from which the frequency diminishes as the income x increases can be regarded as a “mode,” y diminishing as x decreases from that point.

Section I.—The Law of Error.

98. (1) The Normal Law of Error.—The simplest and best recognized statement of the law of error, often called the “normal law,” is the equation

,

more conveniently written , where x is the magnitude of an observation or “statistic,” z is the proportional frequency of observations measuring x, a is the arithmetic mean of the group (supposed indefinitely[162] multiplied) of similar statistics: c is a constant sometimes called the “modulus”[163] proper to the group; and the equation signifies that if any large number N of such a group is taken at random, the number of observations between x and x + ∆x is (approximately) equal to the right-hand side of the equation multiplied by N∆x. A graphical representation of the corresponding curve—sometimes called the “probability-curve”—is here given (fig. 10), showing the general shape of the curve, and how its dimensions vary with the magnitude of the modulus c. The area being constant (viz. unity), the curve is furled up when c is small, spread out when c is large. There is added a table of integrals, corresponding to areas subtended by the curve; in a form suited for calculations of probability, the variable, τ, being the length of the abscissa referred to (divided by) the modulus.[164] It may be noted that the points of inflexion in the figure are each at a distance from the origin of 1/√2 modulus, a distance equal to the square foot of the mean square of error—often called the “standard deviation.” Another notable value of the abscissa is that which divides the area on either side of the origin into two equal parts; commonly called the “probable error.” The value of τ which corresponds to this point is 0.4769. . . .

Fig. 10.

99. An a priori proof of this law was given by Herschel[165] as A priori proof. follows: “The probability of an error depends solely on its magnitude and not on its direction;” positive and negative errors are equally probable. “Suppose a ball dropped from a given height with the intention that it should fall on a given mark,” errors in all directions are equally probable, and errors in perpendicular directions are independent. Accordingly the required law, “which must necessarily be general and apply alike in all cases, since the causes of error are supposed alike unknown,”[166] is for one dimension of the form φ(x2), for two dimensions φ(x2 + y2); and φ(x2 + y2) ≡ φ(x2) × φ(y2); a functional equation of which the solution is the function above written. A reason which satisfied Herschel is entitled to attention, especially if it is endorsed by Thomson and Tait.[167] But it must be confessed that the claim to universality is not, without some strain of interpretation,[168] to be reconciled with common experience.

Table of the Values of the Integral I = .

τ I


  0.00   0.00000  
  .01 .01128
  .02 .02256
  .03 .03384
  .04 .04511
  .05 .05637
  .06 .06762
  .07 .07886
  .08 .09008
  .09 .10128
 .1 .11246
 .2 .22270
 .3 .32863
 .4 .42839
 .5 .52050
 .6 .60386
 .7 .67780
 .8 .74210
 .9 .79691
1.0 .84270
1.1 .88020
1.2 .91031
1.3 .93401
1.4 .95229
1.5 .96611
1.6 .97635
1.7 .98379
1.8 .98909
1.9 .99279
2.0 .99532
2.1 .99702
2.2 .99814
2.3 .99886
2.4 .99931
2.5 .99959
2.6 .99976
2.7 .99986
2.8 .99992
2.9 .99996
3.0 .99998
1.00000 

100. There is, however, one class of phenomena to which Herschel's reasoning applies without reservation. In a “molecular chaos,” such as the received kinetic theory of gases postulates, if a molecule be placed at rest at a given point and the distance which it travels from that point in a given time, driven hither and thither by colliding molecules, is regarded as an “error,” it may be presumed that errors in all directions are equally probable and errors in perpendicular directions are independent. It is remarkable that a similar presumption with respect to the velocities of the molecules was employed by Clerk Maxwell, in his first approach to the theory of molecular motion, to establish the law of error in that region.

101. The Laplace-Quetelet Hypothesis.—That presumption has, indeed, not received general assent; and the law of error appears to be better rested on a proof which was originated by Laplace. According to this view, the normal law of error is a first approximation to the frequency with which different values are apt to be assumed by a variable magnitude dependent on a great number of independent variables, each of which assumes different values in random fashion over a limited range, according to a law of error, not in general the law, nor in general the same for each variable. The normal law prevails in nature because it often happens—in the world of atoms, in organic and in social life—that things depend on a number of independent agencies. Laplace, indeed appears to have applied the mathematical principle on which this explanation depends only to examples (of the law of error) artificially generated by the process of taking averages. The merit of accounting for the prevalence of the law in rerum natura belongs rather to Quetelet. He, however, employed too simple a formula[169] for the action of the causes. The hypothesis seems first to have been stated in all its generality both of mathematical theory and statistical exemplification by Glaisher.[170]

102. The validity of the explanation may best be tested by first (A) deducing the law of error from the condition of numerous (A) Deduction from Hypothetical Conditions. independent causes; and (B) showing that the law is adequately fulfilled in a variety of concrete cases, in which the condition is probably present. The condition may be supposed to be perfectly fulfilled in games of chance, or, more generally, sortitions, characterized by the circumstance that we have a knowledge prior to specific experience of the proportion of what Laplace calls favourable cases[171] to all cases—a category which includes, for instance, the distribution of digits obtained by random extracts from mathematical tables, as well as the distribution of the numbers of points on dominoes.

103. The genesis of the law of error is most clearly illustrated by the simplest sort of “game,” that in which the sortition is between two alternatives, heads or tails, hearts or not-hearts, or, generally, success or failure, the probability of a success being p and Games of Chance. that of a failure q, where p + q = 1. The number of such successes in the course of n trials may be considered as an aggregate made up of n independently varying elements, each of which assumes the values 0 or 1 with respective frequency q and p. The frequency of each value of the aggregate is given by a corresponding term in the expansion of (q + p), and by a well-known theorem[172] this term is approximately equal to 1/π2npqeν2/2npq; where ν is the number of integers by which the term is distant from np (or an integer close to np); provided that ν is of (or <) the order √n. Graphically, let the sortition made for each element be represented by the taking or not taking with respective frequency p and q a step of length i. If a body starting from zero takes successively n such steps, the point at which it will most probably come to a stop is at npi (measured from zero); the probability of its stopping at any neighbouring point within a range of ± √ni is given by the above-written law of frequency, νi being the distance of the stopping-point from npi. Put νi = x and 2npqi2 = c2; then the probability may be written .

104. It is a short step, but a difficult one, from this case, in which the element is binomial—heads or tails—to the general case, in which the element has several values, according to the law of frequency—consists, for instance, of the member of points presented by a randomly-thrown die. According to the general theorem, if Q is the sum[173] of numerous elements, each of which assumes different magnitudes according to a law of frequency, z = fr(x), the function f being in general different for different elements, the number of times that Q assumes magnitudes between x and x + ∆x in the course of N trials is Nzx, if ; where a is the sum of the arithmetic means of all the elements, any one of which ar = [∫xfr(x)dx], the square brackets denoting that the integrations extend between the extreme limits of the element's range, if the frequency-locus for each element is continuous, it being understood that [∫fr(x)dx] = 1; and k is the sum of the mean squares of error for each element, = ∑[∫ξ2fr(ar + ξ)dξ], if the frequency-locus for each element is continuous, where ar is the arithmetic mean of one of the elements, and ξ the deviation of any value assumed by that element from ar, ∑ denoting summation over all the elements. When the frequency-locus for the element is not continuous, the integrations which give the arithmetic mean and mean square of error for the element must be replaced by summations. For example, in the case of the dice above instanced, the law of frequency for each element is that it assumes equally often each of the values 1, 2, 3, 4, 5, 6. Thus the arithmetic mean for each element is 3.5, and the mean square of error {(3.5 − 1)2 + (3.5 − 2)2 + &c.}/6 = 2.916. Accordingly, the sum of the points obtained by tossing a large number, n, of dice at random will assume a particular value x with a frequency which is approximately assigned by the equation

.

The rule equally applies to the case in which the elements are not similar; one might be the number of points on a die, another the number of points on a domino, and so on. Graphically, each element is no longer represented by a step which is either null or i, but by a step which may be, with an assigned probability, one or other of several degrees between those limits, the law of frequency and the range of i being different for the different elements.

105. Variant Proofs.—The evidence of these statements can only be indicated here. All the proofs which have been offered involve some postulate as to the deviation of the elements from their respective centres of gravity, their “errors.” If these errors extended to infinity, it might well happen that the law of error would not be fulfilled by a sum of such elements.[174] The necessary and sufficient postulate appears to be that the mean powers of deviation for the elements, the second (above written) and the similarly formed third, fourth, &c., powers (up to some assigned power), should be finite.[175]

106. (1) The proof which seems to flow most directly from this postulate proceeds thus. It is deduced that the mean powers of deviation for the proposed representative curve, the law of error (up to a certain power), differ from the corresponding powers of the actual locus by quantities which are negligible when the number of the elements is large.[176] But loci which have their mean powers of deviation (up to some certain power) approximately equal may be considered as approximately coincident.[177]

107. (2) The earliest and best-known proof is that which was originated by Laplace and generalized by Poisson.[178] Some idea of this celebrated theory may be obtained from the following free version, applied to a simple case. The case is that in which all the elements have one and the same locus of frequency, and that locus is symmetrical about the centre of gravity. Let the locus be represented by the equation η = φ(ξ), where the centre of gravity is the origin, and φ() = φ(−ξ); the construction signifying that the probability of the element having a value ξ (between say ξ − ½∆ξ and ξ + ½∆ξ is φ(ξ)∆ξ. Square brackets denoting summation between extreme limits, put χ(a) for [Sφ(ξ)e√−1aξξ] where ξ is an integer multiple of ∆ξ (or ∆x) = ρx, say. Form the mth power of χ(a). The coefficient of e√−1arx in (χ(a))m is the probability that the sum of the values of the m elements should be equal to rx; a probability which is equal to ∆xyr, where y is the ordinate of the locus representing the frequency of the compound quantity (formed by the sum of the elements). Owing to the symmetry of the function φ the value of yr, will not be altered if we substitute for e√−1arx, e−√−1arx, nor if we substitute ½(e+√−1arx + e−√−1arx), that is cos arx. Thus (χ(a))m becomes a sum of terms of the form ∆xyr cos arx, where yr = y+r. Now multiply (χ(a))m thus expressed by cos txa, where, t being an integer, tx =x, the abscissa of the “error” the probability of whose occurrence is to be determined. The product will consist of a sum of terms of the form xyr ½(cos a(r + t)∆x + cos a(rt)∆x). As every value of rt (except zero) is matched by a value equal in absolute magnitude, r + t, and likewise every value of r + t is matched by value rt, the series takes the form xyr∑ cos qax + ∆xyt, where q has all possible integer values from 1 to the largest value of |r|[179] increased by |t|; and the term free from circular functions is the equivalent of xyr cos a(r + t)∆x, when r = −t, together with xyr cos a(rt)∆x, when r = +t. Now substitute for ax a new symbol β; and integrate with respect to β, the thus transformed (χ(a))m cos txa between the limits β = 0 and β = π. The integrals of all the terms which are of the form xyr cos qβ will vanish, and there will be left surviving only πxyt. We thus obtain, as equal to πxyt, . Now change the independent variable to a; then as dβ = dax,

.

Replacing tx by x, and dividing both sides by ∆x, we have

.

Now expanding the cos ax which enters into the expression for χ(a), we obtain

χ(a) = [Sφ(a)] − 1/2![Sφ(a)a2]x2 + 1/4![Sφ(a)a4]x4 . . ⋅

Performing the summations indicated, we express χ(a) in terms of the mean powers of deviation for an element. Whence χ(a)m is expressible in terms of the mean powers of the compound locus. First and chief is the mean second power of deviation for the compound, which is the sum of the mean second powers of deviation for the elements, say k. It is found that the sought probability may be equated to - . . ., where k2 is the coefficient defined below.[180] Here π/∆x may be replaced by ∞, since the finite difference ∆x is small with respect to unity when the number of the elements is large;[181] and thus the integrals involved become equateable to known definite integrals. If it were allowable to neglect all the terms of the series but the first the expression would reduce to 1/√(2πk)eu2/k, the normal law of error. But it is allowable to neglect the terms after the first, in a first approximation, for values of x not exceeding a certain range, the number of the elements being large, and if the postulate above enunciated is satisfied.[182] With these reservations it is proved that the sum of a number of similar and symmetrical elements conforms to the normal law of error. The proof is by parity extended to the case in which the elements have different but still symmetrical frequency functions; and, by a bolder use of imaginary quantities, to the case of unsymmetrical functions.

108. (3) De Forest[183] has given a proof unencumbered by imaginaries of what is the fundamental proposition in Laplace's theory that, if a polynomial of the form

A0 + A1z + A2z2 + . . . + Amzm

be raised to the nth power and expanded in the form

B0 + B1z + B2z2 + . . . + Bmnzmn

then the magnitudes of the B's in the neighbourhood of their maximum (say B1) will be disposed in accordance with a “probability-curve,” or normal law of error.

109. (4) Professor Morgan Crofton's original proof of the law of error is based on a datum obtained by observing the effect which the introduction of a new element produces on the frequency-locus for the aggregate of elements. It seems to be assumed, very properly, that the sought function involves as constants some at least of the mean powers of the aggregate, in particular the mean second power, say k. We may without loss of generality refer each of the elements (and accordingly the aggregate) to its respective centre of gravity. Then if y, = f(x), is the ordinate of the frequency-locus for the aggregate before taking in a new element, and y = ∂y the ordinate after that operation, by a well-known principle,[184] y + ∂y = [Sφm(ξ)f(xξ)∆ξ], where η, = φm(ξ), is the frequency-locus for the new element, and the square brackets indicate that the summation is to extend over the whole range of values assumed by that element. Expanding in ascending powers of (each value of) ξ and neglecting powers above the second, as is found to be legitimate under the conditions specified, we have (since the first mean power of the element vanishes)

y = ½[Sξ2φm(ξ)∆ξ]d2f/dx2.

From the fundamental proposition that the mean square for the aggregate equals the sum of mean squares for the elements it follows that [Sξ2φm(ξ)∆ξ] the mean second power of deviation for the mth element is equal to ∂k, the addition to k the mean second power of deviation for the aggregate. There is thus obtained a partial differential equation of the second order

(1)
dy/dk = ½d2y/dk2

A subsidiary equation is (in effect) obtained by Professor Crofton from the property that if the unit according to which the axis of x is graduated is altered in any assigned ratio, there must be a corresponding alteration both of the ordinate expressing the frequency: of the aggregate and of the mean square of deviation for the aggregation. By supposing the alteration indefinitely small he obtains a second partial differential equation, viz. (in the notation here adopted)

(2)
y + xdy/dx + 2kdy/dk = 0.

From these two equations, regard being had to certaian other conditions of the problem,[185] it is deducible that y = Cex2/2k, where C is a constant of which the value is determined by the condition that

.

110. (5) The condition on which Professor Crofton's proof is based may be called differential, as obtained from the introduction of a single new element. There is also an integral condition obtained from the introduction of a whole set of new elements. For let A be the sum of m1 elements, fluctuating according to the sought law of error. Let B be the sum of another set of elements m2 in number (m1 and m2 both large). Then Q a quantity formed by adding together each pair of concurrent values presented by A and B must also conform to the law of error, since Q is the sum of m1 + m2 elements. The general form which satisfies this condition of reproductivity is limited by other conditions to the normal law of error.[186]

111. The list of variant proofs is not yet exhausted,[187] but enough has been said to establish the proposition that a sum of numerous elements of the kind described will fluctuate approximately according to the normal law of error.

112. As the number of elements is increased, the constant above designated k continually increases; so that the curve representing Varieties of Linear Function. the frequency of the compound magnitude spreads out from its centre. It is otherwise if instead of the simple sum we consider the linear function formed by adding the m elements each multiplied by 1/m. The “spread” of the average thus constituted will continually diminish as the number of the elements is increased; the sides closing in as the vertex rises up. The change in “spread” produced by the accession of new elements is illustrated by the transition from the high to the low curve, in fig. 10, in the case of a sum; in the case of an average (arithmetic mean) by the reverse relation.

113. The proposition which has been proved for linear functions may be extended to any other function of numerous variables, Extension to Non-linear Functions. each representing the value assumed by an independently fluctuating element; if the function may be expanded in ascending powers of the variables, according to Taylor's theorem, and all the powers after the first may be neglected. The matter is not so simple as it is often represented, when the variable elements may assume large, perhaps infinite, values; but with the aid of the postulate above enunciated the difficulty can be overcome.[188]

114. All the proofs which have been noticed have been extended to errors in two (or more) dimensions.[189] Let Q be the sum of a Extension to two or more Dimensions. number of elements, each of which, being a function of two variables, x and y, assumes different pairs of values according to a law of frequency zr = fr(x, y), the functions being in general different for different elements. The frequency with which Q assumes values of the variables between x and +∆x and between y and y + ∆y is zxy, if

;

where, as in the simpler case, a = ∑ar, ar being the arithmetic mean of the values of x assumed in the long run by one of the elements, b is the corresponding sum for values of y, and

k = ∑[∬(xar)2fr(x, y)dxdy]

m = ∑[∬(ybr)2fr(x, y)dxdy]

l = ∑[∬(xar)(ybr)fr(x, y)dxdy];

the summation extending over all the elements, and the integration between the extreme limits of each; supposing that the law of frequency for each element is continuous, otherwise summation is to be substituted for integration. For example, let each element be constituted as follows: Three coins having been tossed, the number of heads presented by the first and second coins together is put for x, the number of heads presented by the second and third coins together is put for y. The law of frequency for the element is represented in fig. 11, the integers outside denoting the values of x or y, the fractions inside probabilities of particular values of x and y concurring.

2
 0


 ⅛
 ⅛


 ¼


 ⅛


 ⅛  0 
0 1 2

Fig. 11.

If i is the distance from 0 to 1 and from 1 to 2 on the abscissa, and i′ the corresponding distance on the ordinate, the mean of the values of x for the element—∆a, as we may say,—is i, and the corresponding mean square of horizontal deviations is ½i2. Likewise b = i′; m = ½i2; and l = ⅛(+i × +i′ − i × −i′) = ¼ii′. Accordingly, if n such elements are put together (if n steps of the kind which the diagram represents are taken), the frequency with which a particular pair of aggregates x and y will concur, with which a particular point on the plane of xy, namely, x = ri and y = ri, will be reached, is given by the equation

[190].

115. A verification is afforded by a set of statistics obtained with dice by Weldon, and here reproduced by his permission. A success is in this experiment defined, not by obtaining a head when a coin is tossed, but by obtaining a face with more than three points on it when a die is tossed; the probabilities of the two events are the same, or rather would be if coins and dice were perfectly symmetrical.[191] Professor Weldon virtually took six steps of the sort above described when, six painted dice having been thrown, he added the number of successes in that painted batch to the number of successes in another batch of six to form his x, and to the number of successes in a third batch of six to form his y. The result is represented in the annexed table, where each degree on the axis of x and y respectively corresponds to the i and i′ of the preceding paragraphs, and i = i′. The observed frequencies being represented by numerals, a general correspondence between the facts and the formula is apparent. The maximum frequency is, as it ought to be, at the point x = 6i, y = 6i′. The density is particularly great along a line through that point, making 45° with the axis of x; particularly small in the complementary direction. This also is as it ought to be. For if the centre is made the origin by substituting x for (xa) and y for (yb), and then new co-ordinates X and Y are taken, making an angle θ with x and y respectively, the curve which is traced on the plane of zX by its intersection with the surface is of the form

z = J exp −X2[k sin2 θ − 2l cos θ sin θ + m cos2 θ]/2(kml2),

a probability-curve which will be more or less spread out according as the factor k sin2 θ − 2l cos θ sin θ + m cos2 θ is less or greater. Now this expression has a minimum or maximum when (km) sin θ−2l cos 2θ = 0; a minimum when (km) cos 2θ+2 lsin 2θ is positive, and a maximum when that criterion is negative; that is, in the present case, where k = m, a minimum when θ = ¼π and a maximum when θ = ¾π.

0 1 2 3 4 5 6 7 8 9 10 11 12
12
11   1   1   5   1    1
10   2   6  28  27  19   2
9   1   2  11  43  76  57  54  15   4
8   6  18  49  116  138  118  59  25   5
7  12  47  109  208  213  118  71  23   1
6   9  29  77  199  244  198  121  32   3
5   3  12  51  119  181  200  129  69  18   3
4   2  16  55  100  117  91  46  19   3
3   2  14  28  53  43  34  17   1
2   7  12  13  18   4   1   1
1   2   4   1   2   1
0

116. Characteristics of the Law of Error[192]—As may be presumed from the examples just given, in order that there should be some approximation to the normal law the number of elements need not be very great. A very tolerable imitation of the probability-curve has been obtained by superposing three elements, each obeying a law of frequency quite different from the normal one,[193] namely, that simple law according to which one value of a variable occurs as frequently as another between the limits within which the variation is confined (y = 1/2a, between limits x = +a, x = −a). If the component elements obey unsymmetrical laws of frequency, the compound will indeed be to some extent unsymmetrical, unlike the “normal” probability-curve. But, as the number of the elements is increased, the portion of the compound curve in the neighbourhood of its centre of gravity tends to be rounded off into the normal shape. The portion of the compound curve which is sensibly identical with a curve of the “normal” family becomes greater the greater the number of independent elements; caeteris paribus, and granted certain conditions as to the equality and the range of the elements. It will readily be granted that if one component predominates, it may unduly impress its own character on the compound. But it should be pointed out that the characteristic with which we are now concerned is not average magnitude, but deviation from the average. The component elements may be very unequal in their contributions to the average magnitude of the compound without prejudice to its “normal” character, provided that the fluctuation of all or many of the elements is of one and the same order. The proof of the law requires that the contribution made by each element to the mean square of deviation for the compound, k, should be small, capable of being treated as differential with respect to k. It is not necessary that all these small quantities should be of the same order, but only that they should admit of being rearranged, by massing together those of a smaller order, as a numerous set of independent elements in which no two or three stand out as sui generis in respect of the magnitude of their fluctuation. For example, if one element consist of the number of points on a domino (the sum of two digits taken at random), and other elements, each of either 1 or 0 according as heads or tails turn up when a coin is cast, the first element, having a mean square of deviation 16.5, will not be of the same order as the others, each having 0.25 for its mean square of deviation. But sixty-six of the latter taken together would constitute an independent element of the same order as the first one; and accordingly if there are several times sixty-six elements of the latter sort, along with one or two of the former sort, the conditions for the generation of the normal distribution will be satisfied. These propositions would evidently be unaffected by altering the average magnitude, without altering the deviation from the average, for any element, that is, by adding a greater or less fixed magnitude to each element. The propositions are adapted to the case in which the elements fluctuate according to a law of frequency other than the normal. For if they are already normal, the aforesaid conditions are unnecessary. The normal law will be obeyed, by the sum of elements which each obey it, even though they are not numerous and not independent and not of the same order in respect of the extent of fluctuation. A similar distinction is to be drawn with respect to some further conditions which the reasoning requires. A limitation as to the range of the elements is not necessary when they are already normal, or even have a certain affinity to the normal curve. Very large values of the element are not excluded, provided they are sufficiently rare. What has been said of curves with special reference to one dimension is of course to be extended to the case of surfaces and many dimensions. In all cases the theorem that under the conditions stated the normal law of error will be generated is to be distinguished from the hypothesis that the conditions are fairly well fulfilled in ordinary experience.

117. Having deduced the genesis of the law of error from ideal (B) Verification of the Normal Law. conditions such as are attributed to perfectly fair games of chance, we have next to inquire how far these conditions are realized and the law fulfilled in common experience.

118. Among important concrete cases errors of observation occupy a leading place. The theory is brought to bear on this case Errors proper. by the hypothesis that an error is the algebraic sum of numerous elements, each varying according to a law of frequency special to itself. This hypothesis involves two assumptions: (1) that an error is dependent on numerous independent causes; (2) that the function expressing that dependence can be treated as a linear function, by expanding in terms of ascending powers (of the elements) according to Taylor's theorem and neglecting higher powers, or otherwise. The first assumption seems, in Dr Glaisher's words, “most natural and true. In any observation where great care is taken, so that no large error can occur, we can see that its accuracy is influenced by a great number of circumstances which ultimately depend on independent causes: the state of the observer's eye and his physiological condition in general, the state of the atmosphere, of the different arts of the instrument, &c., evidently depend on a great number of causes, while each contributes to the actual error.”[194] The second assumption seems to be frequently realized in nature. But the assumption is not always safe. For example, where the velocities of molecules are distributed according to the normal law of error, with zero as centre, the energies must be distributed according to a quite different law. This rationale is applicable not only to the fallible perceptions of the senses, but also to impressions into which a large ingredient of inference enters, such as estimates of a man's height or weight from his appearance,[195] and even higher acts of judgments.[196] Aiming at an object is an act similar to measuring an object, misses are produced by much the same variety of causes as mistakes; and, accordingly, it is found that shots aimed at the same bull's-eye are apt to be distributed according to the normal law, whether in two dimensions on a target or according to their horizontal deviations, as exhibited below (par. 156). A residual class comprises miscellaneous statistics, physical as well as social, in which the normal law of error makes Miscellaneous Statistics. its appearance, presumably in consequence of the action of numerous independent influences. Well-known instances are afforded by human heights and other bodily measurements, as tabulated by Quetelet[197] and others.[198] Professor Pearson has found that “the normal curve suffices to describe within the limits of random sampling the distribution of the chief characters in man.”[199] The tendency of social phenomena to conform to the normal law of frequency is well exemplified by A. L. Bowley's grouping of the wages paid to different classes.[200]

119. The division of concrete errors which has been proposed is not to be confounded with another twofold classification, namely, A Variant Classification. observations which stand for a real objective thing, and such statistics as are not thus representative of something outside themselves, groups of which the mean is called “subjective.” This division would be neither clear nor useful. On the one hand so-called real means are often only approximately equal to objective quantities. Thus the proportional frequency with which one face of a die—the six suppose—turns up is only approximately given by the objective fact that the six is one face of a nearly perfect cube. For a set of dice with which Weldon experimented, the average frequency of a throw, presenting either five or six points, proved to be not .3, but 0.3377.[201] The difference of this result from the regulation .3 is as unpredictable from objective data, prior to experiment, as any of the means called subjective or fictitious. So the mean of errors of observation often differs from the thing observed by a so-called “constant error.” So shots may be constantly deflected from the bull's-eye by a steady wind or “drift.”

120. On the other hand, statistics, not purporting to represent a real object, have more or less close relations to magnitudes which cannot be described as fictitious. Where the items averaged are ratios, e.g. the proportion of births or deaths to the total population in several districts or other sections, it sometimes happens that the distribution of the ratios exactly corresponds to that which is obtained in the simplest games of chance—“combinational” distribution in the phrase of Lexis.[202] There is unmistakably suggested a sortition of the simplest type, with a real ascertainable relation between the number of “favourable cases” and the total number of cases. The most remarkable example of this property is presented by the proportion of male to female (or to total) births. Some other instances are given by Lexis[203] and Westergaard.[204] A similar correspondence between the actual and the “combinational” distribution has been found by Bortkevitch[205] in the case of very small probabilities (in which case the law of error is no longer “normal”). And it is likely that some ratios—such as general death-rates—not presenting combinational distribution, might be broken up into subdivisions—such as death-rates for different occupations or age periods—each distributed in that simple fashion.

121. Another sort of averages which it is difficult to class as subjective rather than objective occurs in some social statistics, under the designation of index-numbers. The percentage which represents the change in the value of money between two epochs is seldom regarded as the mere average change in the price of several articles taken at random, but rather as the measure of something, e.g. the variation in the price of a given amount of commodities, or of a unit of commodity.[206] So something substantive appears to be designated by the volume of trade, or that of the consumption of the working classes, of which the growth is measured by appropriate index-numbers,[207] the former due to Bourne and Sir Robert Giffen,[208] the latter to George Wood.[209]

122. But apart from these peculiarities, any set of statistics may be related to a certain quaesitum, very much as measurements are related to the object measured. That quaesitum is the limiting or ultimate mean to which the series of statistics, if indefinitely prolonged, would converge, the mean of the complete group; this conception of a limit applying to any frequency-constant, to “c,” for instance, as well as “a” in the case of the normal curve.[210] The given statistics may be treated as samples from which to reason up to the true constant by that principle of the calculus which determines the comparative probability of different causes from which an observed event may have emanated.[211]

123. Thus it appears that there is a characteristic more essential to the statistician than the existence of an objective quaesitum, namely, the use of that method which is primarily, but not exclusively, proper to that sort of quaesituminverse probability.[212] Without that delicate instrument the doctrine of error can seldom be fully utilized; but some of its uses may be indicated before the introduction of technical difficulties.

124. Having established the prevalence of the law of error,[213] we go on to its applications. The mere presumption that wherever three or Applications of the Normal Law. four independent causes co-operate, the law of error tends to be set up, has a certain speculative interest.[214] The assumption of the law as a hypothesis is legitimate. When the presumption is confirmed by specific experience this knowledge is apt to be turned to account. It is usefully applied to the practice of gunnery,[215] to determine the proportion of shots which under assigned conditions may be expected to hit a zone of given size. The expenditure of ammunition required to hit an object can thence be inferred. Also the comparison between practice under different conditions is facilitated. In many kinds of examination it is found that the total marks given to different candidates for answers to the same set of questions range approximately in conformity with the law of error. It is understood that the civil service commissioners have founded on this fact some practical directions to examiners. Apart from such direct applications, it is a useful addition to our knowledge of a class that the measurable attributes of its members range in conformity with this general law. Something is added to the truth that “the days of a man are threescore and ten,” if we may regard that epoch, or more exactly for England, 72, as “Nature's aim, the length of life for which she builds a man, the dispersion on each side of this point being . . . nearly normal.”[216] So Herschel says: “An [a mere] average gives us no assurance that the future will be like the past. A [normal] mean may be reckoned on with the most complete confidence.”[217] The existence of independent causes,[218] inferred from the fulfilment of the normal law, may be some guarantee of stability. In natural history especially have the conceptions supplied by the law of error been fruitful. Investigators are already on the track of this inquiry: if those members of a species whose size or other measurable attributes are above (or below) the average are preferred—by “natural” or some other kind of selection—as parents, how will the law of frequency as regards that attribute be modified in the next generation?

125. A particularly perfect application of the normal law of error in more than one dimension is afforded by the movements of Normal Distribution of Molecular Velocities. the molecules in a homogeneous gas. A general idea of the rôle played by probabilities in the explanation of these movements may be obtained without entering into the more complicated and controverted parts of the subject, without going beyond the initial very abstract supposition of perfectly elastic equal spheres. For convenience of enunciation we may confine ourselves to two dimensions. Let us imagine, then, an enormous billiard-table with perfectly elastic cushions and a frictionless cloth on which millions of perfectly elastic balls rush hither and thither at random—colliding with each other—a homogeneous chaos, with that sort of uniformity in the midst of diversity which is characteristic of probabilities. Upon this hypothesis, if we fix attention on any n balls taken at random—they need not be, according to some they ought not to be, contiguous—if n is very large, the average properties will be approximately the same as those of the total mixture. In particular the average energy of the n balls may be equated to the average energy of the total number of balls, say T/N, if T is the total energy and N the total number of the balls. Now if we watch any one of the n specimen balls long enough for it to undergo a great number of collisions, we observe that either of its velocity-components, say that in the direction of x, viz. u, receives accessions from an immense number of independent causes in random fashion. We may presume, therefore, that these will be distributed (among the n balls) according to the law of error. The law will not be of the type which was first supposed, where the “spread” continually increases as the number of the elements is increased.[219] Nor will it be of the type which was afterwards mentioned[220] where the spread diminishes as the number of the elements is increased. The linear function by which the elements are aggregated is here of an intermediate type; such that the mean square of deviation corresponding to the velocity remains constant. The method of composition might be illustrated by the process of taking r digits at random from mathematical tables adding the differences between each digit and 4.5 the mean value of digits, and dividing the sum by √r. Here are some figures obtained by taking at random batches of sixteen digits from the expansion of π, subtracting 16 × 4.5 from the sum of each batch, and dividing the remainder by √16:—

+1.25, +0.75, −1, −1, +5.5, −2.75, +0.75, −2, +1.75, +3.25, +0.25, −2.75, −2.25, −0.5, +4.75, +0.25.

If, instead of sixteen, a million digits went to each batch, the general character of the series would be much the same; the aggregate figures would continue to hover about zero with a standard deviation of 8.25, a probable error of nearly 2. Here for instance are seven aggregates formed by recombining 252 out of the 256 digits above utilized into batches of 36 according to the prescribed rule: viz. subtracting 36 × 4.5 from the sum of each batch of 36 and dividing the remainder by √36:—

−0.5, +3.3, +2.6, −0.6, +1.5, −2, +1.

The illustration brings into view the circumstance that though the system of molecules may start with a distribution of velocities other than the normal, yet by repeated collisions the normal distribution will be superinduced. If both the velocities u and v are distributed according to the law of error for one dimension, we may presume that the joint values of u and v conform to the normal surface. Or we may reason directly that as the pair of velocities u and v is made up of a great number of elementary pairs (the co-ordinates in each of which need not, initially at least, be supposed uncorrelated) the law of frequency for concurrent values of u and v must be of the normal form which may be written[221]

z = 1/2√(km.1 − r2) exp -[x2/k − 2rxy/km + y2/m]/2(1 − r2).

It may be presumed that r, the coefficient of correlation, is zero, for, owing to the symmetry of the influences by which the molecular chaos is brought about, it is not to be supposed that there is any connexion or repugnance between one direction of u, say south to north, and one direction of v, say west to east. For a like reason k must be supposed equal to m. Thus the average velocity = 2k; which multiplied by m, the mass of a sphere, is to be equated to the average energy T/N. The reasoning may be extended with confidence to three dimensions, and with caution to contiguous molecules.

126. Correlation cannot be ignored in another application of the many-dimensioned law of error, its use in biological inquiries to Normal Correlation in Biology. investigate the relations between different generations. It was found by Galton that the heights and other measurable attributes of children of the same parents range about a mean which is not that of the parental heights, but nearer the average of the general population. The amount of this “regression” is simply proportional to the distance of the “mid-parent's” height from the general average. This is a case of very general law which governs the relations not only between members of the same family, but also between members of the same organism, and generally between two (or more) coexistent or in any way co-ordinated observations, each belonging to a normal group. Let x and y be the measurements of a pair thus constituted. Then[222] it may be expected that the conjunction of particular values for x and y will approximately obey the two-dimensioned normal law which has been already exhibited (see par. 114).

121. Regression-lines.—In the expression above given, put , and the equation for the frequency of pairs having values of the attribute under measurement becomes

z = 1/2πkm√1 − r2 exp[(xa)2/k − 2r(xa)/k(yb)/m + (yb)2/m]/2(1 − r2).

This formula is of very general application.[223] If two sets of measurements were made on the height, or other measurable feature, of the proverbial “Goodwin Sands” and “Tenterden Steeple,” and the first measurement of one set was coupled with the first of the other set, the second with the second, and so on, the pairs of magnitudes thus presented would doubtless vary according to the above-written law, only in that case r would presumably be zero; the expression for z would reduce to the product of the two independent probabilities that particular values of x and y should concur. But slight interdependences between things supposed to be totally unconnected would often be discovered by this law of error in two or more dimensions.[224] It may be put in a more convenient form by substituting ξ for (xa)/√k and η for (yb)/√m. The equation of the surface then becomes z = (1/2π√1 − r2) exp −[ξ2 − 2rξη + η2]/2(1 − r2). If the frequency of observations in the vicinity of a point is represented by the number of dots in a small increment of area, when r = 0 the dots will be distributed uniformly about the origin, the curves of equal probability will be circles. When r is different from zero the dots will be distributed so that the majority will be massed in two quadrants: in those for which ξ and η; are both positive or both negative when r is positive, in those for which ξ and η have opposite signs when r is negative. In the limiting case, when r = 1 the whole host will be massed along the line η = ξ, every deviation ξ being attended with an equal deviation η. In general, to any deviation of one of the variables ξ′ there corresponds a set or “array” (Pearson) of values of the other variable; for which the frequency is given by substituting ξ′ for ξ in the general equation. The section thus obtained proves to be a normal probability-curve with standard deviation √1 − r2. The most probable value of η corresponding to the assigned value of ξ is rξ′. The equation η − rξ, or rather what it becomes when translated back to our original co-ordinates (yb)/σ2 = r(xa)/σ1, where σ1, σ2 are our √k, √m respectively,[225] is often called a regression-equation. A verification is to hand in the above-cited statistics, which Weldon obtained by casting batches of dice. If the dice were perfect, r ( = l/√km) would equal ½, and as the dice proved not to be very imperfect, the coefficient is doubtless approximately = ½. Accordingly, we may expect that, if axes x and y are drawn through the point of maximum-frequency at the centre of the compartment containing 244 observations, corresponding to any value of x, say 2νi (where i is the side of each square compartment), the most probable value of y should be νi, and corresponding to y = 2νi the most probable value of x should be νi. And in fact these regression-equations are fairly well fulfilled for the integer values of ν (more than which could not be expected from discrete observations): e.g. when x = +4i, the value of y, for which the frequency (25) is a maximum, is as it ought to be +2i; when x = −2i the maximum (119) is at y = −i; when x = −4i the maximum (16) is at y = −2i; when y is +2i the maximum (138) is at x = +i; when y is −2i the maximum (117) at x = −i, and in the two cases (x = +2i and y = +4i), where the fulfilment is not exact, the failure is not very serious.

128. Analogous statements hold good for the case of three or more dimensions of error.[226] The normal law of error for any number of variables, x1 x2 x3, may be put in the form z = (1(2πn/2 √∆)exp − [R11x12 + R22x22 + &c. + 2R12x1x2 + &c.]/2∆ where ∆ is the determinant:—

1 r12 r13 ⋅⋅
r21 1 r23 ⋅⋅
r31 r32 1 ⋅⋅
: : :;

each r, e.g. r23 ( = r32), is the coefficient of correlation between two of the variables, e.g. x2, x3; R11 is the first minor of the determinant formed by omitting the first row and first column; R22 is the first minor formed by omitting the second row and the second column, and so on; R12 ( = R21) is the first minor formed by omitting the first column and second row (or vice versa). The principle of correlation plays an important rôle in natural history. It has replaced the notion that there is a simple proportion between the size of organs by the appropriate conception that there are simple proportions existing between the deviation from the average of one organ and the most probable value for the coexistent deviation of the other organ from its average.[227] Attributes favoured by “natural” or other defection are found to be correlated with other attributes which are not directly selected. The extent to which the attributes of an individual depend upon those of his ancestors as measured by correlation.[228] The principle is instrumental to most of the important “mathematical contributions” which Professor Pearson has made to the theory of evolution.[229] In social inquiries, also, the principle promises a rich harvest. Where numerous fluctuating causes go to produce a result like pauperism or immunity from small-pox, the ideal method of eliminating chance would be to construct “regression-equations” of the following type: “Change % in pauperism [in the decade 1871-1881] in rural districts = −27.07%, +0.299 (change % out-relief ratio), +0.271 (change % on proportion of old), + .064 (change % in population).”[230]

129. In order to determine the best values of the coefficients Determination of Constants by the Inverse Method. involved in the law of error, and to test the worth of the results obtained by using any values, recourse must be had to inverse probability.

130. The simplest problem under this head is where the quaesitum is a single real object and the data consist of a large number of observations, x1, x2, . . . xn, such that if the number were indefinitely increased, the completed series would form a normal probability-curve with the true point as its centre, and having a given modulus c. It is as if we had observed the position of the dints made by the fragments of an exploding shell so far as to know the distance of each mark measured (from an origin) along a right line, say the line of an extended fortification, and it was known that the shell was fired perpendicular to the fortification from a distant ridge parallel to the fortification, and that the shell was of a kind of which the fragments are scattered according to a normal law[231] with a known coefficient of dispersion; the question is at what position on the distant ridge was the enemy's gun probably placed? By received principles the probability, say P, that the given set of observations should have resulted from measuring (or aiming at) an object of which the real position was between x and x + ∆x is

x J exp − [(x - x1)2 + (x - x1)2 + &c.]/c2;

where J is a constant obtained by equating to unity (since the given set of observations must have resulted from some position on the axis of x). The value of x, from which the given set of observations most probably resulted, is obtained by making P a maximum. Putting dP/dx = 0, we have for the maximum (d2P/dx2 being negative for this value) the arithmetic mean of the given observations. The accuracy of the determination is measured by a probability-curve with modulus c/√n. This in the course of a very long siege if every case in which the given group of shell-marks x1, x2, . . . xn was presented could be investigated, it would be found that the enemy's cannon was fired from the position x′, the (point right opposite to the) arithmetic mean of x1, x2, &c., xn, with a frequency assigned by the equation

z = (√n/√πc) exp − n(xx′)2/c2.

The reasoning is applicable without material modification to the case in which the data and the quaesitum are not absolute quantities, but proportions; for instance, given the percentage of white balls in several large batches drawn at random from an immense urn containing black and white balls, to find the percentage of white balls in the urn—the inverse problem associated with the name of Bayes.

131. Simple as this solution is, it is not the one which has most recommended itself to Laplace. He envisages the quaesitum not so much as that point which is most probably the real one, as that point which may most advantageously be put for the real one. In our illustration it is as if it were required to discover from a number of shot-marks not the point[232] which in the course of a long siege would be most frequently the position of the cannon which had scattered the observed fragments but the point which it would be best to treat as that position—to fire at, say, with a view of silencing the enemy's gun—having regard not so much to the frequency with which the direction adopted is right, as to the extent to which it is wrong in the long run. As the measure of the detriment of error, Laplace[233] takes “la Valeur moyenne de l'erreur à craindre,” the mean first power of the errors taken positively on each side of the real point. The mean spquare of errors is proposed by Gauss as the criterion.[234] Any mean power indeed, the integral of any function which increases in absolute magnitude with the increase of its variable, taken as the measure of the detriment, will lead to the same conclusion, if the normal law prevails.[235]

132. Yet another speculative difficulty occurs in the simplest, and recurs in the more complicated inverse problem. In putting P as the probability, deduced from the observations that the real point for which they stand is x (between x and x + ∆x), it is tacitly assumed that prior to observation one value of x is as probable as another. In our illustration it must be assumed that the enemy's gun was as likely to be at one point as another of (a certain tract of) the ridge from which it was fired. If, apart from the evidence of the shell-marks, there was any reason for thinking that the gun was situated at one point rather than another, the formula would require to be modified. This a priori probability is sometimes grounded on our ignorance; according to another view, the procedure is justified by a rough general knowledge that over a tract of x for which P is sensible one value of x occurs about as often as another.[236]

133. Subject to similar speculative difficulties, the solution which has been obtained may be extended to the analogous problem in which the quaesitum is not the real value of an observed magnitude, but the mean to which a series of statistics indefinitely prolonged converges.[237]

134. Next, let the modulus, still supposed given, not be the same for all the observations, but c1 for x1, c2 for x2, &c. Then P becomes proportional to

exp − [(xx1)2/c12 + (xx2)2/c22 + &c.].

And the value of x which is both the most probable and the “most Method of least Squares. advantageous” is (x1/c12 + x2/c22 + &c.)/(1/c12 + 1/c22 + &c.); each observation being weighted with the inverse mean square of observations made under similar conditions.[238] This is the rule prescribed by the “method of least squares”; but as the rule in this case has been deduced by genuine inverse probability, the problem does not exemplify what is most characteristic in that method, namely, that a rule deducible from the hypothesis that the errors of observations obey the normal law of error is employed in cases where the normal law is not known, or even is known not, to hold good. For example, let the curve of error for each observation be of the form of

z = [1/√(πc)]× exp[−x2/c2 − 2j(x/c - 2x3/3c3)],

where j is a small fraction, so that z may equally well be equated to (1/√πc)[1 - 2j(x/c - 2x3/3c3)] exp − x2/c2, a law which is actually very prevalent. Then, according to the genuine inverse method, the most probable value of x is given by the quadratic equation d/dxlog P = 0, where log P = const. − ∑(xxr)2/cr2 − ∑2j[(xxr)3/cr3 − 2(xxr)3/3cr3], ∑ denoting summation over all the observations. According to the “method of least squares,” the solution is the weighted arithmetic mean of the observations, the weight of any observation being inversely proportional to the corresponding mean square, i.e. cr2/2 (the terms of the integral which involve j vanishing), which would be the solution if the j's are all zero. We put for the solution of the given case what is known to be the solution of an essentially different case. How can this paradox be justified?

135. Many of the answers which have been given to this question seem to come to this. When the data are unmanageable, it is legitimate to attend to a part thereof, and to determine the most probable (or the “most advantageous”) value of the quaesitum, and the degree of its accuracy, from the selected portion of the data as if it formed the whole. This throwing overboard of part of the data in order to utilize the remainder has often to be resorted to in the rough course of applied probabilities. Thus an insurance office only takes account of the age and some other simple attributes of its customers, though a better bargain might be made in particular cases by taking into account all available details. The nature of the method is particularly clear in the case where the given set of observations consists of several batches, the observations in any batch ranging under the same law of frequency with mean xr and mean square of error kr, the function and the constants different for different batches; then if we confine our attention to those parts of the data which are of the type xr and kr—ignoring what else may be given as to the laws of error—we may treat the xr's as so many observations, each ranging under the normal law of error with its coefficient of dispersion; and apply the rules proper to the normal law. Those rules applied to the data, considered as a set of derivative observations each formed by a batch of the original observations) averaged, give as the most probable (and also the most advantageous) combination of the observations the arithmetic mean weighted according to the inverse mean square pertaining to each observation, and for the law of the error to which the determination is liable the normal law with standard deviation[239] √(∑k/n)—the very rules that are prescribed by the method of least squares.

136. The principle involved might be illustrated by the proposal to make the economy of datum a littler less rigid: to utilize, not indeed all, but a little more of our materials—not only the mean square of error for each batch, but also the mean cube of error. To begin with the simple case of a single homogeneous batch: suppose that in our example the fragments of the shell are no longer scattered according to the normal law. By the method of least squares it would still be proper to put the arithmetic mean to the given observations for the true point required, and to measure the accuracy of that determination by a probability-curve of which the modulus is √(2k), where k is the mean square of deviation (of fragments from their mean). If it is thought desirable to utilize more of the data there is available, the proposition that the arithmetic mean of a numerous set of observations, say x1, x2, . . . xn (taken as a sample from an indefinitely large group obeying any the same law of frequency) varies from set to set approximately according to the following law (to be established later)

, say;

where c2/2 the mean square of deviation, and j = the mean cube of deviation, and j/c3, say j, is small. Then, by abstraction analogous to that which has just been attributed to the method of least squares, we may regard the datum as a single observation, the arithmetic mean (of a sample batch of observations) subject to the law of error z = f(x). The most probable value of the quaesitum is therefore given by the equation f′(xx′) = 0, where x′ is the arithmetic mean of the given observations. From the resulting quadratic equation, putting x = x′ + ε, and recollecting that ε is small we have ε = jc. That is the correction due to the utilization of the mean cube of error. The most advantageous solution cannot now be determined,[240] f(x) being unsymmetrical, without assuming a particular form for the function of detriment. This method of least squares plus cubes may easily be extended to the case of several batches.

137. This application of probabilities not to the actual data but to a selected part thereof, this economy of the inverse method, is widely practised in miscellaneous statistics, where the object is to determine whether the discrepancy between two sets of observation is accidental or significant of a real difference.[241] For instance, let the data be ages at death of individuals of two classes (e.g. temperate or not so, urban or rural, &c.) who have been under observation, since the age of, say, 20. Granted that the ages at death conform to Gompertz's law; the determination of the modal age at death, that age at which the proportion of the total observed dying (per unit of time) is a maximum for each class, would most perfectly be effected by the genuine inverse method. That method will also enable us to determine the probability that the two modes should have differed to the observed extent by mere accident.[242] According to the abridged method it suffices to proceed as if our data consisted of two observations x′ and y′, the average ages at death of the two classes, each average obeying the normal law of error, with respective moduli , where x1, x2, &c., y1, y2, &c., are the respective sets of observed ages at death; as follows from the law of error, whatever the law of distribution of the given observations. According to a well-known property of the normal law, the difference between the averages of n and n′ observations respectively will range under a probability-curve with modulus , say c. Whence for the probability that a difference as great as the observed one, say e, should have occurred by chance we have ½[1 − θ(τ)], where τ = e/c, and θ(x) is the integral , given in many treatises.

138. This sort of abridgment may be extended to other kinds of average besides the arithmetic, in particular the median (that point Abridged Methods. which has as many of the given observations above as below it). By simple induction we know that the median of a large sample of observations is a probable value for the true median; how probable is determined as follows from a selection of our data. First suppose that all the observations are of the same weight. If x′ were the true median, the probability that as many as ½n + r of the observations should fall on either side of that point is given by the normal law for which the exponent is -2r2/n.[243] This probability that the observed median will differ from the true one by a certain number of observations is connected with the probability that they will differ by a certain extent of the abscissa, by the proposition that the number of observations contained between the true and apparent median is equal to the small difference between them multiplied by the density of observations at the median—in the case of normal and generally symmetrical curves the greatest ordinate. This is the second datum we require to select. In the case of the normal curve it may be calculated from the modulus itself, determined by induction from a selection of data. If the observations are not all of the same worth, weight may be assigned by counting one observation as if it occurred oftener than another. This is the essence of Laplace's Method of Situation.[244]

139. In its simplest form, where all the given observations are of equal weight, this method is of wide applicability. Compared with the genuine inverse method, it is always more convenient, seldom much less accurate, sometimes even more accurate. If the given observations obey the normal law, the precision of the median is less than the precision of the arithmetic mean by only some 25%—a discrepancy not very serious where only a rough estimate of the worth of an average is required. If the observations do not obey the normal law—especially if the extremities are abnormally divergent—the precision of the median may be greater than that of the arithmetic mean.[245]

140. Yet another instance of the contrast between genuine and abridged inversion is afforded by the problem to determine the Determination of Frequency-Constants. modulus as well as the mean for a set of observations known to obey the normal law; what the first problem[246] becomes when the coefficient of dispersion is not given. By inverse probability we ought in that case, in addition to the equation dP/dx = 0, to put dP/dc = 0. Whence c2 = 2[(x′ − x1)2 + (x′ − x2)2 + &c. + (x′ − xn)2]/n, and x′ = (x1 + x2 + &c. + xn)/n. This solution differs from that which is often given in the textbooks[247] in that there, in the expression for c2, (n − 1) occurs in the denominator instead of n. The difference is explained by the fact that the authorities referred to determine c, not by genuine inversion, but by ordinary induction, by a condition which certainly would be fulfilled in the long run, but does not express the whole of our data; a condition in this respect like the equation of c to , where e is the difference (taken positively, without regard to its sign) between any observation and the arithmetic mean of all the observations.[248]

141. Of course the determination of the most probable value is subject to the speculative difficulties proper to a priori probability: which are particularly striking in this case, as it appears equally natural to take as that constant, of which the values are a priori equally probable, k( = c2/2), or even[249] h( = 1/c2), the measure of weight, as in fact Laplace has done;[250] yet no two of these assumptions can be exactly true.[251]

142. A more convenient determination is obtained from simple induction by equating the modulus to some datum of the observed group to which it would be equal if the group were complete—in particular to the distance from the median of some percentile (or point which marks off a certain percentage, e.g. 25 of the given observations) multiplied by a factor corresponding to the percentile obtainable from a familiar table. Mr Sheppard has given an interesting proof[252] that we cannot by way of percentiles obtain such good[253] results for the frequency-constants as by the use of “the average and average square” [the method prescribed by inverse probability].

143. The same philosophical subtleties, with greater mathematical complications, meet us when we pass on to the case of two or more Entangled Measurements. quaesita. The problem under this head which mainly exercised the older writers was to determine a number of unknown quantities, given a larger number, n, of equations involving them.

144. Supposing the true values approximately known, by substituting the approximate values in the given equations and expanding according to Taylor's theorem, there will be obtained for the corrections, say x, y . . ., n linear equations of the form

a1x + b1y · · = f1

a2x + b2y · · = f2,

where each a and b is a known coefficient, and each f is a fallible observation. Suppose that the error to which each is liable obeys the normal law, and that the modulus pertaining to each observation is the same—which latter condition can be secured by multiplying each equation by a proper factor—then if x′ and y′ are the true values of the quaesita, the frequency with which (a1x′ + b1y′ − f1) assumes different values is given by the equation , where c1 is constant which, if not known beforehand, may be inferred, as in the simpler case, from a set of observations. Similar statements holding for the other equations, the probability that the given set of observations f1, f2, &c., should have resulted from a particular system of values for x, y . . is J exp [(a1x + b1yf1)2/c12 + (a2x + b2yf2)2/c22 + &c.], where J is a-constant determined on the same principle as in the analogous simpler cases.[254] The condition that P should be a maximum gives as many linear equations for the determination of xy′ . . . as there are unknown quantities.

145. The solution proper to the case where the observations are known to arrange according to the normal law may be extended to numerous observations ranging under any law, on the principles which justify the use of the Method of Least Squares in the case of a single quaesitum.

146. As in that simple case, the principle of economy will now justify the use of the median, e.g. in the case of two quaesita, putting for the true values of x and y that point for which the sum of the perpendiculars let fall from it on each of a set of lines representing the given equations (properly weighted) is a minimum.[255]

147. The older writers have expressed the error in the determination of one of the variables without reference to the error in the Normal Correlation. other. But the error of one variable may be regarded as correlated with that of another; that is, if the system x′, y′ . . . forms the solution of the given equations, while x′ + ξ, x′ + η . . . is the real system, the (small) values of ξ, η. . . . which will concur in the long run of systems from which the given set of observations result are normally correlated. From this point of view Bravais, in 1846, was led to several theorems which are applicable to the now more important case of correlation in which ξ and η are given (not in general small) deviations from the means of two or more correlated members (organs or attributes) forming a normal group.

148. To determine the frequency-constants of such a group it is proper to proceed on the analogy of the simple case of one-dimensioned error. In the case of two dimensions, for instance, the probability p1 that a given pair of observations (x1, y1) should have resulted from a normal group of which the means are xy′ respectively, the standard deviations σ1 and σ2 and the coefficient of correlation r, may be written—

xyσ1σ2∆r(1/2π),

where E2 = (x′ − x1)2/σ12 − 2r(x′ − x1)(y′ − y1)/σ1σ2 + (y′ − y1)2/σ22. A similar statement holds for each other pair of observations (x2y2), (x3y3). . .; with analogous expressions for p2, p3. . . Whence, as in the simpler case, we have p1 × p2 × &c. × pn/J (a constant) for P, the a posteriori probability that the given observations should have resulted from an assigned system of the frequency-constants. The most probable system is determined by making P a maximum, and accordingly equating to zero each of the following expressions—

dP/dx dP/dy dP/dσ1 dP/dσ2 dP/dr.

The values of the arithmetic mean and of the standard deviation for each variable are what have been obtained in the simple case of one dimension. The value of r is ∑(x′ − xr)(y′ − yr)/σ1σ2.[256] The probable error of the determination is assigned on the assumption that the errors to which it is liable are small.[257] Such coefficients have already been calculated for a great number of interesting cases. For instance, the coefficient of correlation between the human stature and femur is 0.8, between the right and left femur is 0.96, between the statures of husbands and wives is 0.28.[258]

149. This application of inverse probability to determine correlation-coefficients and the error to which the determination is liable has been largely employed by Professor Pearson[259] and other recent writers. The use of the normal formula to measure the probable—and improbable—errors incident to such determinations is justified by reasoning akin to that which has been employed in the general proof of the law of error.[260] Professor Pearson has pointed out a circumstance which seems to be of great importance in the theory of evolution: that the errors incident to the determination of different frequency-coefficients are apt to be mutually correlated. Thus if a random selection be made from a certain population, the correlation-coefficient which fits the organs of that set is apt to differ from the coefficient proper to the complete group in the same sense as some other frequency-coefficients.

150. The last remark applies also to the determination of the coefficients, in particular those of correlation, by abridged methods, on principles explained with reference to the simple case; for instance by the formula r = ∑η/∑ξ, where ∑ξ is the sum of (some or all) the positive (or the negative) deviations of the values for one organ or attribute measured by the modulus pertaining to that member, and ∑η is the sum of the values of the other member, which are associated with the constituents of ∑ξ. This variety of this method is certainly much less troublesome, and is perhaps not much less accurate, than the method prescribed by genuine inversion.

151. A method of rejecting data analogous to the use of percentiles in one dimension is practised when, given the frequency of observations for each increment of area, e.g. each ∆xy, we utilize only the frequency for integral areas. Mr Sheppard has given an elegant solution of the problem: to find the correlation between two attributes, given the medians L, and M, of a normal group for each attribute and the distribution of the total group, as thus.[261]

Below L, Above L,
Below M,   P   R
Above M,   R   P

Fig. 12.

If cos D is put for r, the coefficient of correlation, it is found that D = πR/(P + R). For example, let the group of statistics relating to dice already [262]cited from Professor Weldon be arranged in four quadrants by a horizontal and a vertical line, each of which separates the total groups into two halves: lines of which equations prove to be respectively y = 6.11 and x = 6.156. For R we have 1360.5, and for P 687.5 roughly. Whence D = π × 0.66; r = cos 0.66 × π = −½ nearly, as it ought; the negative sign being required by the circumstance that the lower part of Mr Sheppard's diagram shown in fig. 12 corresponds to the upper part of Professor Weldon's diagram shown in par. 115.

152. Necessity rather than convenience is sometimes the motive for resort to percentiles. Professor Pearson has applied the median method to determine the correlation between husbands and wives in respect of the darkness of eye-colour, a character which does not admit of exact graduation: “our numbers merely refer to certain groupings, arranged, it is true, in increasing darkness of colour, but in no way corresponding to equal increases in colour-intensity.”[263] From data of this sort, having ascertained the number of husbands with eye-colours above the median tint who marry wives with eye colour above the median tint, Professor Pearson finds for r the coefficient of correlation +0.1. A general method for determining the frequency-constants when the data are, or are taken to be, of the integral sort has been given by Professor Pearson.[264] Attention should also be called to Mr Yule's treatment of the problem by a sort of logical calculus on the lines of Boole and Jevons.[265]

153. In the cases of correlation which have been so far considered, it has been presupposed that the things correlated range according Abnormal Correlation. to the normal law of error. But now, suppose the law of distribution to be no longer normal: for instance, that the dots on the plane of xy,[266] representing each a pair of members, are no longer grouped in elliptic (or circular) rings of equal frequency, that the locus of the maximum y deviation, corresponding to an assigned x deviation, is no longer a right line. How is the interdependence of these deviations to be formulated? It is submitted that such data may be treated as if they were normal: by an extension of the Method of Least Squares, in two or more dimensions.[267] Thus when the amount of pauperism together with the amount of outdoor relief is plotted in several unions there is obtained a distribution far from normal. Nevertheless if the average pauperism and average outdoor relief are taken for aggregates—say quintettes or decades—of unions taken at random, it may be expected that these means will conform to the normal law, with coefficients obtained from the original data, according to the rule which is proper to the case of the normal law.[268] By obtaining averages conforming to the normal law, as by the simple application of the method of least squares, we should not indeed have utilized the whole of our data, but we shall put a part of it in a very useful shape. Although the regression-equations obtained would not accurately fit the original material, yet they would have a certain correspondence thereto. What sort of correspondence may be illustrated by an example in games of chance, which Professor Weldon kindly supplied. Three half-dozen of dice having been thrown, the number of dice with more than three points in that dozen which is made up of the first and the second half-dozen is taken for y, the number of sixes in the dozen made up to the first and the third half-dozen, is taken for x.

2
 10/72


 25/72
 7/72 1/72


 10/72 1/72


 15/72


 3/72  0 
0 1 2

Fig. 13.

Thus each twofold observation (xy) is the sum of six twofold elements, each of which is subject to a law of frequency represented in fig. 13; where[269] the figures outside denote the number of successes of each kind, for the ordinate the number of dice with more than three points (out of a cast of two dice), for the co-ordinate the number of sixes (out of a cast of two dice, one of which is common to the aforesaid cast); and the figures inside denote the comparative probabilities of each twofold value (e.g. the probability of obtaining in the first two cast dice each with more than three points, and in the second cast two sixes, is 1/72). Treating this law of frequency according to the rule which is proper to the normal law, we have (for the element) if the sides of the compartments each = i

; ; .

Whence for the regression-equation which gives the value of the ordinate most probably associated with an assigned value of the abscissa we have y = x × rσ2/σ1 = 0.3x; and for the other regression-equation, x = y/6.

0 1 2 3 4 5 6 7 8 9 10 11 12
12   1
11   4   3   3   3   1
10   3  17  15  13  10   4   3   1
9  12  51  59  61  36  14   5   3
8  36  135  154  150  64  21   5   2   1
7  74  195  260  179  112  35   5   1
6  90  248  254  170  75  26   3
5  93  220  230  124  51   8   2
4  86  162  127  75  19   4   1
3  37  86  56  17   6   2
2  14  23  23   4   3
1   2   4
0

Accordingly, in Professor Weldon's statistics, which are reproduced in the annexed diagram, when x = 3 the most probable value of y ought to be 1. And in fact this expectation is verified, x and y being measured along lines drawn through the centre of the compartment, which ought to have the maximum of content, representing the concurrence of one dozen with two sixes and another dozen with six dice having each more than three points, the compartment which in fact contains 254 (almost the maximum content). In the absence of observations at x = −3i or y = ±6i, the regression-equations cannot be further verified. At least they have begun to be verified by batches composed of six elements, whereas they are not verifiable at all for the simple elements. The normal formula describes the given statistics as they behave, not when by themselves, but when massed in crowds; the regression-equation does not tell us that if x′ is the magnitude of one member the most probable magnitude of the other member associated therewith is rx′, but that if x′ is the average of several samples of the first member, then rx′ is the most probable average for the specimens of the other member associated with those samples. Mr Yule's proposal to construct regression-equations according to the normal rule “without troubling to investigate the normality of the distribution”[270] admits of this among other explanations.”[271] Mr Yules own view of the subject is well worthy of attention.

154. In the determination of the standard-deviation proper to the law of error (and other constants proper to other laws of frequency) Sheppard's Corrections. it commonly happens that besides the inaccuracy, which has been estimated, due to the paucity of the data, there is an inaccuracy due to their discrete character: the circumstance that measurement, e.g. of human heights, are given in comparatively large units, e.g. inches, while the real objects are more perfectly graduated. Mr Sheppard has prescribed a remedy for this imperfection. For the standard deviation let μ2 be the rough value obtained on the supposition that the observations are massed at intervals of unit length (not spread out continuously, as ideal measurements would be); then the proper value, the mean integral of deviation squared, say (μ2) = μ21/12h2, where h is the size of a unit, e.g. an inch. It is not to be objected to this correction that it becomes nugatory when it is less than the probable error to which the measurement is liable on account of the paucity of observations. For, as the correction is always in one direction, that of subtraction, it tends in the long run to be advantageous even though masked in particular instances by larger fluctuating errors.[272]

155. Professor Pearson has given a beautiful application of the theory of correlation to test the empirical evidence that a given Pearson's Criterion of Empirical Verification. group conforms to a proposed formula, e.g. the normal law of error.[273]

Supposing the constants of the proposed function to be known—in the case of the normal law the arithmetic mean and modulus—we could determine the position of any percentile, e.g. the median, say a. Now the probability that if any sample numbering n were taken at random from the complete group, the median of the sample, a′, would lie at such a distance from a that there should be r observations between a and a′ is

.[274]

If, then, any observed set has an excess which makes the above written integral very small, the set has probably not been formed by a random selection from the supposed given complete group. To extend this method to the case of two, or generally n, percentiles, forming (n + 1) compartments, it must be observed that the excesses say e and e′, are not independent but correlated. To measure the probability of obtaining a pair of excesses respectively as large as e and e′, we have now (corresponding to the extremity of the probability-curve in the simple case) the solid content of a certain probability-surface outside the curve of equal probability which passes through the points on the plane xy assigned by e, e′ (and the other data). This double, or in general multiple, integral, say P, is expressed by Professor Pearson with great elegance in terms of the quadratic factor, called by him χ2, which forms the exponent of the expression for the probability that a particular system of the values of the correlated e, e′, &c., should concur—

when n is odd; with an expression different in form, but nearly coincident in result, when n is even. The practical rule derived from this general theorem may thus be stated. Find from the given observations the probable values of the coefficients pertaining to the formula which is supposed to represent the observations. Calculate from the coefficients a certain number, say n, of percentiles; thereby dividing the given set into n + 1 sections, any of which, according to calculation, ought to contain say m of the observations, while in fact it contains m′. Put e for m′ − m; then χ2 = ∑e2/m. Professor Pearson has given in an appended table the values of P corresponding to values of n + 1 up to 20, and values of χ2 up to 70. He does not conceal that there is some laxity involved in the circumstance that the coefficients employed are not known exactly, only inferred with probability.[275]

156. Here is one of Professor Pearson's illustrations. The table on next page gives the distribution of 1000 shots fired at a line in a target, the hits being arranged in belts drawn on the target parallel to the line. The “normal distribution” is obtained from a normal curve, of which the coefficients are determined from the observations. From the value of χ2, viz. 45.8, and of (n + 1), viz. 11, we deduce, with sufficient accuracy from Professor Pearson's table, or more exactly from the formula on which the table is based, that P = .000,001,5 · ·. “In other words, if shots are distributed on a target according to the normal law, then such a distribution as that cited could only be expected to occur on an average some 15 or 16 times in 10,000,000 times.”

157. “Such a distribution” in this argument must be interpreted The Criterion Criticized. as a distribution for which it is claimed that the observations are all independent of each other. Suppose that there were only 500 independent observations, the remainder being merely duplicates of these 500. Then in the above table the columns for the normal distribution and for the discrepancy e should each be halved; and accordingly the column for e2/m should be halved. Thus e2/m being reduced to 22.9, P as found from Professor Pearson's table is between 995 and 629. That is, such a distribution might be expected to occur once on an average some once or twice in a hundred times. If actual duplication of this sort is not common in statistics,[276] yet in all such applications of the Pearsonian criterion—and in other calculations involving the number of observations, in particular the determinations of probable error—a good margin is to be left for the possibility that the n observations are not perfectly independent: e.g. the accidents of wind or nerve which affected one shot may have affected other shots immediately before or after.

 Belt.  Observed
 Frequency. 
Normal
 Distribution. 
e. e2/m. 





1  1  1   0 0   
2  4  6 − 2 0.667
3 10 27 −17 10.704 
4 89 67 +22 7.224
5 190  162  +28 4.839
6 212  242  −30 3.719
7 204  240  −36 5.400
8 193  157  +36 8.255
9 79 70 + 9 1.157
10  16 26 −10 3.846
11   2  2   0 0   





  1000   1000   45.811 

158. (2) The Generalized Law of Error.—That the normal law of error should not be exactly fulfilled is not disconcerting to those who ground the law upon the plurality of independent causes. On that view the normal law would only be exact when the numbers of elements from which it is generated is very great. In general, when that number is large, but not indefinitely great,[277] there is required a correction owing to one or other of the following imperfections: that the elements do not fluctuate according to the normal law of frequency; that their fluctuations are not independent of each other; that the function whereby they are aggregated is not linear. The correction is formed by a series of terms descending in the order of magnitude.

159. The first term of this series may be written

−2(k/c3)[x/c − 2x3/3c3];

where c2/2 is the mean square of deviation for the compound and also the sum of the mean squares of deviations for the component Second and Third Approximations. elements, k1 is the mean cube of deviations for the compound and the sum of the mean cubes for the components, and the elements are supposed to be such and so numerous that k1/c3 is of the order 1/√n. This second approximation, first given by Poisson, was rediscovered by De Forest.[278] The present writer has obtained it[279] by a variety of methods. By a further extension of these methods a third and further approximations may be found. The corrected normal law is then of the form[280]

;

where k = k1/c3, k2 = k2/c4, k1 and c are defined as above, k2 is the sum of the respective differences for each element between its mean fourth power of error and thrice its mean square of error,[281] and also the corresponding difference for the compound. The formula may be verified by the case of the binomial, considered as a simple case of the law of great numbers. Here

c2 = 2npq, k1 = npq(qp), k2 = npq(1 − 6pq).[282]

These values being substituted for the coefficients in the general formula, there results an expression which may be obtained directly by continuing[283] to expand the expression for a term of the binomial.

In virtue of the second approximation a set of observations is not to be excluded from the affinity to the normal curve because, like the curve of barometric heights,[284] it is slightly asymmetrical. In virtue of the third approximation it is not excluded because, like the group of shot-marks above examined, it is, though almost perfectly symmetrical, in other respects apparently somewhat abnormal.

160. If the third approximation is not satisfactory there is still available a fourth, or a still higher degree of approximation.[285] Higher Approximations. The general expression for y which (multiplied by ∆x) represents the probability that an error will occur at a particular point (within a particular small interval) may be written

,

where y0 is (the normal error-function) 1/√(2kπ)ex2/2k, k is the mean square of deviation; k1, k2, . . ., &c., are coefficients formed from the mean powers of deviation according to the rule that kt is the difference between the tth mean power as it actually is and what it would be if the (t−1)th approximation were perfectly correct. Thus k1 is the difference between the actual mean third power and what the third power would be if the first approximation, the normal law, were perfectly correct, that is, the difference between the actual mean third power, often written μ3, and zero, that is μ3. Similarly k2 is the difference between the actual mean fourth power of deviation, say μ4, and what that mean power would be if the second approximation were perfectly correct, viz. 3k2. Thus k2 = μ3 − 3k2. The series k1, k3, k5, &c., k, k2, k4, &c., form each a succession of terms descending in the order of magnitude, when each k, e.g. kt, has been divided by the corresponding power, i.e. the power (t+2) of the parameter or modulus c = √(2k), which division is secured by the successive differentiations of y0, with which each k is associated, e.g. kt with . Moreover, the first term of the odd series of k's when divided by the proper power of the parameter, viz. c3 is small in comparison with the first term of the even series, viz. k, properly referred—divided by c2 ( = 2k).

161. Whatever the degree of approximation employed, it is to be remembered that the law in general is only applicable to a certain Character of the Approximation. range of the compound magnitude here represented by the abscissa x.[286] The curve of error, even when generalized as here proposed, coincides only with the central portion—the body, as distinguished from the extremities—of the actual locus; a greater or less proportion.

162. The law thus generalized may be extended, with similar reservations, to two or more dimensions. For example, the second approximation in two dimensions may be written Extension to Two or More Dimensions.

;

where z0 is (the normal error-function)

1/π(1 − r2)exp.−(x2 - 2rxy + y2)/(1 − r2),

x and y are (as before) co-ordinates measured from the centre of gravity of the group as origin, each referred to (divided by) its proper modulus; r is the ordinary coefficient of regression; 3,0k is the mean value of the cubes x3, 2,1k is the mean value of the products x2y, and so on; all these k's being quantities of an order less than unity. This form lends itself readily to the determination of a second approximation to the regression-curve, which is the locus of that y, which is the most probable value of the ordinate corresponding to an assigned value of x. Form the logarithm of the above-written expression (for the frequency-surface); and differentiate that logarithm with respect to x. The required locus is given by equating this differential to zero (the second differential being always negative). The resulting equation is of the form

y − rx − T − αx2 − 2βxy − γy2 = 0,

where T, α, β, γ are all small, linear functions of the k’s. As y is nearly equal to r x, it is legitimate to substitute r x for y, when y is multiplied by a small coefficient. The curve of regression thus reduces to a parabola with equation of the form

y − T = rx − qx2;

where q is a linear function of the third mean powers and moments of the given group.

163. Dissection of certain Heterogeneous Groups.—Under the head of law of error may be placed the case in which statistics relating to two (or more) different types, each separately conforming to the normal law, are mixed together; for instance, the measurements of human heights in a country comprising two distinct races.

In this case the quaesita are the constants in a curve of the form:

,

where α and β are the proportionate sizes of the two groups (α+β = 1); a and b are the respective centres of gravity; and c1, c2 the respective moduli. The data are measurements each of which relates to one or other of these component curves. A splendid solution of this difficult problem has been given by Professor Pearson. The five unknown quantities are connected by him with the centre of gravity of the given observations, and the mean second, third, fourth and fifth powers of their deviations from that centre of gravity, by certain rational algebraic equations, which reduce to an equation in one variable of the ninth dimension. In an example worked by Professor Pearson this fundamental equation had three possible roots, two of which gave very fair solutions of the problem, while the third suggested that there might be a negative solution, importing that the given system would be obtained by subtracting one of the normal groups from the other; but the coefficients for the negative solution proved to be imaginary. “In the case of crabs' foreheads, therefore, we cannot represent the frequency curve for their forehead length as the difference of two normal curves.” In another case, which primâ facie seemed normal, Professor Pearson found that “all nine roots of the fundamental nonic lead to imaginary solutions of the problem. The best and most accurate representation is the normal curve.”

164. This laborious method of separation seems best suited to cases in which it is known beforehand that the statistics are a mixture of two normal groups, or at least this is strongly suggested by the two-headed character of the given group. Otherwise the less troublesome generalized law of error may be preferable, as it is appropriate both to the mixture of two—not very widely different—normal groups, and also the other cases of composition. Even when a group of statistics can be broken up into two or three frequency curves of the normal—or not very abnormal—type, the group may yet be adequately represented by a single curve of the “generalized” type, provided that the heterogeneity is not very great, not great enough to prevent the constants k1, k2, k3, &c., from being small. Thus, suppose the given group to consist of two normal curves each having the same modulus c, and that the distance between the centres is considerable, so considerable as just to cause the central portion of the total group to become saddle-backed. This phenomenon sets in when the distance between the centre of gravity of the system and the centre of either component = √1/2c.[287] Even in this case k2 is only −0.125; k4 is 0.25 (the odd k’s are zero).

Section II.—Laws of Frequency.

165. A formula much more comprehensive than the corrected The “Generalized Probability Curve.” normal law is proposed by Professor Pearson under the designation of the “generalized probability-curve.” The round and scope of the new law cannot be better stated than in the words of the author: “The slope of the normal curve is given by a relation the form

1/ydy/dx = −x/c1

The slope of the curve correlated to the skew binomial, as the normal curve to the symmetrical binomial, is given by a relation of the form

1/ydy/dx = −x/c1 + c2x

Finally, the slope of the curve correlated to the hyper geometrical series (which expresses a probability distribution in which the contributory causes are not independent, and not equally likely to give equal deviations in excess and defect), as the above curves to their respective binomials, is given by a relation of the form

1/ydy/dx = −x/c1 + c2x + c3x2.

This latter curve comprises the two others as special cases, and, so far as my investigations have yet gone, practically covers all homogeneous statistics that I have had to deal with. Something still more general may be conceivable, but I have found no necessity for it.”[288] The “hypergeometrical series,” it should be explained, had appeared as representative of the distribution of black balls,[289] in the following case. “Take n balls in a bag, of which pn are black and qn are white, and let r balls be drawn and the num er of black be recorded. If r > pn, the range of black balls will lie between 0 and pn; the resulting frequency-polygon is given by a hypergeometrical series.”

Further reasons in favour of his construction are given by Professor Pearson in a later paper.[290] “The immense majority, if not the totality, of frequency distributions in homogeneous material show, when the frequency is indefinitely increased, a tendency to give a smooth curve characterized by the following properties. (i.) The frequency starts from zero, increases slowly or rapidly to a maximum and then falls again to zero—probably at a quite different rate—as the character for which the frequency is measured is steadily increased. This is the almost universal unimodal distribution of the frequency of homogeneous series . . . (ii.) In the next place there is generally contact of the frequency-curve at the extremities of the range. These characteristics at once suggest the following of frequency curve, if yδx measure the frequency falling between x and x+δx:—

dy/dx = y(x + a)/F(x) ⋅ ⋅ ⋅

Now let us assume that F(x) can be expanded by Maclaurin’s theorem. Then our differential equation to the frequency will be

1/ydy/dx = (x + a)/b0 + b1x + b2x2 + ⋅ ⋅ ⋅

Experience shows that the form (x) [“keeping b0, b1, b2, only”] suffices for certainly the great bulk of frequency distributions."[291]

166. The “generalized probability-curve” presents two main forms[292]

y = y0(1 + x/a1)νa1) 1 − x/a2)νa2,

and y = y01/(1 + x2/a2)meν tan−1x/a.

When a1, a2, ν are all finite and positive, the first form represents, in general, a skew curve, with limited range in both directions; in the particular case, when a1 = a2, a symmetrical curve, with range limited in both directions. If a2 = ∞, the curve reduces to

y = y0(2 + x/a1νa11eνx);

representing an asymmetrical binomial with ν = 2μ2/μ3, and 21 = 2μ22/μ3aμ3/μ2, μ2 and μ3, being respectively the mean second and mean third power of deviation measured from the centre of gravity. In the particular case, when μ3 is small, this form reduces to what is above called the “quasi-normal” curve; and when μ3 is zero, a1 becoming infinite, to the simple normal curve. The pregnant general form yields two less familiar shapes apt to represent curves of the character shown in figs. 14 and 15—the one occurring in a good number of instances, such as infant deaths, the values of houses, the number of petals in certain flowers; the other less familiarly illustrated by Consumptivity and Cloudiness.[293] The second solution represents a skew curve with unlimited range in both directions.[294] Professor Pearson has successfully applied these formulae to a number of 'beautiful specimens culled in the most diverse fields of statistics. The flexibility with which the generalized probability-curve adapts itself to every variety of existing groups no doubt gives it a great advantage over the normal curve, even. in its extended form. It is only in respect of a priori evidence that the latter can claim precedence.[295]

Fig. 14. Fig. 15.

167. Skew Correlation.—Professor Pearson has extended his method to frequency-loci of two dimensions;[296] constructing for the curve of regression (as a substitute for the normal right line), in the case of “skew correlation,” a parabola,[297] with constants based on the higher moments of the given group.

168. In this connexion reference may again be made to Mr Yule’s method of treating skew surfaces as if they were normal. It is certainly remarkable that the correlation should be so well represented by a line—the property of a normal surface—in cases of which normality cannot be predicated: for instance, the statistics of the number of husbands (or wives) living at each age who have wives (or husbands) living at different ages.[298] It may be suggested that though in this case there is one dominant cause, the continual decrease of the population, inconsistent with the plurality of causes postulated for the law of error, yet there is a sufficient degree of accidental variation to realize one property at least of the normal locus.

169. There is possibly an extensive class of phenomena of which frequency depends largely on fortuitous causes, yet not so completely as to present the genuine law of error.[299] This mixed class of phenomena might be amenable to a kind of law of frequency that would be different Relations between Frequency and Probability. from, yet have some affinity to, the law of error. The double character may be taken as the definition of the laws proper to the present section. The definition of the class is more distinct than its extent. Consider for example the statistics which represent the numbers out of a million born that die in each year of age after thirty of forty—the latter part of the column in a life-table. These are well represented by a species of Professor Pearson’s “generalized probability-curve,”[300] his type iii. of the form

y = y0(1 + x/a)γαe−−γχ.

The statistics also lend themselves to the Gompertz-Makeham formula for the number living at the age

lx = Sxgx/c.

The former law, the simplest species of the “generalized probability-curve,” may well be attributed in part to the operation of a plexus of causes such as that which is apt to generate the law of error. In fact, a high authority, Professor Lexis, has seen in these statistics—or continental statistics in pari materia—a fulfilment of the normal law of error.[301] They at least fulfil tolerably the generalized law of error above described. But the Gompertz-Makeham formula is not thus to be accounted for; at least it is not thus that it was regarded by its discoverers. Gompertz justifies his law[302] by a “hypothetical deduction congruous with many natural effects,” such as the exhaustion of air by a pump; and Makeham follows[303] in the same track of explanation by way of natural laws. Of course it is not denied that mortality is subject to accident. But the Gompertz-Makeham law purports to be fulfilled in spite of, not by reason of, fortuitous agencies. The formula is accounted for not by the interaction of fleeting causes which is characteristic of probability, but by causes of that ordinary kind of which the investigation constitutes the greater part of natural science. Laws of frequency thus conceived do not belong to the theory of Probabilities.

Authorities.—As a comprehensive and masterly treatment of the subject as a whole, in its philosophical as well as mathematical character, there is nothing similar or second to Laplace’s Théorie analytique des probabilités. But this “ne plus ultra of mathematical skill and power” as it is called by Herschel (Edinburgh Review, 1850) is not easy reading. Much of its difficulty is connected with the use of a mathematical method which is now almost superseded, “Generating Functions.” Not all parts of the book are as rewarding as the Introduction (published separately as Essai philosophique des probabilités) and the fourth and subsequent chapters of the second book. Among numerous general treatises E. Czuber’s Wahrscheinlichkeitstheorie (1899) may be noticed as terse, lucid and abounding in references. Other authorities may be mentioned in relation to the different parts of the subject as above divided. First principles are discussed with remarkable acumen by J. Venn in Logic of Chance (1st ed., 1876, 3rd ed., 1888) and by J. v. Kries in Principien der Wahrscheinlichkeitsrechnung (1886). As a repertory of neat problems involving the calculation of probability and expectation W. A. Whitworth’s Choice and Chance (5th ed., 1901), and DCC. Exercises . . . in Choice and Chance (1897) deserve mention. But this advantage is afforded in nearly as great perfection by more comprehensive works. Bertrand’s Calcul des probabilités (1889) abounds in choice examples, while it excels in almost every other branch of the subject. Special mention is also deserved by H. Poincaré’s Calcul des probabilités (leçons professes, 1893–1894). On local or geometrical probability Professor Morgan Crofton is one of the highest authorities. His paper on “Local Probability” in Phil. Trans. (1868), and on “Geometrical Theorems,” Proc. Lond. Math. Soc. (1887), viii., should be read in connexion with the section on “Local Probability” in his article on “Probability” in the 9th edition of the Ency. Brit., from which section several paragraphs have been transferred en bloc to the section on Geometrical Applications in the present article. The topic is treated exhaustively by Czuber in Geometrische Wahrscheinlichkeiten und Mittelworten (1884). Czuber is also to be mentioned as the author of Theorie der Beobachtungsfehler, in which he has reproduced, often with improvement, or referred to, almost everything of importance in the work of his predecessors. A. L. Bowley’s Elements of Statistics, pt. 2 (2nd ed., 1902), forms an introduction to the law of error which leads the beginner easily, yet far. References to other writers are given in Section I. of Part II. above. A list of writings on the cognate topic, the method of least squares, has been given by Merriman (Connecticut Trans. vol. iv.). On laws of frequency, as above defined, Professor Karl Pearson is the highest authority. His “Contributions to the Mathematical Theory of Evolution,” of which twelve have appeared in the Trans. Roy. Soc. (1894–1903) and others are being published by the Drapers' Company, teem with new theories in Probabilities. (F. Y. E.[304]) 


  1. Cf. note to par. 5 below.
  2. It is more usual to speak of the mean expectation, the average number of years per head.
  3. Below, par. 88.
  4. For more exact definition see below, par. 95.
  5. See Bowley's Address to Section F. of the British Association (1906).
  6. Edgeworth, “Methods of Statistics,” Journal of the Statistical Society (Jubilee volume, 1885).
  7. See Biometrika, vol. iii. “Inheritance of Mental Characters.”
  8. Laplace, Théorie analytique des probabilités, liv. II. ch. i. No. 1. Cf. Introduction, IIᵉ principe.
  9. The term employed by Venn in his important Logic of Chance.
  10. Below, par. 119.
  11. E.g. 1/1861 in the expansion of which the digit 8 occurs once in ten times in seemingly random fashion (see Mess. of Maths. 1864, vol. 2, pp. 1 and 39).
  12. The type shows that the phenomena which are the object of probabilities do not constitute a distinct class of things. Occurrences which perfectly conform to laws of nature and are capable of exact prediction yet in certain aspects present the appearance of chance. Cf. Edgeworth, “Law of Error,” Cam. Phil. Trans., 1905, p. 128.
  13. Cf. Venn. op. cit. ch. v. § 14; and v. Kries on the “Prinzip des mangelnden Grundes” in his Wahrscheinlichkeitsrechnung, ch. i. § 4, et passim.
  14. In a passage criticized unfavourably by Dr Venn, Logic of Chance, ch. iv. § 14.
  15. Below, par. 115.
  16. Chances of Death, i. 44.
  17. A summary of such experiments, comprising above 100,000 trials, is given by Professor Karl Pearson in his Chances of Death, i. 48.
  18. E.g. J. S. Mill, Logic, bk. III., ch. xviii. § 2.
  19. Cf. Venn, Logic of Chance, ch. vi. § 24.
  20. Boole, Trans. Roy. Soc. (1862), ix. 251.
  21. Op. cit. Introduction.
  22. Below, par. 130.
  23. Grammar of Science, ed. 2, p. 146.
  24. From the article by the present writer on the “Philosophy of Chance” in Mind, No. ix., in which some of the views here indicated are stated at greater length than is here possible.
  25. Cf. v. Kries, op. cit. ch. ii.
  26. On the principle of Taylor's theorem; cf. Edgeworth, Phil. Mag. (1892), xxxiv. 431 seq.
  27. Cf. J. S. Mill, in the passage referred to below, par. 13, on the use that may be made of an “antecedent probability,” though “it would be impossible to estimate that probability with anything like numerical precision.”
  28. Op. cit. Introduction.
  29. Bertrand on “Probabilités composées,” op. cit. art. 23.
  30. In some of the experiences referred to at par. 5.
  31. See below pars. 132, 159.
  32. Op. cit. Introduction.
  33. There is a good statement of them in Boole's Laws of Thought, ch. xvi. § 7. Cf. De Morgan “Theory of Probabilities” (Encyc. Metrop.), §§ 12 seq.
  34. Laplace, op. cit. Introduction, IVe Principe; cf. Ve Principe and liv., II. ch. i. § 1.
  35. In such a case there seems to be a propriety in expressing the indeterminate element in our data, not as above, but as proposed by Boole in his remarkable Laws of Thought, ch. xvii., ch. xviii., § 1 (cf. Trans. Edin. Roy. Soc., (1857), vol. xxi.; and Trans. Roy. Soc., 1862, vol. ix., vol. clii. pt. i. p. 251); the undetermined constant now representing the probability that if the event C does not occur the event B will. The values of this constant—in the absence of specific data, and where independence is not presumable—are, it should seem, equally distributed between the values 0 and 1. Cf. as to Boole's Calculus, Mind, loc. cit., ix. 230 seq.
  36. Laplace's Sixth Principle.
  37. Manzoni.
  38. De Morgan, Theory of Probabilities, § 19; cf. Venn, Logic of Chance, ch. vii. § 9; Edgeworth, “On the Probable Errors of Frequency Constants,” Journ. Stat. Soc. (1908), p. 653. The essential symmetry of the inverse and the direct methods is shown by an elegant proof which Professor Cook Wilson has given for the received rules of inverse probability (Nature, 1900, Dec. 13).
  39. Laplace's Seventh Principle.
  40. Logic, book III., ch. xviii. § 6.
  41. Cf. above, par. 8; below, par. 46.
  42. Cf. Venn, Logic of Chance, p. 126.
  43. See the reference to Craig in Todhunter, History . . . of Probability.
  44. Formal Logic, p. 173.
  45. Ibid. Cf. “Theory of Probabilities” (Encyc. Metrop.), note to § 5, “Wherever the term greater or less can be applied there twice, thrice, &c., can be conceived, though not perhaps measured by us.”
  46. It is well remarked by Professor Irving Fisher (Capital and Income, 1907, ch. xvi.), that Bernoulli's theorem involves a “subjective” element a “psychological magnitude.” The remark is applicable to the general theory of error of which the theorem of Bernoulli is a particular case (see below, pars. 103, 104).
  47. In the hands of Professor Karl Pearson, Mr Sheppard and Mr Yule. Cf. par. 149, below.
  48. Cf. Edgeworth, Journ. Stat. Soc. (Dec. 1908).
  49. Below, par. 152.
  50. Consider the equivalent of Laplace's second principle given at par 9, above, and his third principle quoted at par. 10.
  51. Above, par. 12.
  52. In the more familiar form; that (of two independently fluctuating quantities) the mean of the product is the product of the means (cf. Czuber, Theorie der Beobachtungsfehler, p. 133).
  53. Above, par. 6.
  54. These peculiarities afford some justification for Laplace's restriction of the term expectation to “goods.” As to the wider definition here adopted see below, par. 94 and par. 95, note.
  55. Each fortune referred to is divided by a proper parameter. See below, par. 69.
  56. Op. cit. liv. II. ch. xiii. No. 41. Cf. liv. II. ch. i. No. 2.
  57. Principles of Economics, book III., ch. vi. § 6, p. 209, ed. 4.
  58. Cf. below, par. 71.
  59. Some further references bearing on the subject are given in a paper by the present writer on the “Pure Theory of Taxation,” No. III. Economic Journ. (1897), vii. 550-551.
  60. Below, par. 131.
  61. Above, par. 14.
  62. Above, par. 5.
  63. Article on “Probabilities” (Encyc. Metrop.), § 40.
  64. Essai (1785), pp. 142 et seq.
  65. Logic of Chance, ch. vi. §§ 24-28.
  66. Wahrscheinlichkeitsrechnung, pp. 184 seq.
  67. The relations of recent logicians to the older mathematical writers on Probabilities may be illustrated by the relations of modern “historical” economists to their more abstract predecessors.
  68. Of the two properties which have been found to characterize probability (above, par. 5)—proportionate (1) number of (equally) favourable cases and (2) frequency of observed occurrence—the former especially pertain to the data and quaesita of this section.
  69. Cf. Bertrand's distinction between “Probabilités totales,” and “Probabilités composées,” Calcul des probabilités, ch. ii. arts. 23, 24.
  70. Cf. Todhunter, History. . . of Probability, p. 360, and other statements of James Bernoulli's Theorem, referred to in the index.
  71. Venn, op. cit. p. 91.
  72. Some of these proofs are adduced, and a new and elegant one added by Bertrand, op. cit. ch. v.
  73. When the degree in which a certain range of central terms tends to preponderate over the residue of the series is formulated with precision, as in the statement given by Todhunter (op. cit. p. 548) when he is interpreting Laplace, then James Bernoulli's theorem presents a particular case of the law of error—the case considered below in par. 103.
  74. See Chrystal, Algebra, ch. xxiii. § 12; or other textbook of algebra.
  75. See Todhunter, History . . . of Probability, art. 8; Bertrand, Calcul des probabilités, p. vii., or the original documents.
  76. As Galileo discerned. A friend of his had observed that 11 occurred 1080 times to 1000 times of 12.
  77. The law of error given below, par. 104.
  78. Above, par. 24.
  79. Trans. Roy. Soc. (1895). See below, par. 165.
  80. Todhunter, History . . . of Probability, and Bertrand, Calcul des probabilités, p. 9.
  81. All three events cannot fail.
  82. (c) occurring n times.
  83. The reasoning may be illustrated by using the area of a circle to represent the frequency with which hearts fail, another (equal) circle for diamonds; for the case in which both hearts and diamonds fail the area common to the circles interlapping, and so on.
  84. See Whitworth, Exercises in Choice and Chance, No. 502 (p. 125); referring to prop. xiv. of the same author's Choice and Chance.
  85. Cf. Whitworth, Choice and Chance, question 143, p. 183, ed. 4.
  86. Ibid.
  87. There is such a table at the end of De Morgan's article in the Calculus of Probabilities in the Ency. Brit. “Pure Sciences,” vol. ii.
  88. Cancelling factors common to the numerator and denominator.
  89. Cf. Boole's Finite Differences, ch. vii. § 5.
  90. Op. cit. liv. II. ch. ii., No. 12.
  91. Op. cit. liv. II. ch. ii., No. 8.
  92. A clear and corrected version of Laplace's reasoning is given by Todhunter, History. . . of Probability, art. 973, p. 528, with reference to the more general cases in which the “skills” of each party—their chances of winning a single game—are not equal but respectively p and q (p + q = 1). See also Czuber, Wahrscheinlichkeitstheorie, pp. 30 seq.
  93. See Todhunter, op. cit. art. 107, and other articles referring to duration of play. See also Boole, Finite Diferences, ch. xiv., art. 7, ex. 6.
  94. Above, par. 13.
  95. Op. cit. liv. II. ch. i. No. 1.
  96. Cf. Bertrand, op. cit. § 118.
  97. Above, par. 5.
  98. Par. 13.
  99. Op. cit. § 134.
  100. By a calculation based on the fundamental theorem (above, par. 23; cf. below, par. 103).
  101. But see below, par. 51.
  102. Morgan Crofton, loc. cit. p. 778, par. 1.
  103. Essai, p. 6 (there is postulated a proviso analogous to that which has been stated in par. 49 above, with reference to witnesses: that the probability of any one voter being right is > ½).
  104. See Mill's forcible remarks on this use of probabilities, which he places among the “misapplications of the calculus which have made it the real opprobrium of mathematics” (Logic, Book III, ch. xviii. § 3). Cf. Bertrand, Calcul des probabilités; Venn, Logic of Chance, ch. xvi. § 5-7; v. Kries, Principien der Wahrscheinlichkeitsrechnung, ch. ix., preface, § v., and ch. xiii. §§ 12, 13; Laplace's general reflections on this matter seem more valuable than his calculations: “Tant de passions et d'interêts particuliers y mêlent si souvent leur influence qu'il est impossible de soumettre au calcul cette probabilité,” op. cit. Introduction (Des Choix et décisions des assemblées).
  105. As to the possibility of mistake in this respect, see Proctor, How to play Whist, p. 121.
  106. Bertrand, loc. cit.
  107. Loc. cit. § 43.
  108. Below, pars. 135, 136. A difficulty raised by Cournot with respect to the determination of several quantities which are connected by an equation does not here arise. The system of values determined for the several causes fulfils by construction the condition that the sum of the values should be equal to unity.
  109. Above, par. 44.
  110. It comes to the same to suppose the total number of balls in the mixture to be N; and to assume that the number of white balls is a priori equally likely to have any one of the values 1, 2, . . . N - 1, N.
  111. Above, par. 5.
  112. Logic, bk. ii. ch. ix. § 5.
  113. Grammar of Science, ch. iv. § 16. Cf. the article in Mind above referred to, ix. 234.
  114. See the introductory remarks headed “Description and Division of the Subject.”
  115. Cf. above, par. 25.
  116. See Pearson, Phil. Trans. (1895), A.
  117. Whitworth, Exercises, No. 502.
  118. Ibid. No. 504, cf. above, par. 29.
  119. Ibid. par. 36.
  120. Above par. 52.
  121. Loc. cit. § 45.
  122. See Edgeworth, “Elements of Chance in Examinations,” Journ. Stat. Soc. (1890). Cf. below, par. 124.
  123. Biometrika, i. 390.
  124. Moore, of Columbia University, New York, has attempted to trace Karl Pearson's theory in the statistics relating to the efficiency of wages (Economic Journal, Dec. 1907; and Journ. Stat. Soc., Dec. 1907).
  125. Cf. below, par. 169.
  126. Whitworth, Choice and Chance, question 126.
  127. Whitworth, Exercises, No. 567.
  128. According to the principle above enounced, par. 15.
  129. Bertrand, id. § 44, prob. xlvii.
  130. Bertrand, id. § 39, prob. xliii. It is not to be objected that the probabilities on which the several expectations are calculated are not independent (above, par. 16).
  131. It is important to remark that we should be wrong in thus adding the expectations if the events were not mutually exclusive. For the mathematical expectations it is not so.
  132. This paragraph is taken from Morgan Crofton's article on “Probability,” in the 9th edition of the Ency. Brit.
  133. Cf. Marshall, Principles of Economics, Mathematical Appendix, note ix.
  134. Or should we rather say, not exceeding the limit at which ψ(a − pα/q) becomes 0? (The value of ψ(0) may be regarded as −∞.) Neither of the proposed limitations materially affects the validity of the theorem.
  135. Loc. cit. par. 25.
  136. See above, par. 25 (James Bernoulli's theorem).
  137. Specimen theoriae novae de mensura sortis (16), translated (into German) with notes by Pringsheim (1906).
  138. Op. cit. art. 389.
  139. Choice and Chance, pp. 211, 232. The danger of a party to a game of chance being “ruined” (by losing more than his whole fortune), which forms a separate chapter in some treatises, is readily deducible from the theory of deviations from an average which will be stated in pt. ii.
  140. Above, par. 5.
  141. Above, par. 20.
  142. Whitworth, Exercises, No. 500.
  143. Cf. Morgan Crofton, loc. cit.
  144. As recorded by Czuber, Geometrische Wahrscheinlichkeiten, p. 90.
  145. Cf. Bertrand, Calcul des probabilités, pp. 4 seq. The matter has been much discussed in the Educational Times. See Mathematical Questions . . . from the Educational Times [a reprint], xxix. 17-20, containing references to earlier discussions, e.g. x. 33 (by Woolhouse).
  146. Loc. cit. § 75.
  147. The whole of p. 787 of Morgan Crofton's article is often referred to, and parts of pp. 786, 788 are transferred here.
  148. This result also follows by considering that, if an infinite plane be covered by an infinity of lines drawn at random, it is evident that the number of these which meet a given finite straight line is proportional to its length, and is the same whatever be its position. Hence, if we take l the length of the line as the measure of this number, the number of random lines which cut any element ds of the contour is measured by ds, and the number which meet the contour is therefore measured by ½L, half the length of the boundary. If we take 2l as the measure for the line, the measure for the contour will be L, as above. Of course we have to remember that each line must meet the contour twice. It would be possible to rectify any closed curve by means of this principle. Suppose it traced on the surface of a circular disk, of circumference, and the disk thrown a great number of times on a system of parallel lines, whose distance asunder equals the diameter, if we count the number of cases in which the closed curve meets one of the parallels, the ratio of this number to the whole number of trials will be ultimately the ratio of the circumference of the curve to that of the circle. [Morgan Crofton's note.]
  149. Or the floor may be supposed painted with parallel bands, at a distance asunder equal to the diameter; so that the circle must fall on one.
  150. The line might be anywhere within the circle without altering the question.
  151. This integral was given by Morgan Crofton in the Comptes rendus (1869), p. 1469. An analytical proof was given by Serret, Annales scient. de l'école normale (1869), p. 177.
  152. Cf. Bertrand, op. cit. § 135.
  153. See e.g. Watson, Kinetic Theory of Gases, p. 2; Tait, Trans. Roy. Soc., Edin. (1888), xxxiii. 68.
  154. Wahrscheinlichkeitstheorie, p. 64.
  155. See introductory remarks and note to par. 95.
  156. A great variety of (functional) averages, including those which are best known, are comprehended in the following general form φ−1{M[φ(x1), φ(x2), . . . φ(xn)]}; where φ is an arbitrary function, φ−1 is inverse (such that φ−1(φ(x)) ≡ x), M is any (functional) mean. When M denotes the arithmetic mean; if φ(x) ≡ log x (φ−1(x) ≡ ex) we have the geometric mean; if φ(x) ≡ 1/x, we have the harmonic mean. Of this whole class of averages it is true that the average of several averages is equal to the average of all their constituents.
  157. This convenient term was introduced by Karl Pearson.
  158. E.g. some specified method of smoothing the given statistics.
  159. See above, pt. i., pars. 3 and 4. Accordingly the expected value of the sum of n (similar) constituents (x1 + x2 + . . . + xn) may be regarded as an average, the average value of nxr where xr is any one of the constituents.
  160. See as to the fact and the evidence for it, Venn, Logic of Chance, 3rd ed., pp. 111, 114. Cf. Ency. Brit., 8th ed., art. “Probability,” p. 592; Bertrand, op. cit., preface § ii.; above, par. 59.
  161. See his Cours d'économie politique, ii. 306. Cf. Bowley, Evidence before the Select Committee on Income Tax (1906, No. 365, Question 1163 seq.); Benini, Metodologica statistica, p. 324, referred to in the Journ. Stat. Soc. (March, 1909).
  162. On this conception see below, par. 122.
  163. E.g. in the article on “Probability” in the 9th ed. of the Ency. Brit.; also by Airy and other authorities. Bravais, in his article Sur la probabilité des erreurs. . . . “Mémoires présentés par divers savants” (1846), p. 257, takes as the “modulus or parameter” the inverse square of our c. Doubtless different parameters are suited to different purposes and contexts; c when we consult the common tables, and in connexion with the operator, as below, par. 160; k( = ½c2) when we investigate the formation of the probability-curve out of independent elements (below, par. 104); h( = 1/c2) when we are concerned with weights or precisions (below, par. 134). If one form of the coefficient must be uniformly adhered to, probably, σ( = c/√2), for which Professor Pearson expresses a preference, appears the best. It is called by him the “standard deviation.”
  164. Fuller tables are to be found in many accessible treatises. Burgess's tables in the Trans. of the Edin. Roy. Soc. for 1900 are carried to a high degree of accuracy. Thorndike, in his Mental and Social Measurements, gives, among other useful tables, one referred to the standard deviation as the argument. New tables of the probability integral are given by W. F. Sheppard, Biometrics, ii. 174 seq.
  165. Edinburgh Review (1850), xcii. 19.
  166. The italics are in the original. The passage continues: “And it is on this ignorance, and not on any peculiarity in cases, that the idea of probability in the abstract is formed.” Cf. above, par. 6.
  167. Natural Philosophy, pt. i. art. 391. For other a priori proofs see Czuber, Theorie der Beobachtungsfehler, th. i.
  168. Cf. note to par. 127.
  169. He considered the effect as the sum of causes each of which obeys the simplest law of frequency, the symmetrical binomial.
  170. Memoirs of Astronomical Society (1878), p. 105. Cf. Morgan Crofton, “On the Law of Errors of Observation,” Trans. Roy. Soc. (1870), vol. clx. pt. i. p. 178.
  171. Above, par. 2.
  172. By the use of Stirling's and Bernoulli's theorems, Todhunter, History. . . of Probability.
  173. The statement includes the case of a linear function, since an element multiplied by a constant is still an element.
  174. E.g. if the frequency-locus of each element were 1/π(1 + x2), extending to infinity in both directions. But extension to infinity would not be fatal, if the form of the element's locus were normal.
  175. For a fuller exposition and a justification of many of the statements which follow, see the writer's paper on “The Law of Error” in the Camb. Phil. Trans. (1905).
  176. Loc. cit. pt. i. § 1.
  177. On this criterion of coincidence see Karl Pearson's paper “On the Systematic Fitting of Curves,” Biometrika, vols. i. and ii.
  178. Laplace, Théorie analytique des probabilités, bk. ii. ch. iv.; Poisson, Recherches sur la probabilité des judgements. Good restatements of this proof are given by Todhunter, History . . . of Probability art. 1004, and by Czuber, Theorie der Beobachtungsfehler, art. 38 and Th. 2, §4.
  179. The symbol || is used to denote absolute magnitude, abstraction being made of sign.
  180. Below, pars. 159, 160.
  181. Loc. cit. app. I.
  182. Loc. cit. p. 53 and context.
  183. The Analyst (Iowa), vols. v., vi., vii. passim; and especially vi. 142 seq., vii. 172 seq.
  184. Morgan Crofton, loc. cit. p. 781, col. a. The principle has been used by the present writer in the Phil. Mag. (1883). xvi. 301.
  185. For a criticism and extension of Crofton's proof see the already cited paper on “The Law of Error,” Camb. Phil. Trans. (1905), pt. i. § 2. Space does not permit the reproduction of Crofton's proof as given in the 9th ed. of the Ency. Brit. (art. “Probability,” § 48).
  186. Loc. cit. pt. I. § 4; and app. 6.
  187. Loc. cit. p. 122 seq.
  188. Loc. cit. pt. ii. § 7.
  189. The second by Burbury, in Phil. Mag. (1894), xxxvii. 145; the third by its author in the Analyst for 1881; and the remainder by the present writer in Phil. Mag. (1896), xii. 247; and Camb. Phil. Trans. (1905), loc. cit.
  190. Compare the formula for the simple case above, § 4.
  191. On the irregularity of the dice with which Weldon experimented, see Pearson, Phil. Mag. (1900), p. 167.
  192. Experiments in pari materia performed by A. D. Darbishire afford additional illustrations. See “Some Tables for illustrating Statistical Correlation,” Mem. and Proc. Man. Lit., and Phil. Soc., vol. li. pt. iii.
  193. Journ. Stat. Soc. (March 1900), p. 73, referring to Burton, Phil. Mag. (1883), xvi. 301.
  194. Memoirs of Astronomical Society (1878), p. 105.
  195. Journ. Stat. Soc. (1890), p. 462 seq.
  196. E.g. the marking of the same work by different examiners. Ibid.
  197. Lettres sur la théorie des probabilités and Physique sociale.
  198. E.g. the measurements of Italian recruits, adduced in the Atlante statistico, published under the direction of the Ministero de Agricultura (Rome, 1882); and Weldon's measurements of crabs, Proc. Roy. Soc. liv. 321; discussed by Pearson in the Trans. Roy. Soc. (1894), vol. clxxxv. A.
  199. Biometrika, iii. 395. Cf. ibid. p. 141.
  200. Wages in the United Kingdom in the Nineteenth Century; and art. “Wages” in the Ency. Brit., 10th ed., vol. xxxiii.
  201. Phil. Mag. (1900), p. 168.
  202. Cf. Journ. Stat. Soc., Jubilee No., p. 192.
  203. Massenerscheinungen.
  204. Grundzüge der Statistik. Cf. Bowley, Elements of Statistics, p. 302.
  205. Das Gesetz der kleinen Zahlen.
  206. See for other definitions Report of the British Association (1889), pp. 136 and 161, and compare Walsh's exhaustive Measurement of General Exchange-Value.
  207. Cf. Bowley, Elements of Statistics, ch. ix.
  208. Journ. Stat. Soc. (1874 and later). Parly. Papers [C. 2247] and [C. 3079].
  209. “Working-Class Progress since 1860,” Journ. Stat. Soc. (1899), p. 639.
  210. On this conception compare Venn, Logic of Chance, chs. iii. and iv., and Sheppard, Proc. Lond. Math. Soc., p. 363 seq.
  211. Laplace's 6th principle, Théorie analytique, intro. x.
  212. See above, pars. 13 and 14.
  213. Cf. above, par. 102.
  214. Cf. Galton's enthusiasm, Natural Inheritance, p. 66.
  215. A lucid statement of the methods and results of probabilities applied to gunnery is given in the Official Text-book of Gunnery (1902).
  216. Venn, Journ. Stat. Soc. (1891), p. 443.
  217. Ed. Rev. (1850), xcii. 23.
  218. Cf. Galton, Phil. Mag. (1875), xlix. 44.
  219. Above, par. 112.
  220. Ibid.
  221. Above, par. 114, and below, par. 127.
  222. Some plurality of independent causes is presumable.
  223. Herschel's a priori proposition concerning the law of error in two dimensions (above, par. 99) might still be defended either as generally true, so many phenomena showing no trace of interdependence, or on the principle which justifies our putting ½ for a probability that is unknown (above, par. 6), or 5 for a decimal place that is neglected; correlation being equally likely to be positive or negative. The latter sort of explanation may be offered for the less serious contrast between the a priori and the empirical proof of the law of error in one dimension (below, par. 158).
  224. Cf. above, par. 115.
  225. Cf. note to par. 98, above.
  226. Phil. Mag. (1892), p. 200 seq.; 1896, p. 211; Pearson, Trans. Roy. Soc. (1896), 187, p. 302; Burbury, Phil. Mag. (1894), p. 145.
  227. Pearson, “On the Reconstruction of Prehistoric Races,” Trans. Roy. Soc. (1898), A, p. 174 seq.; Proc. Roy. Soc. (1898), p. 418.
  228. Pearson, “The Law of Ancestral Heredity,” Trans. Roy. Soc.; Proc. Roy. Soc. (1898).
  229. Papers in the Royal Society since 1895.
  230. An example instructively discussed by Yule, Journ. Stat. Soc. (1899).
  231. If normally in any direction indifferently according to the two- or three-dimensioned law of error, then normally in one dimension when collected and distributed in belts perpendicular to a horizontal right line, as in the example cited below, par. 155.
  232. Or small interval (cf. preceding section).
  233. “Toute erreur soit positive soit négative doit être considerée comme un désavantage ou une perte réelle à un jeu quelconque,” Théorie analytique, art. 20 seq., especially art. 25. As to which it is acutely remarked by Bravais (op. cit. p. 258), “Cette règle simple laisse à désirer une démonstration rigoureuse, car l'analogue du cas actuel avec celui des jeux de hasard est loin d'être complète.”
  234. Theoria combinationis, pt. i. § 6. Simon Newcomb is conspicuous by walking in the way of Laplace and Gauss in his preference of the most advantageous to the most probable determinations. With Gauss he postulates that “the evil of an error is proportioned to the square of its magnitude” (American Journal of Mathematics, vol. viii. No. 4).
  235. As argued by the present writer, Camb. Phil. Trans. (1885), vol; xiv. pt. ii. p. 161. Cf. Glaisher, Mem. Astronom. Soc. xxxix. 108.
  236. The view taken by the present writer on the “Philosophy of Chance,” in Mind (1880; approved by Professor Pearson, Grammar of Science, 2nd ed. p. 146). See also “A priori Probabilities,” Phil. Mag. (Sept. 1884), and Camb. Phil. Trans. (1885), vol. xiv. pt. ii. p. 147 seq.
  237. Above, pars. 6, 7.
  238. The mean square .
  239. The standard deviation pertaining to a set of (n/r) composite observations, each derived from the original n observations by averaging a batch thereof numbering r, is √(k/r)/√(n/r) = √(k/n), when the given observations are all of the same weight; mutatis mutandis when the weights differ.
  240. The use of the cubes is also contrasted with that of the squares (only) in this respect: that it is no longer a matter of indifference how many of the original observations we assign to the batch of which the mean constitutes the single (compound) observation.
  241. The object of the writer's paper on “Methods of Statistics” in the Jubilee number of the Journ. Stat. Soc. (1885).
  242. See on the use of the inverse method to determine the mode of a group, the present writer's paper on “Probable Errors” in the Journ. Stat. Soc. (Sept. 1908).
  243. Above, par. 103.
  244. Théorie analytique, 2nd supp. p. 164. Mécanique céleste, bk. iii. art. 40; on which see the note in Bowdich's translation. The method may be extended to other percentiles. See Czuber, Beobachtungsfehler, § 58. Cf. Phil. Mag. (1886), p. 375; and Sheppard, Trans. Roy. Soc. (1889), 192, p. 135, ante, where the error incident to this kind of determination is ascertained with much precision.
  245. Cf. Phil. Mag. (1887), xxiv. 269 seq., where the median is prescribed in case of “discordant” (heterogeneous) observations. If the more drastic remedy of rejecting part of the data is resorted to Sheppard's method of performing that operation may be recommended (Proc. Lond. Math. Soc. vol. 31). He prescribes for cases to which the median may not be appropriate, namely, the determination of other frequency-constants besides the mean of the observations.
  246. Above, par. 134.
  247. E.g. Airy, Theory of Errors, art. 60.
  248. It is a nice point that the expression for c2, which has (n − 1) instead of n for denominator, though not the more probable, may yet be the more advantageous (supposing that there were any sensible difference between the two). Cf. Camb. Phil. Trans. (1885), vol. xiv. pt. ii. p. 165; and “Probable Errors,” Journ. Stat. Soc. (June 1908).
  249. Above, par. 96, note.
  250. Théorie analytique, 2nd supp. ed. 1847, p. 578.
  251. See the matter discussed in Camb. Phil. Trans., loc. cit.
  252. Trans. Roy. Soc. (1899), A, cxcii. 135.
  253. Good as tested by a comparison of the mean squares of errors in the frequency-constant determined by the compared methods.
  254. Above, par. 130.
  255. See Phil. Mag. (1888), “On a New Method of Reducing Observations”; where a comparison in respect of convenience and accuracy with the received method is attempted.
  256. Corresponding to the of pars. 14, 127 above.
  257. Pearson, Trans. Roy. Soc., A, 191, p. 234.
  258. Pearson, Grammar of Science, 2nd ed. p. 402, 431.
  259. Trans. Roy. Soc. (1898), A, vol. 191; Biometrika, ii. 273.
  260. Above, par. 107. Compare the proof of the “Subsidiary Law of Error,” as the law in this connexion may be called, in the paper on “Probable Errors,” Journ. Stat. Soc. (June 1908).
  261. Trans. Roy. Soc. (1899), A, 192, p. 141.
  262. Above, par. 115.
  263. Grammar of Science, p. 432.
  264. Trans. Roy. Soc., A, vol. 195. In this connexion reference should also be made to Pearson's theory of “Contingency” in his thirteenth contribution to the “Mathematical Theory of Evolution” (Drapers' Company Research Memoirs).
  265. Trans. Roy. Soc. (1900), A, 194, p. 257; (1901), A, 197, p. 91.
  266. Above, par. 127.
  267. Above, par. 116.
  268. If from the given set of n observations (each corresponding to a point on the plane xy) there is derived a set of n/s observations each obtained by averaging a batch numbering s of the original observation; the coefficient of correlation for the derived system is the same as that which pertains to the original system. As to the standard deviation for the new system see note to par. 135.
  269. Cf. above, par. 115.
  270. Proc. Roy. Soc., vol. 60, p. 477.
  271. Below, par. 168.
  272. Just as the removal of a tax tends to be in the long run beneficial to the consumer, though the benefit on any particular occasion may be masked by fluctuations of price due to other causes.
  273. Phil. Mag. (July, 1900).
  274. As shown above, par. 103.
  275. Loc. cit. p. 166.
  276. It is frequent in the statistics of wages.
  277. See on this subject, in addition to the paper on the “Law of Error” already cited (Camb. Phil. Trans., 1905). another paper by the present writer, on “The Generalized Law of Error,” in the Journ. Stat. Soc. (September, 1906).
  278. The Analyst (Iowa), vol. ix.
  279. Phil. Mag. (Feb., 1896) and Camb. Phil. Trans. (1905).
  280. The part of the third approximation affected with k2 may be found by proceeding to another step in the method described (Phil. Mag., 1896, p. 96). The remaining part of the third approximation is found by the same method (or the variant on p. 97) from the new partial differential equation dy/di = 1/24d4y/dx4, where k2, is the difference between the actual mean fourth power of deviation and what it would be if the normal law held good. Further approximations may be obtained on the same principle.
  281. μ4 − 3μ22 in the notation which Professor Pearson has made familiar.
  282. Cf. Pearson, Trans. Roy. Soc. (1895), A, clxxxvi. 347.
  283. Above, § 103, referring to Todhunter, History, art. 993. The third (or second additional term of) approximation for the binomial, given explicitly by Professor Pearson, Trans. Roy. Soc. (1895), A, footnote of p. 347, will be found to agree with the general formula above given, when it is observed that the correction affecting the absolute term, his y0, disappears in his formula by division.
  284. Journ. Stat. Soc. (1899), p. 550, referring to Pearson, Trans. Roy. Soc. (1898), A.
  285. Practically no doubt the law is not available beyond the third or fourth approximation, for a reason given by Pearson, with reference to his generalized probability-curve, that the probable error incident to the determination of the higher moments becomes very great.
  286. This consideration does not present the determination of the true moments from the complete set of observations if homogeneous, according as the system of elements fulfils more or less perfectly certain conditions.
  287. Cf. Journ. Stat. Soc. (1899), lxii. 131. A similar substitution of the generalized law of error may be recommended in preference to the method of translating a normal law of error (putting x = f(x), where x obeys the normal law of error) suggested by the present writer (Journ. Stat. Soc., 1898), and independently by Professor J. C. Kapteyn (Skew Frequency Curves, 1903).
  288. Trans. Roy. Soc. (1895), A, p. 381.
  289. Ibid. p. 360.
  290. “Mathematical Contributions to the Theory of Evolution” (Drapers' Company Research Memoirs, Biometric Series II.), xiv. 4.
  291. p. 7, loc. cit.
  292. Ibid. p. 367.
  293. Pearson, loc. cit., p. 364, and Proc. Roy. Soc.
  294. A lucid exposition of Professor Pearson’s various methods is given by W. Palin Elderton in Frequency-curves and Correlation (1906).
  295. Journ. Stat. Soc. (1895), p. 506.
  296. “Contributions,” No. xiv. (above cited).
  297. Not the same parabola as that proposed at par. 162.
  298. Census of England and Wales General Report (cod. 2174), p. 226. Cf. p. 70, as to the rationale of the phenomenon.
  299. A good example of the suggested blend between law and chance is presented by an hypothesis which Benine (in a passage referred to above, par. 97) has proposed to account for Pareto’s income-curve.
  300. “Contributions,” No. ii., Phil. Trans. (1895), vol. 186, A.
  301. Lexis, Massenerscheinungen, § 46. Cf. Venn, cited above, par. 124.
  302. Phil. Trans. (1-25).
  303. Assurance Magazine (1866), xi. 315.
  304. These initials do not apply to certain passages in the above article, namely, the greater part of paragraphs 41, 52, 62 and 72, and almost the whole of the 4th section of Part. I. (pars. 76-93), which have been adopted from the article “Probability” in the 9th edition of the Ency. Brit., written by Professor Morgan Crofton.