Page:Sm all cc.pdf/37

This page has been proofread, but needs to be validated.

34

The first three reasons are inexcusable; only the last reason is scientifically valid. In statistics, as in any other field, assumptions decrease the scope of possibilities and enable one to draw conclusions with greater confidence, if the assumption is valid. For example, various nonparametric techniques require 5-50% more measurements than parametric techniques need to achieve the same level of confidence in conclusions. Thus nonparametric techniques are said to be less efficient than parametric techniques, and the latter are preferable if the assumption of a normal distribution is valid. If this assumption is invalid but made anyway, then parametric techniques not only overestimate the confidence of conclusions but also give somewhat biased estimates.

The nonparametric analogues of parametric techniques are:

Measure	Parametric	Nonparametric
Average:	Mean	Median
Dispersion:	Standard deviation	Interquartile range
Confidence limits:	Conf. limits on mean	Conf. limits on median

Nonparametric statistics are easy to use, whether or not they are an option in one’s spreadsheet or graphics program. The first step in nearly all nonparametric techniques is to sort the measurements into increasing order. This step is a bit time consuming to do by hand for large datasets, but today most datasets are on the computer, and many software packages include a ‘sort’ command. We will use the symbol Ι_i to refer to the data value in sorted array position i; for example, Ι₁ would be the smallest data value.

The nonparametric measure of the true average value of the parent population is the median. For an odd number of measurements, the median is simply the middle measurement (Ι_N/2), i.e., that measurement for which half of the other measurements is larger and half is smaller. For an even number of measurements there is no single middle measurement, so the median is the average (midpoint) of the two measurements that bracket the middle. For example, if a sorted dataset of five points is 2.1, 2.1, 3.4, 3.6, and 4.7, then the median is 3.4; if a sorted dataset of six points is 2.1, 2.1, 3.4, 3.6, 4.7, and 5.2, then the median is (3.4+3.6)/2 = 3.5.

The median divides the data population at the 50% level: 50% are larger and 50% are smaller. One can also divide a ranked dataset into four equally sized groups, or quartiles. One quarter of the data are smaller than the first quartile, the median is the second quartile, and one quarter of the data are larger than the third quartile.

The range is a frequently used nonparametric measure of data dispersion. The range is the data pair of smallest (Ι₁) and largest (Ι_N) values. For example, the range of the latter dataset above is 2.15.2. The range is a very inefficient measure of data dispersion; one measurement can change it dramatically. A more robust measure of dispersion is the interquartile range, the difference between the third and first quartiles. The interquartile range ignores extreme values. It is conceptually analogous to the standard deviation: the interquartile range encompasses the central 50% of the data, and ± 1 standard deviation encompasses the central 68% of a normal distribution,

For non-normal distributions, confidence limits for the median are the best way to express the reliability with which the true average of the parent population can be estimated. Confidence limits are determined by finding the positions Ι_k and Ι_N-k+1 in the sorted data array Ι_i, where k is determined from Table 5 below. Because these confidence limits use an integer number of array positions, they do not correspond exactly to 95% or 99% confidence limits. Therefore Table 5 gives the largest k yielding a probability of at least the desired probability. For example, suppose that we have 9 ranked measurements: 4.5, 4.6, 4.9, 4.9, 5.2, 5.4, 5.7, 5.8, and 6.2. Then N=9, k=3 yields less