### <--- Previous topic | Next topic --->

## Statistics of dispersion

Summarizing data from a measurement variable requires a number that represents the "middle" of a set of numbers (known as a "statistic of central tendency" or "statistic of location"), along with a measure of the "spread" of the numbers (known as a "statistic of dispersion"). Statistics of dispersion are used to give a single number that describes how compact or spread out a distribution of observations is. Although statistics of dispersion are usually not very interesting by themselves, they form the basis of most statistical tests used on measurement variables.

**Range:** This is simply the difference between the largest
and smallest observations. This is the statistic of dispersion that
people use in everyday conversation, but it is not very informative for
statistical purposes. The range depends only on the largest and smallest
values, so that two sets of data with very different distributions could
have the same range. In addition, the range is expected to increase as
the sample size increases. The range can be found in Excel by using
=MAX(Ys)-MIN(Ys), where Ys represents a set of cells.

**Sum of squares:** This is not really a statistic of dispersion
by itself, but it is mentioned here because it forms the basis of the
variance and standard deviation. Subtract the sample mean from an
observation and square this "deviate". Squaring the deviates makes all of
the squared deviates positive and has other statistical advantages. Do
this for each observation, then sum these squared deviates. This sum of
the squared deviates from the mean is known as the sum of squares. It is
given by the Excel function DEVSQ(Ys), *not* by the function SUMSQ.

**Parametric variance:** If you take the sum of squares and divide
it by the number of observations (N), you are computing the average
squared deviation from the mean. As observations get more and more spread
out, they get farther from the mean, and the average squared deviate gets
larger. This average squared deviate, or sum of squares divided by N, is
the variance. You can only calculate the variance this way if you have
observations for every member of a population, which is almost never the
case. I can't think of a good biological example where using the
parametric variance would be appropriate. The parametric variance is
given by the Excel function VARP(Ys).

**Sample variance:** We almost always have a sample of observations
that we are using to estimate a population parameter. To get an
unbiased estimate of the population variance, divide the sum of squares
by N-1, not by N. This sample variance, which is the one you will almost
always use, is given by the Excel function VAR(Ys). From here on, when you see "variance," it means the sample variance.

**Standard deviation:** Variance, while it has useful
statistical properties that make it the basis of many statistical tests,
is in squared units. A set of heights measured in centimeters would have
a variance expressed in square centimeters, which is just weird. Taking
the square root of the variance gives a measure of dispersion that is in the original units. The square root of the parametric variance is the parametric standard deviation, which you will almost never use;
is given by the Excel function STDEVP(Ys). The sample standard deviation requires a rather complicated correction factor and is given by the Excel function STDEV(Ys). You will almost always use the sample standard deviation; from here on, when you see "standard deviation," it means the sample standard deviation.

In addition to being more understandable than the variance as a measure of the amount of variation in the data, the standard deviation summarizes how close observations are to the mean in a very nice way. Many variables in biology fit the normal probability distribution fairly well. If a variable fits the normal distribution, 68.3 percent (or roughly two-thirds) of the values are within one standard deviation of the mean, 95.4 percent are within two standard deviations of the mean, and 99.7 (or almost all) are within 3 standard deviations of the mean. Here's a histogram that illustrates this:

The proportions of the data that are within 1, 2, or 3 standard deviations of the mean are different if the data do not fit the normal distribution, as shown for these two very non-normal data sets:

**Coefficient of variation.** Coefficient of variation is the standard deviation divided by the mean; it summarizes the amount of variation as a percentage or proportion of the total. It is useful when comparing the amount of variation among groups with different means. For example, let's say you wanted to know which had more variation, pinkie finger length or little toe length; you want to know whether stabilizing selection is stronger on fingers than toes, since we use our fingers for more precise activities than our toes. Pinkie fingers would almost certainly have a higher standard deviation than little toes, because fingers are several times longer than toes. However, the coefficient of variation might show that the standard deviation, as a percentage of the mean, was greater for toes.

### How to calculate the statistics

I have made a spreadsheet that calculates the range, sample variance, sample standard deviation, and coefficient of variation, for up to 1000 observations.

This web page calculates range, variance, standard deviation, and coefficient of variation for up to 80 observations.

This web page calculates range, variance, and standard deviation. I don't know the maximum number of observations it can handle.

### Example

Here are the statistics of dispersion for the blacknose dace data from the central tendency web page. In reality, you would rarely have any reason to report all of these:

Range 90 Variance 1029.5 Standard deviation 32.09 Coefficient of variation 45.8%

### Reference

Sokal and Rohlf 1995, pp. 48-53, 57-59, 98-105.

### <--- Previous topic | Next topic --->

Return to the Research Methods in Biology syllabus

Return to John McDonald's home page

This page was last revised October 15, 2006. Its URL is statdispersion.html