Basics

Tests for attribute variables

Descriptive statistics

Tests for one measurement variable

Tests for multiple measurement variables

Multiple tests

Miscellany

<--- Previous topic | Next topic --->


Statistics of dispersion

Summarizing data from a measurement variable requires a number that represents the "middle" of a set of numbers (known as a "statistic of central tendency" or "statistic of location"), along with a measure of the "spread" of the numbers (known as a "statistic of dispersion"). Statistics of dispersion are used to give a single number that describes how compact or spread out a distribution of observations is. Although statistics of dispersion are usually not very interesting by themselves, they form the basis of most statistical tests used on measurement variables.

Range: This is simply the difference between the largest and smallest observations. This is the statistic of dispersion that people use in everyday conversation, but it is not very informative for statistical purposes. The range depends only on the largest and smallest values, so that two sets of data with very different distributions could have the same range. In addition, the range is expected to increase as the sample size increases. The range can be found in Excel by using =MAX(Ys)-MIN(Ys), where Ys represents a set of cells.

Sum of squares: This is not really a statistic of dispersion by itself, but it is mentioned here because it forms the basis of the variance and standard deviation. Subtract the sample mean from an observation and square this "deviate". Squaring the deviates makes all of the squared deviates positive and has other statistical advantages. Do this for each observation, then sum these squared deviates. This sum of the squared deviates from the mean is known as the sum of squares. It is given by the Excel function DEVSQ(Ys), not by the function SUMSQ.

Parametric variance: If you take the sum of squares and divide it by the number of observations (N), you are computing the average squared deviation from the mean. As observations get more and more spread out, they get farther from the mean, and the average squared deviate gets larger. This average squared deviate, or sum of squares divided by N, is the variance. You can only calculate the variance this way if you have observations for every member of a population, which is almost never the case. I can't think of a good biological example where using the parametric variance would be appropriate. The parametric variance is given by the Excel function VARP(Ys).

Sample variance: We almost always have a sample of observations that we are using to estimate a population parameter. To get an unbiased estimate of the population variance, divide the sum of squares by N-1, not by N. This sample variance, which is the one you will almost always use, is given by the Excel function VAR(Ys). From here on, when you see "variance," it means the sample variance.

Standard deviation: Variance, while it has useful statistical properties that make it the basis of many statistical tests, is in squared units. A set of heights measured in centimeters would have a variance expressed in square centimeters, which is just weird. Taking the square root of the variance gives a measure of dispersion that is in the original units. The square root of the parametric variance is the parametric standard deviation, which you will almost never use; is given by the Excel function STDEVP(Ys). The sample standard deviation requires a rather complicated correction factor and is given by the Excel function STDEV(Ys). You will almost always use the sample standard deviation; from here on, when you see "standard deviation," it means the sample standard deviation.

In addition to being more understandable than the variance as a measure of the amount of variation in the data, the standard deviation summarizes how close observations are to the mean in a very nice way. Many variables in biology fit the normal probability distribution fairly well. If a variable fits the normal distribution, 68.3 percent (or roughly two-thirds) of the values are within one standard deviation of the mean, 95.4 percent are within two standard deviations of the mean, and 99.7 (or almost all) are within 3 standard deviations of the mean. Here's a histogram that illustrates this:

Left: The theoretical normal distribution. Right: Frequencies of 5,000 numbers randomly generated to fit the normal distribution. The proportions of this data within 1, 2, or 3 standard deviations of the mean fit quite nicely to that expected from the theoretical normal distribution.



The proportions of the data that are within 1, 2, or 3 standard deviations of the mean are different if the data do not fit the normal distribution, as shown for these two very non-normal data sets:

Left: Frequencies of 5,000 numbers randomly generated to fit a distribution skewed to the right. Right: Frequencies of 5,000 numbers randomly generated to fit a bimodal distribution.



Coefficient of variation. Coefficient of variation is the standard deviation divided by the mean; it summarizes the amount of variation as a percentage or proportion of the total. It is useful when comparing the amount of variation among groups with different means. For example, let's say you wanted to know which had more variation, pinkie finger length or little toe length; you want to know whether stabilizing selection is stronger on fingers than toes, since we use our fingers for more precise activities than our toes. Pinkie fingers would almost certainly have a higher standard deviation than little toes, because fingers are several times longer than toes. However, the coefficient of variation might show that the standard deviation, as a percentage of the mean, was greater for toes.

How to calculate the statistics

I have made a spreadsheet that calculates the range, sample variance, sample standard deviation, and coefficient of variation, for up to 1000 observations.

This web page calculates range, variance, standard deviation, and coefficient of variation for up to 80 observations.

This web page calculates range, variance, and standard deviation. I don't know the maximum number of observations it can handle.

Example

Here are the statistics of dispersion for the blacknose dace data from the central tendency web page. In reality, you would rarely have any reason to report all of these:

Range                      90
Variance                 1029.5
Standard deviation         32.09
Coefficient of variation   45.8%

Reference

Sokal and Rohlf 1995, pp. 48-53, 57-59, 98-105.


<--- Previous topic | Next topic --->


Return to the Research Methods in Biology syllabus

Return to John McDonald's home page

This page was last revised October 15, 2006. Its URL is statdispersion.html