THE SPREAD OR DISPERSION OF A BATCH 29
the median of the lower half of the batch, although the rules used for finding the
quartiles differ slightly from those used for finding the median. (In exploratory data
analysis the quartiles are often called the hinges.) To find the quartiles, first divide
the number of numbers in the batch by 4. If the result is a fraction, round it up to the
next whole number. Then count in that many numbers from the highest number in
the batch to arrive at the upper quartile and from the lowest number in the batch to
arrive at the lower quartile.
For example, there are 12 flakes from Pit 1 for which weights are given in
Table
3.1. We divide 12 by 4 and get 3. The upper quartile is the third num-
ber from the top of the stem-and-leaf, or 12.9 g. The lower quartile is the third
number from the bottom of the stem-and-leaf, or 9.2 g. The midspread is then
12.9g−9.2g = 3.7g. For Pit 2, we have a batch of 13 weights; (13/4)=3.25,
which we round up to 4. The upper quartile is the fourth number from the top of the
stem-and-leaf, or 13.5 g. The lower quartile is the fourth number from the bottom of
the stem-and-leaf, or 9.8 g. The midspread is thus 13.5g−9.8g= 3.7g.
The midspread gives us better results for this example than the range, indicating
that both batches are spread out to the same degree (a midspread of 3.7 g for both
batches). This is at least closer to the mark than using a numerical index that shows
the Pit 1 batch to be much more spread out than the Pit 2 batch.
The procedure for finding the midspread also reveals why it is sometimes called
the interquartile range (at least by those who never use two syllables when five
will do). The midspread is simply the range between the quartiles, and interquartile
range is the traditional term for it. The midspread is used more in exploratory data
analysis than in traditional statistics, and it works particularly well with the median
to give us a quick indication of the level and spread of a batch.
THE VARIANCE AND STANDARD DEVIATION
The variance and the standard deviation are based on the mean. They are consider-
ably more cumbersome to calculate than the range or the midspread, and they lack
some of the immediately intuitive meaning that the range and midspread have. They
have technical properties, however, that make them extraordinarily useful, and so
they will be of considerable importance to many of the following chapters.
The basic concept on which the variance is based is that of difference from the
mean. Clearly the vast majority of numbers in a batch are likely to be rather different
from the mean of the batch. We can easily see how different any number in a batch
is from the mean by subtracting the mean from it. The first two columns of Table
3.2
illustrate this procedure for all the numbers in the batch of weights of flakes from Pit
2inTable
3.1. As is logical, the higher numbers in the batch have positive deviations
from the mean (because they are above the mean), and the lower numbers have
negative deviations from the mean (because they are below the mean). The numbers
at the extreme ends of the batch, of course, deviate quite strongly from the mean in