Drennan R.D. Statistics for Archaeologists: A Common Sense Approach

Подождите немного. Документ загружается.

28 CHAPTER 3

Table 3.1. Weights of Flakes Recovered from Two Bell-Shaped Pits

Flake weights (g) Back-to-back stem-and-leaf plot

Pit1 Pit2 Pit1 Pit2

9.2 11.3 6 28

12.9 9.8 27

11.4 14.1 26

9.1 13.5 25

28.6 9.7 24

10.5 12.0 23

11.7 7.8 22

10.1 10.6 21

7.6 11.5 20

11.8 14.3 19

14.2 13.6 18

10.8 9.3 17

10.9 16

X 12.33 11.42 2

14 13

Md 11.10 11.30

13 56

12 0

Range 21.0 6.5 74

11 35

Midspread 3.7 3.7 851

10 69

9 378

6 7 8

of Pit 1 because the central bunch (which is always the most important part of the

batch) is more dispersed along the stem. Nevertheless, the range for Pit 1 is much

greater, entirely because of the one very high outlier in the Pit 1 batch. Although

the range is simple to calculate and easily understood by everyone, it is likely to be

very misleading unless all outliers can be removed. It is not much used as an index

of spread.

THE MIDSPREAD OR INTERQUARTILE RANGE

The midspread is the range of the middle half of a batch. The highest 25% of

the numbers and the lowest 25% of the numbers are thus disregarded. It could be

thought of as a sort of trimmed range, thinking back to the trimmed mean discussed

in Chapter

In practice the midspread is found by locating the quartiles and subtracting the

lower quartile from the upper quartile. The upper quartile is something like the

median of the upper half of the batch and the lower quartile is something like

THE SPREAD OR DISPERSION OF A BATCH 29

the median of the lower half of the batch, although the rules used for ﬁnding the

quartiles differ slightly from those used for ﬁnding the median. (In exploratory data

analysis the quartiles are often called the hinges.) To ﬁnd the quartiles, ﬁrst divide

the number of numbers in the batch by 4. If the result is a fraction, round it up to the

next whole number. Then count in that many numbers from the highest number in

the batch to arrive at the upper quartile and from the lowest number in the batch to

arrive at the lower quartile.

For example, there are 12 ﬂakes from Pit 1 for which weights are given in

Table

3.1. We divide 12 by 4 and get 3. The upper quartile is the third num-

ber from the top of the stem-and-leaf, or 12.9 g. The lower quartile is the third

number from the bottom of the stem-and-leaf, or 9.2 g. The midspread is then

12.9g−9.2g = 3.7g. For Pit 2, we have a batch of 13 weights; (13/4)=3.25,

which we round up to 4. The upper quartile is the fourth number from the top of the

stem-and-leaf, or 13.5 g. The lower quartile is the fourth number from the bottom of

the stem-and-leaf, or 9.8 g. The midspread is thus 13.5g−9.8g= 3.7g.

The midspread gives us better results for this example than the range, indicating

that both batches are spread out to the same degree (a midspread of 3.7 g for both

batches). This is at least closer to the mark than using a numerical index that shows

the Pit 1 batch to be much more spread out than the Pit 2 batch.

The procedure for ﬁnding the midspread also reveals why it is sometimes called

the interquartile range (at least by those who never use two syllables when ﬁve

will do). The midspread is simply the range between the quartiles, and interquartile

range is the traditional term for it. The midspread is used more in exploratory data

analysis than in traditional statistics, and it works particularly well with the median

to give us a quick indication of the level and spread of a batch.

THE VARIANCE AND STANDARD DEVIATION

The variance and the standard deviation are based on the mean. They are consider-

ably more cumbersome to calculate than the range or the midspread, and they lack

some of the immediately intuitive meaning that the range and midspread have. They

have technical properties, however, that make them extraordinarily useful, and so

they will be of considerable importance to many of the following chapters.

The basic concept on which the variance is based is that of difference from the

mean. Clearly the vast majority of numbers in a batch are likely to be rather different

from the mean of the batch. We can easily see how different any number in a batch

is from the mean by subtracting the mean from it. The ﬁrst two columns of Table

3.2

illustrate this procedure for all the numbers in the batch of weights of ﬂakes from Pit

2inTable

3.1. As is logical, the higher numbers in the batch have positive deviations

from the mean (because they are above the mean), and the lower numbers have

negative deviations from the mean (because they are below the mean). The numbers

at the extreme ends of the batch, of course, deviate quite strongly from the mean in

30 CHAPTER 3

Table 3.2. Calculating the Standard Deviationof Flake Weights from Pit 2 (Table

3.1)

Deviations from mean Squared deviations from mean

x (g) x−



x−X



14.3 2.88 8.29

14.1 2.68 7.18

13.6 2.18 4.75

13.5 2.08 4.33

12.0 0.58 0.34

11.5 0.08 0.01

11.3 −0.12 0.01

10.9 −0.52 0.27

10.6 −0.82 0.67

9.8 −1.62 2.62

9.7 −1.72 2.96

9.3 −2.12 4.49

7.8 −3.62 13.10

X = 11.42

∑

(x−X)=−0.06

∑



x−X



= 49.02

(sum of squares)

∑

(x−

n −1

49.02

= 4.09

s =

√

4.09 = 2.02

either positive or negative direction. The more spread out a batch is, the more strong

deviations from the mean there are.

If we want to summarize these deviations numerically, it might occur to us to

take the mean of the deviations. This won’t do, however, because we can see that

the deviations must always add up to 0; hence, their mean will always be 0. Indeed,

a different way to think of the mean is to consider it a “balance point” that makes

these deviations add up to 0. (You may notice that the second column of Table

3.2

actually adds up to −0.06 rather than 0. This is a consequence of rounding error,

which commonly occurs. All the deviations are rounded off to two digits following

the decimal point, and in this case by pure chance a little more rounding down has

occurred than rounding up.)

What we are interested in, as an index of spread, is the set of deviations from the

mean without their signs. We could simply drop the signs and add up the absolute

values of the deviations, but it turns out to be preferable to get rid of the signs

by squaring the deviations from the mean. (The squares of the deviations from the

mean are, of course, all positive, as squares must all be.) This calculation is shown

in the third column of Table

3.2. It is this third column that we sum up. This sum is

sometimes referred to as the sum of the squared deviations from the mean or simply

the sum of squares.

This sum of squares will, other things being equal, be larger for a larger batch

of numbers than for a smaller batch because a larger batch has more deviations to

add up. To arrive at an index that is not affected by the size of the batch but only

THE SPREAD OR DISPERSION OF A BATCH 31

by its spread, what we need is something like the average squared deviation from

the mean. Instead of dividing the sum of squares by the number of numbers in the

batch, however, we divide it by one less than the number of numbers in the batch.

We do this for purely technical reasons to make the result more useful in future

chapters where we take batches of numbers to be samples from larger populations.

The equation for the variance, then, is

∑

(x −

n −1

where s

is the variance of x, X is the mean of x,andn is the number of numbers in

the batch of x.

Table

3.2 provides an example of the calculations that correspond to this equa-

tion. The variance has a rather arbitrary character compared to the range or the

midspread. The value of the variance is not as easy to relate intuitively to the values

in the batch as was the case with the range or midspread. We can at least remove the

confusing effect of squaring the deviations by taking the square root of the variance.

The result is s, the standard deviation:

s =

√



∑

(x −

n −1

The standard deviation, unlike the variance, is at least expressed in the same units

as the original batch. Thus it is appropriate to think of the standard deviation of the

weights of ﬂakes from Pit 2 as not just 2.02, but 2.02 g. If we relate the standard

deviation to the stem-and-leaf plot in Table

3.1, we see that the standard devia-

tion delineates the portion of the stem within which most of the ﬂake weights fall.

That is, most weights are within 2.02 g above or below the mean of 11.42 g, which

is to say, most of the weights are between 9.40 (11.42g −2.02g = 9.40g) and

13.44 g (11.42g + 2.02g = 13.44g). These two numbers (9.40 and 13.44 g) pro-

vide an approximation of the limits of the main bunch of numbers. That is what it

means to say that most of the ﬂake weights are within one standard deviation of the

mean. Only a few fall farther than one standard deviation from the mean, that is,

farther than 2.02g from the mean. We can (and will) specify much more about this

way of using the standard deviation in later chapters. For the moment, sufﬁce it to

say that the standard deviation often provides just this kind of indication about the

spread of a batch.

The standard deviation does not behave so satisfactorily for the ﬂake weights

from Pit 1. Table

3.3 shows the calculation of the standard deviation for this batch.

When we ﬁrst compared these two batches of numbers (the weights of ﬂakes from

Pits 1 and 2) on the basis of the stem-and-leaf plots in Table

2.1, we noted that the

ﬂake weights from Pit 1 were (except for the high outlier) more closely bunched

up than those from Pit 2. The variance and the standard deviation for ﬂake weights

from Pit 1, however, are much larger than those for Pit 2, indicating a much larger

32 CHAPTER 3

Table 3.3. Calculating the Standard Deviation of Flake Weights from Pit 1 (Table

3.1)

Deviations from mean Squared deviations from mean

x (g) x−



x−X



28.6 16.27 264.71

14.2 1.87 3.50

12.9 0.57 0.32

11.8 −0.53 0.28

11.7 −0.63 0.40

11.4 −0.93 0.86

10.8 −1.53 2.34

10.5 −1.83 3.35

10.1 −2.23 4.97

9.2 −3.13 9.80

9.1 −3.23 10.43

7.6 −4.73 22.37

X = 12.33

∑

(x−X)=−0. 06

∑



x−X



= 323.33

(sum of squares)

∑

(x−

n −1

323.33

= 29.39

s =

√

29.39 = 5.42

spread for the ﬂakes from Pit 1 – exactly opposite the conclusion the stem-and-leaf

plot clearly indicates.

Table

3.3 shows very clearly why the variance and the standard deviation are so

large for Pit 1: the value for the one heaviest ﬂake deviates very strongly from the

mean. That one ﬂake is alone responsible for such a high sum of squares and thus

for such a high variance and standard deviation. Clearly, like the mean, the variance

and the standard deviation are not at all resistant to the effects of outliers. Using

the variance or the standard deviation as a numerical index of the spread of a batch,

then, is not a good idea at all if the batch has outliers.

Table

3.3 also provides a convenient illustration of why the mean lacks resistance

along the lines of observations made in Chapter

2. Think of the mean as the balance

point of a see-saw. The high outlier is like a person far out at one end of the see-saw.

In order to make the see-saw balance, the mean must be moved substantially toward

that end so that most of the numbers are on the other side. In that position it is far

off to one side of the center of the main bunch of numbers. It was precisely this

undesirable effect that we complained about in Chapter

THE TRIMMED STANDARD DEVIATION

The basic idea of the trimmed standard deviation is exactly like that of the trimmed

mean: outliers are excluded from the sample so that they will not have an undue

THE SPREAD OR DISPERSION OF A BATCH 33

Table 3.4. Calculating the 5% Trimmed Standard Deviation

of Flake Weights from Pit 1 (Table

3.1)

Original batch Winsorized batch Deviations from mean Squared deviations from mean

x (g) x

(g) x

−X



−X



28.614.22.95 8.70

14.214.22.95 8.70

12.912.91.65 2.72

11.811.80.55 0.30

11.711.70.45 0.20

11.411.40.15 0.02

10.810.8 −0.45 0.20

10.510.5 −0.75 0.56

10.110.1 −1.15 1.32

9.29.2 −2.05 4.20

9.19.1 −2.15 4.62

7.69.1 −2.15 4.62

= 11.25

∑

−X

)=0.00

∑



−X



= 36.16

(sum of squares)

∑

−X

)

n −1

36.16

= 3.29



(n−1)s

−1



(12−1)3.29

(10−1)

= 2.01

effect on the result. Calculation of the trimmed standard deviation, however,

becomes more involved. Instead of simply reducing the size of the batch by trim-

ming off numbers at the top and bottom, we must maintain the size of the batch by

replacing trimmed numbers with the numbers next in line for trimming. Table

3.4

shows this process for calculating a 5% trimmed standard deviation of the batch of

ﬂake weights from Pit 1. When, in Chapter

2, we calculated the 5% trimmed mean

of this same batch, we trimmed the single highest and lowest number from the batch.

This time, we replace the highest number with the next highest number (the high-

est number that remained in the batch after trimming). Thus 28.6g becomes 14.2 g.

Similarly, we replace the lowest number with the next lowest number (the lowest

number that remained in the batch after trimming). Thus 7.6g becomes 9.1 g.

The new batch that results is a Winsorized batch. The Winsorized variance is

calculated simply as the ordinary variance of this Winsorized batch. Note, though,

that the mean involved in calculating the Winsorized variance is the mean of the

Winsorized batch (which is not the same as the trimmed mean) and that the trimmed

standard deviation is not simply the square root of the variance of the Winsorized

batch. The trimmed standard deviation is derived from the Winsorized variance by

the following equation:



(n −1)s

−1

34 CHAPTER 3

Statpacks

Midspreads and standard deviations are pretty common fare in statpacks, and

statpacks are truly helpful here because calculating a standard deviation with

a calculator is time consuming (unless your calculator has a special key for

doing it automatically). Trimmed standard deviations, however, are much less

often provided for in statpacks. Just as in calculating a trimmed mean with your

statpack, you are likely to have to adjust the batch yourself ﬁrst. In this case

instead of replacing extreme values with missing data, you replace extreme

values with the adjacent nonextreme value in the data. Once this modiﬁcation

has been made, the batch has been Winsorized, and the variance your statpack

calculates on these numbers is the Winsorized variance, which you can con-

vert into the trimmed standard deviation with your calculator, as illustrated in

Table

3.4. Be sure not to forget this last step!

where s

is the trimmed standard deviation, n is the number of numbers in the

untrimmed batch, s

is the variance of the Winsorized batch, and n

is the number

of numbers in the trimmed batch.

Table

3.4 shows the full calculation of the trimmed standard deviation for the

ﬂake weights from Pit 1. Comparison of the calculation columns for Tables

3.3

and 3.4 shows quite clearly how the trimmed standard deviation avoids the over-

whelming effect of outliers.

Just as the trimmed mean can be calculated for various trimming fractions, so

can the trimmed standard deviation. In Chapter

2 we calculated a 25% trimmed

mean of the ﬂake weights from Pit 1 by trimming the three highest and the three

lowest numbers from the batch. Calculation of the 25% trimmed standard deviation

would begin with the creation of a Winsorized batch of 12 numbers in which the

three highest numbers were replaced with the fourth highest and the three lowest

numbers were replaced with the fourth lowest. From there on the calculation of the

variance of the Winsorized batch and the trimmed standard deviation follow exactly

the same path we have just taken for the 5% trimmed standard deviation. When a

trimmed mean and standard deviation are used, the trimming fraction should always

be speciﬁed.

WHICH INDEX TO USE

The range, the midspread, the standard deviation, and the trimmed standard devi-

ation are all numerical indexes of the spread of a batch. Just as we asked when to

use which index of the center of the batch, we must ask when to use which index

of spread. The answer parallels that given in Chapter

2. The range is very widely

understood but so badly affected by outliers that it is not often of much use. The mid-

spread has been emphasized in exploratory data analysis. It is not as familiar as it

THE SPREAD OR DISPERSION OF A BATCH 35

should be to archaeologists, but it is easy to ﬁnd and of wide utility for basic descrip-

tive purposes. Its resistance to the effects of outliers makes it particularly attractive.

The standard deviation is quite widely familiar (at least the term is, whether or not

many archaeologists are really at home with the concept or not). Its statistical prop-

erties, like those of the mean, will serve us well in the rest of this book. It is of such

importance that we will spend some effort on techniques to overcome its poor resis-

tance to the effects of outliers. Some of these techniques are based on the trimmed

standard deviation. Indexes of center and spread work together in pairs: the median

with the midspread, the mean with the standard deviation, or the trimmed mean with

the trimmed standard deviation (both with the same trimming fraction). Using the

median together with the standard deviation, for example, is like wearing one white

sock and one brown sock – only worse.

Table 3.5. Areas of Bronze Age Sites

Near Nanxiong

Site area (ha)

Early Bronze Age Late Bronze Age

1.810.4

1.05.9

1.912.8

0.64.6

2.37.8

1.24.1

0.82.6

4.28.4

1.55.2

2.64.5

2.14.1

1.74.0

2.311.2

2.46.7

0.65.8

2.93.9

2.09.2

2.25.6

1.95.4

1.14.8

2.64.2

2.23.0

1.76.1

1.15.1

6.3

12.3

3.9

36 CHAPTER 3

PRACTICE

Imagine you have conducted a regional survey of a small valley north of Nanxiong

and have carefully measured the areas of the surface scatters that indicate the Bronze

Age sites you encountered. The areas (in hectares) are given in Table

3.5.

1. Begin to explore these two batches of numbers with a back-to-back stem-and-leaf

plot.

2. Continue your exploration by calculating the median, the mean, and the 10%

trimmed mean for each batch and then the index of spread that corresponds to

each of these indexes of level. Which pair of indexes makes most sense to use

here? Why?

3. Based on the stem-and-leaf plots and the indexes of level and spread, what obser-

vations would you make about changes in site size from Early Bronze Age to Late

Bronze Age near Nanxiong?

Chapter 4

Comparing Batches

The Box-and-Dot Plot .............................................................................. 37

Removing the Level ................................................................................ 42

Removing the Spread............................................................................... 42

Unusualness......................................................................................... 45

Standardizing Based on the Mean and Standard Deviation ...................................... 48

Practice.............................................................................................. 49

We have already compared batches with back-to-back stem-and-leaf plots, but there

are quicker and more effective tools for graphically comparing batches. The numer-

ical indexes of the center and spread of a batch that we have discussed in the last two

chapters provide the basis for such tools. A standard way of plotting some of these

indexes in exploratory data analysis is called the box-and-dot plot (or the box-and-

whisker plot). The box-and-dot plot could, in theory, be based on any of the indexes

of center and spread, but in practice the median and the midspread are used. This

is such standard practice that a box-and-dot plot is automatically taken to represent

the median and midspread, and this convention should not be violated.

THE BOX-AND-DOT PLOT

Construction of a box-and-dot plot begins in exactly the same way as construction

of a stem-and-leaf plot: with the establishment of a scale along which the numbers

in the batch will lie. Figure

4.1 presents a stem-and-leaf plot of post hole diame-

ters from the Smith site (taken from Tables

1.7 and 1.8). To the right the stem is

converted into a scale for drawing a box-and-dot plot. A horizontal line is placed

next to 17.2cm on this scale to represent the median. Two more lines at 18.8 and

15.7 cm represent the upper and lower quartiles. These three lines are framed with

two vertical lines to form a box with a line across it near its center. This box graph-

ically represents the midspread, that is, the central half of the numbers – those that

fall between the two quartiles. The box provides a clear, clean picture of the most

important central bunch of numbers in the batch, one that is more quickly perceived

than the stem-and-leaf plot.

R.D. Drennan, Statistics for Archaeologists, Interdisciplinary Contributions

to Archaeology, DOI 10.1007/978-1-4419-0413-3

 Springer Science+Business Media, LLC 2004, 2009