Zelditch M.L. (и др.) Geometric Morphometrics for Biologists: a primer

Подождите немного. Документ загружается.

chap-08 4/6/2004 17: 25 page 198

198 GEOMETRIC MORPHOMETRICS FOR BIOLOGISTS

To form a bootstrap version of the t-test, we will use the bootstrap approach to simulate

the null hypothesis we wish to reject. This simple principle is the key to understanding

how to form your own bootstrap tests when asking novel statistical questions. The null

hypothesis of the t-test is that the means of the two groups are equal, which we can also

phrase as the hypothesis that the two groups in question came from a single underlying

distribution that was arbitrarily subdivided into two groups. If this were the case, any

difference between the means would arise simply by chance. So to test this hypothesis,

we assume that the null hypothesis is true – i.e. that X and Y were drawn from the same

population. Therefore we merge the two sets of observations (X and Y) into a common

pool of specimens (Z) and draw (with replacement) two bootstrap sets from Z, one of size

and one of size N

, and compute the differences in means between the two bootstrap

sets. This is repeated N

Bootstrap

times. We can then determine the number of times in which

the difference between the means of paired bootstrap sets exceeds the observed difference

between the means of X and Y. Expressed as a proportion of the total, we get an estimate of

the probability that the observed difference is due to chance; i.e. if the difference between

means of pairs of bootstrap samples exceeds the observed differences in 5% (or fewer)

of the total number of iterations, we can reject the null hypothesis that the means are

equal. This is simply another way of phrasing the statement that the observed difference is

statistically significant at a 5% confidence level if the observed difference between means

exceeds the 95th percentile of differences between means of the bootstrap sets.

A symbolic example of this merging and subsequent formation of two bootstrap sets

may help to develop an understanding of how the test operates. Suppose we have a set C

of five elements, and a set D of four elements:

C ={C

, C

} (8.16)

D ={D

, D

} (8.17)

The merged set, M, would have nine elements:

M ={C

, C

, D

} (8.18)

To draw two bootstrap sets out of M, we would form a list of five random integers (because

there are five elements in C), and the elements in M corresponding to this list would be

the elements in the bootstrap version of C:

={75185} (8.19)

Bootstrap

={D

, C

, D

, C

} (8.20)

Note that two elements in C

Bootstrap

come from D. A second list of four integers is used

to form a bootstrap version of D:

={2499} (8.21)

Bootstrap

={C

, C

, D

} (8.22)

The formation of the bootstrap versions of C and D reflects the null hypothesis that C

and D come from a common underlying distribution. The elements of C and D are thus

interchangeable.

chap-08 4/6/2004 17: 25 page 199

COMPUTER-BASED STATISTICAL METHODS 199

The difference between means of the bootstrapped versions of C and D can be deter-

mined by many repetitions, developing a bootstrap estimate of the distribution of the

differences between means produced by the null hypothesis (given the data). When we

carry out this bootstrap t-test on our numerical example, sets X and Y, we find that 268 of

1000 bootstrap sets (26.8%) have a difference between means as large or larger than that

between the means of X and Y. Thus, we cannot reject the null hypothesis that these sam-

ples were drawn from populations with equal means, the difference between them being

due to chance. Using a t-test based on the normal distribution, we would have rejected

that null hypothesis. Because both samples appear to have non-normal distributions, as

discussed earlier, it seems reasonable to attribute the difference between results to violating

the assumption of normality.

The bootstrap method is probably the most popular of the computer-based methods

for estimating confidence intervals, and it is also one of the easiest to implement.

Permutation tests

Permutation tests pre-date the bootstrap test. They were introduced by R. A. Fisher in

the 1930s as a basis for supporting the ideas of the Student’s t-test rather than as a tool

for computation. With the advent of computers, permutation methods could be used

profitably for statistical inference. Permutation tests operate in much the same manner as

bootstrap tests, but differ in that they resample groups without replacement. This makes

permutation tests suitable for hypothesis testing, but not for the estimation of confidence

intervals (Efron and Tibshirani, 1993).

Again, we can look at a simple, abstract example of how a permutation set is formed to

get a sense of how the approach works, and how it differs from the bootstrap. Consider

two data sets C and D:

C ={C

, C

} (8.23)

D ={D

, D

} (8.24)

with sample sizes of five and four respectively. We form the merged set M of nine elements:

M ={C

, C

, D

} (8.25)

To produce permutation set versions of C and D, we want to resample M without

replacement. To do this, write a list of nine integers, then randomly permute it to form a

list L:

L ={526873941} (8.26)

The first five values in L are the ordinal values of the elements in M, placed in the

permuted version of C:

permutation

={C

, C

, D

} (8.27)

The last four values in the list are the ordinal values of the elements in M that are placed

in the permuted version of D:

permutation

={C

, D

, C

} (8.28)

chap-08 4/6/2004 17: 25 page 200

200 GEOMETRIC MORPHOMETRICS FOR BIOLOGISTS

Note the different way that the permutation sets (Equations 8.27, 8.28) and bootstrap sets

(Equations 8.20, 8.22) are constructed from C and D.

To carry out a permutation test of the hypothesis that the means of the two groups

X and Y (see Equations 8.13 and 8.14) are equal, we would first compute the difference

between the means of the two groups, which have sample sizes of N

=31 and N

=18.

The second step is to merge the two data sets into a single larger one and form a series of

paired permutation sets, each drawn from the merged data set. The first permutation set

in each pair, containing N

specimens, is drawn randomly without replacement from the

merged set. The second permutation set of the pair contains the remaining N

elements of

the merged data set. (No element of the original sets appears twice in the paired permutation

sets, and none is omitted.) The difference between means of the two permutation sets is

then calculated, and repeated for N

Permutation

sets. The proportion of times in which the

difference between the means of the paired permutation sets exceeds that between the

original data sets is taken as the probability that the observed value could have arisen by

a random splitting of a single underlying distribution.

The permutation test of the difference between the means of sets C and D indicates

that 21.3% of the permuted sets had a difference in means equal to or greater than the

observed difference of 0.428, so we cannot reject the null hypothesis that the means are

equal at a 5% level of confidence. The permutation test has produced results agreeing with

the bootstrap test (in which 26.8% of the bootstrap sets had a difference between means

as large or larger than the observed data set).

It is possible to form permutation tests for a wide variety of statistical hypotheses in a

manner similar to the bootstrap (see Efron and Tibshirani, 1993; Good, 1994). However,

there is an important difference between the permutation and bootstrapping approaches

due to fundamental differences in how they operate. Permutation tests are not suited to

the estimation of confidence intervals because the standard deviation of the estimates of

a parameter (such as a mean or median) is not a reliable estimate of the standard error

in that parameter. Rather, the permutation test yields an estimate of the range of param-

eter values possible under the null model simulated by the test. In contrast, the standard

deviation of the bootstrap estimates of the same parameter yields a reliable estimate of

its standard error because the bootstrap resampling simulates a repetition of the process

of selecting specimens from the population (Efron and Tibshirani, 1993). When used for

hypothesis testing, both methods tend to give very similar results, so it is difficult (and

perhaps unnecessary) to determine which approach is preferable in most cases. To some

extent, the choice between them appears to be a matter of preference among writers of

software. There are some reasons to think that permutation tests may yield a more exact

achieved significance level (ASL) than bootstrap approaches (Efron and Tibshirani, 1993),

but this is at the cost of precluding estimates of confidence intervals (or standard errors)

on the statistics involved.

The jackknife

Jackknife methods (Quenouille, 1949; Tukey, 1958) also preceded bootstrap methods,

and, to some extent, have been supplanted by them. Jackknife estimates are obtained by

resampling such that one element is left out at a time (hence the name – to use a jackknife,

you have to leave one out, either one blade or one specimen). If there are N specimens in

chap-08 4/6/2004 17: 25 page 201

COMPUTER-BASED STATISTICAL METHODS 201

a sample, then it is possible to form N jackknife data sets, each with N −1 specimens. If

we again look at the set C:

C ={C

, C

} (8.29)

The five possible jackknife versions of C are:

={C

, C

} (8.30)

={C

, C

} (8.31)

={C

, C

} (8.32)

={C

, C

} (8.33)

={C

, C

} (8.34)

Jackknife data sets will always be more similar to the original data set than bootstrap

sets are because the bootstrap offers a greater variety of ways of resampling the data. The

jackknife may be viewed as an approximation to the bootstrap (Efron and Tibshirani,

1993), and it is a good approximation when the changes in the statistic are smooth or

linear with respect to changes in the data. The mean is a linear statistic, but the median

is not (because the median may change abruptly as observations are added or subtracted

from the sample); therefore the jackknife estimate of the mean will not differ much from the

bootstrap estimate of the mean, but their estimates of the median may differ considerably.

There are some approaches to combining the bootstrap and the jackknife (see partic-

ularly Efron, 1992; Efron and Tibshirani, 1993, Chapter 19, on assessing the error of

bootstrap estimates), but otherwise the jackknife appears to offer few advantages over the

bootstrap.

Monte Carlo methods

Monte Carlo methods compare the value of an observed statistic to the range of values

expected under a given null hypothesis, assuming a model of the populations involved.

Like analytical statistical methods, Monte Carlo methods require making assumptions

about the nature of the distribution from which populations are drawn. They then fit

parameters of the distributional models to the observed samples. In contrast, analytic

statistical approaches use algebraic derivations to estimate the values of statistics (and

standard errors in those statistics) based on the nature of the underlying distributions.

The distinction is that Monte Carlo approaches generate random data sets based on the

parameters and distribution of the model; those random data sets are drawn from model

distributions having the same sample size as the original one. The distribution of the

statistic of interest (estimated over many computer-generated Monte Carlo sets) is used to

estimate the mean and standard deviation of that statistic, under the null model and the

model distribution used. Monte Carlo methods can be used both for hypothesis testing

and for generating confidence intervals.

Monte Carlo methods use numerical simulations to avoid the need for extensive alge-

braic computations and approximations. It may often be easier to program a Monte Carlo

chap-08 4/6/2004 17: 25 page 202

202 GEOMETRIC MORPHOMETRICS FOR BIOLOGISTS

simulation than to determine analytically the distribution of an intricate statistical function,

particularly when the statistic is not a linear function. Because it is necessary to assume

a model of the distributions of the samples, the Monte Carlo method shares most of the

primary weaknesses of analytic statistics; if the observed distribution departs substantially

from the model, the Monte Carlo sets will not represent the actual system of interest. One

useful feature of the Monte Carlo method is the ability to determine the effect of different

distributional models (the ones typically used are the uniform, normal or Gaussian, and

Poisson) on the range of values estimated by the Monte Carlo sets. The comparison of

observed distributions to those produced by Monte Carlo methods is a powerful approach

to hypothesis testing.

For example, if we wish to determine the significance of the observed difference in the

means of sets X and Y:

X ={2, 2, 3, 4, 2, 5, 3, 2, 6, 2, 3, 4, 6, 2, 1, 4, 3, 7, 2, 3, 4, 4, 5, 8, 5, 2, 1, 3, 4, 4, 3} (8.35)

Y ={2, 2, 3, 2, 4, 2, 3, 2, 8, 9, 2, 9, 3, 2, 3, 3, 3, 9} (8.36)

we will test the null hypothesis that the two sets (X and Y) came from the same underlying

distribution, with the observed difference between them being due to a random assignment

of specimens into groups. To form the Monte Carlo set, we will assume that the single

underlying distribution is normal. We then estimate the mean and standard deviation of

this underlying distribution by merging the data sets into a single group. The mean of the

single distribution is 3.67 and the standard deviation is 2.1. To determine the significance

of the observed difference in the means of the two groups, we generate a series of paired

Monte Carlo sets, one with a sample size N

=31, one with a sample size N

=18, and

we determine the difference between the two means. We then determine the proportion

of N

Monte Carlo

sets in which the difference between the means of the paired Monte Carlo

sets exceeds that observed between the means of the original data sets.

For the sets X and Y above, the Monte Carlo sets were generated under the assumption

that both samples were drawn from the same normal distribution, with a mean of 3.67 and

a standard deviation of 2.1 (the mean and standard deviation of the combined data sets).

In 480 of 1000 pairs of Monte Carlo sets (48%), the difference between the means of the

paired Monte Carlo sets exceeds the observed difference between the means of the original

data sets, thus the null hypothesis of a single underlying normal distribution cannot be

rejected. It should be noted that the combined data set (of all specimens in X and Y)is

probably not normally distributed, so we might want to repeat the Monte Carlo test using

other models of the underlying distribution.

Monte Carlo simulations are particularly useful for testing different hypothetical sit-

uations when the underlying distributions are believed to be well known. Monte Carlo

methods can be used in cases when bootstrap methods cannot, such as to estimate the

effect of increasing the sample size on the estimated variance; Monte Carlo simulations

are not limited by the observed sample sizes (as bootstrap methods are).

Example: computer-based tests and regression models

To this point, we have focused on t-tests, but computer-based methods are useful for a

wide variety of tests. To develop a more general understanding of these methods, we now

chap-08 4/6/2004 17: 25 page 203

COMPUTER-BASED STATISTICAL METHODS 203

show how bootstrap and permutation methods can be used in regression analysis (the

subject of Chapter 10). Both approaches can be used to determine if one set of measured

variables Y (the dependent variable) has a statistically significant dependence on a second

set of measured variables X (the independent variable). If we have N observations, each

of a pair of measurements (X

, Y

), then the typical linear regression model is:

= A +BX

+ε

(8.37)

The regression slope, B, is given by:

B =

(8.38)

The intercept term, A, is given by:

A = <Y> −B<X> (8.39)

where <X> and <Y> are the expected values (means) of the X

and Y

values, and



i=1

−<X>)

(8.40)



i=1

−<X>)(Y

−<Y>)(8.41)

are the values of A and B which minimized the summed square residuals (ε

). This sum of

squared error terms is:

Error =



i=1

−A −BX

)



i=1

(ε

)

(8.42)

under the assumption that the residuals are independently and identically normally

distributed.

To show that there is a statistically significant dependence of Y on X, it is sufficient to

show that the confidence interval on the slope excludes zero. This is equivalent to showing

that there is a non-zero correlation between Y and X, which may be tested using the

squared value of the correlation coefficient (R

) between X and Y, which indicates the

fraction of the variance in the dependent variable (Y) that is explained by the independent

variable (X). The expression for R

is:

(8.43)

where



i=1

−<Y>)

(8.44)

chap-08 4/6/2004 17: 25 page 204

204 GEOMETRIC MORPHOMETRICS FOR BIOLOGISTS

It is very common to interpret high R

values as being indicative of high explanatory power

in a regression model. There is a method of testing whether an R

value is statistically

significant (under the assumption of normality of the residuals), by the expression:



1 +R

1 −R



(8.45)

which is a normally distributed variable, with variance equal to 1/(N −3), where N is the

sample size.

The significance of the slope can be assessed by a permutation test. The objective is

to determine the range of slopes that could be generated by random permutations of the

associations among X and Y values. Thus, we again adopt the strategy of assuming that the

null hypothesis is true (which, in this case, is that the associations among X and Y values

is random). The associations of the X

values with the Y

are then randomized, generating

a permutation set of paired X and Y values with the same distribution of X and Y values

as in the data, but with randomized combinations of X and Y. The regression model is

then fitted to each permutation set, and the slope (or correlation coefficient) is calculated.

The distribution of the regression slopes (or the correlation coefficients) generated by the

permutation sets can be used to determine if the observed regression slope (or correlation

coefficient) could have been produced by a random association among X and Y variables.

If the observed slope (or correlation coefficient) is outside the 95% confidence interval of

the permutation sets, then we can reject the null hypothesis that the slope (or correlation

coefficient) does not differ from zero. Note that the permutation test estimates the range of

slopes (or correlation coefficients) produced by the null model, not by the observed data.

Thus we reject the null hypothesis by showing that the observed statistic lies outside the

range of the values predicted by the null model.

To carry out a bootstrap test of the significance of the regression line, two approaches

are available: one is to bootstrap (resample with replacement) the paired observations

, Y

); the other is to bootstrap the residuals from the regression. When bootstrapping

specimens, we form bootstrap sets by sampling (with replacement) from the paired speci-

men values (X

, Y

) to form a bootstrap set. The regression model is fitted and the slope (or

correlation coefficient) is determined for each bootstrap set, forming a bootstrap estimate

of the confidence intervals for the slope (or correlation coefficient). This yields a confidence

interval on the slope itself, so that if it excludes zero, we can reject a null hypothesis that

the regression slope (or correlation) is zero.

The alternative is to bootstrap the residuals, by first determining the residuals to the

bootstrap, and the Y values that are predicted by the regression model for each X value:

predicted

= A +BX (8.46)

Then the residuals are randomly combined with the paired X

and Y

predicted

values, both

of which are resampled (with replacement). This approach produces a wider variety of

possible paired values of X

and Y

; it can be thought of as bootstrapping the variable

part of the distribution, independently of the portion that is dependent on X. The range

of slopes (or correlation coefficients) is determined over many bootstrap sets; if the 95%

confidence interval for the slope (or correlation coefficient) excludes zero, we can infer

that there is a statistically significant dependence of Y on X at a 5% confidence level.

chap-08 4/6/2004 17: 25 page 205

COMPUTER-BASED STATISTICAL METHODS 205

The discussion of how a permutation test is used to determine the statistical significance

of a regression slope serves as a useful illustration of the differences in approach between

bootstrap and permutation methods. In the permutation method, the approach is to esti-

mate the confidence interval under the null model, given the distribution of observed data.

Thus, if the observed statistic is outside the confidence interval of the null, the observed

statistic is judged to be significant. In contrast, the bootstrap approach estimates the range

of the statistic on the observed data (rather than the range under the null). Permutation

tests almost always focus on estimating distributions under the assumption that the null

model is true, whereas bootstrap methods can be used to estimate the distribution of a

statistic either over the observed data or under an assumption that the null is true.

Issues common to all computer-based methods

Statistical power

When evaluating the utility of statistical tests we are faced with Type I error (i.e. falsely

rejecting the null hypothesis when it is true), which is controlled by setting the alpha level

of the test. Because that is under control, statistical tests cannot be said to differ in their

rates of Type I error. In contrast, statistical tests can differ in their rates of Type II error

(i.e. failure to reject the null hypothesis when it is false and an alternative is true). The

rate of Type II error depends on the nature of the test, the null hypothesis and the alterna-

tive hypotheses used. The power of a statistical test is its ability to distinguish between the

false null hypothesis and the true alternative, and it is sometimes expressed as 1 minus the

rate of Type II error.

Estimating the power of statistical tests turns out to be both difficult, and neglected

by many researchers. Some work indicates that permutation, bootstrap and analytic

tests have equivalent statistical power when the data meet the requirements of the ana-

lytic tests (Hoeffding, 1952; Robinson, 1973; Romano, 1989; Manly, 1997). Edgington

(1995) reports higher statistical power for randomization tests when there are violations

of the assumptions of the analytic statistical tests. Efron and Tibshirani (1993) present

an approach to estimating power, given a specific sample size. The approach offered by

Sheets and Mitchell (2001) is to use Monte Carlo methods to estimate the rates of Type

II error under several plausible alternatives to the null hypothesis. Despite the attendant

difficulty in estimating the statistical power of different tests, computer-based tests seem

to have at least as much statistical power as the more familiar analytical tests.

How many repetitions?

Regardless of the method used, the researcher is always faced with the question of how

many replications or repetitions should be made. We want a small bias and standard

deviation, but it is not clear how many replications are required to achieve this end.

The number of independent bootstrap samples that one may form out of N specimens

is (2N −1)!/N!(N −1) (Efron and Tibshirani, 1993), which is over 90,000 for N =10

specimens. In most cases, even thousands of bootstrap replicates will not come close to

exhausting all possible bootstrap sets. Typically, a modest subset of all possible sets is ade-

quate for most statistical questions. Estimates of standard errors can usually be produced

chap-08 4/6/2004 17: 25 page 206

206 GEOMETRIC MORPHOMETRICS FOR BIOLOGISTS

using only 100 or fewer bootstrap sets (Efron and Tibshirani, 1993), but reliable estimates

of confidence intervals may require using many more. It does not appear that there is

complete consensus on this issue (see Efron, 1992; Efron and Tibshirani, 1993; Jackson

and Somers, 1989; Manly, 1997), but it does seem that more repetitions are necessary for

estimating confidence intervals, where we must estimate a specific percentile point value,

than either for hypothesis testing (see Manly, 1997) or for estimating of standard errors

(Efron and Tibshirani, 1993). If computer time is not an issue, a range of 1000 to 2000

bootstrap tests is recommended for estimating a 95% confidence interval on a parameter

(Efron, 1987; Efron and Tibshirani, 1993). When the time necessary to complete a calcu-

lation is a factor, one approach is to increase the sample size steadily until arriving at a

value that is stable with respect to further increases in sample size. The stability criterion is

perhaps most applicable to hypothesis testing, where we may not need to know the exact

confidence level of the observed statistic – only that we can (or cannot) reject the null

hypothesis at a 5% confidence level.

For example, if we run a bootstrap t-test and find that in 100 bootstrap tests the differ-

ence in means exceeds the observed difference 40 times (yielding p =0.40), it is probably

safe to state that we cannot reject the null at a 5% confidence level. A repetition of the

bootstrap procedure might yield a slightly different confidence level, even changing by

several percentage points, but it is highly unlikely to yield p < 0.05. Similarly, in such a

bootstrap t-test, if the difference in bootstrap means never exceeds the observed difference

in means (in 100 bootstrap sets), a single repetition of the bootstrap calculations at 100

bootstrap sets confirms that p < 0.05 appears to be reasonable. The difficulty arises when

the bootstrap estimate of the p-value is very close to the desired confidence level (p =0.05

in this example). In such a case, a large number of bootstrap sets may be warranted.

It is worth remembering that for N

Bootstrap

sets, the smallest confidence level we could

possibly estimate is 1/N

Bootstrap

– e.g. for 1000 bootstraps, the smallest confidence level

we could ever hope to estimate is 1/1000 =0.001. The estimate of the confidence interval

at 0.001, using 1000 bootstrap sets, is essentially based on the value obtained from a

single bootstrap set (the one producing the largest or smallest value out of the 1000 sets

examined). This suggests that it would be more appropriate to use 10,000 to 20,000 sets

to obtain an estimate of the confidence interval at 0.001, so that the estimate is based on

the results of 10 to 20 bootstrap sets (the 10 or 20 most extreme values out of the 10,000

or 20,000 total sets). In most cases it is not necessary to estimate confidence intervals at

0.1% (0.001); 5% confidence intervals are the standard, and are achievable with lower

numbers of bootstraps.

When in doubt about the number of bootstrap sets that should be used to establish a

particular confidence interval, the safest approach is to repeat the analysis after doubling

the number of bootstrap sets (to determine whether that doubling alters the confidence

level). This doubling should be repeated until the estimate stabilizes; the iterative approach

may be time-consuming, but it is preferable to a blind reliance on a rule of thumb.

Summary

Computer-based statistics provide a useful alternative to the more familiar analytical statis-

tical approaches, particularly when the observed distribution departs substantially from the

chap-08 4/6/2004 17: 25 page 207

COMPUTER-BASED STATISTICAL METHODS 207

assumptions of analytic models, or when no analytic estimate is available for the confidence

interval of a specific statistic needed for the analysis. The performance of computer-based

methods appears to be equal to that of analytic methods, although the greater flexibility

of computer-based methods comes at the cost of increased computational time (and the

need to produce specialized software for specific tests).

References

Edgington, E. S. (1995). Randomization Tests. Marcel Dekker.

Efron, B. (1979). Computers and the theory of statistics, thinking the unthinkable. Society for

Industrial and Applied Mathematics Review, 21, 460–480.

Efron, B. (1987). Better bootstrap confidence intervals. Journal of the American Statistical

Association, 82, 171–185.

Efron, B. (1992). Jackknife-after-bootstrap standard errors and influence functions. Journal of the

Royal Statistical Society Series B, Methodological, 54, 83–127.

Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall.

Good, P. (1994). Permutation Tests: A Practical Guide to Resampling Methods for Testing

Hypotheses. Springer-Verlag.

Hoeffding, W. (1952). The large-sample power of tests based on permutation of observations. Annals

of Mathematical Statistics, 23, 169–192.

Jackson, D. A. and Somers, K. M. (1989). Are probability estimates from the permutation models

of Mantel’s test stable? Canadian Journal of Zoology, 67, 766–779.

Manly, B. F. (1997). Randomization, Bootstrap and Monte Carlo Methods in Biology.

Chapman & Hall.

Quenouille, M. (1949). Approximate tests of correlation in time series. Journal of the Royal

Statistical Society B, 11, 18–44.

Raspé, R. E. 1785. Baron Münchhausen’s narrative of his Marvelous Travels and Campaigns in

Russia.

Robinson, J. (1973). Large-sample power of permutation tests for randomization models. Annals of

Statistics, 1, 291–296.

Romano, J. P. (1989). Bootstrap and randomization tests of some non-parametric hypotheses. Annals

of Statistics, 17, 141–159.

Sheets, H. D. and Mitchell, C. E. (2001). Why the null matters: statistical tests, random walks and

evolution. Genetica, 112, 105–125.

Sokal, R. R. and Rohlf, F. J. (1995). Biometry: The Principals and Practice of Statistics in Biological

Research, 3rd edn. Freeman.

Tukey, J. W. (1958). Bias and confidence in not quite large samples. (Abstract) Annals Mathematical

Statistics, 29, 614.