Predictive Quantitative Structure–Activity Relationships Modeling 221
new method, namely training and test set resampling, a nonparametric technique that
could be used to estimate statistics and confidence intervals for a population of objects
when only a representative subset of the population is available (a dataset used to build
models). “Resampling” means multiple random divisions of a dataset into training
and test sets of the same size as were used for building models. The training sets are
used for the calculation of internal prediction accuracy, such as cross-validation q
2
for continuous problems, or CCR and/or A for classification or category QSAR (see
Formulas 7.8 through 7.18). The corresponding test sets are used for the calculation
of the correlation coefficient between predicted and observed response variables,
and coefficients of determination and slopes of regression lines of predicted versus
observed and observed versus predicted response variables through the origin for
continuous problems. In case of classification or category response variable, test sets
are used for estimation of the total classification accuracy as well as classification
accuracy for each class or category. Prior to prediction of compounds from the test
set, the AD for the corresponding training set should be defined (see Section 7.5).
Prediction should be made only for those compounds of the test sets that are within
the ADs of the training sets. We argue that predictive models should have similar
statistics to those obtained with the initial training and test sets. Large differences
between model statistics will evidence that the model is unstable. Average statistics
values obtained using the training and test set resampling approach are expected to
be better estimates of the population statistics than those obtained with the initial
training and test sets. It will be also possible to estimate confidence intervals of the
model statistics, which are important characteristics of the model stability.
A similar method of validation, which is used in QSAR and other data analysis
areas, is bootstrapping [9–11]. Like the resampling of training and test sets, boot-
strapping is a nonparametric approach to obtain estimates of statistics and confidence
intervals for a population of objects when only a representative subset of the popu-
lation is available. Bootstrapping consists of choosing N objects with returns from a
dataset of N objects. Due to returns of the selected objects to the initial dataset, some
objects will be included in the bootstrapped datasets several times, while others will
not be included at all. It has been shown that if the procedure is repeated many times
(about 1000 times or more), average bootstrapped statistics are good estimates of
population statistics. Bootstrapping can be used separately for training and test sets.
Selecting the same compounds several times into the training sets is unacceptable for
some QSAR methodologies like kNN. On the other hand, training and test set resam-
pling is free from this disadvantage, because in different realizations it will include
the same objects in the training or in the test set. Thus, after many realizations, both
training and test sets will be represented by all objects included in the dataset. To
obtain population statistics estimates, we shall use the same approaches as used for
bootstrapping. They are described elsewhere [10–12].
The authors of a recent publication [13] assert that cross-validation and boot-
strapping are not reliable in estimating the true predictive power of a classifier, if a
dataset includes less than about 1000 objects, and suggest the Bayesian confidence
intervals should be used instead. Cross-validation and bootstrapping are particularly
unreliable for small datasets (including less than about 100 compounds). But 95%
Bayesian confidence intervals for these datasets are very wide [13]. The authors show