
Predictive accuracy is relatively simple to measure, and for that reason may pro-
vide your best initial guide to the merit of a model. Unfortunately, predictive accu-
racy is not your best final guide to a model’s merit: there are excellent reasons to
prefer other measures. A fundamental problem is that predictive accuracy entirely
disregards the confidence of the prediction. In the mushroom classification prob-
lem, for example, a prediction of edibility with a probability of 0.51 counts exactly
the same as a prediction of edibility with a probability of 1.0. Now, if we were
confronted with the first prediction, we might rationally hesitate to consume such a
mushroom. The predictive accuracy measure does not hesitate. According to stan-
dard practice, any degree of confidence is as good as any other if it leads to the same
prediction: that is, all predictions in the end are categorical, rather than probabilistic.
Any business, or animal, which behaved this way would have a very short life span!
In language we have previously introduced (see
9.3.3.1), predictive accuracy pays
no attention to the calibration of the model’s probabilities. So, we will now consider
three alternative metrics which do reward calibration and penalize miscalibration.
10.5.2 Expected value
In recognition of this kind of problem, there is a growing movement in the machine
learning community to employ cost-sensitive classification methods. Instead of pre-
ferring an algorithm or model which simply has the highest predictive accuracy, the
idea is to prefer an algorithm or model with the best weighted average cost or benefit
computed from its probabilistic predictions. In other words, the best model is that
which produces the highest expected utility for the task at hand.
Since classifications are normally done with some purpose in mind, such as se-
lecting a treatment for a disease, it should come as no surprise that utilities should be
relevant to judging the predictive model. Indeed, it is clear that an evaluation which
ignores utilities, such as predictive accuracy, cannot be optimal in the first place,
since it will, for example, penalize a false negative of some nasty cancer no more
and no less than a false positive, even if the consequences of the former swamp the
latter.
Here is a simple binomial example of how we can use expected value to evaluate
a model. Suppose that the utilities associated with categorically predicting
or are as shown in Table 10.5. Then the model’s score would be
across those cases where in fact ,and
otherwise.
Its average score would be the sum of these divided by the number of test cases,
which is the best estimate of the expected utility of its predictions, if the model is
forced to make a choice about the target class.
This expected value measurement explicitly takes into account the full probability
distribution over the target class. If a model is overconfident, then for many cases
it will, for example, attach a high probability to
when the facts are other-
wise, and it will be penalized with the negative utility
multiplied with that high
probability.
© 2004 by Chapman & Hall/CRC Press LLC