
lower than 50% in a population, say 1%, then a model which reflects no understand-
ing of the domain at all — beyond that base rate — can accumulate quite a high
information reward simply by predicting a 1% chance of disease for all future cases.
Rather than Good’s intended zero reward for ignorance, we have quite a high reward
for ignorance. As a Bayesian reward function, we propose one which takes the prior
into account and, in particular, which rewards with zero any prediction at the prior
probability for the target class.
The second deficiency is simply that
is limited to binomial predictions.
Bayesian networks are very commonly applied to more complex prediction prob-
lems, so we propose the Bayesian information reward which applies to multino-
mial prediction:
(10.6)
where
is the number of test cases, for the true class and otherwise,
and
where is the model’s probability for the variable taking the particular value at issue
and
is the prior probability of the variable taking that value. Note that this version
of information reward is directly related to Kullback-Leibler divergence (
3.6.5),
where the prior probability takes the role of the reference probability, but with two
differences. First, the weighting is done implicitly by the frequency with which the
different values arise in the test set, presumably corresponding to the prior proba-
bility of the different values. Second, the ratio between the model probability and
the prior is inverted, to reflect the idea that this is a reward for the model to diverge
from the reference prior, rather than to converge on it — so long as that divergence
is approaching the truth more closely than does the prior. This latter distinction, re-
warding divergence toward the truth and penalizing divergence away from the truth,
is enforced by the distinction between
for true value and for false values.
This generalizes Good’s information reward to cover multinomial prediction by
introducing a reward value for values which the variable in the test case fails to take
(i.e.,
). In the case of the binomial this was handled implicitly, since the
probability of the alternative to the true value is constrained to be
.However,
for multinomials there is more than one alternative to the true value and they must be
treated individually, as we do above.
has the following meritorious properties:
If a model predicts any value with that value’s prior probability, then the
is zero. That is, ignorance is never rewarded. This raises the question of where
the priors for the test cases should come from. The simplest, and usually
satisfactory, answer is to use the frequency of the value for the variable within
the training cases.
© 2004 by Chapman & Hall/CRC Press LLC