providing training derivatives, or slopes, as additional information for each train-
ing example (e.g., (XI,
f
(XI),
I,,
)).
By fitting both the training values
f
(xi)
and these training derivatives PI,, the learner has a better chance to correctly
generalize from the sparse training data. To summarize, the impact of including
the training derivatives is to override the usual syntactic inductive bias of BACK-
PROPAGATION
that favors a smooth interpolation between points, replacing it by
explicit input information about required derivatives. The resulting hypothesis
h
shown in the rightmost plot of the figure provides a much more accurate estimate
of the true target function
f.
In the above example, we considered only simple kinds of derivatives of
the target function. In fact, TANGENTPROP can accept training derivatives with
respect to various transformations of the input x. Consider, for example, the task
of learning to recognize handwritten characters. In particular, assume the input
x corresponds to an image containing a single handwritten character, and the
task is to correctly classify the character. In this task, we might be interested in
informing the learner that "the target function is invariant to small rotations of
the character within the image." In order to express this prior knowledge to the
learner, we first define a transformation
s(a, x), which rotates the image x by
a!
degrees. Now we can express our assertion about rotational invariance by stating
that for each training instance xi, the derivative of the target function with respect
to this transformation is zero (i.e., that rotating the input image does not alter the
value of the target function). In other words, we can assert the following training
derivative for every training instance xi
af
($(a, xi))
=o
aa
where
f
is the target function and
s(a,
xi) is the image resulting from applying
the transformation s to the image xi.
How are such training derivatives used by TANGENTPROP to constrain the
weights of the neural network? In TANGENTPROP these training derivatives are
incorporated into the error function that is minimized by gradient descent. Recall
from Chapter
4
that the BACKPROPAGATION algorithm performs gradient descent to
attempt to minimize the sum of squared errors
where xi denotes the ith training instance,
f
denotes the true target function, and
f
denotes the function represented by the learned neural network.
In TANGENTPROP an additional term is added to the error function to penal-
ize discrepancies between the trainin4 derivatives and the actual derivatives of
the learned neural network function
f.
In general, TANGENTPROP accepts multi-
ple transformations (e.g., we might wish to assert both rotational invariance and
translational invariance of the character identity). Each transformation must be
of the form
sj(a, x) where
a!
is a continuous parameter, where sj is differen-
tiable, and where sj(O, x)
=
x (e.g., for rotation of zero degrees the transforma-
tion is the identity function). For each such transformation, sj(a!, x), TANGENT-