greater detailt. The network in Figure
4.7
was trained using the algorithm shown
in Table
4.2,
with initial weights set to random values in the interval (-0.1,0.1),
learning rate
q
=
0.3, and no weight momentum (i.e.,
a!
=
0). Similar results
were obtained by using other learning rates and by including nonzero momentum.
The hidden unit encoding shown in Figure
4.7
was obtained after 5000 training
iterations through the outer loop of the algorithm (i.e., 5000 iterations through each
of the eight training examples). Most of the interesting weight changes occurred,
however, during the first 2500 iterations.
We can directly observe the effect of
BACKPROPAGATION'S gradient descent
search by plotting the squared output error as a function of the number of gradient
descent search steps. This is shown in the top plot of Figure
4.8.
Each line in
this plot shows the squared output error summed over all training examples, for
one of the eight network outputs. The horizontal axis indicates the number of
iterations through the outermost loop of the BACKPROPAGATION algorithm. As this
plot indicates, the sum of squared errors for each output decreases as the gradient
descent procedure proceeds, more quickly for some output units and less quickly
for others.
The evolution of the hidden layer representation can be seen in the second
plot of Figure
4.8.
This plot shows the three hidden unit values computed by the
learned network for one of the possible inputs (in particular, 01000000). Again, the
horizontal axis indicates the number of training iterations. As this plot indicates,
the network passes through a number of different encodings before converging to
the final encoding given in Figure
4.7.
Finally, the evolution of individual weights within the network is illustrated
in the third plot of Figure
4.8.
This plot displays the evolution of weights con-
necting the eight input units (and the constant
1
bias input) to one of the three
hidden units. Notice that significant changes in the weight values for this hidden
unit coincide with significant changes in the hidden layer encoding and output
squared errors. The weight that converges to a value near zero in this case is the
bias weight
wo.
4.6.5
Generalization, Overfitting, and Stopping Criterion
In the description of t'le BACKPROPAGATION algorithm in Table
4.2,
the termination
condition for the algcrithm has been left unspecified. What is an appropriate con-
dition for terrninatinp the weight update loop? One obvious choice is to continue
training until the errcr
E
on the training examples falls below some predetermined
threshold. In fact, this is a poor strategy because BACKPROPAGATION is suscepti-
ble to overfitting the training examples at the cost of decreasing generalization
accuracy over other unseen examples.
To see the dangers of minimizing the error over the training data, consider
how the error
E
varies with the number of weight iterations. Figure
4.9
shows
t~he source code to reproduce this example is available at
http://www.cs.cmu.edu/-tom/mlbook.hhnl.