left, (0,1,0,O) to encode a face looking straight, etc. Instead of 0 and 1 values,
we use values of 0.1 and 0.9, so that (0.9,O. 1,0.1,0.1) is the target output vector
for a face looking to the left. The reason for avoiding target values of 0 and 1
is that sigmoid units cannot produce these output values given finite weights. If
we attempt to train the network to fit target values of exactly 0 and 1, gradient
descent will force the weights to grow without bound. On the other hand, values
of 0.1 and 0.9 are achievable using a sigmoid unit with finite weights.
Network graph structure.
As described earlier,
BACKPROPAGATION
can be ap-
plied to any acyclic directed graph of sigmoid units. Therefore, another design
choice we face is how many units to include in the network and how to inter-
connect them. The most common network structure is a layered network with
feedforward connections from every unit in one layer to every unit in the next.
In the current design we chose this standard structure, using two layers of sig-
moid units (one hidden layer and one output layer). It is common to use one or
two layers of sigmoid units and, occasionally, three layers. It is not common to
use more layers than this because training times become very long and because
networks with three layers of sigmoid units can already express a rich variety of
target functions (see Section 4.6.2). Given our choice of a layered feedforward
network with one hidden layer, how many hidden units should we include? In the
results reported in Figure 4.10, only three hidden units were used, yielding
a
test
set accuracy of 90%. In other experiments 30 hidden units were used, yielding a
test set accuracy one to two percent higher. Although the generalization accuracy
varied only a small amount between these two experiments, the second experiment
required significantly more training time. Using 260 training images, the training
time was approximately 1 hour on a Sun
Sparc5 workstation for the 30 hidden unit
network, compared to approximately 5 minutes for the 3 hidden unit network. In
many applications it has been found that some minimum number of hidden units
is required in order to learn the target function accurately and that extra hidden
units above this number do not dramatically affect generalization accuracy, pro-
vided cross-validation methods are used to determine how many gradient descent
iterations should be performed. If such methods are not used, then increasing the
number of hidden units often increases the tendency to
overfit the training data,
thereby reducing generalization accuracy.
Other learning algorithm parameters.
In these learning experiments the learn-
ing rate
r]
was set to 0.3, and the momentum
a!
was set to 0.3. Lower values for both
parameters produced roughly equivalent generalization accuracy, but longer train-
ing times. If these values are set too high, training fails to converge to a network
with acceptable error over the training set. Full gradient descent was used in all
these experiments (in contrast to the stochastic approximation to gradient descent
in the algorithm of Table 4.2). Network weights in the output units were initial-
ized to small random values. However, input unit weights were initialized to zero,
because this yields much more intelligible visualizations of the learned weights
(see Figure 4.10), without any noticeable impact on generalization accuracy. The