This file is a collection of terms and definitions that are used to
describe benchmark problems and results in this benchmark collection. By
defining these terms here, in a central file, we hope to encourage precise
and uniform reporting of results without having to define a lot of terms
within every benchmark description file.
This glossary file is NOT meant to be a comprehensive survey of how these
terms are used in the literature. Where different researchers use a term
in different ways, we must choose -- perhaps arbitrarily -- which
definition to use here.
This file was created and is currently maintained by Scott E. Fahlman.
It is not copyrighted because I want researchers to have unrestricted
access to everything in this benchmark collection. However, I would ask
anyone reprinting a substantial portion of this file to acknowledge the
source, as a matter of simple courtesy.
---------------------------------------------------------------------------
ACTIVATION FUNCTION: In many learning architectures, the output of each
unit is some function (usually nonlinear) of the weighted inputs to that
unit. This function is called the "activation function" for that unit.
Among the popular activation functions are threshold, sigmoid, symmetric
sigmoid, and Gaussian.
BACKPROP: A colloquial expression for the BACK-PROPAGATION learning
algorithm.
BACK-PROPAGATION LEARNING: A supervised learning method for networks
without any directed loops. See [1] for a detailed description of the
algorithm and parameters.
BIAS UNIT: Many learning network architectures use something like the
following formula to describe the behavior of each unit:
output = f(sum-of-weighted-inputs + bias)
The function f is some nonlinear activation function. Both the weights
and the bias for each unit must be adjusted by the learning machinery.
The learning algorithm is simplified by replacing the bias unit with a
connection to a "bias unit" whose value is held constant and positive.
BIAS CONNECTION: The connection a unit receives from the bias unit.
CROSS ENTROPY ERROR FUNCTION: An alternative to the mean-squared error
function. In general, the values of d, the desired output for a given
training example, and y, the observed output for that case, will be real
numbers. We can interpret d as the desired probability that a
binary-valued output will assume a value of 1 in this case, and y as the
observed probability of seeing a 1 in that case.
The cross-entropy error, C is then expressed as
C = - sum [all cases and outputs] ( d log(y) + (1-d) log(1-y) )
The derivative of this error function for a given output and training
example (this is the value we actually back-propagate) is
dC/dy = - d/y + (1-d)(1-y)
Note that this back-propagated derivative goes to infinity as the
difference between y and d goes to +1 or -1. This can counteract the
tendency of the network to get stuck in regions where the derivative of
the sigmoid function approaches zero.
COMPLETE CONNECTIVITY: In most feed-forward networks the units are
arranged in regular layers. Layer L1 is said to be "completely
connected" to layer L2 if every unit in L2 receives a connection from
every unit in L1. A network is said to be "completely connected" if each
of its layers is completely connected to the next one, and if every unit
in the first hidden layer receives a connection from each of the
network's external inputs.
EPOCH: An epoch is defined as a single presentation of each of the
examples in the training set, either in fixed or random order. This
concept is only well-defined for problems with a fixed, finite set of
training examples. Modification of weights may occur after every epoch,
after every pattern presentation, or on some other schedule.
It is often convenient to measure learning time in epochs for problems
and algorithms in which the concept of epoch is well-defined; for other
problems it may be necessary to measure learning time in terms of
individual pattern presentations or in terms of the number of arithmetic
operations required.
ERROR FUNCTION: The task of supervised learning is generally to minimize
some function of the difference between the observed and the desired
outputs, taken over all the outputs and over some set of input/output
pairs. This function is called the "error function". The most common
error function is "mean squared error": the square of the difference
between desired and observed outputs is averaged over all outputs and
over the entire set of trials. The derivative of this measure with
respect to each output value is just the difference itself, so it is this
difference that produces the error signal that is back-propagated through
the network.
40-20-40 CORRECTNESS CRITERION: For problems in which the output units
are supposed to assume binary values -- logical zero and logical one --
there is a question of what range of output values are to be considered
"close enough" to these ideal targets. The 40-20-40 criterion says that
values in the lowest 40% of the output range are treated as logical zero,
values in the highest 40% are treated as logical one, and values in the
middle 20% are treated as indeterminate and therefore not correct. This
criterion is widely used because it does not require extreme accuracy in
the outputs but does require that the output values be distinct enough
for some amount of noise immunity. Do not confuse this measure with the
TARGET VALUES for the output units.
HYPERBOLIC ARCTANGENT ERROR FUNCTION: An error function under which the
derivative error value back-propagated into the network is the hyperbolic
arctangent of the difference between the desired and observed values,
rather than the difference itself. See [2] for details.
HYPERGEOMETRIC MEAN: The hypergeometric mean H of a set of N values {v1, v2,
... vN} can be computed by the formula
H = N / (1/v1 + 1/v2 ... + 1/vN)
This measure has been used by some researchers as a way of reporting
"average" learning times for problems in which a normal arithmetic mean
is unusable because some of the learning trials do not converge at all.
These trials go to zero in the summation above, and so they cause no
special problems. Note, however, that this method gives a result that is
strictly less than the arithmetic mean of the same value set because fast
learning trials are weighted more heavily than slow ones. This measure
is also known as the "harmonic mean".
MEAN SQUARED ERROR: See ERROR FUNCTION.
PATTERN PRESENTATION: See PRESENTATION.
PRESENTATION: Most supervised learning algorithms operate by presenting
training examples or training patterns to the network, one at a time.
Each such example consists of an input and a desired output. Learning
time is often measured by the total number of presentations required for
training the network to some specified level of proficiency.
QUICKPROP: A variation on the BACK-PROPAGATION learning algorithm that
uses a second-order weight-update function, based on measurement of the
error gradient at two successive points, to speed up convergence over
simple first-order gradient descent. See reference [2] for details.
REINFORCEMENT LEARNING: A learning scheme in which the network is trained
by giving it inputs, observing the outputs, and indicating how good those
outputs were according to some criterion known to the trainer. The
desired outputs are not presented to the network in explicit form.
RESTART REPORTING METHOD: This is a method for reporting "average"
learning times in situations where some trials do not converge or take an
unusually long time. The experimenter picks some number of epochs as a
"restart" value. If a learning trial has not been completed by the time
this value is reached, the trial is terminated and started over with a
new set of random initial values. When the problem ultimately is solved,
the time reported includes the time invested before the restart as well
as the time after.
SHORTCUT CONNECTIONS: In most feed-forward networks the units are
arranged in regular layers. Units in one layer receive incoming
connections only from units in the previous layer; units in the first
layer receive connections only from the external inputs. A "shortcut
connection" skips over layers in the network. In a network with
"complete shortcut connectivity", each unit receives direct connections
from all units in ALL previous layers and also from all the external
inputs.
SIGMOID ACTIVATION FUNCTION: A monotonic, continuously differentiable
activation function often used in back-propagation networks. Defined as
output = 1 / ( 1 + exp (-x/T) )
where x is the weighted sum of inputs and T is a scaling parameter
controlling the steepness of the slope. This is equivalent to
output = ( tanh (x/T) + 1 ) / 2
The range of this function is from 0 to +1.
SQUARED ERROR: See ERROR FUNCTION.
SUPERVISED LEARNING: A learning scheme in which the desired output for each
training input is presented to the network explicitly.
SYMMETRIC SIGMOID ACTIVATION FUNCTION: A version of the sigmoid function
shifted to that its range is symmetric around 0. Some researchers use a
range of -1 to +1, while others use -1/2 to +1/2. It is best to specify
which convention is in use.
TARGET VALUE: This is the value presented as the desired output during
training. For problems with binary outputs, the choice of target values
is up to the algorithm designer. Many researchers use .1 and .9 rather
than the extreme values of 0.0 and 1.0, which are unreachable with
sigmoid output units.
THRESHOLD ACTIVATION FUNCTION: An activation function that is 0 for all
input sums below 0 or some other threshold value, and 1 for all input
sums above this threshold. This is the limiting case of the sigmoid
activation function as T goes to 0. Used in the Perceptron, but
unsuitable for back-propagation, which requires a continuous,
differentiable activation function.
TWICE-MEDIAN REPORTING METHOD: This is an alternative to the restart
reporting method or the use of the hypergeometric mean for reporting
learning-time experiments in which some trials do not terminate. We
record the time required by all trials, with very long trials being
terminated and given a value of infinity. We compute the median of all
trials, and classify any trial that took longer than twice the median
time to be "unsuccessful". We report the median value, the average of
all successful trials, and the number of unsuccessful trials. The
advantage of this method over the restart method is that it avoids the
guesswork involved in choosing a good restart value. Of course, it is
only applicable if at least half of the trials converge.
UNSUPERVISED LEARNING: A learning scheme in which the desired outputs are
not presented to the network; only the inputs are seen. The task of the
learning algorithm is generally to learn to reproduce the input
distribution, to complete partial inputs, or to classify the inputs
according to some observed regularities.
---------------------------------------------------------------------------
REFERENCES:
1. D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning Internal
Representations by Error Propagation", in Parallel Distributed
Processing, Vol. 1, D. E. Rumelhart and J. L. McClelland (eds.), MIT
Press, 1986.
2. Scott E. Fahlman, "Faster-Learning Variations on Back-Propagation: An
Empirical Study", in Proceedings of the 1988 Connectionist Models Summer
School, Morgan Kaufmann, 1988.