Neural Networks

<< Algorithms

Neural Networks

The selection of the name ‘neural network’ was one of the great PR successes of the Twentieth Century. It certainly sounds more exciting than ... “A network of weighted, additive values with nonlinear transfer functions.”

— Phillip Sherrod, DTREG: predictive modeling software

Terminology

The “neural network” is the successor to the “perceptron” whose originator claimed it was based on the physiology of biological neurons. In fact it was based on an analogy with an incomplete and largely incorrect theory of such neurons. The name “neural network” is a pretentious misnomer.

The biology of the nervous system is infinitely more complex than these mechanistic “neural networks.” The nodes are not like neurons, by which we mean real, biological neurons. What the activity of a neuron means is unknown but what seems to be true is that is the rate of firing that affects what it is connected to, not an individual firing. For example, a muscle cell contracts with more force when its motor neuron is activated with a higher frequency. There is nothing like, or even analogous, to backpropagation in neurons. The dendrites are not like wires, indeed one axon is “as complicated as computer” to quote the late neurophysiologist Jerome Lettvin.

The conventional terminology makes no sense. Perceptrons which perceive nothing, pattern recognition – as the general field is sometimes called – in which nothing is recognized.

Men design the machine, modeled or built, to sort input data (from a limited universe) into groups, after certain machine parameters have been determined by the user inputting lots of data sets along with which group each belongs to. The iterative process is called “training,” though it is as much training as the Newton-Raphson iterative method of finding the zero of a function.

Another quirk in the conventional terminology of neural networks is that the middle node layer is called “hidden.” Hidden from whom?. And “pattern” is used in the sense of one of several inputs, for example a pixelated picture, it is not used in the sense of a logical connection.

The conventional terminology is harmless enough if we keep in mind that it is whimsical and has no connection to biology or mental processes.

What it Does

Say you have a vast, potentially unlimited, collection of pictures that can be pixelated, that is, divided into small areas each of which can be specified by a number, in the simplest case 0 or 1. The “pictures” can have any dimension so we will call them patterns. (In our example program, the pattern is one dimensional, 20 pixels long, and a pattern consists of from one to five of the pixels being bright, the other dim.)

You have some evaluation function that categorizes each pattern. For example, if the patterns were two dimensional pictures of one dog or cat or mule, you might want to categorize each pattern as a dog, or a cat, or a mule. (In our program, the category is the number of bright pixels among the 20.)

A so-called “artificial neural network” contains several layers of “artificial neurons”: an input layer, one or more middle or “hidden” layers, and an output layer. Regardless of the dimension of the patterns, how each layer of nodes is arranged is immaterial; they might as well be strung out in a line. The number of nodes of the input layer equals the total number of pixels in a pattern (which if the pattern is N dimensional would be a product of the number of pixels on its N sides). The number of nodes of the output layer equals the number of categories. (In our program there are five output nodes, one for each possible count.)

You can present a pattern to the input layer of the neural network. By a process shown in the program, the nodes of the output layer light up. The brightnesses of the output layer nodes are determined by the input layer and the weights of the connections from the input layer to the hidden layer and from the hidden layer to the output layer (and between the hidden layers if more than one). You want just one of the nodes to stand out as the brightest, and that one indicate the correct answer to “What is the category of the input pattern?”

There are two phases in using the network, training and testing. The network starts with random weights for the interconnections. During training, a large number and variety of patterns are presented to it, patterns you know the category of, that is, you know the correct answer. The network will provide an answer based on its current weight matrices, and it will most likely get it wrong. The error is propagated back to adjust the weights so it will do better next time. The collection of training patterns is gone through over and over until the network gets the answer right for all the patterns in the training collection.

Then there is the testing phase. You try giving the network patterns it didn’t train on, and see how it does.

Amazingly, it can do quite well after training on only a faction of the possible patterns. In the example program, there are over four thousand possible patterns consisting of five pixels among twenty spaced no closer than two pixels apart. If you have a network with 15 nodes in one hidden layer, and train it on only 300 of those patterns until it gets them all right, then it will almost always give the right answer when you give it one of the three thousand seven hundred patterns it never saw before.

The neural network program below finds two matrices that can be used to determine count the number of bright pixels in a pattern. These matrices operate on the input variables considered as a column matrix to determine the output variables. The output variable with the highest value (ideally equal to one, the others ideally equal to zero) indicates one distribution of the input variables.

Again, these two matrices are determined by an iterative process. Known patterns are placed in the input, each coded as a sequence of 0s and 1s. By comparing the output to what it should be, the matrix parameters (called weights) are modified, and the process is repeated. With luck it converges. The process is called “training.”

Afterwards you can use the matrices, now with fixed parameters, to classify future patterns. The future patterns could even vary somewhat from the original ones and the functions might – it doesn’t always work – classify them with the nearest original pattern.

Classifying somewhat different patterns is the whole point of the enterprise. If you knew beforehand that the input pattern exactly matched one of the training patterns, it would be trivial to find which one by simply comparing its input array with that of each pattern in turn.

More Detail

The line weights between the layers specify the current status of the neural net. Each node of one layer is connected to each node of the other, so these weights can be represented by matrices. In the case of one hidden layer there are three layers and two weight matrices:

weightH I (hidden, input)
weightOH (output, hidden)

where “input,” “hidden,” and “output” are indices specifying a node in the input, hidden, and output layer respectively.

The input is represented by a one-column matrix of numbers each either 0 or 1 (or in some applications, between 0 and 1 inclusive):

pattern (input)

and the output by a one-row matrix of numbers each between 0 and 1 inclusive:

outO (output)

Letting an asterisk indicate matrix multiplication:

netH = weightH I * pattern : outH = Transfer (netH)
netO = weightOH * outH : outO = Transfer (netO)

where Transfer is a “squishing” function that operates on all elements of a column matrix, taking each entry and forcing it into the range 0 to 1. One such is the sigmoid function:

Transfer(x) = 1 / (1 + e^–x)

For example, the special cases:

Transfer(-∞) = 0, Transfer(0) = .5, Transfer(+∞) = 1.

In the literature on this subject, the transfer function is sometimes called the activation function.

The “training” – the iterative process – finds the matrices weightH I() and weightOH() so that if pattern() is the kth pattern then the kth entry of outO() is 1 and the other entries of outO() are 0.

In practice you can stop the iterating long before outO() consists of 0s and 1s and simply use the largest entry as the answer. You can think of outO(ouput) as the probability that the pattern is output.

There are a lot of parameters: inputs * hiddens + hiddens * outputs so it’s no surprise a solution exists. The problem is finding it.

The fact that the input layer and hidden layer are arranged in a line might be seen as a restriction but it isn’t. Each node is connected to every other node of adjacent layers, so how the nodes are arranged is immaterial. The input itself can be of any dimension.

A certain minimum number of nodes, depending on the problem, is required to train up to a 100% success rate.

To figure out the diagram the program draws, use the left column of the Color Key for the value of the network lines and sum them going into a node. Then use the right column of the color key to determine the value and color of the node. See however the second comment above.

Node color = color of Transfer (sum of net lines going in)

Here is the program:
A “Neural Network” with Post-Train Testing