Hire a web Developer and Designer to upgrade and boost your online presence with cutting edge Technologies

Saturday 13 July 2024

How to Code a Neural Network with Backpropagation In Python (from scratch)

 The backpropagation algorithm is used in the classical feed-forward artificial neural network.

It is the technique still used to train large deep learning networks.

In this tutorial, you will discover how to implement the backpropagation algorithm for a neural network from scratch with Python.

After completing this tutorial, you will know:

  • How to forward-propagate an input to calculate an output.
  • How to back-propagate error and train a network.
  • How to apply the backpropagation algorithm to a real-world predictive modeling problem.

    Description

    This section provides a brief introduction to the Backpropagation Algorithm and the Wheat Seeds dataset that we will be using in this tutorial.

    Backpropagation Algorithm

    The Backpropagation algorithm is a supervised learning method for multilayer feed-forward networks from the field of Artificial Neural Networks.

    Feed-forward neural networks are inspired by the information processing of one or more neural cells, called a neuron. A neuron accepts input signals via its dendrites, which pass the electrical signal down to the cell body. The axon carries the signal out to synapses, which are the connections of a cell’s axon to other cell’s dendrites.

    The principle of the backpropagation approach is to model a given function by modifying internal weightings of input signals to produce an expected output signal. The system is trained using a supervised learning method, where the error between the system’s output and a known expected output is presented to the system and used to modify its internal state.

    Technically, the backpropagation algorithm is a method for training the weights in a multilayer feed-forward neural network. As such, it requires a network structure to be defined of one or more layers where one layer is fully connected to the next layer. A standard network structure is one input layer, one hidden layer, and one output layer.

    Backpropagation can be used for both classification and regression problems, but we will focus on classification in this tutorial.

    In classification problems, best results are achieved when the network has one neuron in the output layer for each class value. For example, a 2-class or binary classification problem with the class values of A and B. These expected outputs would have to be transformed into binary vectors with one column for each class value. Such as [1, 0] and [0, 1] for A and B respectively. This is called a one hot encoding.

    Wheat Seeds Dataset

    The seeds dataset involves the prediction of species given measurements seeds from different varieties of wheat.

    There are 201 records and 7 numerical input variables. It is a classification problem with 3 output classes. The scale for each numeric input value vary, so some data normalization may be required for use with algorithms that weight inputs like the backpropagation algorithm.

    Below is a sample of the first 5 rows of the dataset.

    Using the Zero Rule algorithm that predicts the most common class value, the baseline accuracy for the problem is 28.095%.

    You can learn more and download the seeds dataset from the UCI Machine Learning Repository.

    Download the seeds dataset and place it into your current working directory with the filename seeds_dataset.csv.

    The dataset is in tab-separated format, so you must convert it to CSV using a text editor or a spreadsheet program.

    Update, download the dataset in CSV format directly:

    Tutorial

    This tutorial is broken down into 6 parts:

    1. Initialize Network.
    2. Forward Propagate.
    3. Back Propagate Error.
    4. Train Network.
    5. Predict.
    6. Seeds Dataset Case Study.

    These steps will provide the foundation that you need to implement the backpropagation algorithm from scratch and apply it to your own predictive modeling problems.

    1. Initialize Network

    Let’s start with something easy, the creation of a new network ready for training.

    Each neuron has a set of weights that need to be maintained. One weight for each input connection and an additional weight for the bias. We will need to store additional properties for a neuron during training, therefore we will use a dictionary to represent each neuron and store properties by names such as ‘weights‘ for the weights.

    A network is organized into layers. The input layer is really just a row from our training dataset. The first real layer is the hidden layer. This is followed by the output layer that has one neuron for each class value.

    We will organize layers as arrays of dictionaries and treat the whole network as an array of layers.

    It is good practice to initialize the network weights to small random numbers. In this case, will we use random numbers in the range of 0 to 1.

    Below is a function named initialize_network() that creates a new neural network ready for training. It accepts three parameters, the number of inputs, the number of neurons to have in the hidden layer and the number of outputs.

    You can see that for the hidden layer we create n_hidden neurons and each neuron in the hidden layer has n_inputs + 1 weights, one for each input column in a dataset and an additional one for the bias.

    You can also see that the output layer that connects to the hidden layer has n_outputs neurons, each with n_hidden + 1 weights. This means that each neuron in the output layer connects to (has a weight for) each neuron in the hidden layer.

    Let’s test out this function. Below is a complete example that creates a small network.

    Running the example, you can see that the code prints out each layer one by one. You can see the hidden layer has one neuron with 2 input weights plus the bias. The output layer has 2 neurons, each with 1 weight plus the bias.

    Now that we know how to create and initialize a network, let’s see how we can use it to calculate an output.

    2. Forward Propagate

    We can calculate an output from a neural network by propagating an input signal through each layer until the output layer outputs its values.

    We call this forward-propagation.

    It is the technique we will need to generate predictions during training that will need to be corrected, and it is the method we will need after the network is trained to make predictions on new data.

    We can break forward propagation down into three parts:

    1. Neuron Activation.
    2. Neuron Transfer.
    3. Forward Propagation.

    2.1. Neuron Activation

    The first step is to calculate the activation of one neuron given an input.

    The input could be a row from our training dataset, as in the case of the hidden layer. It may also be the outputs from each neuron in the hidden layer, in the case of the output layer.

    Neuron activation is calculated as the weighted sum of the inputs. Much like linear regression.

    Where weight is a network weight, input is an input, i is the index of a weight or an input and bias is a special weight that has no input to multiply with (or you can think of the input as always being 1.0).

    Below is an implementation of this in a function named activate(). You can see that the function assumes that the bias is the last weight in the list of weights. This helps here and later to make the code easier to read.

    Now, let’s see how to use the neuron activation.

    2.2. Neuron Transfer

    Once a neuron is activated, we need to transfer the activation to see what the neuron output actually is.

    Different transfer functions can be used. It is traditional to use the sigmoid activation function, but you can also use the tanh (hyperbolic tangent) function to transfer outputs. More recently, the rectifier transfer function has been popular with large deep learning networks.

    The sigmoid activation function looks like an S shape, it’s also called the logistic function. It can take any input value and produce a number between 0 and 1 on an S-curve. It is also a function of which we can easily calculate the derivative (slope) that we will need later when backpropagating error.

    We can transfer an activation function using the sigmoid function as follows:

    Where e is the base of the natural logarithms (Euler’s number).

    Below is a function named transfer() that implements the sigmoid equation.

    Now that we have the pieces, let’s see how they are used.

    2.3. Forward Propagation

    Forward propagating an input is straightforward.

    We work through each layer of our network calculating the outputs for each neuron. All of the outputs from one layer become inputs to the neurons on the next layer.

    Below is a function named forward_propagate() that implements the forward propagation for a row of data from our dataset with our neural network.

    You can see that a neuron’s output value is stored in the neuron with the name ‘output‘. You can also see that we collect the outputs for a layer in an array named new_inputs that becomes the array inputs and is used as inputs for the following layer.

    The function returns the outputs from the last layer also called the output layer.

    Let’s put all of these pieces together and test out the forward propagation of our network.

    We define our network inline with one hidden neuron that expects 2 input values and an output layer with two neurons.

    Running the example propagates the input pattern [1, 0] and produces an output value that is printed. Because the output layer has two neurons, we get a list of two numbers as output.

    The actual output values are just nonsense for now, but next, we will start to learn how to make the weights in the neurons more useful.

    3. Back Propagate Error

    The backpropagation algorithm is named for the way in which weights are trained.

    Error is calculated between the expected outputs and the outputs forward propagated from the network. These errors are then propagated backward through the network from the output layer to the hidden layer, assigning blame for the error and updating weights as they go.

    The math for backpropagating error is rooted in calculus, but we will remain high level in this section and focus on what is calculated and how rather than why the calculations take this particular form.

    This part is broken down into two sections.

    1. Transfer Derivative.
    2. Error Backpropagation.

    3.1. Transfer Derivative

    Given an output value from a neuron, we need to calculate it’s slope.

    We are using the sigmoid transfer function, the derivative of which can be calculated as follows:

    Below is a function named transfer_derivative() that implements this equation.

    Now, let’s see how this can be used.

    3.2. Error Backpropagation

    The first step is to calculate the error for each output neuron, this will give us our error signal (input) to propagate backwards through the network.

    The error for a given neuron can be calculated as follows:

    Where expected is the expected output value for the neuron, output is the output value for the neuron and transfer_derivative() calculates the slope of the neuron’s output value, as shown above.

    This error calculation is used for neurons in the output layer. The expected value is the class value itself. In the hidden layer, things are a little more complicated.

    The error signal for a neuron in the hidden layer is calculated as the weighted error of each neuron in the output layer. Think of the error traveling back along the weights of the output layer to the neurons in the hidden layer.

    The back-propagated error signal is accumulated and then used to determine the error for the neuron in the hidden layer, as follows:

    Where error_j is the error signal from the jth neuron in the output layer, weight_k is the weight that connects the kth neuron to the current neuron and output is the output for the current neuron.

    Below is a function named backward_propagate_error() that implements this procedure.

    You can see that the error signal calculated for each neuron is stored with the name ‘delta’. You can see that the layers of the network are iterated in reverse order, starting at the output and working backwards. This ensures that the neurons in the output layer have ‘delta’ values calculated first that neurons in the hidden layer can use in the subsequent iteration. I chose the name ‘delta’ to reflect the change the error implies on the neuron (e.g. the weight delta).

    You can see that the error signal for neurons in the hidden layer is accumulated from neurons in the output layer where the hidden neuron number j is also the index of the neuron’s weight in the output layer neuron[‘weights’][j].

    Let’s put all of the pieces together and see how it works.

    We define a fixed neural network with output values and backpropagate an expected output pattern. The complete example is listed below.

    Running the example prints the network after the backpropagation of error is complete. You can see that error values are calculated and stored in the neurons for the output layer and the hidden layer.

    Now let’s use the backpropagation of error to train the network.

    4. Train Network

    The network is trained using stochastic gradient descent.

    This involves multiple iterations of exposing a training dataset to the network and for each row of data forward propagating the inputs, backpropagating the error and updating the network weights.

    This part is broken down into two sections:

    1. Update Weights.
    2. Train Network.

    4.1. Update Weights

    Once errors are calculated for each neuron in the network via the back propagation method above, they can be used to update weights.

    Network weights are updated as follows:

    Where weight is a given weight, learning_rate is a parameter that you must specify, error is the error calculated by the backpropagation procedure for the neuron and input is the input value that caused the error.

    The same procedure can be used for updating the bias weight, except there is no input term, or input is the fixed value of 1.0.

    Learning rate controls how much to change the weight to correct for the error. For example, a value of 0.1 will update the weight 10% of the amount that it possibly could be updated. Small learning rates are preferred that cause slower learning over a large number of training iterations. This increases the likelihood of the network finding a good set of weights across all layers rather than the fastest set of weights that minimize error (called premature convergence).

    Below is a function named update_weights() that updates the weights for a network given an input row of data, a learning rate and assume that a forward and backward propagation have already been performed.

    Remember that the input for the output layer is a collection of outputs from the hidden layer.

    Now we know how to update network weights, let’s see how we can do it repeatedly.

    4.2. Train Network

    As mentioned, the network is updated using stochastic gradient descent.

    This involves first looping for a fixed number of epochs and within each epoch updating the network for each row in the training dataset.

    Because updates are made for each training pattern, this type of learning is called online learning. If errors were accumulated across an epoch before updating the weights, this is called batch learning or batch gradient descent.

    Below is a function that implements the training of an already initialized neural network with a given training dataset, learning rate, fixed number of epochs and an expected number of output values.

    The expected number of output values is used to transform class values in the training data into a one hot encoding. That is a binary vector with one column for each class value to match the output of the network. This is required to calculate the error for the output layer.

    You can also see that the sum squared error between the expected output and the network output is accumulated each epoch and printed. This is helpful to create a trace of how much the network is learning and improving each epoch.

    We now have all of the pieces to train the network. We can put together an example that includes everything we’ve seen so far including network initialization and train a network on a small dataset.

    Below is a small contrived dataset that we can use to test out training our neural network.

    Below is the complete example. We will use 2 neurons in the hidden layer. It is a binary classification problem (2 classes) so there will be two neurons in the output layer. The network will be trained for 20 epochs with a learning rate of 0.5, which is high because we are training for so few iterations.

    Running the example first prints the sum squared error each training epoch. We can see a trend of this error decreasing with each epoch.

    Once trained, the network is printed, showing the learned weights. Also still in the network are output and delta values that can be ignored. We could update our training function to delete these data if we wanted.

    Once a network is trained, we need to use it to make predictions.

    5. Predict

    Making predictions with a trained neural network is easy enough.

    We have already seen how to forward-propagate an input pattern to get an output. This is all we need to do to make a prediction. We can use the output values themselves directly as the probability of a pattern belonging to each output class.

    It may be more useful to turn this output back into a crisp class prediction. We can do this by selecting the class value with the larger probability. This is also called the arg max function.

    Below is a function named predict() that implements this procedure. It returns the index in the network output that has the largest probability. It assumes that class values have been converted to integers starting at 0.

    We can put this together with our code above for forward propagating input and with our small contrived dataset to test making predictions with an already-trained network. The example hardcodes a network trained from the previous step.

    The complete example is listed below.

    Running the example prints the expected output for each record in the training dataset, followed by the crisp prediction made by the network.

    It shows that the network achieves 100% accuracy on this small dataset.

    Now we are ready to apply our backpropagation algorithm to a real world dataset.

    6. Wheat Seeds Dataset

    This section applies the Backpropagation algorithm to the wheat seeds dataset.

    The first step is to load the dataset and convert the loaded data to numbers that we can use in our neural network. For this we will use the helper function load_csv() to load the file, str_column_to_float() to convert string numbers to floats and str_column_to_int() to convert the class column to integer values.

    Input values vary in scale and need to be normalized to the range of 0 and 1. It is generally good practice to normalize input values to the range of the chosen transfer function, in this case, the sigmoid function that outputs values between 0 and 1. The dataset_minmax() and normalize_dataset() helper functions were used to normalize the input values.

    We will evaluate the algorithm using k-fold cross-validation with 5 folds. This means that 201/5=40.2 or 40 records will be in each fold. We will use the helper functions evaluate_algorithm() to evaluate the algorithm with cross-validation and accuracy_metric() to calculate the accuracy of predictions.

    A new function named back_propagation() was developed to manage the application of the Backpropagation algorithm, first initializing a network, training it on the training dataset and then using the trained network to make predictions on a test dataset.

    The complete example is listed below.

    A network with 5 neurons in the hidden layer and 3 neurons in the output layer was constructed. The network was trained for 500 epochs with a learning rate of 0.3. These parameters were found with a little trial and error, but you may be able to do much better.

    Running the example prints the average classification accuracy on each fold as well as the average performance across all folds.

    You can see that backpropagation and the chosen configuration achieved a mean classification accuracy of about 93% which is dramatically better than the Zero Rule algorithm that did slightly better than 28% accuracy.

    Extensions

    This section lists extensions to the tutorial that you may wish to explore.

    • Tune Algorithm Parameters. Try larger or smaller networks trained for longer or shorter. See if you can get better performance on the seeds dataset.
    • Additional Methods. Experiment with different weight initialization techniques (such as small random numbers) and different transfer functions (such as tanh).
    • More Layers. Add support for more hidden layers, trained in just the same way as the one hidden layer used in this tutorial.
    • Regression. Change the network so that there is only one neuron in the output layer and that a real value is predicted. Pick a regression dataset to practice on. A linear transfer function could be used for neurons in the output layer, or the output values of the chosen dataset could be scaled to values between 0 and 1.
    • Batch Gradient Descent. Change the training procedure from online to batch gradient descent and update the weights only at the end of each epoch.

    Did you try any of these extensions?
    Share your experiences in the comments below.

No comments:

Post a Comment

Connect broadband

On the Suitability of Long Short-Term Memory Networks for Time Series Forecasting

 Long Short-Term Memory (LSTM) is a type of recurrent neural network that can learn the order dependence between items in a sequence. LSTM...