Hire a web Developer and Designer to upgrade and boost your online presence with cutting edge Technologies

Tuesday 3 September 2024

How to Configure Multilayer Perceptron Network for Time Series Forecasting

 It can be difficult when starting out on a new predictive modeling project with neural networks.

There is so much to configure, and no clear idea where to start.

It is important to be systematic. You can break bad assumptions and quickly hone in on configurations that work and areas for further investigation likely to payoff.

In this tutorial, you will discover how to use exploratory configuration of multilayer perceptron (MLP) neural networks to find good first-cut models for time series forecasting.

After completing this tutorial, you will know:

  • How to design a robust experimental test harness to evaluate MLP models for time series forecasting.
  • Systematic experimental designs for varying epochs, neurons, and lag configurations.
  • How to interpret results and use diagnostics to learn more about well-performing models.

    Tutorial Overview

    This tutorial is broken down into 6 parts. They are:

    1. Shampoo Sales Dataset
    2. Experimental Test Harness
    3. Vary Training Epochs
    4. Vary Hidden Layer Neurons
    5. Vary Hidden Layer Neurons with Lag
    6. Review of Results

    Environment

    This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this example.

    This tutorial assumes you have Keras v2.0 or higher installed with either the TensorFlow or Theano backend.

    This tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

    Next, let’s take a look at a standard time series forecasting problem that we can use as context for this experiment.

    If you need help setting up your Python environment, see this post:

    Shampoo Sales Dataset

    This dataset describes the monthly number of sales of shampoo over a 3-year period.

    The units are a sales count and there are 36 observations. The original dataset is credited to Makridakis, Wheelwright, and Hyndman (1998).

    The example below loads and creates a plot of the loaded dataset.

    Running the example loads the dataset as a Pandas Series and prints the first 5 rows.

    A line plot of the series is then created showing a clear increasing trend.

    Line Plot of Shampoo Sales Dataset

    Line Plot of Shampoo Sales Dataset

    Next, we will take a look at the model configuration and test harness used in the experiment.

    Experimental Test Harness

    This section describes the test harness used in this tutorial.

    Data Split

    We will split the Shampoo Sales dataset into two parts: a training and a test set.

    The first two years of data will be taken for the training dataset and the remaining one year of data will be used for the test set.

    Models will be developed using the training dataset and will make predictions on the test dataset.

    The persistence forecast (naive forecast) on the test dataset achieves an error of 136.761 monthly shampoo sales. This provides a lower acceptable bound of performance on the test set.

    Model Evaluation

    A rolling-forecast scenario will be used, also called walk-forward model validation.

    Each time step of the test dataset will be walked one at a time. A model will be used to make a forecast for the time step, then the actual expected value from the test set will be taken and made available to the model for the forecast on the next time step.

    This mimics a real-world scenario where new Shampoo Sales observations would be available each month and used in the forecasting of the following month.

    This will be simulated by the structure of the train and test datasets.

    All forecasts on the test dataset will be collected and an error score calculated to summarize the skill of the model. The root mean squared error (RMSE) will be used as it punishes large errors and results in a score that is in the same units as the forecast data, namely monthly shampoo sales.

    Data Preparation

    Before we can fit an MLP model to the dataset, we must transform the data.

    The following three data transforms are performed on the dataset prior to fitting a model and making a forecast.

    1. Transform the time series data so that it is stationary. Specifically, a lag=1 differencing to remove the increasing trend in the data.
    2. Transform the time series into a supervised learning problem. Specifically, the organization of data into input and output patterns where the observation at the previous time step is used as an input to forecast the observation at the current timestep
    3. Transform the observations to have a specific scale. Specifically, to rescale the data to values between -1 and 1.

    These transforms are inverted on forecasts to return them into their original scale before calculating and error score.

    MLP Model

    We will use a base MLP model with 1 neuron hidden layer, a rectified linear activation function on hidden neurons, and linear activation function on output neurons.

    A batch size of 4 is used where possible, with the training data truncated to ensure the number of patterns is divisible by 4. In some cases a batch size of 2 is used.

    Normally, the training dataset is shuffled after each batch or each epoch, which can aid in fitting the training dataset on classification and regression problems. Shuffling was turned off for all experiments as it seemed to result in better performance. More studies are needed to confirm this result for time series forecasting.

    The model will be fit using the efficient ADAM optimization algorithm and the mean squared error loss function.

    Experimental Runs

    Each experimental scenario will be run 30 times and the RMSE score on the test set will be recorded from the end each run.

    Let’s dive into the experiments.

    Need help with Deep Learning for Time Series?

    Take my free 7-day email crash course now (with sample code).

    Click to sign-up and also get a free PDF Ebook version of the course.

    Vary Training Epochs

    In this first experiment, we will investigate varying the number of training epochs for a simple MLP with one hidden layer and one neuron in the hidden layer.

    We will use a batch size of 4 and evaluate training epochs 50, 100, 500, 1000, and 2000.

    The complete code listing is provided below.

    This code listing will be used as the basis for all following experiments, with only the changes to this code provided in subsequent sections.

    Running the experiment prints the test set RMSE at the end of each experimental run.

    At the end of all runs, a table of summary statistics is provided, one row for each statistic and one configuration for each column.

    Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

    The summary statistics suggest that on average 1000 training epochs resulted in the better performance with a general decreasing trend in error with the increase of training epochs.

    A box and whisker plot of the distribution of test RMSE scores for each configuration was also created and saved to file.

    The plot highlights that each configuration shows the same general spread in test RMSE scores (box), with the median (green line) trending downward with the increase of training epochs.

    The results confirm that the configured MLP trained for 1000 is a good starting point on this problem.

    Box and Whisker Plot of Vary Training Epochs for Time Series Forecasting on the Shampoo Sales Dataset

    Box and Whisker Plot of Varying Training Epochs for Time Series Forecasting on the Shampoo Sales Dataset

    Another angle to consider with a network configuration is how it behaves over time as the model is being fit.

    We can evaluate the model on the training and test datasets after each training epoch to get an idea as to if the configuration is overfitting or underfitting the problem.

    We will use this diagnostic approach on the top result from each set of experiments. A total of 10 repeats of the configuration will be run and the train and test RMSE scores after each training epoch plotted on a line plot.

    In this case, we will use this diagnostic on the MLP fit for 1000 epochs.

    The complete diagnostic code listing is provided below.

    As with the previous code listing, the code listing below will be used as the basis for all diagnostics in this tutorial and only the changes to this listing will be provided in subsequent sections.

    Running the diagnostic prints the final train and test RMSE for each run. More interesting is the final line plot created.

    The line plot shows the train RMSE (blue) and test RMSE (orange) after each training epoch.

    In this case, the diagnostic plot shows little difference in train and test RMSE after about 400 training epochs. Both train and test performance level out on a near flat line.

    This rapid leveling out suggests the model is reaching capacity and may benefit from more information in terms of lag observations or additional neurons.

    Diagnostic Line Plot of Train and Test Performance of 400 Epochs on the Shampoo Sales Dataset

    Diagnostic Line Plot of Train and Test Performance of 1000 Epochs on the Shampoo Sales Dataset

    Vary Hidden Layer Neurons

    In this section, we will look at varying the number of neurons in the single hidden layer.

    Increasing the number of neurons can increase the learning capacity of the network at the risk of overfitting the training data.

    We will explore increasing the number of neurons from 1 to 5 and fit the network for 1000 epochs.

    The differences in the experiment script are listed below.

    Running the experiment prints summary statistics for each configuration.

    Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

    Looking at the average performance, it suggests a decrease of test RMSE with an increase in the number of neurons in the single hidden layer.

    The best results appear to be with 3 neurons.

    A box and whisker plot is also created to summarize and compare the distributions of results.

    The plot confirms the suggestion of 3 neurons performing well compared to the other configurations and suggests in addition that the spread of results is also smaller. This may indicate a more stable configuration.

    Box and Whisker Plot of Varying Hidden Neurons for Time Series Forecasting on the Shampoo Sales Dataset

    Box and Whisker Plot of Varying Hidden Neurons for Time Series Forecasting on the Shampoo Sales Dataset

    Again, we can dive a little deeper by reviewing diagnostics of the chosen configuration of 3 neurons fit for 1000 epochs.

    The changes to the diagnostic script are limited to the run() function and listed below.

    Running the diagnostic script provides a line plot of train and test RMSE for each training epoch.

    The diagnostics suggest a flattening out of model skill, perhaps around 400 epochs. The plot also suggests a possible situation of overfitting where there is a slight increase in test RMSE over the last 500 training epochs, but not a strong increase in training RMSE.

    Diagnostic Line Plot of Train and Test Performance of 3 Hidden Neurons on the Shampoo Sales Dataset

    Diagnostic Line Plot of Train and Test Performance of 3 Hidden Neurons on the Shampoo Sales Dataset

    Vary Hidden Layer Neurons with Lag

    In this section, we will look at increasing the lag observations as input, whilst at the same time increasing the capacity of the network.

    Increased lag observations will automatically scale the number of input neurons. For example, 3 lag observations as input will result in 3 input neurons.

    The added input will require additional capacity in the network. As such, we will also scale the number of neurons in the one hidden layer with the number of lag observations used as input.

    We will use odd numbers of lag observations as input from 1, 3, 5, and 7 and use the same number of neurons respectively.

    The change to the number of inputs affects the total number of training patterns during the conversion of the time series data to a supervised learning problem. As such, the batch size was reduced from 4 to 2 for all experiments in this section.

    A total of 1000 training epochs are used in each experimental run.

    The changes from the base experiment script are limited to the experiment() function and the running of the experiment, listed below.

    Running the experiment summarizes the results using descriptive statistics for each configuration.

    Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

    The results suggest that all increases in lag input variables with increases with hidden neurons decrease performance.

    Of note is the 1 neuron and 1 input configuration, which compared to the results from the previous section resulted in a similar mean and standard deviation.

    It is possible that the decrease in performance is related to the smaller batch size and that the results from the 1-neuron/1-lag case are insufficient to tease this out.

    A box and whisker plot of the distribution of results was also created allowing configurations to be compared.

    Interestingly, the use of 3 neurons and 3 input variables shows a tighter spread compared to the other configurations. This is similar to the observation from 3 neurons and 1 input variable seen in the previous section.

    Box and Whisker Plot of Varying Lag Features and Hidden Neurons for Time Series Forecasting on the Shampoo Sales Dataset

    Box and Whisker Plot of Varying Lag Features and Hidden Neurons for Time Series Forecasting on the Shampoo Sales Dataset

    We can also use diagnostics to tease out how the dynamics of the model might have changed while fitting the model.

    The results for 3-lags/3-neurons are interesting and we will investigate them further.

    The changes to the diagnostic script are confined to the run() function.

    Running the diagnostics script creates a line plot showing the train and test RMSE after each training epoch for 10 experimental runs.

    The results suggest good learning during the first 500 epochs and perhaps overfitting in the remaining epochs with the test RMSE showing an increasing trend and the train RMSE showing a decreasing trend.

    Diagnostic Line Plot of Train and Test Performance of 3 Hidden Neurons and Lag Features on the Shampoo Sales Dataset

    Diagnostic Line Plot of Train and Test Performance of 3 Hidden Neurons and Lag Features on the Shampoo Sales Dataset

    Review of Results

    We have covered a lot of ground in this tutorial. Let’s review.

    • Epochs. We looked at how model skill varied with the number of training epochs and found that 1000 might be a good starting point.
    • Neurons. We looked at varying the number of neurons in the hidden layer and found that 3 neurons might be a good configuration.
    • Lag Inputs. We looked at varying the number of lag observations as inputs whilst at the same time increasing the number of neurons in the hidden layer and found that results generally got worse, but again, 3 neurons in the hidden layer shows interest. Poor results may have been related to the change of batch size from 4 to 2 compared to other experiments.

    The results suggest using a 1 lag input, 3 neurons in the hidden layer, and fit for 1000 epochs as a first-cut model configuration.

    This can be improved upon in many ways; the next section lists some ideas.

    Extensions

    This section lists extensions and follow-up experiments you might like to explore.

    • Shuffle vs No Shuffle. No shuffling was used, which is abnormal. Develop an experiment to compare shuffling to no shuffling of the training set when fitting the model for time series forecasting.
    • Normalization Method. Data was rescaled to -1 to 1, typical for a tanh activation function, not used in the model configurations. Explore other rescaling, such as 0-1 normalization and standardization and the impact on model performance.
    • Multiple Layers. Explore the use of multiple hidden layers to add network capacity to learn more complex multi-step patterns.
    • Feature Engineering. Explore the use of additional features, such as an error time series and even elements of the date-time of each observation.

      Summary

      In this tutorial, you discovered how to use systematic experiments to explore the configuration of a multilayer perceptron for time series forecasting and develop a first-cut model.

      Specifically, you learned:

      • How to develop a robust test harness for evaluating MLP models for time series forecasting.
      • How to systematically evaluate training epochs, hidden layer neurons, and lag inputs.
      • How to use diagnostics to help interpret results and suggest follow-up experiments.

      Do you have any questions about this tutorial?
      Ask your questions in the comments below and I will do my best to answer.

No comments:

Post a Comment

Connect broadband

How to Evaluate the Skill of Deep Learning Models

 I often see practitioners expressing confusion about how to evaluate a deep learning model. This is often obvious from questions like: W...