Hire a web Developer and Designer to upgrade and boost your online presence with cutting edge Technologies

Sunday, 30 June 2024

How To Implement Baseline Machine Learning Algorithms From Scratch With Python

 It is important to establish baseline performance on a predictive modeling problem.

A baseline provides a point of comparison for the more advanced methods that you evaluate later.

In this tutorial, you will discover how to implement baseline machine learning algorithms from scratch in Python.

After completing this tutorial, you will know:

  • How to implement the random prediction algorithm.
  • How to implement the zero rule prediction algorithm.

    Description

    There are many machine learning algorithms to choose from. Hundreds in fact.

    You must know whether the predictions for a given algorithm are good or not. But how do you know?

    The answer is to use a baseline prediction algorithm. A baseline prediction algorithm provides a set of predictions that you can evaluate as you would any predictions for your problem, such as classification accuracy or RMSE.

    The scores from these algorithms provide the required point of comparison when evaluating all other machine learning algorithms on your problem.

    Once established, you can comment on how much better a given algorithm is as compared to the naive baseline algorithm, providing context on just how good a given method actually is.

    The two most commonly used baseline algorithms are:

    • Random Prediction Algorithm.
    • Zero Rule Algorithm.

    When starting on a new problem that is more sticky than a conventional classification or regression problem, it is a good idea to first devise a random prediction algorithm that is specific to your prediction problem. Later you can improve upon this and devise a zero rule algorithm.

    Let’s implement these algorithms and see how they work.

    Tutorial

    This tutorial is divided into 2 parts:

    1. Random Prediction Algorithm.
    2. Zero Rule Algorithm.

    These steps will provide the foundations you need to handle implementing and calculating baseline performance for your machine learning algorithms.

    1. Random Prediction Algorithm

    The random prediction algorithm predicts a random outcome as observed in the training data.

    It is perhaps the simplest algorithm to implement.

    It requires that you store all of the distinct outcome values in the training data, which could be large on regression problems with lots of distinct values.

    Because random numbers are used to make decisions, it is a good idea to fix the random number seed prior to using the algorithm. This is to ensure that we get the same set of random numbers, and in turn the same decisions each time the algorithm is run.

    Below is an implementation of the Random Prediction Algorithm in a function named random_algorithm().

    The function takes both a training dataset that includes output values and a test dataset for which output values must be predicted.

    The function will work for both classification and regression problems. It assumes that the output value in the training data is the final column for each row.

    First, the set of unique output values is collected from the training data. Then a randomly selected output value from the set is selected for each row in the test set.

    We can test this function with a small dataset that only contains the output column for simplicity.

    The output values in the training dataset are either “0” or “1”, meaning that the set of predictions the algorithm will choose from is {0, 1}. The test set also contains a single column, with no data as the predictions are not known.

    Running the example calculates random predictions for the test dataset and prints those predictions.

    The random prediction algorithm is easy to implement and fast to run, but we could do better as a baseline.

    2. Zero Rule Algorithm

    The Zero Rule Algorithm is a better baseline than the random algorithm.

    It uses more information about a given problem to create one rule in order to make predictions. This rule is different depending on the problem type.

    Let’s start with classification problems, predicting a class label.

    Classification

    For classification problems, the one rule is to predict the class value that is most common in the training dataset. This means that if a training dataset has 90 instances of class “0” and 10 instances of class “1” that it will predict “0” and achieve a baseline accuracy of 90/100 or 90%.

    This is much better than the random prediction algorithm that would only achieve 82% accuracy on average. For details on how this is estimate for random search is calculated, see below:

    Below is a function named zero_rule_algorithm_classification() that implements this for the classification case.

    The function makes use of the max() function with the key attribute, which is a little clever.

    Given a list of class values observed in the training data, the max() function takes a set of unique class values and calls the count on the list of class values for each class value in the set.

    The result is that it returns the class value that has the highest count of observed values in the list of class values observed in the training dataset.

    If all class values have the same count, then we will choose the first class value observed in the dataset.

    Once we select a class value, it is used to make a prediction for each row in the test dataset.

    Below is a worked example with a contrived dataset that contains 4 examples of class “0” and 2 examples of class “1”. We would expect the algorithm to choose the class value “0” as the prediction for each row in the test dataset.

    Running this example makes the predictions and prints them to screen. As expected, the class value of “0” was chosen and predicted.

    Now, let’s see the Zero Rule Algorithm for regression problems.

    Regression

    Regression problems require the prediction of a real value.

    A good default prediction for real values is to predict the central tendency. This could be the mean or the median.

    A good default is to use the mean (also called the average) of the output value observed in the training data.

    This is likely to have a lower error than random prediction which will return any observed output value.

    Below is a function to do that named zero_rule_algorithm_regression(). It works by calculating the mean value for the observed output values.

    Once calculated, the mean is then predicted for each row in the training data.

    This function can be tested with a simple example.

    We can contrive a small dataset where the mean value is known to be 15.

    Below is the complete example. We would expect that the mean value of 15 will be predicted for each of the 4 rows in the test dataset.

    Running the example calculates the predicted output values that are printed. As expected, the mean value of 15 is predicted for each row in the test dataset.

    Extensions

    Below are a few extensions to the baseline algorithms that you may wish to investigate an implement as an extension to this tutorial.

    • Alternate Central Tendency where the median, mode or other central tendency calculations are predicted instead of the mean.
    • Moving Average for time series problems where the mean of the last n records is predicted.

    Review

    In this tutorial, you discovered the importance of calculating a baseline of performance on your machine learning problem.

    You now know:

    • How to implement a random prediction algorithm for classification and regression problems.
    • How to implement a zero rule algorithm for classification and regression problems.

Saturday, 29 June 2024

How To Create an Algorithm Test Harness From Scratch With Python

 We cannot know which algorithm will be best for a given problem.

Therefore, we need to design a test harness that we can use to evaluate different machine learning algorithms.

In this tutorial, you will discover how to develop a machine learning algorithm test harness from scratch in Python.

After completing this tutorial, you will know:

  • How to implement a train-test algorithm test harness.
  • How to implement a k-fold cross-validation algorithm test harness.

    Description

    A test harness provides a consistent way to evaluate machine learning algorithms on a dataset.

    It involves 3 elements:

    1. The resampling method to split-up the dataset.
    2. The machine learning algorithm to evaluate.
    3. The performance measure by which to evaluate predictions.

    The loading and preparation of a dataset is a prerequisite step that must have been completed prior to using the test harness.

    The test harness must allow for different machine learning algorithms to be evaluated, whilst the dataset, resampling method and performance measures are kept constant.

    In this tutorial, we are going to demonstrate the test harnesses with a real dataset.

    The dataset used is the Pima Indians diabetes dataset. It contains 768 rows and 9 columns. All of the values in the file are numeric, specifically floating point values.

    The Zero Rule algorithm will be evaluated as part of the tutorial. The Zero Rule algorithm always predicts the class that has the most observations in the training dataset.

    Tutorial

    This tutorial is broken down into two main sections:

    1. Train-Test Algorithm Test Harness.
    2. Cross-Validation Algorithm Test Harness.

    These test harnesses will give you the foundation that you need to evaluate a suite of machine learning algorithms on a given predictive modeling problem.

    1. Train-Test Algorithm Test Harness

    The train-test split is a simple resampling method that can be used to evaluate a machine learning algorithm.

    As such, it is a good starting point for developing a test harness.

    We can assume the prior development of a function to split a dataset into train and test sets and a function to evaluate the accuracy of a set of predictions.

    We need a function that can take a dataset and an algorithm and return a performance score.

    Below is a function named evaluate_algorithm() that achieves this. It takes 3 fixed arguments including the dataset, the algorithm function and the split percentage for the train-test split.

    First, the dataset is split into train and test elements. Next, a copy of the test set is made and each output value is cleared by setting it to the None value, to prevent the algorithm from cheating accidentally.

    The algorithm provided as a parameter is a function that expects the train and test datasets on which to prepare and then make predictions. The algorithm may require additional configuration parameters. This is handled by using the variable arguments *args in the evaluate_algorithm() function and passing them on to the algorithm function.

    The algorithm function is expected to return a list of predictions, one for each row in the training dataset. These are compared to the actual output values from the unmodified test dataset by the accuracy_metric() function.

    Finally, the accuracy is returned.

    The evaluation function does make some strong assumptions, but they can easily be changed if needed.

    Specifically, it assumes that the last row in the dataset is always the output value. A different column could be used. The use of the accuracy_metric() assumes that the problem is a classification problem, but this could be changed to mean squared error for regression problems.

    Let’s piece this together with a worked example.

    We will use the Pima Indians diabetes dataset and evaluate the Zero Rule algorithm.

    The dataset was split into 60% for training the model and 40% for evaluating it.

    Notice how the name of the Zero Rule algorithm zero_rule_algorithm_classification was passed as an argument to the evaluate_algorithm() function. You can see how this test harness may be used again and again with different algorithms.

    Running the example above prints out the accuracy of the model.

    2. Cross-Validation Algorithm Test Harness

    Cross-validation is a resampling technique that provides more reliable estimates of algorithm performance on unseen data.

    It requires the creation and evaluation of k models on different subsets of your data, and such is more computationally expensive. Nevertheless, it is the gold standard for evaluating machine learning algorithms.

    As in the previous section, we need to create a function that ties together the resampling method, the evaluation of the algorithm on the dataset and the performance calculation method.

    Unlike above, the algorithm must be evaluated on different subsets of the dataset many times. This means we need additional loops within our evaluate_algorithm() function.

    Below is a function that implements algorithm evaluation with cross-validation.

    First, the dataset is split into n_folds groups called folds.

    Next, we loop giving each fold an opportunity to be held out of training and used to evaluate the algorithm. A copy of the list of folds is created and the held out fold is removed from this list. Then the list of folds is flattened into one long list of rows to match the algorithms expectation of a training dataset. This is done using the sum() function.

    Once the training dataset is prepared the rest of the function within this loop is as above. A copy of the test dataset (the fold) is made and the output values are cleared to avoid accidental cheating by algorithms. The algorithm is prepared on the train dataset and makes predictions on the test dataset. The predictions are evaluated and stored in a list.

    Unlike the train-test algorithm test harness, a list of scores is returned, one for each cross-validation fold.

    Although slightly more complex in code and slower to run, this function provides a more robust estimate of algorithm performance.

    We can tie all of this together with a complete example on the diabetes dataset with the Zero Rule algorithm.

    A total of 5 cross validation folds were used to evaluate the Zero Rule Algorithm. As such, 5 scores were returned from the evaluate_algorithm() algorithm.

    Running this example both prints these list of scores calculated and prints the mean score.

    You now have two different test harnesses that you can use to evaluate your own machine learning algorithms.

    Extensions

    This section lists extensions to this tutorial that you may wish to consider.

    • Parameterized Evaluation. Pass in the function used to evaluate predictions, allowing you to seamlessly work with regression problems.
    • Parameterized Resampling. Pass in the function used to calculate resampling splits, allowing you to easily switch between the train-test and cross-validation methods.
    • Standard Deviation Scores. Calculate the standard deviation to get an idea of the spread of scores when evaluating algorithms using cross-validation.

    Did you try any of these extensions?
    Share your experiences in the comments below.

    Review

    In this tutorial, you discovered how to create a test harness from scratch to evaluate your machine learning algorithms.

    Specifically, you now know:

    • How to implement and use a train-test algorithm test harness.
    • How to implement and use a cross-validation algorithm test harness.
Connect broadband

A Gentle Introduction to Expected Value, Variance, and Covariance with NumPy

  Fundamental statistics are useful tools in applied machine learning for a better understanding your data. They are also the tools that pro...