Ensemble algorithms are a powerful class of machine learning algorithm that combine the predictions from multiple models.
A benefit of using Weka for applied machine learning is that makes available so many different ensemble machine learning algorithms.
In this post you will discover the how to use ensemble machine learning algorithms in Weka.
After reading this post you will know:
- About 5 top ensemble machine learning algorithms.
- How to use top ensemble algorithms in Weka.
- About key configuration parameters for ensemble algorithms in Weka.
Ensemble Algorithms Overview
We are going to take a tour of 5 top ensemble machine learning algorithms in Weka.
Each algorithm that we cover will be briefly described in terms of how it works, key algorithm parameters will be highlighted and the algorithm will be demonstrated in the Weka Explorer interface.
The 5 algorithms that we will review are:
- Bagging
- Random Forest
- AdaBoost
- Voting
- Stacking
These are 5 algorithms that you can try on your problem in order to lift performance.
A standard machine learning classification problem will be used to demonstrate each algorithm.
Specifically, the Ionosphere binary classification problem. This is a good dataset to demonstrate classification algorithms because the input variables are numeric and all have the same scale the problem only has two classes to discriminate.
Each instance describes the properties of radar returns from the atmosphere and the task is to predict whether or not there is structure in the ionosphere or not. There are 34 numerical input variables of generally the same scale. You can learn more about this dataset on the UCI Machine Learning Repository. Top results are in the order of 98% accuracy.
Start the Weka Explorer:
- Open the Weka GUI Chooser.
- Click the “Explorer” button to open the Weka Explorer.
- Load the Ionosphere dataset from the data/ionosphere.arff file
- Click “Classify” to open the Classify tab.
Need more help with Weka for Machine Learning?
Take my free 14-day email course and discover how to use the platform step-by-step.
Click to sign-up and also get a free PDF Ebook version of the course.
Bootstrap Aggregation (Bagging)
Bootstrap Aggregation or Bagging for short is an ensemble algorithm that can be used for classification or regression.
Bootstrap is a statistical estimation technique where a statistical quantity like a mean is estimated from multiple random samples of your data (with replacement). It is a useful technique when you have a limited amount of data and you are interested in a more robust estimate of a statistical quantity.
This sample principle can be used with machine learning models. Multiple random samples of your training data are drawn with replacement and used to train multiple different machine learning models. Each model is then used to make a prediction and the results are averaged to give a more robust prediction.
It is a technique that is best used with models that have a low bias and a high variance, meaning that the predictions they make are highly dependent on the specific data from which they were trained. The most used algorithm for bagging that fits this requirement of high variance are decision trees.
Choose the bagging algorithm:
- Click the “Choose” button and select “Bagging” under the “meta” group.
- Click on the name of the algorithm to review the algorithm configuration.
A key configuration parameter in bagging is the type of model being bagged. The default is the REPTree which is the Weka implementation of a standard decision tree, also called a Classification and Regression Tree or CART for short. This is specified in the classifier parameter.
The size of each random sample is specified in the bagSizePercent, which is a size as a percentage of the raw training dataset. The default is 100% which will create a new random sample the same size as the training dataset, but will have a different composition.
This is because the random sample is drawn with replacement, which means that each time an instance is randomly drawn from the training dataset and added to the sample, it is also added back into the training dataset (replaced) meaning that it can be chosen again and added twice or more times to the sample.
Finally, the number of bags (and number of classifiers) can be specified in the numIterations parameter. The default is 10, although it is common to use values in the hundreds or thousands. Continue to increase the value of numIterations until you no longer see an improvement in the model, or you run out of memory.
- Click “OK” to close the algorithm configuration.
- Click the “Start” button to run the algorithm on the Ionosphere dataset.
You can see that with the default configuration that bagging achieves an accuracy of 91%.
Random Forest
Random Forest is an extension of bagging for decision trees that can be used for classification or regression.
A down side of bagged decision trees is that decision trees are constructed using a greedy algorithm that selects the best split point at each step in the tree building process. As such, the resulting trees end up looking very similar which reduces the variance of the predictions from all the bags which in turn hurts the robustness of the predictions made.
Random Forest is an improvement upon bagged decision trees that disrupts the greedy splitting algorithm during tree creation so that split points can only be selected from a random subset of the input attributes. This simple change can have a big effect decreasing the similarity between the bagged trees and in turn the resulting predictions.
Choose the random forest algorithm:
- Click the “Choose” button and select “RandomForest” under the “trees” group.
- Click on the name of the algorithm to review the algorithm configuration.
In addition to the parameters listed above for bagging, a key parameter for random forest is the number of attributes to consider in each split point. In Weka this can be controlled by the numFeatures attribute, which by default is set to 0, which selects the value automatically based on a rule of thumb.
- Click “OK” to close the algorithm configuration.
- Click the “Start” button to run the algorithm on the Ionosphere dataset.
You can see that with the default configuration that random forests achieves an accuracy of 92%.
AdaBoost
AdaBoost is an ensemble machine learning algorithm for classification problems.
It is part of a group of ensemble methods called boosting, that add new machine learning models in a series where subsequent models attempt to fix the prediction errors made by prior models. AdaBoost was the first successful implementation of this type of model.
Adaboost was designed to use short decision tree models, each with a single decision point. Such short trees are often referred to as decision stumps.
The first model is constructed as per normal. Each instance in the training dataset is weighted and the weights are updated based on the overall accuracy of the model and whether an instance was classified correctly or not. Subsequent models are trained and added until a minimum accuracy is achieved or no further improvements are possible. Each model is weighted based on its skill and these weights are used when combining the predictions from all of the models on new data.
Choose the AdaBoost algorithm:
- Click the “Choose” button and select “AdaBoostM1” under the “meta” group.
- Click on the name of the algorithm to review the algorithm configuration.
The weak learner within the AdaBoost model can be specified by the classifier parameter.
The default is the decision stump algorithm, but other algorithms can be used. a key parameter in addition to the weak learner is the number of models to create and add in series. This can be specified in the numIterations parameter and defaults to 10.
- Click “OK” to close the algorithm configuration.
- Click the “Start” button to run the algorithm on the Ionosphere dataset.
You can see that with the default configuration that AdaBoost achieves an accuracy of 90%.
Voting
Voting is perhaps the simplest ensemble algorithm, and is often very effective. It can be used for classification or regression problems.
Voting works by creating two or more sub-models. Each sub-model makes predictions which are combined in some way, such as by taking the mean or the mode of the predictions, allowing each sub-model to vote on what the outcome should be.
Choose the Vote algorithm:
- Click the “Choose” button and select “Vote” under the “meta” group.
- Click on the name of the algorithm to review the algorithm configuration.
The key parameter of a Vote ensemble is the selection of sub-models.
Models can be specified in Weka in the classifier parameter. Clicking this parameter lets you add a number of classifiers.
Clicking the “Edit” button with a classifier selected lets you configure the details of that classifier. An objective in selecting sub-models is to select models that make quite different predictions (uncorrelated predictions). As such, it is a good rule of thumb to select very different model types, such as trees, instance based methods, functions and so on.
Another key parameter to configure for voting is how the predictions of the sub models are combined. This is controlled by the combinationRule parameter which is set to take the average of the probabilities by default.
- Click “OK” to close the algorithm configuration.
- Click the “Start” button to run the algorithm on the Ionosphere dataset.
You can see that with the default configuration that Vote achieves an accuracy of 64%.
Obviously, this technique achieved poor results because only the ZeroR sub-model was selected. Try selecting a collection of 5-to-10 different sub models.
Stacked Generalization (Stacking)
Stacked Generalization or Stacking for short is a simple extension to Voting ensembles that can be used for classification and regression problems.
In addition to selecting multiple sub-models, stacking allows you to specify another model to learn how to best combine the predictions from the sub-models. Because a meta model is used to best combine the predictions of sub-models, this technique is sometimes called blending, as in blending predictions together.
Choose the Stacking algorithm:
- Click the “Choose” button and select “Stacking” under the “meta” group.
- Click on the name of the algorithm to review the algorithm configuration.
As with the Vote classifier, you can specify the sub-models in the classifiers parameter.
The model that will be trained to learn how to best combine the predictions from the sub model can be specified in the metaClassifier parameter, which is set to ZeroR (majority vote or mean) by default. It is common to use a linear algorithm like linear regression or logistic regression for regression and classification type problems respectively. This is to achieve an output that is a simple linear combination of the predictions of the sub models.
- Click “OK” to close the algorithm configuration.
- Click the “Start” button to run the algorithm on the Ionosphere dataset.
You can see that with the default configuration that Stacking achieves an accuracy of 64%.
Again, the same as voting, Stacking achieved poor results because only the ZeroR sub-model was selected. Try selecting a collection of 5-to-10 different sub models and a good model to combine the predictions.
Summary
In this post you discovered how to use ensemble machine learning algorithms in Weka.
Specifically you learned:
- About 5 ensemble machine learning algorithms that you can use on your problem.
- How to use ensemble machine learning algorithms in Weka.
- About the key configuration parameters for ensemble machine learning algorithms in Weka.
Do you have any questions about ensemble machine learning algorithms in Weka or about this post? Ask your questions in the comments and I will do my best to answer.
No comments:
Post a Comment