Tuesday 31 October 2023

Biggest Mistake I Made When Starting Machine Learning, And How To Avoid It

 When I first got started in machine learning I implemented algorithms by hand. It was really slow going.

I was a terrible programmer at the time. I was trying to figure out the algorithms from books, how to use them on problems and how to write code – all at the same time. This was the biggest mistake I made when getting started. It made everything 3-times harder and killed my motivation.

A friend of mine suggested I look at Weka. I resisted. I was a student and a C-programmer and I didn’t want to get into Java.

Later, I was looking into decision tree algorithms and I learned that Weka had an implementation of C4.5 (a really powerful method).

I downloaded Weka and discovered that in addition to the Java API, Weka had a fully interactive graphical interface for loading data, running algorithms and reviewing results. Basically, all of the things I was trying to figure out how to do and implement myself, but in a GUI.

Kick-start your project with my new book Machine Learning Mastery With Weka, including step-by-step tutorials and clear screenshots for all examples.

Weka Explorer Interface with the Iris dataset loaded

Weka Explorer Interface with the Iris dataset loaded

I was hooked. I started using it for class work and for my own experiments. Later in grad school, I started my own research by writing 3rd-party plugins for Weka (LVQ algorithm and others).

I now recommend it to programmers just getting started because it’s so quick to get meaningful results on a dataset. It also instills best practices like repeatable experiments and statistical methods for comparing results.

If you haven’t already, take a look at Weka.

I have a short tutorial in which you discover how to run a classifier in 5 minutes.

Good luck machine learning!

Monday 30 October 2023

How to Run Your First Classifier in Weka

 Weka makes learning applied machine learning easy, efficient, and fun. It is a GUI tool that allows you to load datasets, run algorithms and design and run experiments with results statistically robust enough to publish.

I recommend Weka to beginners in machine learning because it lets them focus on learning the process of applied machine learning rather than getting bogged down by the mathematics and the programming — those can come later.

In this post, I want to show you how easy it is to load a dataset, run an advanced classification algorithm and review the results.

If you follow along, you will have machine learning results in under 5 minutes, and the knowledge and confidence to go ahead and try more datasets and more algorithms.

Kick-start your project with my new book Machine Learning Mastery With Weka, including step-by-step tutorials and clear screenshots for all examples.

1. Download Weka and Install

Visit the Weka Download page and locate a version of Weka suitable for your computer (Windows, Mac, or Linux).

Weka requires Java. You may already have Java installed and if not, there are versions of Weka listed on the download page (for Windows) that include Java and will install it for you. I’m on a Mac myself, and like everything else on Mac, Weka just works out of the box.

If you are interested in machine learning, then I know you can figure out how to download and install software into your own computer. If you need help installing Weka, see the following post that provides step-by-step instructions:

2. Start Weka

Start Weka. This may involve finding it in program launcher or double clicking on the weka.jar file. This will start the Weka GUI Chooser.

The Weka GUI Chooser lets you choose one of the Explorer, Experimenter, KnowledgeExplorer and the Simple CLI (command line interface).

Weka GUI Chooser

Weka GUI Chooser

Click the “Explorer” button to launch the Weka Explorer.

This GUI lets you load datasets and run classification algorithms. It also provides other features, like data filtering, clustering, association rule extraction, and visualization, but we won’t be using these features right now.

Need more help with Weka for Machine Learning?

Take my free 14-day email course and discover how to use the platform step-by-step.

Click to sign-up and also get a free PDF Ebook version of the course.

3. Open the data/iris.arff Dataset

Click the “Open file…” button to open a data set and double click on the “data” directory.

Weka provides a number of small common machine learning datasets that you can use to practice on.

Select the “iris.arff” file to load the Iris dataset.

Weka Explorer Interface with the Iris dataset loaded

Weka Explorer Interface with the Iris dataset loaded

The Iris Flower dataset is a famous dataset from statistics and is heavily borrowed by researchers in machine learning. It contains 150 instances (rows) and 4 attributes (columns) and a class attribute for the species of iris flower (one of setosa, versicolor, and virginica). You can read more about Iris flower dataset on Wikipedia.

4. Select and Run an Algorithm

Now that you have loaded a dataset, it’s time to choose a machine learning algorithm to model the problem and make predictions.

Click the “Classify” tab. This is the area for running algorithms against a loaded dataset in Weka.

You will note that the “ZeroR” algorithm is selected by default.

Click the “Start” button to run this algorithm.

Weka Results for the ZeroR algorithm on the Iris flower dataset

Weka Results for the ZeroR algorithm on the Iris flower dataset

The ZeroR algorithm selects the majority class in the dataset (all three species of iris are equally present in the data, so it picks the first one: setosa) and uses that to make all predictions. This is the baseline for the dataset and the measure by which all algorithms can be compared. The result is 33%, as expected (3 classes, each equally represented, assigning one of the three to each prediction results in 33% classification accuracy).

You will also note that the test options selects Cross Validation by default with 10 folds. This means that the dataset is split into 10 parts: the first 9 are used to train the algorithm, and the 10th is used to assess the algorithm. This process is repeated, allowing each of the 10 parts of the split dataset a chance to be the held-out test set. You can read more about cross validation here.

The ZeroR algorithm is important, but boring.

Click the “Choose” button in the “Classifier” section and click on “trees” and click on the “J48” algorithm.

This is an implementation of the C4.8 algorithm in Java (“J” for Java, 48 for C4.8, hence the J48 name) and is a minor extension to the famous C4.5 algorithm. You can read more about the C4.5 algorithm here.

Click the “Start” button to run the algorithm.

Weka J48 algorithm results on the iris flower dataset

Weka J48 algorithm results on the Iris flower dataset

5. Review Results

After running the J48 algorithm, you can note the results in the “Classifier output” section.

The algorithm was run with 10-fold cross-validation: this means it was given an opportunity to make a prediction for each instance of the dataset (with different training folds) and the presented result is a summary of those predictions.

Just the results of the J48 algorithm on the Iris flower dataset in Weka

Just the results of the J48 algorithm on the Iris flower dataset in Weka

Firstly, note the Classification Accuracy. You can see that the model achieved a result of 144/150 correct or 96%, which seems a lot better than the baseline of 33%.

Secondly, look at the Confusion Matrix. You can see a table of actual classes compared to predicted classes and you can see that there was 1 error where an Iris-setosa was classified as an Iris-versicolor, 2 cases where Iris-virginica was classified as an Iris-versicolor, and 3 cases where an Iris-versicolor was classified as an Iris-setosa (a total of 6 errors). This table can help to explain the accuracy achieved by the algorithm.

Summary

In this post, you loaded your first dataset and ran your first machine learning algorithm (an implementation of the C4.8 algorithm) in Weka. The ZeroR algorithm doesn’t really count: it’s just a useful baseline.

You now know how to load the datasets that are provided with Weka and how to run algorithms: go forth and try different algorithms and see what you come up with.

Leave a note in the comments if you can achieve better than 96% accuracy on the Iris dataset.

AI:AI automated machines available to heal quickly hot burns attack, chemical attack, animal bites attack and mosquitoes, worms, reptiles attack

Here are some AI-automated machines that are being developed to heal quickly hot burns attack, chemical attack, animal bites attack, and mosquitoes, worms, and reptiles attacks:

Hot Burns

  • Hydrogel Dressings: These dressings are made of a hydrogel material that contains water and other ingredients that help to cool the burn and promote healing. Some hydrogel dressings also contain antimicrobial agents to help prevent infection.
  • Spray-on Skin: This is a type of skin graft that is sprayed onto the burn wound. It contains cells that help to regenerate the skin.
  • Exoskeleton Suit: This suit is worn by the patient and provides support to the burned area. It also helps to keep the burn wound clean and moist.

Chemical Attack

  • Chemical Decontamination Shower: This shower removes the chemical agent from the skin and clothing of the patient. It also helps to neutralize the chemical agent.
  • Chemical Burn Dressing: These dressings are made of a material that absorbs the chemical agent and prevents it from further damaging the skin.
  • Anti-Chemical Warfare Mask: This mask protects the patient from inhaling chemical agents.

Animal Bites

  • Wound Debridement Machine: This machine cleans the wound and removes any dead tissue or debris.
  • Wound Irrigation System: This system flushes the wound with water or saline to help remove any bacteria or other contaminants.
  • Antibiotic Treatment System: This system delivers antibiotics directly to the wound to help prevent infection.

Mosquitoes, Worms, and Reptiles Attacks

  • Mosquito Net: This net protects the patient from being bitten by mosquitoes.
  • De-worming Medication: This medication kills and removes worms from the body.
  • Anti-venom: This medication neutralizes the venom of snakes, spiders, and other venomous creatures.

These machines are still in the early stages of development, but they have the potential to revolutionize the way that victims of burns, chemical attacks, animal bites, and mosquitoes, worms, and reptiles attacks are treated. By automating many of the tasks involved in treating these injuries, these machines can help to improve the quality of care that patients receive and reduce the risk of complications.

In addition to the above machines, there are also a number of AI-powered software programs that are being developed to help doctors and nurses diagnose and treat these injuries more effectively. For example, some programs can be used to identify the type of injury, assess its severity, and recommend the best course of treatment. Other programs can be used to monitor the healing process and to identify any potential complications.

AI has the potential to play a major role in the future of wound care. By developing new AI-powered machines and software programs, we can improve the way that patients are treated and help them to heal more quickly and effectively.there are various AI-driven technologies and automated machines designed to address injuries and conditions resulting from burns, chemical exposures, animal bites, and vector-borne diseases (e.g., mosquito-borne illnesses). These technologies and machines aid in diagnosis, treatment, and prevention. Keep in mind that the availability and development of such technologies may have evolved since then. Here's an overview:

  1. Telemedicine and Diagnostic AI:

    • Telemedicine platforms and AI-based medical chatbots can provide initial guidance and triage for patients with burns, chemical exposures, or animal bites. They can help determine the severity of the condition and recommend appropriate actions, such as seeking immediate medical attention.
  2. Wound Healing AI:

    • There are AI-driven wound assessment tools that assist in monitoring the healing progress of burns and other types of wounds. These tools can analyze images of the wound, track healing trends, and alert healthcare professionals if intervention is needed.
  3. Chemical Exposure Detection and Management:

    • Some AI systems are designed to detect chemical exposures or spills in industrial settings. These systems can identify the type of chemical involved and recommend appropriate decontamination procedures.
  4. Antivenom Production Optimization:

    • AI can be used to optimize the production of antivenom for snake and reptile bites. Machine learning models can help predict the required antivenom doses based on the snake's species and the severity of the bite.
  5. Vector-Borne Disease Monitoring:

    • AI and machine learning can be employed to analyze data related to vector-borne diseases like malaria and dengue. Predictive models can forecast disease outbreaks, helping with early intervention and resource allocation.
  6. Robot-Assisted Surgery:

    • In the case of severe burns, robotic surgical systems, such as the da Vinci Surgical System, can be used by skilled surgeons to perform precise and minimally invasive procedures for burn wound treatment and skin grafting.
  7. Machine Learning for Drug Discovery:

    • AI is used in drug discovery to identify potential treatments for conditions resulting from chemical exposure, such as developing antidotes or therapeutic drugs.
  8. Robotic Insect and Pest Control:

    • For controlling pests like mosquitoes, there are autonomous robotic systems that use AI and sensors to locate and eliminate or trap pests. These robots can be used for pest control in various environments.
  9. Smart Bandages:

    • Smart bandages equipped with sensors and AI can monitor wounds and detect signs of infection, helping in the early treatment of wounds caused by animal bites, burns, or other injuries.

It's important to note that while AI and automation can be valuable tools in healthcare and emergency response, they do not replace the need for medical professionals and prompt medical attention in the case of severe injuries or health emergencies. The field of AI in healthcare is rapidly evolving, so it's advisable to keep up with the latest developments and technologies for specific applications.

AI:which AI automated machines are avialble to create a fibre web around the wound of living being to heal quickly and to repair skin cells instantly from burns, cuts , wounds

 There are a number of AI automated machines that are being developed to create a fiber web around the wound of a living being to heal quickly and to repair skin cells instantly from burns, cuts, and wounds. Here are a few examples:

  • Spider Silk Wound Dressing Machine: This machine uses a 3D printer to create a wound dressing made of spider silk. Spider silk is a very strong and elastic material that is also biocompatible, meaning that it does not cause an immune response in the body. This makes it ideal for wound dressings.
  • Nanofiber Wound Dressing Machine: This machine uses a process called electrospinning to create a wound dressing made of nanofibers. Nanofibers are extremely thin fibers that are smaller than the diameter of a human hair. They are also highly porous, which allows air and moisture to reach the wound. This helps to promote wound healing.
  • Skin Cell Regeneration Machine: This machine uses a process called 3D bioprinting to create a new layer of skin cells over a wound. This can be used to treat burns, cuts, and other wounds that are too large or complex to heal on their own.

These machines are still in the early stages of development, but they have the potential to revolutionize the way that wounds are treated. By creating a fiber web around the wound and/or regenerating new skin cells, these machines can help wounds to heal more quickly and efficiently. This could reduce the risk of infection and scarring, and it could also improve the patient's quality of life.

In addition to the above machines, there are also a number of AI-powered software programs that are being developed to help doctors and nurses diagnose and treat wounds more effectively. For example, some programs can be used to identify the type of wound, assess its severity, and recommend the best course of treatment. Other programs can be used to monitor the wound healing process and to identify any potential complications.

AI has the potential to play a major role in the future of wound care. By developing new AI-powered machines and software programs, we can improve the way that wounds are treated and help patients to heal more quickly and effectively.there were several emerging technologies in the field of wound healing, including the development of AI-driven devices and materials to accelerate the healing process. While there may not be AI "machines" that create a fiber web around wounds, there are materials and technologies that leverage AI to enhance wound healing. Here are a couple of areas where AI plays a role:

  1. Smart Bandages: Smart bandages are wound dressings embedded with sensors and microelectronics that can monitor the wound's condition, detect infection, and release medications or growth factors to promote healing. AI and machine learning can be used to analyze data from these sensors to make real-time treatment decisions. For example, if an infection is detected, the bandage can release antibiotics directly to the wound.

  2. Biomedical Materials: Researchers are developing advanced wound dressings and biomaterials that incorporate nanofibers or other structures to accelerate tissue regeneration. While these materials may not be "machines," they can be designed with the help of AI and used to facilitate the healing of burns, cuts, and wounds.

  3. 3D Bioprinting: Although not necessarily an AI "machine," 3D bioprinting technology can create living tissues and skin substitutes that can be used for skin repair and wound healing. AI and machine learning can be used to optimize the bioprinting process and improve the structural integrity of the printed tissues.

  4. Telemedicine and Remote Monitoring: AI-driven telemedicine platforms can allow healthcare providers to remotely monitor wound healing progress. By analyzing images of the wound, AI can provide insights into the effectiveness of the treatment and recommend adjustments as needed.

While these technologies show promise in wound healing and tissue repair, they may not be widely available for all applications, and their effectiveness can vary. Advances in the field of regenerative medicine, tissue engineering, and AI-driven healthcare continue to progress, so it's essential to stay updated on the latest developments and consult with medical professionals for the most appropriate wound healing solutions.

Sunday 29 October 2023

How To Choose The Right Test Options When Evaluating Machine Learning Algorithms

 The test options you use when evaluating machine learning algorithms can mean the difference between over-learning, a mediocre result and a usable state-of-the-art result that you can confidently shout from the roof tops (you really do feel like doing that sometimes).

In this post you will discover the standard test options you can use in your algorithm evaluation test harness and how to choose the right options next time.

Randomness

The root of the difficulty in choosing the right test options is randomness. Most (almost all) machine learning algorithms use randomness in some way. The randomness may be explicit in the algorithm or may be in the sample of the data selected to train the algorithm.

Randomness

Randomness
Photo by afoncubierta, some rights reserved

This does not mean that the algorithms produce random results, it means that they produce results with some noise or variance. We call this type of limited variance, stochastic and the algorithms that exploit it, stochastic algorithms.

Train and Test on Same Data

If you have a dataset, you may want to train the model on the dataset and then report the results of the model on that dataset. That’s how good the model is, right?

The problem with this approach of evaluating algorithms is that you indeed will know the performance of the algorithm on the dataset, but do not have any indication of how the algorithm will perform on data that the model was not trained on (so-called unseen data).

This matters, only if you want to use the model to make predictions on unseen data.

Split Test

A simple way to use one dataset to both train and estimate the performance of the algorithm on unseen data is to split the dataset. You take the dataset, and split it into a training dataset and a test dataset. For example, you randomly select 66% of the instances for training and use the remaining 34% as a test dataset.

The algorithm is run on the training dataset and a model is created and assessed on the test dataset and you get a performance accuracy, lets say 87% classification accuracy.

Spit tests are fast and great when you have a lot of data or when training a model is expensive (it resources or time). A split test on a very very large dataset can produce an accurate estimate of the actual performance of the algorithm.

How good is the algorithm on the data? Can we confidently say it can achieve an accuracy of 87%?

A problem is that if we spit the training dataset again into a different 66%/34% split, we would get a different result from our algorithm. This is called model variance.

Multiple Split Tests

A solution to our problem with the split test getting different results on different splits of the dataset is to reduce the variance of the random process and do it many times. We can collect the results from a fair number of runs (say 10) and take the average.

For example, let’s say we split our dataset 66%/34%, ran our algorithm and got an accuracy and we did this 10 times with 10 different splits. We might have 10 accuracy scores as follows: 87, 87, 88, 89, 88, 86, 88, 87, 88, 87.

The average performance of our model is 87.5, with a standard deviation of about 0.85.

Coin Toss

Coin Toss
Photo by ICMA Photos, some rights reserved

A problem with multiple split tests is that it is possible that some data instance are never included for training or testing, where as others may be selected multiple times. The effect is that this may skew results and may not give an meaningful idea of the accuracy of the algorithm.

Cross Validation

A solution to the problem of ensuring each instance is used for training and testing an equal number of times while reducing the variance of an accuracy score is to use cross validation. Specifically k-fold cross validation, where k is the number of splits to make in the dataset.

For example, let’s choose a value of k=10 (very common). This will split the dataset into 10 parts (10 folds) and the algorithm will be run 10 times. Each time the algorithm is run, it will be trained on 90% of the data and tested on 10%, and each run of the algorithm will change which 10% of the data the algorithm is tested on.

In this example, each data instance will be used as a training instance exactly 9 times and as a test instance 1 time. The accuracy will not be a mean and a standard deviation, but instead will be an exact accuracy score of how many correct predictions were made.

The k-fold cross validation method is the go-to method for evaluating the performance of an algorithm on a dataset. You want to choose k-values that give you a good sized training and test dataset for your algorithm. Not too disproportionate (too large or small for training or test). If you have a lot of data, you may may have to resort to either sampling the data or reverting to a split test.

Cross validation does give an unbiased estimation of the algorithms performance on unseen data, but what if the algorithm itself uses randomness. The algorithm would produce different results for the same training data each time it was trained with a different random number seed (start of the sequence of pseudo-randomness). Cross validation does not account for variance in the algorithm’s predictions.

Another point of concern is that cross validation itself uses randomness to decide how to split the dataset into k folds. Cross validation does not estimate how the algorithm perform with different sets of folds.

This only matters if you want to understand how robust the algorithm is on the dataset.

Multiple Cross Validation

A way to account for the variance in the algorithm itself is to run cross validation multiple times and take the mean and the standard deviation of the algorithm accuracy from each run.

This will will give you an an estimate of the performance of the algorithm on the dataset and an estimation of how robust (the size of the standard deviation) the performance is.

If you have one mean and standard deviation for algorithm A and another mean and standard deviation for algorithm B and they differ (for example, algorithm A has a higher accuracy), how do you know if the difference is meaningful?

This only matters if you want to compare the results between algorithms.

Statistical Significance

A solution to comparing algorithm performance measures when using multiple runs of k-fold cross validation is to use statistical significance tests (like the Student’s t-test).

The results from multiple runs of k-fold cross validation is a list of numbers. We like to summarize these numbers using the mean and standard deviation. You can think of these numbers as a sample from an underlying population. A statistical significance test answers the question: are two samples drawn from the same population? (no difference). If the answer is “yes”, then, even if the mean and standard deviations differ, the difference can be said to be not statistically significant.

We can use statistical significance tests to give meaning to the differences (or lack there of) between algorithm results when using multiple runs (like multiple runs of k-fold cross validation with different random number seeds). This can when we want to make accurate claims about results (algorithm A was better than algorithm B and the difference was statistically significant)

This is not the end of the story, because there are different statistical significance tests (parametric and nonparametric) and parameters to those tests (p-value). I’m going to draw the line here because if you have followed me this far, you now know enough about selecting test options to produce rigorous (publishable!) results.

Summary

In this post you have discovered the difference between the main test options available to you when designing a test harness to evaluate machine learning algorithms.

Specifically, you learned the utility and problems with:

  • Training and testing on the same dataset
  • Split tests
  • Multiple split tests
  • Cross validation
  • Multiple cross validation
  • Statistical significance

When in doubt, use k-fold cross validation (k=10) and use multiple runs of k-fold cross validation with statistical significance tests when you want to meaningfully compare algorithms on your dataset.

Saturday 28 October 2023

Design and Run your First Experiment in Weka

 Weka is the perfect platform for learning machine learning. It provides a graphical user interface for exploring and experimenting with machine learning algorithms on datasets, without you having to worry about the mathematics or the programming.

A powerful feature of Weka is the Weka Experimenter interface. Unlike the Weka Explorer that is for filtering data and trying out different algorithms, the Experimenter is for designing and running experiments. The experimental results it produces are robust and are good enough to be published (if you know what you are doing).

In a previous post you learned how to run your first classifier in the Weka Explorer.

In this post you will discover the power of the Weka Experimenter. If you follow along the step-by-step instructions, you will design an run your first machine learning experiment in under five minutes.

Kick-start your project with my new book Machine Learning Mastery With Weka, including step-by-step tutorials and clear screenshots for all examples.

First Experiment

First Experiment
Photo by mhofstrand, some rights reserved

1. Download and Install Weka

Visit the Weka Download page and locate a version of Weka suitable for your computer (Windows, Mac or Linux).

Weka requires Java. You may already have Java installed and if not, there are versions of Weka listed on the download page (for Windows) that include Java and will install it for you. I’m on a Mac myself, and like everything else on Mac, Weka just works out of the box.

If you are interested in machine learning, then I know you can figure out how to download and install software into your own computer.

Need more help with Weka for Machine Learning?

Take my free 14-day email course and discover how to use the platform step-by-step.

Click to sign-up and also get a free PDF Ebook version of the course.

2. Start Weka

Start Weka. This may involve finding it in program launcher or double clicking on the weka.jar file. This will start the Weka GUI Chooser.

Weka GUI Chooser

Weka GUI Chooser

The Weka GUI Chooser lets you choose one of the Explorer, Experimenter, KnowledgeExplorer and the Simple CLI (command line interface).

Click the “Experimenter” button to launch the Weka Experimenter.

The Weka Experimenter allows you to design your own experiments of running algorithms on datasets, run the experiments and analyze the results. It’s a powerful tool.

3. Design Experiment

Click the “New” button to create a new experiment configuration.

Weka Experimenter

Weka Experimenter
Start a new Experiment

Test Options

The experimenter configures the test options for you with sensible defaults. The experiment is configured to use Cross Validation with 10 folds. It is a “Classification” type problem and each algorithm + dataset combination is run 10 times (iteration control).

Iris flower Dataset

Let’s start out by selecting the dataset.

  1. In the “Datasets” select click the “Add new…” button.
  2. Open the “data“directory and choose the “iris.arff” dataset.

The Iris flower dataset is a famous dataset from statistics and is heavily borrowed by researchers in machine learning. It contains 150 instances (rows) and 4 attributes (columns) and a class attribute for the species of iris flower (one of setosa, versicolor, virginica). You can read more about Iris flower dataset on Wikipedia.

Let’s choose 3 algorithms to run our dataset.

ZeroR

  1. Click “Add new…” in the “Algorithms” section.
  2. Click the “Choose” button.
  3. Click “ZeroR” under the “rules” selection.

ZeroR is the simplest algorithm we can run. It picks the class value that is the majority in the dataset and gives that for all predictions. Given that all three class values have an equal share (50 instances), it picks the first class value “setosa” and gives that as the answer for all predictions. Just off the top of our head, we know that the best result ZeroR can give is 33.33% (50/150). This is good to have as a baseline that we demand algorithms to outperform.

OneR

  1. Click “Add new…” in the “Algorithms” section.
  2. Click the “Choose” button.
  3. Click “OneR” under the “rules” selection.

OneR is like our second simplest algorithm. It picks one attribute that best correlates with the class value and splits it up to get the best prediction accuracy it can. Like the ZeroR algorithm, the algorithm is so simple that you could implement it by hand and we would expect that more sophisticated algorithms out perform it.

J48

  1. Click “Add new…” in the “Algorithms” section.
  2. Click the “Choose” button.
  3. Click “J48” under the “trees” selection.

J48 is decision tree algorithm. It is an implementation of the C4.8 algorithm in Java (“J” for Java and 48 for C4.8). The C4.8 algorithm is a minor extension to the famous C4.5 algorithm and is a very powerful prediction algorithm.

Weka Experimenter

Weka Experimenter
Configure the experiment

We are ready to run our experiment.

4. Run Experiment

Click the “Run” tab at the top of the screen.

This tab is the control panel for running the currently configured experiment.

Click the big “Start” button to start the experiment and watch the “Log” and “Status” sections to keep an eye on how it is doing.

Weka Experimenter

Weka Experimenter
Run the experiment

Given that the dataset is small and the algorithms are fast, the experiment should complete in seconds.

5. Review Results

Click the “Analyse” tab at the top of the screen.

This will open up the experiment results analysis panel.

Weka Experimenter

Weka Experimenter
Load the experiment results

Click the “Experiment” button in the “Source” section to load the results from the current experiment.

Algorithm Rank

The first thing we want to know is which algorithm was the best. We can do that by ranking the algorithms by the number of times a given algorithm beat the other algorithms.

  1. Click the “Select” button for the “Test base” and choose “Ranking“.
  2. Now Click the “Perform test” button.
Weka Experimenter

Weka Experimenter
Rank the algorithms in the experiment results

The ranking table shows the number of statistically significant wins each algorithm has had against all other algorithms on the dataset. A win, means an accuracy that is better than the accuracy of another algorithm and that the difference was statistically significant.

We can see that both J48 and OneR have one win each and that ZeroR has two losses. This is good, it means that OneR and J48 are both potentially contenders outperforming out baseline of ZeroR.

Algorithm Accuracy

Next we want to know what scores the algorithms achieved.

  1. Click the “Select” button for the “Test base” and choose the “ZeroR” algorithm in the list and click the “Select” button.
  2. Click the check-box next to “Show std. deviations“.
  3. Now click the “Perform test” button.
Weka Experimenter

Weka Experimenter
Algorithm accuracy compared to ZeroR

In the “Test output” we can see a table with the results for 3 algorithms. Each algorithm was run 10 times on the dataset and the accuracy reported is the mean and the standard deviation in rackets of those 10 runs.

We can see that both the OneR and J48 algorithms have a little “v” next to their results. This means that the difference in the accuracy for these algorithms compared to ZeroR is statistically significant. We can also see that the accuracy for these algorithms compared to ZeroR is high, so we can say that these two algorithms achieved a statistically significantly better result than the ZeroR baseline.

The score for J48 is higher than the score for OneR, so next we want to see if the difference between these two accuracy scores is significant.

  1. Click the “Select” button for the “Test base” and choose the “J48” algorithm in the list and click the “Select” button.
  2. Now click the “Perform test” button.
Weka Experimenter

Weka Experimenter
Algorithm accuracy compared to J48

We can see that the ZeroR has a “*” next to its results, indicating that its results compared to the J48 are statistically different. But we already knew this. We do not see a “*” next to the results for the OneR algorithm. This tells us that although the mean accuracy between J48 and OneR is different, the differences is not statistically significant.

All things being equal, we would choose the OneR algorithm to make predictions on this problem because it is the simpler of the two algorithms.

If we wanted to report the results, we would say that the OneR algorithm achieved a classification accuracy of 92.53% (+/- 5.47%) which is statistically significantly better than ZeroR at 33.33% (+/- 5.47%).

Summary

You discovered how to configure a machine learning experiment with one dataset and three algorithms in Weka. You also learned about how to analyse the results from an experiment and the importance of statistical significance when interpreting results.

You now have the skill to design and run experiments with any algorithms provided by Weka on datasets of your choosing and meaningfully and confidently report results that you achieve.

Connect broadband

How To Compare Machine Learning Algorithms in Python with scikit-learn

 It is important to compare the performance of multiple different machine learning algorithms consistently. In this post you will discover...