When you are building a predictive model, you need a way to evaluate the capability of the model on unseen data.
This is typically done by estimating accuracy using data that was not used to train the model such as a test set, or using cross validation. The caret package in R provides a number of methods to estimate the accuracy of a machines learning algorithm.
In this post you discover 5 approaches for estimating model performance on unseen data. You will also have access to recipes in R using the caret package for each method, that you can copy and paste into your own project, right now.
Kick-start your project with my new book Machine Learning Mastery With R, including step-by-step tutorials and the R source code files for all examples.
Let’s get started.
data:image/s3,"s3://crabby-images/adcbc/adcbc58c699fba39a5de5617d54754faafd89592" alt="Caret package in R"
Caret package in R, from the caret homepage
Estimating Model Accuracy
We have considered model accuracy before in the configuration of test options in a test harness. You can read more in the post: How To Choose The Right Test Options When Evaluating Machine Learning Algorithms.
In this post you can going to discover 5 different methods that you can use to estimate model accuracy.
They are as follows and each will be described in turn:
- Data Split
- Bootstrap
- k-fold Cross Validation
- Repeated k-fold Cross Validation
- Leave One Out Cross Validation
Generally, I would recommend Repeated k-fold Cross Validation, but each method has its features and benefits, especially when the amount of data or space and time complexity are considered. Consider which approach best suits your problem.
Need more Help with R for Machine Learning?
Take my free 14-day email course and discover how to use R on your project (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Data Split
Data splitting involves partitioning the data into an explicit training dataset used to prepare the model and an unseen test dataset used to evaluate the models performance on unseen data.
It is useful when you have a very large dataset so that the test dataset can provide a meaningful estimation of performance, or for when you are using slow methods and need a quick approximation of performance.
The example below splits the iris dataset so that 80% is used for training a Naive Bayes model and 20% is used to evaluate the models performance.
Bootstrap
Bootstrap resampling involves taking random samples from the dataset (with re-selection) against which to evaluate the model. In aggregate, the results provide an indication of the variance of the models performance. Typically, large number of resampling iterations are performed (thousands or tends of thousands).
The following example uses a bootstrap with 10 resamples to prepare a Naive Bayes model.
k-fold Cross Validation
The k-fold cross validation method involves splitting the dataset into k-subsets. For each subset is held out while the model is trained on all other subsets. This process is completed until accuracy is determine for each instance in the dataset, and an overall accuracy estimate is provided.
It is a robust method for estimating accuracy, and the size of k and tune the amount of bias in the estimate, with popular values set to 3, 5, 7 and 10.
The following example uses 10-fold cross validation to estimate Naive Bayes on the iris dataset.
Repeated k-fold Cross Validation
The process of splitting the data into k-folds can be repeated a number of times, this is called Repeated k-fold Cross Validation. The final model accuracy is taken as the mean from the number of repeats.
The following example uses 10-fold cross validation with 3 repeats to estimate Naive Bayes on the iris dataset.
Leave One Out Cross Validation
In Leave One Out Cross Validation (LOOCV), a data instance is left out and a model constructed on all other data instances in the training set. This is repeated for all data instances.
The following example demonstrates LOOCV to estimate Naive Bayes on the iris dataset.
Summary
In this post you discovered 5 different methods that you can use to estimate the accuracy of your model on unseen data.
Those methods were: Data Split, Bootstrap, k-fold Cross Validation, Repeated k-fold Cross Validation, and Leave One Out Cross Validation.
You can learn more about the caret package in R at the caret package homepage and the caret package CRAN page. If you would like to master the caret package, I would recommend the book written by the author of the package, titled: Applied Predictive Modeling, especially Chapter 4 on overfitting models.
No comments:
Post a Comment