Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Wednesday, 22 November 2023

How To Estimate Model Accuracy in R Using The Caret Package

When you are building a predictive model, you need a way to evaluate the capability of the model on unseen data.

This is typically done by estimating accuracy using data that was not used to train the model such as a test set, or using cross validation. The caret package in R provides a number of methods to estimate the accuracy of a machines learning algorithm.

In this post you discover 5 approaches for estimating model performance on unseen data. You will also have access to recipes in R using the caret package for each method, that you can copy and paste into your own project, right now.

Kick-start your project with my new book Machine Learning Mastery With R, including step-by-step tutorials and the R source code files for all examples.

Let’s get started.

Caret package in R, from the caret homepage

Estimating Model Accuracy

We have considered model accuracy before in the configuration of test options in a test harness. You can read more in the post: How To Choose The Right Test Options When Evaluating Machine Learning Algorithms.

In this post you can going to discover 5 different methods that you can use to estimate model accuracy.

They are as follows and each will be described in turn:

Data Split
Bootstrap
k-fold Cross Validation
Repeated k-fold Cross Validation
Leave One Out Cross Validation

Generally, I would recommend Repeated k-fold Cross Validation, but each method has its features and benefits, especially when the amount of data or space and time complexity are considered. Consider which approach best suits your problem.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Data Split

Data splitting involves partitioning the data into an explicit training dataset used to prepare the model and an unseen test dataset used to evaluate the models performance on unseen data.

It is useful when you have a very large dataset so that the test dataset can provide a meaningful estimation of performance, or for when you are using slow methods and need a quick approximation of performance.

The example below splits the iris dataset so that 80% is used for training a Naive Bayes model and 20% is used to evaluate the models performance.

# load the libraries

library(caret)

library(klaR)

# load the iris dataset

data(iris)

# define an 80%/20% train/test split of the dataset

split=0.80

trainIndex <- createDataPartition(iris$Species, p=split, list=FALSE)

data_train <- iris[ trainIndex,]

data_test <- iris[-trainIndex,]

# train a naive bayes model

model <- NaiveBayes(Species~., data=data_train)

# make predictions

x_test <- data_test[,1:4]

y_test <- data_test[,5]

predictions <- predict(model, x_test)

# summarize results

confusionMatrix(predictions$class, y_test)

Bootstrap

Bootstrap resampling involves taking random samples from the dataset (with re-selection) against which to evaluate the model. In aggregate, the results provide an indication of the variance of the models performance. Typically, large number of resampling iterations are performed (thousands or tends of thousands).

The following example uses a bootstrap with 10 resamples to prepare a Naive Bayes model.

# load the library

library(caret)

# load the iris dataset

data(iris)

# define training control

train_control <- trainControl(method="boot", number=100)

# train the model

model <- train(Species~., data=iris, trControl=train_control, method="nb")

# summarize results

print(model)

k-fold Cross Validation

The k-fold cross validation method involves splitting the dataset into k-subsets. For each subset is held out while the model is trained on all other subsets. This process is completed until accuracy is determine for each instance in the dataset, and an overall accuracy estimate is provided.

It is a robust method for estimating accuracy, and the size of k and tune the amount of bias in the estimate, with popular values set to 3, 5, 7 and 10.

The following example uses 10-fold cross validation to estimate Naive Bayes on the iris dataset.

# load the library

library(caret)

# load the iris dataset

data(iris)

# define training control

train_control <- trainControl(method="cv", number=10)

# fix the parameters of the algorithm

grid <- expand.grid(.fL=c(0), .usekernel=c(FALSE))

# train the model

model <- train(Species~., data=iris, trControl=train_control, method="nb", tuneGrid=grid)

# summarize results

print(model)

Repeated k-fold Cross Validation

The process of splitting the data into k-folds can be repeated a number of times, this is called Repeated k-fold Cross Validation. The final model accuracy is taken as the mean from the number of repeats.

The following example uses 10-fold cross validation with 3 repeats to estimate Naive Bayes on the iris dataset.

# load the library

library(caret)

# load the iris dataset

data(iris)

# define training control

train_control <- trainControl(method="repeatedcv", number=10, repeats=3)

# train the model

model <- train(Species~., data=iris, trControl=train_control, method="nb")

# summarize results

print(model)

Leave One Out Cross Validation

In Leave One Out Cross Validation (LOOCV), a data instance is left out and a model constructed on all other data instances in the training set. This is repeated for all data instances.

The following example demonstrates LOOCV to estimate Naive Bayes on the iris dataset.

# load the library

library(caret)

# load the iris dataset

data(iris)

# define training control

train_control <- trainControl(method="LOOCV")

# train the model

model <- train(Species~., data=iris, trControl=train_control, method="nb")

# summarize results

print(model)

Summary

In this post you discovered 5 different methods that you can use to estimate the accuracy of your model on unseen data.

Those methods were: Data Split, Bootstrap, k-fold Cross Validation, Repeated k-fold Cross Validation, and Leave One Out Cross Validation.

You can learn more about the caret package in R at the caret package homepage and the caret package CRAN page. If you would like to master the caret package, I would recommend the book written by the author of the package, titled: Applied Predictive Modeling, especially Chapter 4 on overfitting models.

Artificial Intelligence , Machine Learning and Data Science Hubspot

Wednesday, 22 November 2023

How To Estimate Model Accuracy in R Using The Caret Package

Estimating Model Accuracy

Need more Help with R for Machine Learning?

Data Split

Bootstrap

k-fold Cross Validation

Repeated k-fold Cross Validation

Leave One Out Cross Validation

Summary

No comments:

Post a Comment

Report Abuse

Labels

"Donate for a Noble Cause