Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Saturday, 30 March 2024

scikit-learn Cookbook Book Review

The scikit-learn library is the premiere library for machine learning in Python.

The online documentation is quite good but sometimes can feel fragmented or limited by narrow examples.

In this post you will discover the book Scikit-Learn Cookbook by Trent Hauck that provides a desktop reference to supplement the online documentation and help you get started with scikit-learn quickly.

Book Overview

The Scikit-Learn Cookbook is a focused book written by Trent Hauck and published by Packt Publishing.

The subtitle for the book is:

Over 50 recipes to incorporate scikit-learn into every step of the data science pipeline, from feature extraction to model building and model evaluation.

It was published at the end of 2014 and it is just under 200 pages long. I like the form factor. Thick reference texts really put me off these days (think Numerical Recipes which sits proudly on my shelf). I would prefer to have 10 smaller focused reference texts, like a mini encyclopedia series.

I like that it is a small sharp focused text on scikit-learn recipes.

Book Audience

The book is not for the machine learning beginner. Take note.

It assumes:

Familiarity with Python.
Familiarity with the SciPy stack.
Familiarity with machine learning.

These are reasonable assumptions for someone already using scikit-learn on projects, in which case the book becomes a desktop reference to consult for specific ad hoc machine learning tasks.

Book Contents

The book is comprised of 50 recipes? (maybe 57 recipes if I trust the table of contents and my own counting) separated into 5 chapters.

Chapter 1: Premodel Workflow
Chapter 2: Working with Linear Models
Chapter 3: Building Models with Distance Metrics
Chapter 4: Classifying Data with scikit-learn
Chapter 5: Postmodel Workflow

The chapters generally map onto the workflow of a standard data science project:

Acquire and prepare data.
Try some linear models
Try some nonlinear models
Try some more non-linear models.
Finalize the model

It is an okay structure for a book, the problem is that scikit-learn alone does not service all of these steps well. It excels at the modelling part and does a fair job of data pre-processing, but it is poor at the data loading and data analysis steps which are generally ignored.

Next we will step through each chapter in turn.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Chapter Walkthrough

In this section we take a closer look at the recipes in each of the five chapters.

Chapter 1: Premodel Workflow

This chapter focuses on data preparation. That is re-formatting the data to best expose the structure of the problem to the machine learning algorithms we may choose to use later on.

There are 17 recipes in this chapter and I would group them as follows:

Data Loading: Loading your own data and using the built-in datasets.
Data Cleaning: Tasks like imputing missing values.
Data Pre-Processing: Scaling and feature engineering.
Dimensionality Reduction: SVD, PCA and factor analysis.
Other: Pipelines, Gaussian Processes and gradient descent.

I’m sad that I had to devise my own structure here. I’m also sad that there is an “other” category. It is indicative that the organization of the recipes in chapters could be cleaner.

I would like more and separate recipes on scaling methods. I find myself doing a lot of scaling on datasets before I can use them. It’s perhaps the most common pre-processing step required to get good results.

Chapter 2: Working with Linear Models

The focus of this chapter is linear models. This shorter chapter contains 9 recipes.

Generally, the recipes in this chapter cover:

Linear Regression
Regularized Regression
Logistic Regression
More exotic variations on regression like boosting.

This is again another strange grouping of recipes.

I guess I feel that the focus of linear models could have extended further to LDA, Perceptron and other models supported by the platform, not limited to regression.

Chapter 3: Building Models with Distance Metrics

Many algorithms do use a distance measure at their core.

The first that may come to mind is KNN, but in fact you could interpret this more broadly and pull in techniques like support vector machines and related techniques that use kernels.

This chapter focuses on techniques that use distance measures and focuses really on K-Means almost exclusively (8 of the 9 recipes in this chapter). There is one KNN recipe at the end of the chapter.

The chapter should have been called clustering or K-Means.

Also, it is good to note my bias here in that I don’t use clustering methods at all, I find them utterly useless for predictive modeling.

Chapter 4: Classifying Data with scikit-learn

From the title, this chapter is about classification algorithms.

I would organize the 11 recipes in this chapter as follows:

Decision Trees (CART and Random Forest)
Support Vector Machines
Discriminant Analysis (LDA and QDA)
Naive Bayes
Other (semi-supervised learning, gradient descent, etc.)

I would put LDA and QDA in the linear models chapter (Chapter 2) and I would have added a ton more algorithms. A big benefit of scikit-learn is that it offers so many algorithms out of the box.

Those algorithms that are covered in this chapter is fine, what I am saying is I would double or triple the number and make recipes for algorithms the focus of the book.

Chapter 5: Postmodel Workflow

This chapter contains 11 recipes on general post modeling tasks.

This is technical not accurate as you would perform these tasks as a part of modeling, nevertheless, I see what the author was going for.

I would summarize the recipes in this chapter as follows:

Resampling methods (Cross validation and variations).
Algorithm Tuning (Grid search, random search, manual search, etc.).
Feature Selection.
Other (model persistence, model evaluation and baselines).

A good chapter covering important topics. Very important topics.

Generally, I would introduce each algorithm in the context of k-fold cross validation, because evaluating algorithms any other way might not be a good idea for most use cases.

I’m also surprised to see feature selection so late in the book. I would have expected this to have appeared in Chapter 1. It belongs up front with data preparation.

Thoughts On The Book

The book is just fine. I would recommend it for someone looking for a good desktop reference to support the online docs for scikit-learn.

I generally like the way each recipe is presented. In fact it is good to the point of verbosity, whereas in other books the recipes can be too brief. The structure is as follows:

Recipe name and description.
Getting ready (e.g. the preconditions or requirements).
How to do it (actual code and steps required to achieve a result).
How it works (additional explanation of the API or processes).
There’s more (optional additional variations on the recipe that are useful).

Given the above soft recommendation, I did note a some things while reading.

I was frustrated with the content of many recipes. So much so that I would never use them make them cannon in my own library of scikit-learn recipes I use from project to project.

I have used scikit-learn a fair bit and I took the time to read and try most of the API. Many recipes in the book are hand-crafted functions that actually already exist in the scikit-learn API. Maybe the API has been updated since publication, or not, but this did bother me. Less code is less maintenance and if you are using a library like scikit-learn then you should use all of it, and well.

Also, generally there are a few equations sprinkled through the explanations. They are mainly there to provide a shortcut description of a technique and avoid the exposition. It’s fine, but they may as well be left out and point to a good reference text instead and keep a laser focus on the scikit-learn API.

Some recipes are too long. I light tight, focused and self-contained. Something I can copy and paste and use to jumpstart a process in my own project.

You cannot cover the whole scikit-learn API, and the coverage in this book was pretty good. It covered the key parts of the library. I would like to see it cover some aspects that differentiate the library such as Pipelines in greater detail, learning line graphs and model calibrations.

Summary

In this post you discovered the book Scikit-Learn Cookbook by Trent Hauck.

You learned that it is a book of 50+ recipes for using scikit-learn covering topics such as:

Data preparation.
Linear and nonlinear algorithms.
Model evaluation and algorithm tuning.

It is a reasonable cookbook that can be used as a desktop reference to supplement the online documentation for the scikit-learn library.

Do you have any questions about the book? Have you read the book? Leave a comment and let me know what you thought of it.