Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Saturday, 22 June 2024

Stochastic Gradient Boosting with XGBoost and scikit-learn in Python

A simple technique for ensembling decision trees involves training trees on subsamples of the training dataset.

Subsets of the the rows in the training data can be taken to train individual trees called bagging. When subsets of rows of the training data are also taken when calculating each split point, this is called random forest.

These techniques can also be used in the gradient tree boosting model in a technique called stochastic gradient boosting.

In this post you will discover stochastic gradient boosting and how to tune the sampling parameters using XGBoost with scikit-learn in Python.

After reading this post you will know:

The rationale behind training trees on subsamples of data and how this can be used in gradient boosting.
How to tune row-based subsampling in XGBoost using scikit-learn.

How to tune column-based subsampling by both tree and split-point in XGBoost.

Stochastic Gradient Boosting

Gradient boosting is a greedy procedure.

New decision trees are added to the model to correct the residual error of the existing model.

Each decision tree is created using a greedy search procedure to select split points that best minimize an objective function. This can result in trees that use the same attributes and even the same split points again and again.

Bagging is a technique where a collection of decision trees are created, each from a different random subset of rows from the training data. The effect is that better performance is achieved from the ensemble of trees because the randomness in the sample allows slightly different trees to be created, adding variance to the ensembled predictions.

Random forest takes this one step further, by allowing the features (columns) to be subsampled when choosing split points, adding further variance to the ensemble of trees.

These same techniques can be used in the construction of decision trees in gradient boosting in a variation called stochastic gradient boosting.

It is common to use aggressive sub-samples of the training data such as 40% to 80%.

Tutorial Overview

In this tutorial we are going to look at the effect of different subsampling techniques in gradient boosting.

We will tune three different flavors of stochastic gradient boosting supported by the XGBoost library in Python, specifically:

Subsampling of rows in the dataset when creating each tree.
Subsampling of columns in the dataset when creating each tree.
Subsampling of columns for each split in the dataset when creating each tree.

Problem Description: Otto Dataset

In this tutorial we will use the Otto Group Product Classification Challenge dataset.

This dataset is available for free from Kaggle (you will need to sign-up to Kaggle to be able to download this dataset). You can download the training dataset train.csv.zip from the Data page and place the unzipped train.csv file into your working directory.

This dataset describes the 93 obfuscated details of more than 61,000 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input attributes are counts of different events of some kind.

The goal is to make predictions for new products as an array of probabilities for each of the 10 categories and models are evaluated using multiclass logarithmic loss (also called cross entropy).

This competition was completed in May 2015 and this dataset is a good challenge for XGBoost because of the nontrivial number of examples, the difficulty of the problem and the fact that little data preparation is required (other than encoding the string class variables as integers).

Tuning Row Subsampling in XGBoost

Row subsampling involves selecting a random sample of the training dataset without replacement.

Row subsampling can be specified in the scikit-learn wrapper of the XGBoost class in the subsample parameter. The default is 1.0 which is no sub-sampling.

We can use the grid search capability built into scikit-learn to evaluate the effect of different subsample values from 0.1 to 1.0 on the Otto dataset.

[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]

There are 9 variations of subsample and each model will be evaluated using 10-fold cross validation, meaning that 9×10 or 90 models need to be trained and tested.

The complete code listing is provided below.

# XGBoost on Otto dataset, tune subsample

from pandas import read_csv

from xgboost import XGBClassifier

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import StratifiedKFold

from sklearn.preprocessing import LabelEncoder

import matplotlib

matplotlib.use('Agg')

from matplotlib import pyplot

# load data

data = read_csv('train.csv')

dataset = data.values

# split data into X and y

X = dataset[:,0:94]

y = dataset[:,94]

# encode string class values as integers

label_encoded_y = LabelEncoder().fit_transform(y)

# grid search

model = XGBClassifier()

subsample = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]

param_grid = dict(subsample=subsample)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)

grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)

grid_result = grid_search.fit(X, label_encoded_y)

# summarize results

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

means = grid_result.cv_results_['mean_test_score']

stds = grid_result.cv_results_['std_test_score']

params = grid_result.cv_results_['params']

for mean, stdev, param in zip(means, stds, params):

print("%f (%f) with: %r" % (mean, stdev, param))

# plot

pyplot.errorbar(subsample, means, yerr=stds)

pyplot.title("XGBoost subsample vs Log Loss")

pyplot.xlabel('subsample')

pyplot.ylabel('Log Loss')

pyplot.savefig('subsample.png')

Running this example prints the best configuration as well as the log loss for each tested configuration.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the best results achieved were 0.3, or training trees using a 30% sample of the training dataset.

Best: -0.000647 using {'subsample': 0.3}

-0.001156 (0.000286) with: {'subsample': 0.1}

-0.000765 (0.000430) with: {'subsample': 0.2}

-0.000647 (0.000471) with: {'subsample': 0.3}

-0.000659 (0.000635) with: {'subsample': 0.4}

-0.000717 (0.000849) with: {'subsample': 0.5}

-0.000773 (0.000998) with: {'subsample': 0.6}

-0.000877 (0.001179) with: {'subsample': 0.7}

-0.001007 (0.001371) with: {'subsample': 0.8}

-0.001239 (0.001730) with: {'subsample': 1.0}

We can plot these mean and standard deviation log loss values to get a better understanding of how performance varies with the subsample value.

Plot of Tuning Row Sample Rate in XGBoost

We can see that indeed 30% has the best mean performance, but we can also see that as the ratio increased, the variance in performance grows quite markedly.

It is interesting to note that the mean performance of all subsample values outperforms the mean performance without subsampling (subsample=1.0).

Tuning Column Subsampling in XGBoost By Tree

We can also create a random sample of the features (or columns) to use prior to creating each decision tree in the boosted model.

In the XGBoost wrapper for scikit-learn, this is controlled by the colsample_bytree parameter.

The default value is 1.0 meaning that all columns are used in each decision tree. We can evaluate values for colsample_bytree between 0.1 and 1.0 incrementing by 0.1.

[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]

The full code listing is provided below.

# XGBoost on Otto dataset, tune colsample_bytree

from pandas import read_csv

from xgboost import XGBClassifier

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import StratifiedKFold

from sklearn.preprocessing import LabelEncoder

import matplotlib

matplotlib.use('Agg')

from matplotlib import pyplot

# load data

data = read_csv('train.csv')

dataset = data.values

# split data into X and y

X = dataset[:,0:94]

y = dataset[:,94]

# encode string class values as integers

label_encoded_y = LabelEncoder().fit_transform(y)

# grid search

model = XGBClassifier()

colsample_bytree = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]

param_grid = dict(colsample_bytree=colsample_bytree)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)

grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)

grid_result = grid_search.fit(X, label_encoded_y)

# summarize results

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

means = grid_result.cv_results_['mean_test_score']

stds = grid_result.cv_results_['std_test_score']

params = grid_result.cv_results_['params']

for mean, stdev, param in zip(means, stds, params):

print("%f (%f) with: %r" % (mean, stdev, param))

# plot

pyplot.errorbar(colsample_bytree, means, yerr=stds)

pyplot.title("XGBoost colsample_bytree vs Log Loss")

pyplot.xlabel('colsample_bytree')

pyplot.ylabel('Log Loss')

pyplot.savefig('colsample_bytree.png')

Running this example prints the best configuration as well as the log loss for each tested configuration.

We can see that the best performance for the model was colsample_bytree=1.0. This suggests that subsampling columns on this problem does not add value.

Best: -0.001239 using {'colsample_bytree': 1.0}

-0.298955 (0.002177) with: {'colsample_bytree': 0.1}

-0.092441 (0.000798) with: {'colsample_bytree': 0.2}

-0.029993 (0.000459) with: {'colsample_bytree': 0.3}

-0.010435 (0.000669) with: {'colsample_bytree': 0.4}

-0.004176 (0.000916) with: {'colsample_bytree': 0.5}

-0.002614 (0.001062) with: {'colsample_bytree': 0.6}

-0.001694 (0.001221) with: {'colsample_bytree': 0.7}

-0.001306 (0.001435) with: {'colsample_bytree': 0.8}

-0.001239 (0.001730) with: {'colsample_bytree': 1.0}

Plotting the results, we can see the performance of the model plateau (at least at this scale) with values between 0.5 to 1.0.

Plot of Tuning Per-Tree Column Sampling in XGBoost

Tuning Column Subsampling in XGBoost By Split

Rather than subsample the columns once for each tree, we can subsample them at each split in the decision tree. In principle, this is the approach used in random forest.

We can set the size of the sample of columns used at each split in the colsample_bylevel parameter in the XGBoost wrapper classes for scikit-learn.

As before, we will vary the ratio from 10% to the default of 100%.

The full code listing is provided below.

# XGBoost on Otto dataset, tune colsample_bylevel

from pandas import read_csv

from xgboost import XGBClassifier

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import StratifiedKFold

from sklearn.preprocessing import LabelEncoder

import matplotlib

matplotlib.use('Agg')

from matplotlib import pyplot

# load data

data = read_csv('train.csv')

dataset = data.values

# split data into X and y

X = dataset[:,0:94]

y = dataset[:,94]

# encode string class values as integers

label_encoded_y = LabelEncoder().fit_transform(y)

# grid search

model = XGBClassifier()

colsample_bylevel = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]

param_grid = dict(colsample_bylevel=colsample_bylevel)

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)

grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)

grid_result = grid_search.fit(X, label_encoded_y)

# summarize results

print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

means = grid_result.cv_results_['mean_test_score']

stds = grid_result.cv_results_['std_test_score']

params = grid_result.cv_results_['params']

for mean, stdev, param in zip(means, stds, params):

print("%f (%f) with: %r" % (mean, stdev, param))

# plot

pyplot.errorbar(colsample_bylevel, means, yerr=stds)

pyplot.title("XGBoost colsample_bylevel vs Log Loss")

pyplot.xlabel('colsample_bylevel')

pyplot.ylabel('Log Loss')

pyplot.savefig('colsample_bylevel.png')

Running this example prints the best configuration as well as the log loss for each tested configuration.

We can see that the best results were achieved by setting colsample_bylevel to 70%, resulting in an (inverted) log loss of -0.001062, which is better than -0.001239 seen when setting the per-tree column sampling to 100%.

This suggest to not give up on column subsampling if per-tree results suggest using 100% of columns, and to instead try per-split column subsampling.

Best: -0.001062 using {'colsample_bylevel': 0.7}

-0.159455 (0.007028) with: {'colsample_bylevel': 0.1}

-0.034391 (0.003533) with: {'colsample_bylevel': 0.2}

-0.007619 (0.000451) with: {'colsample_bylevel': 0.3}

-0.002982 (0.000726) with: {'colsample_bylevel': 0.4}

-0.001410 (0.000946) with: {'colsample_bylevel': 0.5}

-0.001182 (0.001144) with: {'colsample_bylevel': 0.6}

-0.001062 (0.001221) with: {'colsample_bylevel': 0.7}

-0.001071 (0.001427) with: {'colsample_bylevel': 0.8}

-0.001239 (0.001730) with: {'colsample_bylevel': 1.0}

We can plot the performance of each colsample_bylevel variation. The results show relatively low variance and seemingly a plateau in performance after a value of 0.3 at this scale.

Plot of Tuning Per-Split Column Sampling in XGBoost

Summary

In this post you discovered stochastic gradient boosting with XGBoost in Python.

Specifically, you learned:

About stochastic boosting and how you can subsample your training data to improve the generalization of your model
How to tune row subsampling with XGBoost in Python and scikit-learn.
How to tune column subsampling with XGBoost both per-tree and per-split.

Do you have any questions about stochastic gradient boosting or about this post? Ask your questions in the comments and I will do my best to answer.

Artificial Intelligence , Machine Learning and Data Science Hubspot

Saturday, 22 June 2024

Stochastic Gradient Boosting with XGBoost and scikit-learn in Python

Stochastic Gradient Boosting

Tutorial Overview

Problem Description: Otto Dataset

Tuning Row Subsampling in XGBoost

Tuning Column Subsampling in XGBoost By Tree

Tuning Column Subsampling in XGBoost By Split

Summary

No comments:

Post a Comment

Report Abuse

Labels

"Donate for a Noble Cause