Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Tuesday, 2 April 2024

How To Prepare Your Data For Machine Learning in Python with Scikit-Learn

Many machine learning algorithms make assumptions about your data.

It is often a very good idea to prepare your data in such way to best expose the structure of the problem to the machine learning algorithms that you intend to use.

In this post you will discover how to prepare your data for machine learning in Python using scikit-learn.

Need For Data Preprocessing

You almost always need to preprocess your data. It is a required step.

A difficulty is that different algorithms make different assumptions about your data and may require different transforms. Further, when you follow all of the rules and prepare your data, sometimes algorithms can deliver better results without the preprocessing.

Generally, I would recommend creating many different views and transforms of your data, then exercise a handful of algorithms on each view of your dataset. This will help you to flush out which data transforms might be better at exposing the structure of your problem in general.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Preprocessing Machine Learning Recipes

This section lists 4 different data preprocessing recipes for machine learning.

All of the recipes were designed to be complete and standalone.

You can copy and paste them directly into your project and start working.

The Pima Indian diabetes dataset is used in each recipe. This is a binary classification problem where all of the attributes are numeric and have different scales. It is a great example of dataset that can benefit from pre-processing.

Each recipe follows the same structure:

Load the dataset from a URL.
Split the dataset into the input and output variables for machine learning.
Apply a preprocessing transform to the input variables.
Summarize the data to show the change.

The transforms are calculated in such a way that they can be applied to your training data and any samples of data you may have in the future.

The scikit-learn documentation has some information on how to use various different preprocessing methods. You can review the preprocess API in scikit-learn here.

1. Rescale Data

When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.

Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1. This is useful for optimization algorithms in used in the core of machine learning algorithms like gradient descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.

You can rescale your data using scikit-learn using the MinMaxScaler class.

# Rescale data (between 0 and 1)

import pandas

import scipy

import numpy

from sklearn.preprocessing import MinMaxScaler

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

# separate array into input and output components

X = array[:,0:8]

Y = array[:,8]

scaler = MinMaxScaler(feature_range=(0, 1))

rescaledX = scaler.fit_transform(X)

# summarize transformed data

numpy.set_printoptions(precision=3)

print(rescaledX[0:5,:])

After rescaling you can see that all of the values are in the range between 0 and 1.

[[ 0.353 0.744 0.59 0.354 0. 0.501 0.234 0.483]

[ 0.059 0.427 0.541 0.293 0. 0.396 0.117 0.167]

[ 0.471 0.92 0.525 0. 0. 0.347 0.254 0.183]

[ 0.059 0.447 0.541 0.232 0.111 0.419 0.038 0. ]

[ 0. 0.688 0.328 0.354 0.199 0.642 0.944 0.2 ]]

2. Standardize Data

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.

It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis.

You can standardize data using scikit-learn with the StandardScaler class.

# Standardize data (0 mean, 1 stdev)

from sklearn.preprocessing import StandardScaler

import pandas

import numpy

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

# separate array into input and output components

X = array[:,0:8]

Y = array[:,8]

scaler = StandardScaler().fit(X)

rescaledX = scaler.transform(X)

# summarize transformed data

numpy.set_printoptions(precision=3)

print(rescaledX[0:5,:])

The values for each attribute now have a mean value of 0 and a standard deviation of 1.

[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426]

[-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191]

[ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106]

[-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042]

[-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]

3. Normalize Data

Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm in linear algebra).

This preprocessing can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as K-Nearest Neighbors.

You can normalize data in Python with scikit-learn using the Normalizer class.

# Normalize data (length of 1)

from sklearn.preprocessing import Normalizer

import pandas

import numpy

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

# separate array into input and output components

X = array[:,0:8]

Y = array[:,8]

scaler = Normalizer().fit(X)

normalizedX = scaler.transform(X)

# summarize transformed data

numpy.set_printoptions(precision=3)

print(normalizedX[0:5,:])

The rows are normalized to length 1.

[[ 0.034 0.828 0.403 0.196 0. 0.188 0.004 0.28 ]

[ 0.008 0.716 0.556 0.244 0. 0.224 0.003 0.261]

[ 0.04 0.924 0.323 0. 0. 0.118 0.003 0.162]

[ 0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139]

[ 0. 0.596 0.174 0.152 0.731 0.188 0.01 0.144]]

4. Binarize Data (Make Binary)

You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.

This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.

You can create new binary attributes in Python using scikit-learn with the Binarizer class.

# binarization

from sklearn.preprocessing import Binarizer

import pandas

import numpy

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

# separate array into input and output components

X = array[:,0:8]

Y = array[:,8]

binarizer = Binarizer(threshold=0.0).fit(X)

binaryX = binarizer.transform(X)

# summarize transformed data

numpy.set_printoptions(precision=3)

print(binaryX[0:5,:])

You can see that all values equal or less than 0 are marked 0 and all of those above 0 are marked 1.

[[ 1. 1. 1. 1. 0. 1. 1. 1.]

[ 1. 1. 1. 1. 0. 1. 1. 1.]

[ 1. 1. 1. 0. 0. 1. 1. 1.]

[ 1. 1. 1. 1. 1. 1. 1. 1.]

[ 0. 1. 1. 1. 1. 1. 1. 1.]]

Summary

In this post you discovered how you can prepare your data for machine learning in Python using scikit-learn.

You now have recipes to:

Rescale data.
Standardize data.
Normalize data.
Binarize data.

Your action step for this post is to type or copy-and-paste each recipe and get familiar with data preprocesing in scikit-learn.

Do you have any questions about data preprocessing in Python or this post? Ask in the comments and I will do my best to answer.

Artificial Intelligence , Machine Learning and Data Science Hubspot

Tuesday, 2 April 2024

How To Prepare Your Data For Machine Learning in Python with Scikit-Learn

Need For Data Preprocessing

Need help with Machine Learning in Python?

Preprocessing Machine Learning Recipes

1. Rescale Data

2. Standardize Data

3. Normalize Data

4. Binarize Data (Make Binary)

Summary

No comments:

Post a Comment

AI:List the estimated population in the sky scraping building and in the middle class families and lower backward families and number of male female disputes every year and mental illness cases

Report Abuse

Labels

"Donate for a Noble Cause