Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Monday, 4 December 2023

Rescaling Data for Machine Learning in Python with Scikit-Learn

Your data must be prepared before you can build models. The data preparation process can involve three steps: data selection, data preprocessing and data transformation.

In this post you will discover two simple data transformation methods you can apply to your data in Python using scikit-learn.

Let’s get started.

Data Rescaling

Your preprocessed data may contain attributes with a mixtures of scales for various quantities such as dollars, kilograms and sales volume.

Many machine learning methods expect or are more effective if the data attributes have the same scale. Two popular data scaling methods are normalization and standardization.

Data Normalization

Normalization refers to rescaling real valued numeric attributes into the range 0 and 1.

It is useful to scale the input attributes for a model that relies on the magnitude of values, such as distance measures used in k-nearest neighbors and in the preparation of coefficients in regression.

The example below demonstrate data normalization of the Iris flowers dataset.

# Normalize the data attributes for the Iris dataset.

from sklearn.datasets import load_iris

from sklearn import preprocessing

# load the iris dataset

iris = load_iris()

print(iris.data.shape)

# separate the data from the target attributes

X = iris.data

y = iris.target

# normalize the data attributes

normalized_X = preprocessing.normalize(X)

For more information see the normalize function in the API documentation.

Data Standardization

Standardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance).

It is useful to standardize attributes for a model that relies on the distribution of attributes such as Gaussian processes.

The example below demonstrate data standardization of the Iris flowers dataset.

# Standardize the data attributes for the Iris dataset.

from sklearn.datasets import load_iris

from sklearn import preprocessing

# load the Iris dataset

iris = load_iris()

print(iris.data.shape)

# separate the data and target attributes

X = iris.data

y = iris.target

# standardize the data attributes

standardized_X = preprocessing.scale(X)

For more information see the scale function in the API documentation.

Tip: Which Method To Use

It is hard to know whether rescaling your data will improve the performance of your algorithms before you apply them. If often can, but not always.

A good tip is to create rescaled copies of your dataset and race them against each other using your test harness and a handful of algorithms you want to spot check. This can quickly highlight the benefits (or lack there of) of rescaling your data with given models, and which rescaling method may be worthy of further investigation.

Summary

Data rescaling is an important part of data preparation before applying machine learning algorithms.

In this post you discovered where data rescaling fits into the process of applied machine learning and two methods: Normalization and Standardization that you can use to rescale your data in Python using the scikit-learn library.

Artificial Intelligence , Machine Learning and Data Science Hubspot

Monday, 4 December 2023

Rescaling Data for Machine Learning in Python with Scikit-Learn

Data Rescaling

Data Normalization

Data Standardization

Tip: Which Method To Use

Summary

No comments:

Post a Comment

Report Abuse

Labels

"Donate for a Noble Cause