Your data must be prepared before you can build models. The data preparation process can involve three steps: data selection, data preprocessing and data transformation.
In this post you will discover two simple data transformation methods you can apply to your data in Python using scikit-learn.
Let’s get started.
Data Rescaling
Your preprocessed data may contain attributes with a mixtures of scales for various quantities such as dollars, kilograms and sales volume.
Many machine learning methods expect or are more effective if the data attributes have the same scale. Two popular data scaling methods are normalization and standardization.
Data Normalization
Normalization refers to rescaling real valued numeric attributes into the range 0 and 1.
It is useful to scale the input attributes for a model that relies on the magnitude of values, such as distance measures used in k-nearest neighbors and in the preparation of coefficients in regression.
The example below demonstrate data normalization of the Iris flowers dataset.
For more information see the normalize function in the API documentation.
Data Standardization
Standardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance).
It is useful to standardize attributes for a model that relies on the distribution of attributes such as Gaussian processes.
The example below demonstrate data standardization of the Iris flowers dataset.
For more information see the scale function in the API documentation.
Tip: Which Method To Use
It is hard to know whether rescaling your data will improve the performance of your algorithms before you apply them. If often can, but not always.
A good tip is to create rescaled copies of your dataset and race them against each other using your test harness and a handful of algorithms you want to spot check. This can quickly highlight the benefits (or lack there of) of rescaling your data with given models, and which rescaling method may be worthy of further investigation.
Summary
Data rescaling is an important part of data preparation before applying machine learning algorithms.
In this post you discovered where data rescaling fits into the process of applied machine learning and two methods: Normalization and Standardization that you can use to rescale your data in Python using the scikit-learn library.
No comments:
Post a Comment