You must know how to load data before you can use it to train a machine learning model.
When starting out, it is a good idea to stick with small in-memory datasets using standard file formats like comma separated value (.csv).
In this tutorial you will discover how to load your data in Python from scratch, including:
- How to load a CSV file.
- How to convert strings from a file to floating point numbers.
- How to convert class values from a file to integers.
Description
Comma Separated Values
The standard file format for small datasets is Comma Separated Values or CSV.
In it’s simplest form, CSV files are comprised of rows of data. Each row is divided into columns using a comma (“,”).
You can learn more about the CSV file format in RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files.
In this tutorial, we are going to practice loading two different standard machine learning datasets in CSV format.
Pima Indians Diabetes Dataset
The first is the Pima Indians diabetes dataset. It contains 768 rows and 9 columns.
All of the values in the file are numeric, specifically floating point values. We will learn how to load the file first, then later how to convert the loaded strings to numeric values.
Iris Flower Species Dataset
The second dataset we will work with is the iris flowers dataset.
It contains 150 rows and 4 columns. The first 3 columns are numeric. It is different in that the class value (final column) is a string, indicating a species of flower. We will learn how to convert the numeric columns from string to numbers and how to convert the flower species string into an integer that we can use consistently.
Tutorial
This tutorial is divided into 3 parts:
- Load a file.
- Load a file and convert Strings to Floats.
- Load a file and convert Strings to Integers.
These steps will provide the foundations you need to handle loading your own data.
1. Load CSV File
The first step is to load the CSV file.
We will use the csv module that is a part of the standard library.
The reader() function in the csv module takes a file as an argument.
We will create a function called load_csv() to wrap this behavior that will take a filename and return our dataset. We will represent the loaded dataset as a list of lists. The first list is a list of observations or rows, and the second list is the list of column values for a given row.
Below is the complete function for loading a CSV file.
We can test this function by loading the Pima Indians dataset. Download the dataset and place it in the current working directory with the name pima-indians-diabetes.csv. Open the file and delete any empty lines at the bottom.
Taking a peek at the first 5 rows of the raw data file we can see the following:
The data is numeric and separated by commas and we can expect that the whole file meets this expectation.
Let’s use the new function and load the dataset. Once loaded we can report some simple details such as the number of rows and columns loaded.
Putting all of this together, we get the following:
Running this example we see:
A limitation of this function is that it will load empty lines from data files and add them to our list of rows. We can overcome this by adding rows of data one at a time to our dataset and skipping empty rows.
Below is the updated example with this new improved version of the load_csv() function.
Running this example we see:
2. Convert String to Floats
Most, if not all machine learning algorithms prefer to work with numbers.
Specifically, floating point numbers are preferred.
Our code for loading a CSV file returns a dataset as a list of lists, but each value is a string. We can see this if we print out one record from the dataset:
This produces output like:
We can write a small function to convert specific columns of our loaded dataset to floating point values.
Below is this function called str_column_to_float(). It will convert a given column in the dataset to floating point values, careful to strip any whitespace from the value before making the conversion.
We can test this function by combining it with our load CSV function above, and convert all of the numeric data in the Pima Indians dataset to floating point values.
The complete example is below.
Running this example we see the first row of the dataset printed both before and after the conversion. We can see that the values in each column have been converted from strings to numbers.
3. Convert String to Integers
The iris flowers dataset is like the Pima Indians dataset, in that the columns contain numeric data.
The difference is the final column, traditionally used to hold the outcome or value to be predicted for a given row. The final column in the iris flowers data is the iris flower species as a string.
Download the dataset and place it in the current working directory with the file name iris.csv. Open the file and delete any empty lines at the bottom.
For example, below are the first 5 rows of the raw dataset.
Some machine learning algorithms prefer all values to be numeric, including the outcome or predicted value.
We can convert the class value in the iris flowers dataset to an integer by creating a map.
- First, we locate all of the unique class values, which happen to be: Iris-setosa, Iris-versicolor and Iris-virginica.
- Next, we assign an integer value to each, such as: 0, 1 and 2.
- Finally, we replace all occurrences of class string values with their corresponding integer values.
Below is a function to do just that called str_column_to_int(). Like the previously introduced str_column_to_float() it operates on a single column in the dataset.
We can test this new function in addition to the previous two functions for loading a CSV file and converting columns to floating point values. It also returns the dictionary mapping of class values to integer values, in case any users downstream want to convert predictions back to string values again.
The example below loads the iris dataset then converts the first 3 columns to floats and the final column to integer values.
Running this example produces the output below.
We can see the first row of the dataset before and after the data type conversions. We can also see the dictionary mapping of class values to integers.
Extensions
You learned how to load CSV files and perform basic data conversions.
Data loading can be a difficult task given the variety of data cleaning and conversion that may be required from problem to problem.
There are many extensions that you could make to make these examples more robust to new and different data files. Below are just a few ideas you can consider researching and implementing yourself:
- Detect and remove empty lines at the top or bottom of the file.
- Detect and handle missing values in a column.
- Detect and handle rows that do not match expectations for the rest of the file.
- Support for other delimiters such as “|” (pipe) or white space.
- Support more efficient data structures such as arrays.
Two libraries you may wish to use in practice for loading CSV data are NumPy and Pandas.
NumPy offers the loadtxt() function for loading data files as NumPy arrays. Pandas offers the read_csv() function that offers a lot of flexibility regarding data types, file headers and more.
Review
In this tutorial, you discovered how you can load your machine learning data from scratch in Python.
Specifically, you learned:
- How to load a CSV file into memory.
- How to convert string values to floating point values.
- How to convert a string class value into an integer encoding.
Do you have any questions about loading machine learning data or about this post?
Ask your question in the comments and I will do my best to answer.
No comments:
Post a Comment