Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Thursday, 4 July 2024

How to Load Machine Learning Data From Scratch In Python

You must know how to load data before you can use it to train a machine learning model.

When starting out, it is a good idea to stick with small in-memory datasets using standard file formats like comma separated value (.csv).

In this tutorial you will discover how to load your data in Python from scratch, including:

How to load a CSV file.
How to convert strings from a file to floating point numbers.

How to convert class values from a file to integers.

Description

Comma Separated Values

The standard file format for small datasets is Comma Separated Values or CSV.

In it’s simplest form, CSV files are comprised of rows of data. Each row is divided into columns using a comma (“,”).

You can learn more about the CSV file format in RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files.

In this tutorial, we are going to practice loading two different standard machine learning datasets in CSV format.

Pima Indians Diabetes Dataset

The first is the Pima Indians diabetes dataset. It contains 768 rows and 9 columns.

All of the values in the file are numeric, specifically floating point values. We will learn how to load the file first, then later how to convert the loaded strings to numeric values.

Iris Flower Species Dataset

The second dataset we will work with is the iris flowers dataset.

It contains 150 rows and 4 columns. The first 3 columns are numeric. It is different in that the class value (final column) is a string, indicating a species of flower. We will learn how to convert the numeric columns from string to numbers and how to convert the flower species string into an integer that we can use consistently.

Tutorial

This tutorial is divided into 3 parts:

Load a file.
Load a file and convert Strings to Floats.
Load a file and convert Strings to Integers.

These steps will provide the foundations you need to handle loading your own data.

1. Load CSV File

The first step is to load the CSV file.

We will use the csv module that is a part of the standard library.

The reader() function in the csv module takes a file as an argument.

We will create a function called load_csv() to wrap this behavior that will take a filename and return our dataset. We will represent the loaded dataset as a list of lists. The first list is a list of observations or rows, and the second list is the list of column values for a given row.

Below is the complete function for loading a CSV file.

from csv import reader

# Load a CSV file

def load_csv(filename):

file = open(filename, "r")

lines = reader(file)

dataset = list(lines)

return dataset

We can test this function by loading the Pima Indians dataset. Download the dataset and place it in the current working directory with the name pima-indians-diabetes.csv. Open the file and delete any empty lines at the bottom.

Taking a peek at the first 5 rows of the raw data file we can see the following:

6,148,72,35,0,33.6,0.627,50,1

1,85,66,29,0,26.6,0.351,31,0

8,183,64,0,0,23.3,0.672,32,1

1,89,66,23,94,28.1,0.167,21,0

0,137,40,35,168,43.1,2.288,33,1

The data is numeric and separated by commas and we can expect that the whole file meets this expectation.

Let’s use the new function and load the dataset. Once loaded we can report some simple details such as the number of rows and columns loaded.

Putting all of this together, we get the following:

from csv import reader

# Load a CSV file

def load_csv(filename):

file = open(filename, "r")

lines = reader(file)

dataset = list(lines)

return dataset

# Load dataset

filename = 'pima-indians-diabetes.csv'

dataset = load_csv(filename)

print('Loaded data file {0} with {1} rows and {2} columns').format(filename, len(dataset), len(dataset[0]))

Running this example we see:

Loaded data file pima-indians-diabetes.csv with 768 rows and 9 columns

A limitation of this function is that it will load empty lines from data files and add them to our list of rows. We can overcome this by adding rows of data one at a time to our dataset and skipping empty rows.

Below is the updated example with this new improved version of the load_csv() function.

# Example of loading Pima Indians CSV dataset

from csv import reader

# Load a CSV file

def load_csv(filename):

dataset = list()

with open(filename, 'r') as file:

csv_reader = reader(file)

for row in csv_reader:

if not row:

continue

dataset.append(row)

return dataset

# Load dataset

filename = 'pima-indians-diabetes.csv'

dataset = load_csv(filename)

print('Loaded data file {0} with {1} rows and {2} columns').format(filename, len(dataset), len(dataset[0]))

Running this example we see:

Loaded data file pima-indians-diabetes.csv with 768 rows and 9 columns

2. Convert String to Floats

Most, if not all machine learning algorithms prefer to work with numbers.

Specifically, floating point numbers are preferred.

Our code for loading a CSV file returns a dataset as a list of lists, but each value is a string. We can see this if we print out one record from the dataset:

print(dataset[0])

This produces output like:

['6', '148', '72', '35', '0', '33.6', '0.627', '50', '1']

We can write a small function to convert specific columns of our loaded dataset to floating point values.

Below is this function called str_column_to_float(). It will convert a given column in the dataset to floating point values, careful to strip any whitespace from the value before making the conversion.

def str_column_to_float(dataset, column):

for row in dataset:

row[column] = float(row[column].strip())

We can test this function by combining it with our load CSV function above, and convert all of the numeric data in the Pima Indians dataset to floating point values.

The complete example is below.

from csv import reader

# Load a CSV file

def load_csv(filename):

file = open(filename, "rb")

lines = reader(file)

dataset = list(lines)

return dataset

# Convert string column to float

def str_column_to_float(dataset, column):

for row in dataset:

row[column] = float(row[column].strip())

# Load pima-indians-diabetes dataset

filename = 'pima-indians-diabetes.csv'

dataset = load_csv(filename)

print('Loaded data file {0} with {1} rows and {2} columns').format(filename, len(dataset), len(dataset[0]))

print(dataset[0])

# convert string columns to float

for i in range(len(dataset[0])):

str_column_to_float(dataset, i)

print(dataset[0])

Running this example we see the first row of the dataset printed both before and after the conversion. We can see that the values in each column have been converted from strings to numbers.

Loaded data file pima-indians-diabetes.csv with 768 rows and 9 columns

['6', '148', '72', '35', '0', '33.6', '0.627', '50', '1']

[6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0]

3. Convert String to Integers

The iris flowers dataset is like the Pima Indians dataset, in that the columns contain numeric data.

The difference is the final column, traditionally used to hold the outcome or value to be predicted for a given row. The final column in the iris flowers data is the iris flower species as a string.

Download the dataset and place it in the current working directory with the file name iris.csv. Open the file and delete any empty lines at the bottom.

For example, below are the first 5 rows of the raw dataset.

5.1,3.5,1.4,0.2,Iris-setosa

4.9,3.0,1.4,0.2,Iris-setosa

4.7,3.2,1.3,0.2,Iris-setosa

4.6,3.1,1.5,0.2,Iris-setosa

5.0,3.6,1.4,0.2,Iris-setosa

Some machine learning algorithms prefer all values to be numeric, including the outcome or predicted value.

We can convert the class value in the iris flowers dataset to an integer by creating a map.

First, we locate all of the unique class values, which happen to be: Iris-setosa, Iris-versicolor and Iris-virginica.
Next, we assign an integer value to each, such as: 0, 1 and 2.
Finally, we replace all occurrences of class string values with their corresponding integer values.

Below is a function to do just that called str_column_to_int(). Like the previously introduced str_column_to_float() it operates on a single column in the dataset.

# Convert string column to integer

def str_column_to_int(dataset, column):

class_values = [row[column] for row in dataset]

unique = set(class_values)

lookup = dict()

for i, value in enumerate(unique):

lookup[value] = i

for row in dataset:

row[column] = lookup[row[column]]

return lookup

We can test this new function in addition to the previous two functions for loading a CSV file and converting columns to floating point values. It also returns the dictionary mapping of class values to integer values, in case any users downstream want to convert predictions back to string values again.

The example below loads the iris dataset then converts the first 3 columns to floats and the final column to integer values.

from csv import reader

# Load a CSV file

def load_csv(filename):

file = open(filename, "rb")

lines = reader(file)

dataset = list(lines)

return dataset

# Convert string column to float

def str_column_to_float(dataset, column):

for row in dataset:

row[column] = float(row[column].strip())

# Convert string column to integer

def str_column_to_int(dataset, column):

class_values = [row[column] for row in dataset]

unique = set(class_values)

lookup = dict()

for i, value in enumerate(unique):

lookup[value] = i

for row in dataset:

row[column] = lookup[row[column]]

return lookup

# Load iris dataset

filename = 'iris.csv'

dataset = load_csv(filename)

print('Loaded data file {0} with {1} rows and {2} columns').format(filename, len(dataset), len(dataset[0]))

print(dataset[0])

# convert string columns to float

for i in range(4):

str_column_to_float(dataset, i)

# convert class column to int

lookup = str_column_to_int(dataset, 4)

print(dataset[0])

print(lookup)

Running this example produces the output below.

We can see the first row of the dataset before and after the data type conversions. We can also see the dictionary mapping of class values to integers.

Loaded data file iris.csv with 150 rows and 5 columns

['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']

[5.1, 3.5, 1.4, 0.2, 1]

{'Iris-virginica': 0, 'Iris-setosa': 1, 'Iris-versicolor': 2}

Extensions

You learned how to load CSV files and perform basic data conversions.

Data loading can be a difficult task given the variety of data cleaning and conversion that may be required from problem to problem.

There are many extensions that you could make to make these examples more robust to new and different data files. Below are just a few ideas you can consider researching and implementing yourself:

Detect and remove empty lines at the top or bottom of the file.
Detect and handle missing values in a column.
Detect and handle rows that do not match expectations for the rest of the file.
Support for other delimiters such as “|” (pipe) or white space.
Support more efficient data structures such as arrays.

Two libraries you may wish to use in practice for loading CSV data are NumPy and Pandas.

NumPy offers the loadtxt() function for loading data files as NumPy arrays. Pandas offers the read_csv() function that offers a lot of flexibility regarding data types, file headers and more.

Review

In this tutorial, you discovered how you can load your machine learning data from scratch in Python.

Specifically, you learned:

How to load a CSV file into memory.
How to convert string values to floating point values.
How to convert a string class value into an integer encoding.

Do you have any questions about loading machine learning data or about this post?
Ask your question in the comments and I will do my best to answer.

Artificial Intelligence , Machine Learning and Data Science Hubspot

Thursday, 4 July 2024

How to Load Machine Learning Data From Scratch In Python

Description

Comma Separated Values

Pima Indians Diabetes Dataset

Iris Flower Species Dataset

Tutorial

1. Load CSV File

2. Convert String to Floats

3. Convert String to Integers

Extensions

Review

No comments:

Post a Comment

Report Abuse

Labels

"Donate for a Noble Cause