Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Thursday, 19 March 2026

Converting Pandas DataFrames to PyTorch DataLoaders for Custom Deep Learning Model Training

Converting Pandas DataFrames to PyTorch DataLoaders for Custom Deep Learning Model Training.
Image by Editor | Ideogram

Pandas DataFrames are powerful and versatile data manipulation and analysis tools. While the versatility of this data structure is undeniable, in some situations — like working with PyTorch — a more structured and batch-friendly format would be more efficient and better suited for training deep learning models, in which case the DataLoader class stands out as the preferred solution.

In this article, we show how to turn a dataset contained in a Pandas DataFrame object into a PyTorch DataLoader object, which will allow a more efficient provision of mini-batches of data into a model during training and evaluation as part of custom deep learning model training workflows. We will illustrate the process of training and evaluating a simple, custom deep learning model for regression — namely, to estimate house prices.

Preparing the Data: From DataFrame to DataLoader

The process starts by importing the Python libraries and modules we will need, primarily Pandas for loading the dataset into a DataFrame and preprocessing it, PyTorch modules like Dataset, DataLoader, and several more for modeling and training neural network models, and scikit-learn components for supporting initial data preparation.

import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

The next few instructions load the California housing dataset, containing over 20K instances described by eight predictor attributes — all of them numerical in this scikit-learn’s built-in version of the dataset — plus a target numerical label: median house value. The data is loaded into a Pandas DataFrame.

from sklearn.datasets import fetch_california_housing
 
housing = fetch_california_housing(as_frame=True)
df = housing.frame
df.head()

Before converting the data into a PyTorch-friendly format, we set the target variable aside in a variable named y, keeping the rest of the data inputs in a new DataFrame X. Both are split into training (80%) and test (20%) for later model training and testing, respectively. Since some attributes’ values move across much wider ranges than others, we are also applying feature scaling aided by scikit-learn’s StandardScaler class.

X = df.drop(columns='MedHouseVal')
y = df['MedHouseVal']
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

With these steps, data preprocessing is pretty much done! Now it’s time to turn it into two DataLoaders: one for the training set and one for the test set. We do so aided by a class defined by us that inherits PyTorch Dataset. An object of this class, which we called HousingDataset, will contain two attributes: the inputs stemming from X, and the outputs stemming from y. Likewise, we override the special class methods __len__() and __getitem()__, which return the number of instances and the ith instance, respectively.

class HousingDataset(Dataset):
    def __init__(self, features, targets):
        self.X = torch.tensor(features, dtype=torch.float32)
        self.y = torch.tensor(targets.values, dtype=torch.float32).view(-1, 1)
 
    def __len__(self):
        return len(self.X)
 
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

We now instantiate two objects of the newly created class, one for the training data/labels, and one for the test data/labels, as follows:

train_dataset = HousingDataset(X_train, y_train)
test_dataset = HousingDataset(X_test, y_test)

Finally, it’s time to create the two associated DataLoader instances. The one used for the training data will be shuffled to ensure an unbiased model training.

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64)

You may have noticed both DataLoader objects are instantiated upon the two previous Dataset-based objects. A DataLoader object does not actually contain the data — this is the custom class HousingDataset‘s job — but it is responsible for iterating and managing its associated dataset.

Custom Model Training and Testing

Now that we have PyTorch DataLoader objects ready to use, why not build a simple neural network model for regression, train it, and evaluate it on the test data by using such objects? The code below defines a very straightforward neural network architecture consisting of two fully connected linear layers bridged by a ReLU activation function.

class SimpleModel(nn.Module):
    def __init__(self, input_dim):
        super(SimpleModel, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )
 
    def forward(self, x):
        return self.net(x)

Next, we instantiate a model based on this architecture, specifying an input size equal to the original number of predictor attributes in the dataset, set the loss function and optimizer, and start a training loop across 10 iterations or epochs. Ideally, the loss function should gradually decrease after every training iteration as a sign of model convergence towards better predictions on the training data itself:

model = SimpleModel(input_dim=X_train.shape[1])
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
 
for epoch in range(10):
    model.train()
    total_loss = 0
    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()
        predictions = model(batch_X)
        loss = criterion(predictions, batch_y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")

Notice how the DataLoader for training data, train_loader, is used and split into batches during the training process. And finally, the following code can be used to quickly test the model’s prediction error (namely using Mean Squared Error, MSE) on the test instances, aided by the other DataLoader, test_loader:

from torch.nn import MSELoss
 
model.eval()
test_loss = 0
criterion = MSELoss()
 
with torch.no_grad():
    for batch_X, batch_y in test_loader:
        predictions = model(batch_X)
        loss = criterion(predictions, batch_y)
        test_loss += loss.item() * batch_X.size(0)
 
average_test_loss = test_loss / len(test_loader.dataset)
print(f"Test Mean Squared Error: {average_test_loss:.4f}")

Output: Test Mean Squared Error: 0.3385

Wrapping Up

This article demonstrated the process of turning a dataset contained in Pandas DataFrames into PyTorch DataLoaders, an ideal approach to efficiently train and use custom deep learning models.

Artificial Intelligence , Machine Learning and Data Science Hubspot

Thursday, 19 March 2026

Converting Pandas DataFrames to PyTorch DataLoaders for Custom Deep Learning Model Training

Preparing the Data: From DataFrame to DataLoader

Custom Model Training and Testing

Wrapping Up

No comments:

Post a Comment

Build an Inference Cache to Save Costs in High-Traffic LLM Apps

Report Abuse

Labels

"Donate for a Noble Cause