
Converting Pandas DataFrames to PyTorch DataLoaders for Custom Deep Learning Model Training.
Image by Editor | Ideogram
Pandas DataFrames are powerful and versatile data manipulation and analysis tools. While the versatility of this data structure is undeniable, in some situations — like working with PyTorch — a more structured and batch-friendly format would be more efficient and better suited for training deep learning models, in which case the DataLoader class stands out as the preferred solution.
In this article, we show how to turn a dataset contained in a Pandas DataFrame object into a PyTorch DataLoader object, which will allow a more efficient provision of mini-batches of data into a model during training and evaluation as part of custom deep learning model training workflows. We will illustrate the process of training and evaluating a simple, custom deep learning model for regression — namely, to estimate house prices.
Preparing the Data: From DataFrame to DataLoader
The process starts by importing the Python libraries and modules we will need, primarily Pandas for loading the dataset into a DataFrame and preprocessing it, PyTorch modules like Dataset, DataLoader, and several more for modeling and training neural network models, and scikit-learn components for supporting initial data preparation.
The next few instructions load the California housing dataset, containing over 20K instances described by eight predictor attributes — all of them numerical in this scikit-learn’s built-in version of the dataset — plus a target numerical label: median house value. The data is loaded into a Pandas DataFrame.
Before converting the data into a PyTorch-friendly format, we set the target variable aside in a variable named y, keeping the rest of the data inputs in a new DataFrame X. Both are split into training (80%) and test (20%) for later model training and testing, respectively. Since some attributes’ values move across much wider ranges than others, we are also applying feature scaling aided by scikit-learn’s StandardScaler class.
With these steps, data preprocessing is pretty much done! Now it’s time to turn it into two DataLoaders: one for the training set and one for the test set. We do so aided by a class defined by us that inherits PyTorch Dataset. An object of this class, which we called HousingDataset, will contain two attributes: the inputs stemming from X, and the outputs stemming from y. Likewise, we override the special class methods __len__() and __getitem()__, which return the number of instances and the ith instance, respectively.
We now instantiate two objects of the newly created class, one for the training data/labels, and one for the test data/labels, as follows:
Finally, it’s time to create the two associated DataLoader instances. The one used for the training data will be shuffled to ensure an unbiased model training.
You may have noticed both DataLoader objects are instantiated upon the two previous Dataset-based objects. A DataLoader object does not actually contain the data — this is the custom class HousingDataset‘s job — but it is responsible for iterating and managing its associated dataset.
Custom Model Training and Testing
Now that we have PyTorch DataLoader objects ready to use, why not build a simple neural network model for regression, train it, and evaluate it on the test data by using such objects? The code below defines a very straightforward neural network architecture consisting of two fully connected linear layers bridged by a ReLU activation function.
Next, we instantiate a model based on this architecture, specifying an input size equal to the original number of predictor attributes in the dataset, set the loss function and optimizer, and start a training loop across 10 iterations or epochs. Ideally, the loss function should gradually decrease after every training iteration as a sign of model convergence towards better predictions on the training data itself:
Notice how the DataLoader for training data, train_loader, is used and split into batches during the training process. And finally, the following code can be used to quickly test the model’s prediction error (namely using Mean Squared Error, MSE) on the test instances, aided by the other DataLoader, test_loader:
Output: Test Mean Squared Error: 0.3385
Wrapping Up
This article demonstrated the process of turning a dataset contained in Pandas DataFrames into PyTorch DataLoaders, an ideal approach to efficiently train and use custom deep learning models.

No comments:
Post a Comment